Sha256: d2ccd61337382e3a2ac0ee23d553537d6c2e6527c6c84f2c1137233d7aecd269

Contents?: true

Size: 449 Bytes

Versions: 2

Compression:

Stored size: 449 Bytes

Contents

require 'unicode_utils/each_word'
require 'tf-idf-similarity/token'

# A tokenizer using UnicodeUtils to tokenize a text.
#
# @see https://github.com/lang/unicode_utils
module TfIdfSimilarity
  class Tokenizer
    # Tokenizes a text.
    #
    # @param [String] text
    # @return [Enumerator] an enumerator of Token objects
    def tokenize(text)
      UnicodeUtils.each_word(text).map do |word|
        Token.new(word)
      end
    end
  end
end

Version data entries

2 entries across 2 versions & 1 rubygems

Version Path
tf-idf-similarity-0.3.0 lib/tf-idf-similarity/tokenizer.rb
tf-idf-similarity-0.2.0 lib/tf-idf-similarity/tokenizer.rb