Sha256: 359f067a567a6217297ee2a22d0b4e91672d55f856a772be296e2a176747b058

Contents?: true

Size: 830 Bytes

Versions: 3

Compression:

Stored size: 830 Bytes

Contents

module BowTfidf
  class Tokenizer
    SPLIT_REGEX = /[\s\n\t\.,\-\!:()\/%\\+\|@^<«>*'~;=»\?—•$”\"’\[£“■‘\{#®♦°™€¥\]©§\}–]/
    TOKEN_MIN_LENGTH = 3
    TOKEN_MAX_LENGTH = 15

    attr_reader :tokens

    def initialize
      @tokens = Set[]
    end

    def call(text)
      raise(ArgumentError, 'String instance expected') unless text.is_a?(String)

      raw_tokens = split(text)

      raw_tokens.each do |token|
        process_token(token)
      end

      tokens
    end

    private

    def split(text)
      text.split(SPLIT_REGEX)
    end

    def process_token(token)
      return if token.length < TOKEN_MIN_LENGTH
      return if token.length > TOKEN_MAX_LENGTH
      return if token.scan(/\D/).empty? # skip if str contains only digits

      tokens << token.downcase
    end
  end
end

Version data entries

3 entries across 3 versions & 1 rubygems

Version Path
bow_tfidf-0.1.2 lib/bow_tfidf/tokenizer.rb
bow_tfidf-0.1.1 lib/bow_tfidf/tokenizer.rb
bow_tfidf-0.1.0 lib/bow_tfidf/tokenizer.rb