Sha256: 359f067a567a6217297ee2a22d0b4e91672d55f856a772be296e2a176747b058
Contents?: true
Size: 830 Bytes
Versions: 3
Compression:
Stored size: 830 Bytes
Contents
module BowTfidf class Tokenizer SPLIT_REGEX = /[\s\n\t\.,\-\!:()\/%\\+\|@^<«>*'~;=»\?—•$”\"’\[£“■‘\{#®♦°™€¥\]©§\}–]/ TOKEN_MIN_LENGTH = 3 TOKEN_MAX_LENGTH = 15 attr_reader :tokens def initialize @tokens = Set[] end def call(text) raise(ArgumentError, 'String instance expected') unless text.is_a?(String) raw_tokens = split(text) raw_tokens.each do |token| process_token(token) end tokens end private def split(text) text.split(SPLIT_REGEX) end def process_token(token) return if token.length < TOKEN_MIN_LENGTH return if token.length > TOKEN_MAX_LENGTH return if token.scan(/\D/).empty? # skip if str contains only digits tokens << token.downcase end end end
Version data entries
3 entries across 3 versions & 1 rubygems
Version | Path |
---|---|
bow_tfidf-0.1.2 | lib/bow_tfidf/tokenizer.rb |
bow_tfidf-0.1.1 | lib/bow_tfidf/tokenizer.rb |
bow_tfidf-0.1.0 | lib/bow_tfidf/tokenizer.rb |