Sha256: 53c12ad811d2a1a64d91f173d9a83c10dff2695003cd9f1964d6dbb41a5761cf
Contents?: true
Size: 1.72 KB
Versions: 4
Compression:
Stored size: 1.72 KB
Contents
module Tokenizers # The base indexing tokenizer. # # Override in indexing subclasses and define in configuration. # class Index < Base # Default handling definitions. Override in config. # removes_characters(//) stopwords(//) contracts_expressions(//, '') splits_text_on(/\s/) normalizes_words([]) removes_characters_after_splitting(//) # Default indexing preprocessing hook. # # Does: # 1. Umlaut substitution. # 2. Downcasing. # 3. Remove illegal expressions. # 4. Contraction. # 5. Remove non-single stopwords. (Stopwords that occur with other words) # def preprocess text text = substituter.substitute text if substituter? text.downcase! remove_illegals text contract text # we do not remove single stopwords for an entirely different # reason than in the query tokenizer. # An indexed thing with just name "UND" (a stopword) should not lose its name. # remove_non_single_stopwords text text end # Default indexing pretokenizing hook. # # Does: # 1. Split the text into words. # 2. Normalize each word. # # TODO Rename into wordize? Or somesuch? # def pretokenize text words = split text words.collect! do |word| normalize_with_patterns word word end end # Does not actually return a token, but a # symbol "token". # def token_for text symbolize text end # Rejects tokens if they are too short (or blank). # # Override in subclasses to redefine behaviour. # def reject tokens tokens.reject! { |token| token.to_s.size < 2 } end end end
Version data entries
4 entries across 4 versions & 1 rubygems
Version | Path |
---|---|
picky-0.3.0 | lib/picky/tokenizers/index.rb |
picky-0.2.4 | lib/picky/tokenizers/index.rb |
picky-0.2.3 | lib/picky/tokenizers/index.rb |
picky-0.2.2 | lib/picky/tokenizers/index.rb |