Sha256: 3219c28752689274e3bec4f5b4c50edf88d6873e671c4d02e8e925af86a8edda
Contents?: true
Size: 1.91 KB
Versions: 3
Compression:
Stored size: 1.91 KB
Contents
# coding: utf-8 # A token. # # @note We can add more filters from Solr and stem using Porter's Snowball. # # @see https://github.com/aurelian/ruby-stemmer # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory class TfIdfSimilarity::Token < String # Returns a falsy value if all its characters are numbers, punctuation, # whitespace or control characters. # # @note Some implementations ignore one and two-letter words. # # @return [Boolean] whether the string is a token def valid? !self[%r{ \A ( \d | # number [[:cntrl:]] | # control character [[:punct:]] | # punctuation [[:space:]] # whitespace )+ \z }x] end # Returns a lowercase string. # # @return [Token] a lowercase string # # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseFilterFactory def lowercase_filter self.class.new(defined?(UnicodeUtils) ? UnicodeUtils.downcase(self) : tr( "ÀÁÂÃÄÅĀĂĄÇĆĈĊČÐĎĐÈÉÊËĒĔĖĘĚĜĞĠĢĤĦÌÍÎÏĨĪĬĮĴĶĹĻĽĿŁÑŃŅŇŊÒÓÔÕÖØŌŎŐŔŖŘŚŜŞŠŢŤŦÙÚÛÜŨŪŬŮŰŲŴÝŶŸŹŻŽ", "àáâãäåāăąçćĉċčðďđèéêëēĕėęěĝğġģĥħìíîïĩīĭįĵķĺļľŀłñńņňŋòóôõöøōŏőŕŗřśŝşšţťŧùúûüũūŭůűųŵýŷÿźżž" ).downcase) end # Returns a string with no English possessive or periods in acronyms. # # @return [Token] a string with no English possessive or periods in acronyms # # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ClassicFilterFactory def classic_filter self.class.new(self.gsub('.', '').chomp("'s")) end end
Version data entries
3 entries across 3 versions & 1 rubygems
Version | Path |
---|---|
tf-idf-similarity-0.1.3 | lib/tf-idf-similarity/token.rb |
tf-idf-similarity-0.1.2 | lib/tf-idf-similarity/token.rb |
tf-idf-similarity-0.1.1 | lib/tf-idf-similarity/token.rb |