Sha256: 3219c28752689274e3bec4f5b4c50edf88d6873e671c4d02e8e925af86a8edda

Contents?: true

Size: 1.91 KB

Versions: 3

Compression:

Stored size: 1.91 KB

Contents

# coding: utf-8

# A token.
#
# @note We can add more filters from Solr and stem using Porter's Snowball.
#
# @see https://github.com/aurelian/ruby-stemmer
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
class TfIdfSimilarity::Token < String
  # Returns a falsy value if all its characters are numbers, punctuation,
  # whitespace or control characters.
  #
  # @note Some implementations ignore one and two-letter words.
  #
  # @return [Boolean] whether the string is a token
  def valid?
    !self[%r{
      \A
        (
         \d           | # number
         [[:cntrl:]]  | # control character
         [[:punct:]]  | # punctuation
         [[:space:]]    # whitespace
        )+
      \z
    }x]
  end

  # Returns a lowercase string.
  #
  # @return [Token] a lowercase string
  #
  # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseFilterFactory
  def lowercase_filter
    self.class.new(defined?(UnicodeUtils) ? UnicodeUtils.downcase(self) : tr(
      "ÀÁÂÃÄÅĀĂĄÇĆĈĊČÐĎĐÈÉÊËĒĔĖĘĚĜĞĠĢĤĦÌÍÎÏĨĪĬĮĴĶĹĻĽĿŁÑŃŅŇŊÒÓÔÕÖØŌŎŐŔŖŘŚŜŞŠŢŤŦÙÚÛÜŨŪŬŮŰŲŴÝŶŸŹŻŽ",
      "àáâãäåāăąçćĉċčðďđèéêëēĕėęěĝğġģĥħìíîïĩīĭįĵķĺļľŀłñńņňŋòóôõöøōŏőŕŗřśŝşšţťŧùúûüũūŭůűųŵýŷÿźżž"
    ).downcase)
  end

  # Returns a string with no English possessive or periods in acronyms.
  #
  # @return [Token] a string with no English possessive or periods in acronyms
  #
  # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ClassicFilterFactory
  def classic_filter
    self.class.new(self.gsub('.', '').chomp("'s"))
  end
end

Version data entries

3 entries across 3 versions & 1 rubygems

Version Path
tf-idf-similarity-0.1.3 lib/tf-idf-similarity/token.rb
tf-idf-similarity-0.1.2 lib/tf-idf-similarity/token.rb
tf-idf-similarity-0.1.1 lib/tf-idf-similarity/token.rb