Sha256: 3032b41a022aa7293662c5c3708f65ba4633d26ea8afecc16b12c53d5a3bbea5

Contents?: true

Size: 1.62 KB

Versions: 1

Compression:

Stored size: 1.62 KB

Contents

# coding: utf-8
require 'delegate'

# A token.
#
# @note We can add more filters from Solr and stem using Porter's Snowball.
#
# @see https://github.com/aurelian/ruby-stemmer
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
module TfIdfSimilarity
  class Token < ::SimpleDelegator
    # Returns a falsy value if all its characters are numbers, punctuation,
    # whitespace or control characters.
    #
    # @note Some implementations ignore one and two-letter words.
    #
    # @return [Boolean] whether the string is a token
    def valid?
      !self[%r{
        \A
          (
           \d           | # number
           [[:cntrl:]]  | # control character
           [[:punct:]]  | # punctuation
           [[:space:]]    # whitespace
          )+
        \z
      }x]
    end

    # Returns a lowercase string.
    #
    # @return [Token] a lowercase string
    #
    # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseFilterFactory
    def lowercase_filter
      self.class.new(UnicodeUtils.downcase(self))
    end

    # Returns a string with no English possessive or periods in acronyms.
    #
    # @return [Token] a string with no English possessive or periods in acronyms
    #
    # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ClassicFilterFactory
    def classic_filter
      self.class.new(self.gsub('.', '').sub(/['`’]s\z/, ''))
    end
  end
end

Version data entries

1 entries across 1 versions & 1 rubygems

Version Path
tf-idf-similarity-0.1.6 lib/tf-idf-similarity/token.rb