Sha256: 57b73d2450d8484b5e97220070a77a1059c08e3e1baa89169f9f91f83a31b949

Contents?: true

Size: 1.87 KB

Versions: 2

Compression:

Stored size: 1.87 KB

Contents

# coding: utf-8
require 'delegate'
require 'unicode_utils/downcase'
require 'unicode_utils/each_word'

# A token.
#
# @note We can add more filters from Solr and stem using Porter's Snowball.
#
# @see https://github.com/aurelian/ruby-stemmer
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
module TfIdfSimilarity
  class Token < ::SimpleDelegator
    # Returns a falsy value if all its characters are numbers, punctuation,
    # whitespace or control characters.
    #
    # @note Some implementations ignore one and two-letter words.
    #
    # @return [Boolean] whether the string is a token
    def valid?
      !self[%r{
        \A
          (
           \d           | # number
           [[:cntrl:]]  | # control character
           [[:punct:]]  | # punctuation
           [[:space:]]    # whitespace
          )+
        \z
      }x]
    end

    # Returns a lowercase string.
    #
    # @return [Token] a lowercase string
    #
    # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseFilterFactory
    def lowercase_filter
      self.class.new(UnicodeUtils.downcase(self))
    end

    # Returns a string with no English possessive or periods in acronyms.
    #
    # @return [Token] a string with no English possessive or periods in acronyms
    #
    # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ClassicFilterFactory
    def classic_filter
      self.class.new(self.gsub('.', '').sub(/['`’]s\z/, ''))
    end

    def to_s
      # Don't call #lowercase_filter and #classic_filter to avoid creating unnecessary objects.
      UnicodeUtils.downcase(self).gsub('.', '').sub(/['`’]s\z/, '')
    end
  end
end

Version data entries

2 entries across 2 versions & 1 rubygems

Version Path
tf-idf-similarity-0.3.0 lib/tf-idf-similarity/token.rb
tf-idf-similarity-0.2.0 lib/tf-idf-similarity/token.rb