Sha256: ce7669328ecab071a6a90d60d34b0daee96f5774084aab3ceaa9e520f9c9eac2

Contents?: true

Size: 649 Bytes

Versions: 1

Compression:

Stored size: 649 Bytes

Contents

module Boilerpipe
  class UnicodeTokenizer
    INVISIBLE_SEPARATOR = "\u2063"
    WORD_BOUNDARY = Regexp.new('\b')
    NOT_WORD_BOUNDARY = Regexp.new("[\u2063]*([\\\"'\\.,\\!\\@\\-\\:\\;\\$\\?\\(\\)/])[\u2063]*")

    # replace word boundaries with 'invisible separator' 
    # strip invisible separators from non-word boundaries
    # replace spaces or invisible separators with a single space
    # trim
    # split words on single space

    def self.tokenize(text)
      text.gsub(WORD_BOUNDARY, INVISIBLE_SEPARATOR)
        .gsub(NOT_WORD_BOUNDARY, '\1')
        .gsub(/[ \u2063]+/, ' ')
        .strip
        .split(/[ ]+/)
    end
  end
end

Version data entries

1 entries across 1 versions & 1 rubygems

Version Path
boilerpipe-ruby-0.0.1 lib/boilerpipe/util/unicode_tokenizer.rb