Sha256: ce7669328ecab071a6a90d60d34b0daee96f5774084aab3ceaa9e520f9c9eac2
Contents?: true
Size: 649 Bytes
Versions: 1
Compression:
Stored size: 649 Bytes
Contents
module Boilerpipe class UnicodeTokenizer INVISIBLE_SEPARATOR = "\u2063" WORD_BOUNDARY = Regexp.new('\b') NOT_WORD_BOUNDARY = Regexp.new("[\u2063]*([\\\"'\\.,\\!\\@\\-\\:\\;\\$\\?\\(\\)/])[\u2063]*") # replace word boundaries with 'invisible separator' # strip invisible separators from non-word boundaries # replace spaces or invisible separators with a single space # trim # split words on single space def self.tokenize(text) text.gsub(WORD_BOUNDARY, INVISIBLE_SEPARATOR) .gsub(NOT_WORD_BOUNDARY, '\1') .gsub(/[ \u2063]+/, ' ') .strip .split(/[ ]+/) end end end
Version data entries
1 entries across 1 versions & 1 rubygems
Version | Path |
---|---|
boilerpipe-ruby-0.0.1 | lib/boilerpipe/util/unicode_tokenizer.rb |