Sha256: 72984ce2f49f9b07875bff37c6c35c6f9753d579bde18fcce7db21f5882f5b1b
Contents?: true
Size: 873 Bytes
Versions: 2
Compression:
Stored size: 873 Bytes
Contents
# A full-text extractor trained on http://krdwrd.org/ # https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf # Works well with SimpleEstimator, too. module Boilerpipe::Filters class CanolaClassifier def self.process(doc) return doc if doc.text_blocks.size < 1 empty = Boilerpipe::Document::TextBlock.empty_start text_blocks = [empty] + doc.text_blocks + [empty] text_blocks.each_cons(3) do |slice| prev, current, nxt = *slice current.content = classify(prev, current, nxt) end doc end def self.classify(prev, current, nxt) current.link_density > 0 && nxt.num_words > 11 \ || current.num_words > 19 \ || nxt.num_words > 6 && nxt.link_density == 0 && prev.link_density == 0 && ( current.num_words > 6 || prev.num_words > 7 || nxt.num_words > 19 ) end end end
Version data entries
2 entries across 2 versions & 1 rubygems
Version | Path |
---|---|
boilerpipe-ruby-0.4.0 | lib/boilerpipe/filters/canola_classifier.rb |
boilerpipe-ruby-0.3.0 | lib/boilerpipe/filters/canola_classifier.rb |