Sha256: 3d036a0d350b5c6fbe40b68d5dbb51bf5546cbdb45efc95a68785e36f6713c57

Contents?: true

Size: 867 Bytes

Versions: 5

Compression:

Stored size: 867 Bytes

Contents

# A full-text extractor trained on http://krdwrd.org/
# https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf
# Works well with SimpleEstimator, too.

module Boilerpipe::Filters
  class CanolaClassifier
    def self.process(doc)
      return doc if doc.text_blocks.size < 1

      empty = Boilerpipe::Document::TextBlock.empty_start
      text_blocks = [empty] + doc.text_blocks + [empty]

      text_blocks.each_cons(3) do |slice|
        prev, current, nxt = *slice
        current.content = classify(prev, current, nxt)
      end

      doc
    end

    def self.classify(prev, current, nxt)
      current.link_density > 0 && nxt.num_words > 11 \
        || current.num_words > 19 \
        || nxt.num_words > 6 && nxt.link_density == 0 && prev.link_density == 0 && (current.num_words > 6 || prev.num_words > 7 || nxt.num_words > 19)
    end
  end
end

Version data entries

5 entries across 5 versions & 1 rubygems

Version Path
boilerpipe-ruby-0.5.0 lib/boilerpipe/filters/canola_classifier.rb
boilerpipe-ruby-0.4.4 lib/boilerpipe/filters/canola_classifier.rb
boilerpipe-ruby-0.4.3 lib/boilerpipe/filters/canola_classifier.rb
boilerpipe-ruby-0.4.2 lib/boilerpipe/filters/canola_classifier.rb
boilerpipe-ruby-0.4.1 lib/boilerpipe/filters/canola_classifier.rb