Sha256: 5d21f87c902aa70ff705029acd99e67e38ec252f1e8b620263528947e90d39d4

Contents?: true

Size: 1.08 KB

Versions: 5

Compression:

Stored size: 1.08 KB

Contents

# encoding: utf-8

#  Classifies TextBlocks as content/not-content through rules that have been determined
#  using the C4.8 machine learning algorithm, as described in the paper
#  "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly
#  using number of words per block and link density per block.

module Boilerpipe::Filters
  class NumWordsRulesClassifier

    def self.process(doc)
      empty = Boilerpipe::Document::TextBlock.empty_start
      text_blocks = [empty] + doc.text_blocks + [empty]

      text_blocks.each_cons(3) do |slice|
        prev, current, nxt = *slice
        current.content = classify(prev, current, nxt)
      end

      doc
    end

    private

    def self.classify(prev, current, nxt)
      return false if current.link_density > 0.333333

      if prev.link_density <= 0.555556
        return true if current.num_words > 16

        return true if nxt.num_words > 15
        return true if prev.num_words > 4
      else
        return true if current.num_words > 40
        return true if nxt.num_words > 17
      end

      false
    end

  end
end

Version data entries

5 entries across 5 versions & 1 rubygems

Version Path
boilerpipe-ruby-0.4.0 lib/boilerpipe/filters/num_words_rules_classifier.rb
boilerpipe-ruby-0.3.0 lib/boilerpipe/filters/num_words_rules_classifier.rb
boilerpipe-ruby-0.2.0 lib/boilerpipe/filters/num_words_rules_classifier.rb
boilerpipe-ruby-0.1.1 lib/boilerpipe/filters/num_words_rules_classifier.rb
boilerpipe-ruby-0.1.0 lib/boilerpipe/filters/num_words_rules_classifier.rb