Sha256: 286d2ec951bcf2c26914ecfed519773ae841432f1020252056d127b41f8b1e0c

Contents?: true

Size: 521 Bytes

Versions: 7

Compression:

Stored size: 521 Bytes

Contents

module Boilerpipe::Extractors
  class LargestContentExtractor
    def self.text(contents)
      doc = ::Boilerpipe::SAX::BoilerpipeHTMLParser.parse(contents)
      ::Boilerpipe::Extractors::LargestContentExtractor.process doc
      doc.content
    end

    def self.process(doc)
      filters = ::Boilerpipe::Filters
      filters::NumWordsRulesClassifier.process doc
      filters::BlockProximityFusion::MAX_DISTANCE_1.process doc
      filters::KeepLargestBlockFilter::INSTANCE.process doc

      doc
    end
  end
end

Version data entries

7 entries across 7 versions & 1 rubygems

Version Path
boilerpipe-ruby-0.5.0 lib/boilerpipe/extractors/largest_content_extractor.rb
boilerpipe-ruby-0.4.4 lib/boilerpipe/extractors/largest_content_extractor.rb
boilerpipe-ruby-0.4.3 lib/boilerpipe/extractors/largest_content_extractor.rb
boilerpipe-ruby-0.4.2 lib/boilerpipe/extractors/largest_content_extractor.rb
boilerpipe-ruby-0.4.1 lib/boilerpipe/extractors/largest_content_extractor.rb
boilerpipe-ruby-0.4.0 lib/boilerpipe/extractors/largest_content_extractor.rb
boilerpipe-ruby-0.3.0 lib/boilerpipe/extractors/largest_content_extractor.rb