Sha256: 59c8229341062711050b4b4e6ebe7c314e565a77f0f797192471604b2169e66e

Contents?: true

Size: 674 Bytes

Versions: 3

Compression:

Stored size: 674 Bytes

Contents

module Boilerpipe::Extractors
  class DefaultExtractor

    def self.text(contents)
      doc = ::Boilerpipe::SAX::BoilerpipeHTMLParser.parse(contents)
      ::Boilerpipe::Extractors::DefaultExtractor.process doc
      doc.content
    end

    def self.process(doc)
      filters = ::Boilerpipe::Filters
      # merge adjacent blocks with equal text_density
      filters::SimpleBlockFusionProcessor.process doc

      # merge text blocks next to each other
      filters::BlockProximityFusion::MAX_DISTANCE_1.process doc

      # marks text blocks as content / non-content using boilerpipe alg
      filters::DensityRulesClassifier.process doc

      doc
    end
  end
end

Version data entries

3 entries across 3 versions & 1 rubygems

Version Path
boilerpipe-ruby-0.4.0 lib/boilerpipe/extractors/default_extractor.rb
boilerpipe-ruby-0.3.0 lib/boilerpipe/extractors/default_extractor.rb
boilerpipe-ruby-0.2.0 lib/boilerpipe/extractors/default_extractor.rb