Sha256: ac2c1f4f5d827bbb6ef287081fc1ac7706f13e8dacb461c8cc19fccb4f439478
Contents?: true
Size: 673 Bytes
Versions: 5
Compression:
Stored size: 673 Bytes
Contents
module Boilerpipe::Extractors class DefaultExtractor def self.text(contents) doc = ::Boilerpipe::SAX::BoilerpipeHTMLParser.parse(contents) ::Boilerpipe::Extractors::DefaultExtractor.process doc doc.content end def self.process(doc) filters = ::Boilerpipe::Filters # merge adjacent blocks with equal text_density filters::SimpleBlockFusionProcessor.process doc # merge text blocks next to each other filters::BlockProximityFusion::MAX_DISTANCE_1.process doc # marks text blocks as content / non-content using boilerpipe alg filters::DensityRulesClassifier.process doc doc end end end
Version data entries
5 entries across 5 versions & 1 rubygems