Sha256: a376390590c2b195726ca74cf62f2c9c5ed975dec3923b711ab0e63003edd296
Contents?: true
Size: 738 Bytes
Versions: 1
Compression:
Stored size: 738 Bytes
Contents
# A full-text extractor which extracts the largest text component of a page. # For news articles, it may perform better than the DefaultExtractor, but # usually worse than ArticleExtractor. module Boilerpipe::Extractors class KeepEverythingWithKMinWordsExtractor def self.text(min, contents) doc = ::Boilerpipe::SAX::BoilerpipeHTMLParser.parse(contents) ::Boilerpipe::Extractors::KeepEverythingWithKMinWordsExtractor.process min, doc doc.content end def self.process(min, doc) ::Boilerpipe::Filters::SimpleBlockFusionProcessor.process doc ::Boilerpipe::Filters::MarkEverythingContentFilter.process doc ::Boilerpipe::Filters::MinWordsFilter.process min, doc doc end end end
Version data entries
1 entries across 1 versions & 1 rubygems
Version | Path |
---|---|
boilerpipe-ruby-0.4.0 | lib/boilerpipe/extractors/keep_everything_with_k_min_words_extractor.rb |