Sha256: 36de2d47fb59264b52b6edb55358336ad96bf9c1e9573ef4286d8222a1ef3dd7
Contents?: true
Size: 455 Bytes
Versions: 2
Compression:
Stored size: 455 Bytes
Contents
module Boilerpipe::SAX class BoilerpipeHTMLParser def self.parse(text) # strip out tags that cause issues text = Preprocessor.strip(text) # use nokogiri to fix any bad tags, errors - keep experimenting with this text = Nokogiri::HTML(text).to_html handler = HTMLContentHandler.new noko_parser = Nokogiri::HTML::SAX::Parser.new(handler) noko_parser.parse(text) handler.text_document end end end
Version data entries
2 entries across 2 versions & 1 rubygems
Version | Path |
---|---|
boilerpipe-ruby-0.5.0 | lib/boilerpipe/sax/boilerpipe_html_parser.rb |
boilerpipe-ruby-0.4.4 | lib/boilerpipe/sax/boilerpipe_html_parser.rb |