Sha256: 36de2d47fb59264b52b6edb55358336ad96bf9c1e9573ef4286d8222a1ef3dd7

Contents?: true

Size: 455 Bytes

Versions: 2

Compression:

Stored size: 455 Bytes

Contents

module Boilerpipe::SAX
  class BoilerpipeHTMLParser
    def self.parse(text)
      # strip out tags that cause issues
      text = Preprocessor.strip(text)

      # use nokogiri to fix any bad tags, errors - keep experimenting with this
      text = Nokogiri::HTML(text).to_html
      handler = HTMLContentHandler.new
      noko_parser = Nokogiri::HTML::SAX::Parser.new(handler)
      noko_parser.parse(text)
      handler.text_document
    end
  end
end

Version data entries

2 entries across 2 versions & 1 rubygems

Version Path
boilerpipe-ruby-0.5.0 lib/boilerpipe/sax/boilerpipe_html_parser.rb
boilerpipe-ruby-0.4.4 lib/boilerpipe/sax/boilerpipe_html_parser.rb