Sha256: 9cf88fe7e92765a2452465e3e56f1638a38a9cd494b2559d1cba9c95500c73b2

Contents?: true

Size: 658 Bytes

Versions: 4

Compression:

Stored size: 658 Bytes

Contents

require 'nokogiri'
module Boilerpipe::SAX
  class BoilerpipeHTMLParser
    def self.parse(text)

      #script bug - delete script tags
      text  = text.gsub(/\<script>.+?<\/script>/i, '')

      # nokogiri uses libxml for mri and nekohtml for jruby
      # mri doesn't remove &nbsp; when missing the semicolon
      text = text.gsub(/(&nbsp) /, '\1; ')


      # use nokogiri to fix any bad tags, errors - keep experimenting with this
      text = Nokogiri::HTML(text).to_html


      handler = HTMLContentHandler.new
      noko_parser = Nokogiri::HTML::SAX::Parser.new(handler)
      noko_parser.parse(text)
      handler.text_document
    end
  end
end

Version data entries

4 entries across 4 versions & 1 rubygems

Version Path
boilerpipe-ruby-0.3.0 lib/boilerpipe/sax/boilerpipe_html_parser.rb
boilerpipe-ruby-0.2.0 lib/boilerpipe/sax/boilerpipe_html_parser.rb
boilerpipe-ruby-0.1.1 lib/boilerpipe/sax/boilerpipe_html_parser.rb
boilerpipe-ruby-0.1.0 lib/boilerpipe/sax/boilerpipe_html_parser.rb