Sha256: aeba94bd9e7d408b4e8dd78bd2a340dcb91bf48db66b933c4b50d3c4d8e46c57
Contents?: true
Size: 645 Bytes
Versions: 1
Compression:
Stored size: 645 Bytes
Contents
require 'nokogiri' module Boilerpipe::SAX class BoilerpipeHTMLParser def self.parse(text) #script bug - delete script tags text.gsub!(/\<script>.+?<\/script>/i, '') # nokogiri uses libxml for mri and nekohtml for jruby # mri doesn't remove when missing the semicolon text.gsub!(/( ) /, '\1; ') # use nokogiri to fix any bad tags, errors - keep experimenting with this text = Nokogiri::HTML(text).to_html handler = HTMLContentHandler.new noko_parser = Nokogiri::HTML::SAX::Parser.new(handler) noko_parser.parse(text) handler.text_document end end end
Version data entries
1 entries across 1 versions & 1 rubygems
Version | Path |
---|---|
boilerpipe-ruby-0.4.0 | lib/boilerpipe/sax/boilerpipe_html_parser.rb |