Ruby Readability ================ Ruby Readability is a tool for extracting the primary readable content of a webpage. It is a Ruby port of arc90's readability project. Build Status ------------ [](https://travis-ci.org/cantino/ruby-readability) Install ------- Command line: (sudo) gem install ruby-readability Bundler: gem "ruby-readability", :require => 'readability' Example ------- require 'rubygems' require 'readability' require 'open-uri' source = open('http://lab.arc90.com/experiments/readability/').read puts Readability::Document.new(source).content Options ------- You may provide options to `Readability::Document.new`, including: * `:tags`: the base whitelist of tags to sanitize, defaults to `%w[div p]`; * `:remove_empty_nodes`: remove `
` tags that have no text content; also removes `
` tags that contain only images;
* `:attributes`: whitelist of allowed attributes;
* `:debug`: provide debugging output, defaults false;
* `:encoding`: if the page is of a known encoding, you can specify it; if left
unspecified, the encoding will be guessed (only in Ruby 1.9.x). If you wish
to disable guessing, supply `:do_not_guess_encoding => true`;
* `:html_headers`: in Ruby 1.9.x these will be passed to the
`guess_html_encoding` gem to aid with guessing the HTML encoding;
* `:ignore_image_format`: for use with .images. For example:
`:ignore_image_format => ["gif", "png"]`;
* `:min_image_height`: set a minimum image height for `#images`;
* `:min_image_width`: set a minimum image width for `#images`.
Command Line Tool
-----------------
Readability comes with a command-line tool for experimentation in
`bin/readability`.
Usage: readability [options] URL
-d, --debug Show debug output
-i, --images Keep images and links
-h, --help Show this message
Images
------
You can get a list of images in the content area with `Document#images`. This
feature requires that the `fastimage` gem be installed.
rbody = Readability::Document.new(body, :tags => %w[div p img a], :attributes => %w[src href], :remove_empty_nodes => false)
rbody.images
Related Projects
----------------
* [newspaper](https://github.com/codelucas/newspaper) is an advanced news extraction, article extraction, and content curation library for Python.
Potential Issues
----------------
If you're on a Mac and are getting segmentation faults, see the discussion at