MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, and meta tags.

= Installation

Install the gem from RubyGems:

  gem install metainspector

This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.

= Usage

Initialize a scraper instance for an URL, like this:

  page = MetaInspector::Scraper.new('http://w3clove.com')

or, for short, a convenience alias is also available:

  page = MetaInspector.new('http://w3clove.com')

If you don't include the scheme on the URL, http:// will be used
by defaul:

  page = MetaInspector.new('w3clove.com')

By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
You can set a different timeout with a second parameter, like this:

  page = MetaInspector.new('w3clove.com', 5) # this would wait just 5 seconds to timeout

Then you can see the scraped data like this:

  page.url                # URL of the page
  page.scheme             # Scheme of the page (http, https)
  page.host               # Hostname of the page (like, w3clove.com, without the scheme)
  page.root_url           # Root url (scheme + host, like http://w3clove.com/)
  page.title              # title of the page, as string
  page.links              # array of strings, with every link found on the page as an absolute URL
  page.meta_description   # meta description, as string
  page.description        # returns the meta description, or the first long paragraph if no meta description is found
  page.meta_keywords      # meta keywords, as string
  page.image              # Most relevant image, if defined with og:image
  page.images             # array of strings, with every img found on the page as an absolute URL
  page.feed               # Get rss or atom links in meta data fields as array
  page.meta_og_title      # opengraph title
  page.meta_og_image      # opengraph image

MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute

  page.meta_description       # <meta name="description" content="..." />
  page.meta_keywords          # <meta name="keywords" content="..." />
  page.meta_robots            # <meta name="robots" content="..." />
  page.meta_generator         # <meta name="generator" content="..." />

It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:

  page.meta_content_language  # <meta http-equiv="content-language" content="..." />
  page.meta_Content_Type      # <meta http-equiv="Content-Type" content="..." />

Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type

You can also access most of the scraped data as a hash:

  page.to_hash               # { "url"=>"http://w3clove.com", "title" => "W3CLove :: site-wide markup validation tool", ... }

The full scraped document if accessible from:

  page.document # Nokogiri doc that you can use it to get any element from the page

= Errors handling

You can check if the page has been succesfully parsed with:

  page.parsed?                # Will return true if everything looks OK

In case there have been any errors, you can check them with:

  page.errors                 # Will return an array with the error messages

= Examples

You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:

  $ irb
  >> require 'metainspector'
  => true

  >> page = MetaInspector.new('http://w3clove.com')
  => #<MetaInspector:0x11330c0 @url="http://w3clove.com">

  >> page.title
  => "W3CLove :: site-wide markup validation tool"

  >> page.meta_description
  => "Site-wide markup validation tool. Validate the markup of your whole site with just one click."

  >> page.meta_keywords
  => "html, markup, validation, validator, tool, w3c, development, standards, free"

  >> page.links.size
  => 15

  >> page.links[4]
  => "/plans-and-pricing"

  >> page.document.class
  => String

  >> page.parsed_document.class
  => Nokogiri::HTML::Document

= ZOMG Fork! Thank you!

You're welcome to fork this project and send pull requests. I want to thank specially:

* Ryan Romanchuk https://github.com/rromanchuk
* Edmund Haselwanter https://github.com/ehaselwanter
* Jonathan Hernández https://github.com/ionmx
* Oriol Gual https://github.com/oriolgual

= To Do

* Get page.base_dir from the URL
* Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs
* If keywords seem to be separated by blank spaces, replace them with commas
* Check content type, process only HTML pages, don't try to scrape TAR files like http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
* Autodiscover all available meta tags

Copyright (c) 2009-2011 Jaime Iniesta, released under the MIT license