MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, and meta tags. = Installation Install the gem from RubyGems: gem install metainspector This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3. = Usage Initialize a scraper instance for an URL, like this: page = MetaInspector::Scraper.new('http://w3clove.com') or, for short, a convenience alias is also available: page = MetaInspector.new('http://w3clove.com') If you don't include the scheme on the URL, http:// will be used by defaul: page = MetaInspector.new('w3clove.com') By default, MetaInspector times out after 20 seconds of waiting for a page to respond. You can set a different timeout with a second parameter, like this: page = MetaInspector.new('w3clove.com', 5) # this would wait just 5 seconds to timeout Then you can see the scraped data like this: page.url # URL of the page page.scheme # Scheme of the page (http, https) page.host # Hostname of the page (like, w3clove.com, without the scheme) page.root_url # Root url (scheme + host, like http://w3clove.com/) page.title # title of the page, as string page.links # array of strings, with every link found on the page as an absolute URL page.meta_description # meta description, as string page.description # returns the meta description, or the first long paragraph if no meta description is found page.meta_keywords # meta keywords, as string page.image # Most relevant image, if defined with og:image page.images # array of strings, with every img found on the page as an absolute URL page.feed # Get rss or atom links in meta data fields as array page.meta_og_title # opengraph title page.meta_og_image # opengraph image MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute page.meta_description # page.meta_keywords # page.meta_robots # page.meta_generator # It will also work for the meta tags of the form , like the following: page.meta_content_language # page.meta_Content_Type # Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type You can also access most of the scraped data as a hash: page.to_hash # { "url"=>"http://w3clove.com", "title" => "W3CLove :: site-wide markup validation tool", ... } The full scraped document if accessible from: page.document # Nokogiri doc that you can use it to get any element from the page = Errors handling You can check if the page has been succesfully parsed with: page.parsed? # Will return true if everything looks OK In case there have been any errors, you can check them with: page.errors # Will return an array with the error messages = Examples You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb: $ irb >> require 'metainspector' => true >> page = MetaInspector.new('http://w3clove.com') => # >> page.title => "W3CLove :: site-wide markup validation tool" >> page.meta_description => "Site-wide markup validation tool. Validate the markup of your whole site with just one click." >> page.meta_keywords => "html, markup, validation, validator, tool, w3c, development, standards, free" >> page.links.size => 15 >> page.links[4] => "/plans-and-pricing" >> page.document.class => String >> page.parsed_document.class => Nokogiri::HTML::Document = ZOMG Fork! Thank you! You're welcome to fork this project and send pull requests. I want to thank specially: * Ryan Romanchuk https://github.com/rromanchuk * Edmund Haselwanter https://github.com/ehaselwanter * Jonathan Hernández https://github.com/ionmx * Oriol Gual https://github.com/oriolgual = To Do * Get page.base_dir from the URL * Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs * If keywords seem to be separated by blank spaces, replace them with commas * Check content type, process only HTML pages, don't try to scrape TAR files like http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/ * Autodiscover all available meta tags Copyright (c) 2009-2011 Jaime Iniesta, released under the MIT license