# SpiderCrawl

A ruby gem that can crawl a domain and let you have information about the pages it visits. 

With the help of Nokogiri, SpiderCrawl will parse each page and return you its title, links, css, words, and many many more! You can also customize what you want to do before & after each fetch request.

Long story short - Feed an URL to SpiderCrawl and it will crawl + scrape the content for you. 

## Installation

Add this line to your application's Gemfile:

```ruby
gem 'spidercrawl'
```

And then execute:

    $ bundle

Or install it yourself as:

    $ gem install spidercrawl

## Usage

Start crawling a domain by calling __Spiderman.shoot__(*url*) and it will return you a list of pages it has crawled and scraped:

    pages = Spiderman.shoot('http://forums.hardwarezone.com.sg/hwm-magazine-publication-38/')

To include a pattern matching for each page:

    pages = Spiderman.shoot('http://forums.hardwarezone.com.sg/hwm-magazine-publication-38/',
                            :pattern => Regexp.new('^http:\/\/forums\.hardwarezone\.com\.sg\/hwm-magazine-publication-38\/?(.*\.html)?$')

Access the following scraped data:

    pages.each do |page|
      page.url              #URL of the page
      page.scheme           #Scheme of the page (http, https, etc.)
      page.host             #Hostname of the page
      page.base_url         #Root URL of the page
      page.doc              #Nokogiri document
      page.headers          #Response headers for the page
      page.title            #Title of the page
      page.links            #Every link found in the page, returned as an array
      page.internal_links   #Only internal links returned as an array
      page.external_links   #Only external links returned as an array
      page.emails           #Every email found in the page, returned as an array
      page.images           #Every img found in the page, returned as an array
      page.words            #Every word that appeared in the page, returned as an array
      page.css              #CSS scripts used in the page, returned as an array
      page.content          #Contents of the HTML document in string
      page.content_type     #Content type of the page
      page.text             #Any text found in the page without HTML tags
      page.response_code    #HTTP response code of the page
      page.response_time    #HTTP response time of the page
      page.crawled_time     #The time when this page is crawled/fetched, returned as milliseconds since epoch
    end

## Dependencies

+ Colorize
+ Curb
+ Nokogiri
+ Typhoeus

## Contributing

1. Fork it ( https://github.com/belsonheng/spidercrawl/fork )
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create a new Pull Request

## License

SpiderCrawl is released under the [MIT license](https://github.com/belsonheng/spidercrawl/blob/master/LICENSE.txt).