# Elasticrawl Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data. Elasticrawl can be used with [crawl data](http://commoncrawl.org/the-data/get-started/) from April 2014 onwards. | Crawl Name | Month | Web Pages | Segments | -------------- |:---------:|:----------:|:-------: | [CC-MAIN-2015-06](http://blog.commoncrawl.org/2015/03/january-2015-crawl-archive-available/) | January 2015 | ~ 1.82 billion | 98 | [CC-MAIN-2014-52](http://blog.commoncrawl.org/2015/01/december-2014-crawl-archive-available/) | December 2014 | ~ 2.08 billion | 314 | [CC-MAIN-2014-49](http://blog.commoncrawl.org/2014/12/november-2014-crawl-archive-available/) | November 2014 | ~ 1.95 billion | 136 | [CC-MAIN-2014-35](http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/) | August 2014 | ~ 2.8 billion | 111 | [CC-MAIN-2014-23](http://blog.commoncrawl.org/2014/08/july-2014-crawl-data-available/) | July 2014 | ~ 3.6 billion | 253 | [CC-MAIN-2014-15](http://blog.commoncrawl.org/2014/07/april-2014-crawl-data-available/) | April 2014 | ~ 2.3 billion | 70 Common Crawl announce new crawls on their [blog](http://blog.commoncrawl.org/). Ships with a default configuration that launches the [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs. This is an implementation of the standard Hadoop Word Count example. This [blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) has a walkthrough of running the example jobs on the November 2014 crawl. ## Installation Deployment packages are available for Linux and OS X, unfortunately Windows isn't supported yet. Download the package, extract it and run the elasticrawl command from the package directory. ```bash # OS X https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.3-osx.tar.gz # Linux (64-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.3-linux-x86_64.tar.gz # Linux (32-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.3-linux-x86.tar.gz # e.g. curl -O https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.3-osx.tar.gz tar -xzf elasticrawl-1.1.3-osx.tar.gz cd elasticrawl-1.1.3-osx/ ./elasticrawl --help ``` ## Commands ### elasticrawl init The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created and will store your data and logs. ```bash ~$ ./elasticrawl init your-s3-bucket Enter AWS Access Key ID: ************ Enter AWS Secret Access Key: ************ ... Bucket s3://elasticrawl-test created Config dir /Users/ross/.elasticrawl created Config complete ``` ### elasticrawl parse The parse command takes in the crawl name and an optional number of segments and files to parse. ```bash ~$ ./elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3 Segments Segment: 1416400372202.67 Files: 150 Segment: 1416400372490.23 Files: 124 Job configuration Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment Cluster configuration Master: 1 m1.medium (Spot: 0.12) Core: 2 m1.medium (Spot: 0.12) Task: -- Launch job? (y/n) y Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB ``` ### elasticrawl combine The combine command takes in the results of previous parse jobs and produces a combined set of results. ```bash ~$ ./elasticrawl combine --input-jobs 1420124830792 Job configuration Combining: 2 segments Cluster configuration Master: 1 m1.medium (Spot: 0.12) Core: 2 m1.medium (Spot: 0.12) Task: -- Launch job? (y/n) y Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL ``` ### elasticrawl status The status command shows crawls and your job history. ```bash ~$ ./elasticrawl status Crawl Status CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136 Job History (last 10) 1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment ``` ### elasticrawl reset The reset comment resets a crawl so it is parsed again. ```bash ~$ ./elasticrawl reset CC-MAIN-2014-49 Reset crawl? (y/n) y CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136 ``` ### elasticrawl destroy The destroy command deletes your S3 bucket and the ~/.elasticrawl directory. ```bash ~$ ./elasticrawl destroy WARNING: Bucket s3://elasticrawl-test and its data will be deleted Config dir /home/vagrant/.elasticrawl will be deleted Delete? (y/n) y Bucket s3://elasticrawl-test deleted Config dir /home/vagrant/.elasticrawl deleted Config deleted ``` ## Configuring Elasticrawl The elasticrawl init command creates the ~/elasticrawl/ directory which contains * [aws.yml](https://github.com/rossf7/.elasticrawl/blob/master/templates/aws.yml) - stores your AWS access credentials. Or you can set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY * [cluster.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/cluster.yml) - configures the EC2 instances that are launched to form your EMR cluster * [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) - stores your S3 bucket name and the config for the parse and combine jobs ## Development Elasticrawl is developed in Ruby and requires Ruby 2.0.0 or later (Ruby 2.1 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers. [![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl) [![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl) [![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 2.0.0, 2.1.5, 2.2.0 The deployment packages are created using [Traveling Ruby](http://phusion.github.io/traveling-ruby/). The deploy packages contain a Ruby 2.1 interpreter, Gems and the compiled C extensions. The [traveling-elasticrawl](https://github.com/rossf7/traveling-elasticrawl) repository has a Rake task that automates building the deployment packages. ## TODO * Add support for Streaming and Pig jobs ## Thanks * Thanks to everyone at Common Crawl for making this awesome dataset available! * Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity) gem which provides a nice Ruby wrapper for the EMR REST API. * Thanks to Phusion for creating Traveling Ruby. ## Contributing 1. Fork it 2. Create your feature branch (`git checkout -b my-new-feature`) 3. Commit your changes (`git commit -am 'Add some feature'`) 4. Push to the branch (`git push origin my-new-feature`) 5. Create new Pull Request ## License This code is licensed under the MIT license.