README.md in elasticity-2.3.1 vs README.md in elasticity-2.4
- old
+ new
@@ -1,104 +1,118 @@
-Elasticity provides programmatic access to Amazon's Elastic Map Reduce service. The aim is to conveniently map the EMR REST API calls to higher level operations that make working with job flows more productive and more enjoyable.
+Elasticity provides programmatic access to Amazon's Elastic Map Reduce service. The aim is to conveniently abstract away the complex EMR REST API and make working with job flows more productive and more enjoyable.
[![Build Status](https://secure.travis-ci.org/rslifka/elasticity.png)](http://travis-ci.org/rslifka/elasticity) REE, 1.8.7, 1.9.2, 1.9.3
Elasticity provides two ways to access EMR:
* **Indirectly through a JobFlow-based API**. This README discusses the Elasticity API.
-* **Directly through access to the EMR REST API**. The less-discussed hidden darkside... I use this to enable the Elasticity API though it is not documented save for RubyDoc available at the the RubyGems [auto-generated documentation site](http://rubydoc.info/gems/elasticity/frames). Be forewarned: Making the calls directly requires that you understand how to structure EMR requests at the Amazon API level and from experience I can tell you there are more fun things you could be doing :) Scroll to the end for more information on the Amazon API.
+* **Directly through access to the EMR REST API**. The less-discussed hidden darkside... I use this to enable the Elasticity API. RubyDoc can be found at the RubyGems [auto-generated documentation site](http://rubydoc.info/gems/elasticity/frames). Be forewarned: Making the calls directly requires that you understand how to structure EMR requests at the Amazon API level and from experience I can tell you there are more fun things you could be doing :) Scroll to the end for more information on the Amazon API.
# Installation
-```ruby
- gem install elasticity
```
+gem install elasticity
+```
or in your Gemfile
-```ruby
- gem 'elasticity', '~> 2.0'
```
+gem 'elasticity', '~> 2.0'
+```
This will ensure that you protect yourself from API changes, which will only be made in major revisions.
-# Kicking Off a Job
+# Roughly, What Am I Getting Myself Into?
-When using the EMR UI, there are several sample jobs that Amazon supplies. The assets for these sample jobs are hosted on S3 and publicly available meaning you can run this code as-is (supplying your AWS credentials appropriately) and ```JobFlow#run``` will return the ID of the job flow.
+If you're familiar with the AWS EMR UI, you'll recall there are sample jobs Amazon supplies to help us get familiar with EMR. Here's how you'd kick off the "Cloudburst (Custom Jar)" sample job with Elasticity. You can run this code as-is (supplying your AWS credentials and an output location) and ```JobFlow#run``` will return the ID of the job flow.
```ruby
require 'elasticity'
# Create a job flow with your AWS credentials
jobflow = Elasticity::JobFlow.new('AWS access key', 'AWS secret key')
+# Omit credentials to use the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables
+# jobflow = Elasticity::JobFlow.new
+
# This is the first step in the jobflow - running a custom jar
step = Elasticity::CustomJarStep.new('s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar')
# Here are the arguments to pass to the jar
-step.arguments = %w(s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br s3n://elasticmapreduce/samples/cloudburst/input/100k.br s3n://slif-output/cloudburst/output/2012-06-22 36 3 0 1 240 48 24 24 128 16)
+step.arguments = %w(s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br s3n://elasticmapreduce/samples/cloudburst/input/100k.br s3n://OUTPUT_BUCKET/cloudburst/output/2012-06-22 36 3 0 1 240 48 24 24 128 16)
# Add the step to the jobflow
jobflow.add_step(step)
# Let's go!
jobflow.run
```
-Note that this example is only for ```CustomJarStep```. ```PigStep``` and ```HiveStep``` will have different means of passing parameters.
+Note that this example is only for ```CustomJarStep```. Other steps will have different means of passing parameters.
# Working with Job Flows
Job flows are the center of the EMR universe. The general order of operations is:
1. Create a job flow.
1. Specify options.
1. (optional) Configure instance groups.
1. (optional) Add bootstrap actions.
- 1. Create steps.
+ 1. Add steps.
+ 1. (optional) Upload assets.
1. Run the job flow.
1. (optional) Add additional steps.
1. (optional) Shutdown the job flow.
## 1 - Create a Job Flow
Only your AWS credentials are needed.
```ruby
+# Manually specify AWS credentials
jobflow = Elasticity::JobFlow.new('AWS access key', 'AWS secret key')
+
+# Use the standard environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
+jobflow = Elasticity::JobFlow.new
```
If you want to access a job flow that's already running:
```ruby
+# Manually specify AWS credentials
jobflow = Elasticity::JobFlow.from_jobflow_id('AWS access key', 'AWS secret key', 'jobflow ID', 'region')
+
+# Use the standard environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
+jobflow = Elasticity::JobFlow.from_jobflow_id(nil, nil, 'jobflow ID', 'region')
```
This is useful if you'd like to attach to a running job flow and add more steps, etc. The ```region``` parameter is necessary because job flows are only accessible from the the API when you connect to the same endpoint that created them (e.g. us-west-1). If you don't specify the ```region``` parameter, us-east-1 is assumed.
-## 2 - Specifying Job Flow Options
+## 2 - Specifying Options
Configuration job flow options, shown below with default values. Note that these defaults are subject to change - they are reasonable defaults at the time(s) I work on them (e.g. the latest version of Hadoop).
These options are sent up as part of job flow submission (i.e. ```JobFlow#run```), so be sure to configure these before running the job.
```ruby
+jobflow.name = 'Elasticity Job Flow'
+
jobflow.action_on_failure = 'TERMINATE_JOB_FLOW'
+jobflow.keep_job_flow_alive_when_no_steps = false
jobflow.ami_version = 'latest'
-jobflow.ec2_key_name = 'default'
-jobflow.ec2_subnet_id = nil
-jobflow.hadoop_version = '0.20.205'
-jobflow.keep_job_flow_alive_when_no_steps = true
+jobflow.hadoop_version = '1.0.3'
jobflow.log_uri = nil
-jobflow.name = 'Elasticity Job Flow'
+
+jobflow.ec2_key_name = nil
+jobflow.ec2_subnet_id = nil
jobflow.placement = 'us-east-1a'
jobflow.instance_count = 2
jobflow.master_instance_type = 'm1.small'
jobflow.slave_instance_type = 'm1.small'
```
-## 3 - Configuring Instance Groups (optional)
+## 3 - Configure Instance Groups (optional)
Technically this is optional since Elasticity creates MASTER and CORE instance groups for you (one m1.small instance in each). If you'd like your jobs to finish in an appreciable amount of time, you'll want to at least add a few instances to the CORE group :)
### The Easy Way™
@@ -140,26 +154,48 @@
ig.set_spot_instances(0.25) # Makes this a SPOT group with a $0.25 bid price
jobflow.set_core_instance_group(ig)
```
-## 4 - Adding Bootstrap Actions (optional)
+## 4 - Add Bootstrap Actions (optional)
Bootstrap actions are run as part of setting up the job flow, so be sure to configure these before running the job.
+### Bootstrap Actions
+
+With the basic ```BootstrapAction``` you specify everything about the action - the script, options and arguments.
+
```ruby
+action = Elasticity::BootstrapAction.new('s3n://my-bucket/my-script', '-g', '100')
+jobflow.add_bootstrap_action(action)
+```
+
+### Hadoop Bootstrap Actions
+
+`HadoopBootstrapAction` handles passing Hadoop configuration options through.
+
+```ruby
[
Elasticity::HadoopBootstrapAction.new('-m', 'mapred.map.tasks=101'),
Elasticity::HadoopBootstrapAction.new('-m', 'mapred.reduce.child.java.opts=-Xmx200m')
Elasticity::HadoopBootstrapAction.new('-m', 'mapred.tasktracker.map.tasks.maximum=14')
].each do |action|
jobflow.add_bootstrap_action(action)
end
```
-## 5 - Adding Steps
+### Hadoop File Bootstrap Actions
+With EMR's current limit of 15 bootstrap actions, chances are you're going to create a configuration file full of your options and opt to use that instead of passing all the options individually. In that case, use the ```HadoopFileBootstrapAction```, supplying the location of your configuration file.
+
+```ruby
+action = Elasticity::HadoopFileBootstrapAction.new('s3n://my-bucket/job-config.xml')
+jobflow.add_bootstrap_action(action)
+```
+
+## 5 - Add Steps
+
Each type of step has ```#name``` and ```#action_on_failure``` fields that can be overridden. Apart from that, steps are configured differently - exhaustively described below.
### Adding a Pig Step
```ruby
@@ -233,22 +269,39 @@
jar_step.arguments = ['arg1', 'arg2']
jobflow.add_step(jar_step)
```
-## 6 - Running the Job Flow
+## 6 - Upload Assets (optional)
+This isn't part of ```JobFlow```; more of an aside :) Elasticity provides a very basic means of uploading assets to S3 so that your EMR job has access to them. For example, a TSV file with a range of valid values, join tables, etc.
+
+```ruby
+# Specify the bucket and AWS credentials
+s3 = Elasticity::SyncToS3('my-bucket', 'access', 'secret')
+
+# Use the standard environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
+# s3 = Elasticity::SyncToS3('my-bucket')
+
+# Recursively sync the contents of '/some/parent/dir' under the remote location 'remote-dir/this-job/assets'
+s3.sync('/some/parent/dir', 'remote-dir/this-job/assets')
+```
+
+If the files already exist, there is an MD5 checksum check. If the checksums are the same, the file will be skipped. Now you can use something like ```s3n://my-bucket/remote-dir/this-job/assets/join.tsv``` in your EMR jobs.
+
+## 7 - Run the Job Flow
+
Submit the job flow to Amazon, storing the ID of the running job flow.
```ruby
jobflow_id = jobflow.run
```
-## 7 - Adding Additional Steps (optional)
+## 8 - Add Additional Steps (optional)
Steps can be added to a running jobflow just by calling ```#add_step``` on the job flow exactly how you add them prior to submitting the job.
-## 8 - Shutting Down the Job Flow (optional)
+## 9 - Shut Down the Job Flow (optional)
By default, job flows are set to terminate when there are no more running steps. You can tell the job flow to stay alive when it has nothing left to do:
```ruby
jobflow.keep_job_flow_alive_when_no_steps = true