# Rodimus
[![Gem Version](https://badge.fury.io/rb/rodimus.svg)](http://badge.fury.io/rb/rodimus) [![Build Status](https://travis-ci.org/nevern02/rodimus.svg?branch=master)](https://travis-ci.org/nevern02/rodimus)

ETL stands for Extract-Transform-Load. Sometimes, you have data in Source A
that needs to be moved to Destination B.  Along the way, it needs to be
manipulated in some way.  This is a common scenario when working with a data
warehouse.  There are lots of ETL solutions in the wild, but very few of them
are open source.  None of them (that I know of) are Ruby.  So, I started
hacking on one for my own use.

__Why the name?__ Rodimus Prime is one of the leaders of the Autobots, and he
has a cool name.  Naming a data transformation library after a Transformer
increases the coolness factor.  It's science.

## Installation

Add this line to your application's Gemfile:

    gem 'rodimus'

And then execute:

    $ bundle

Or install it yourself as:

    $ gem install rodimus

## Usage

tl;dr: See the examples directory for the quickest path to success.

```ruby
require 'rodimus'
require 'csv'
require 'json'

class CsvInput < Rodimus::Step
  def before_run_set_incoming
    @incoming = CSV.open('examples/worldbank-sample.csv')
    @incoming.readline # skip the headers
  end

  def process_row(row)
    row.to_json
  end
end

class FormattedText < Rodimus::Step
  def before_run_set_stdout
    @outgoing = STDOUT.dup
  end

  def process_row(row)
    data = JSON.parse(row)
    "In #{data.first} during #{data[1]}, CO2 emissions were #{data[2]} metric tons per capita." 
  end
end

t = Rodimus::Transformation.new
s1 = CsvInput.new
s2 = FormattedText.new
t.steps << s1
t.steps << s2
t.run
puts "Transformation complete!"
```

A transformation is an operation that consists of many steps.  Each step may
manipulate the data in some way.  Typically, the first step is reserved for
reading from your data source, and the last step is used to write to the new
destination.  

In Rodimus, you create a transformation object, and then you add
one or more steps to its array of steps.  You typically create steps by writing 
your own classes that inherit from Rodimus::Step.  When the transformation is
subsequently run, a new process is forked for each step.  On platforms that support 
native threads (JRuby, Rubinius), threads are used instead of forking processes. 
All processes are connected together using pipes except for the first and last 
steps (those being the source and destination steps).  Each step then consumes
rows of data from its incoming pipe and performs some operation on it before
writing it to the outgoing pipe.  

There are several methods on the Rodimus::Step class that are able to be
overridden for custom processing behavior before, during, or after the each
row is handled.  If those aren't enough, you're also free to manipulate the
input/output objects (i.e. to redirect to standard out).

The Rodimus approach is to provide a minimal, flexible framework upon which
custom ETL solutions can be built.  ETL is complex, and there tend to be many
subtle differences between projects which can make things like establishing
conventions and encouraging code reuse difficult.  Rodimus is an attempt to
codify those things which are probably useful to a majority of ETL projects
with as little overhead as possible.

If you'd like to know the thought process behind Rodimus, check out [this 
blog post](http://www.blrice.net/blog/2014/06/03/etl-with-ruby-and-rodimus/).

## Contributing

1. Fork it ( http://github.com/nevern02/rodimus/fork )
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create new Pull Request