README.md in ETL-0.0.1 vs README.md in ETL-1.0.0.rc
- old
+ new
@@ -1,8 +1,8 @@
# ETL
-TODO: Write a gem description
+Extract, transform, and load data with ruby!
## Installation
Add this line to your application's Gemfile:
@@ -14,16 +14,372 @@
Or install it yourself as:
$ gem install ETL
-## Usage
+## ETL Dependencies
-TODO: Write usage instructions here
+ETL depends on having a database connection object that __must__ respond
+to `#query`. The [mysql2](https://github.com/brianmario/mysql2) gem is a good option.
+You can also proxy another library using Ruby's `SimpleDelegator` and add a `#query`
+method if need be.
+The gem comes bundled with a default logger. If you'd like to write your own
+just make sure that it implements `#debug` and `#info`. For more information
+on what is logged and when, view the [logger details](#logger-details).
+
+### Basic ETL
+
+Assume that we have a database connection represented by `connection`.
+
+To run a basic ETL that is composed of sequential SQL statements, start by
+creating a new ETL instance:
+
+```ruby
+etl = ETL.new(description: "a description of what this ETL does",
+ connection: connection)
+```
+which can then be configured:
+
+```ruby
+etl.config do |etl|
+ etl.ensure_destination do |etl|
+ # For most ETLs you may want to ensure that the destination exists, so the
+ # #ensure_destination block is ideally suited to fulfill this requirement.
+ #
+ # By way of example:
+ #
+ etl.query %[
+ CREATE TABLE IF NOT EXISTS some_database.some_destination_table (
+ user_id INT UNSIGNED NOT NULL,
+ created_date DATE NOT NULL,
+ total_amount INT SIGNED NOT NULL,
+ message VARCHAR(100) DEFAULT NULL,
+ PRIMARY KEY (user_id),
+ KEY (user_id, created_date),
+ KEY (created_date)
+ )]
+ end
+
+ etl.before_etl do |etl|
+ # All pre-ETL work is performed in this block.
+ #
+ # This can be thought of as a before-ETL hook that will fire only once. When
+ # you are not leveraging the ETL iteration capabilities, the value of this
+ # block vs the #etl block is not very clear. We will see how and when to
+ # leverage this block effectively when we introduce iteration.
+ #
+ # As an example, let's say we want to get rid of all entries that have an
+ # amount less than zero before moving on to our actual etl:
+ #
+ etl.query %[DELETE FROM some_database.some_source_table WHERE amount < 0]
+ end
+
+ etl.etl do |etl|
+ # Here is where the magic happens! This block contains the main ETL
+ # operation.
+ #
+ # For example:
+ #
+ etl.query %[
+ REPLACE INTO some_database.some_destination_table
+ SELECT
+ user_id
+ , DATE(created_at) AS created_date
+ , SUM(amount) AS total_amount
+ FROM
+ some_database.some_source_table sst
+ GROUP BY
+ sst.user_id
+ , sst.DATE(created_at)]
+ end
+
+ etl.after_etl do |etl|
+ # All post-ETL work is performed in this block.
+ #
+ # Again, to finish up with an example:
+ #
+ etl.query %[
+ UPDATE some_database.some_destination_table
+ SET message = "WOW"
+ WHERE total_amount > 100]
+ end
+end
+```
+
+At this point it is possible to run the ETL instance via:
+
+```ruby
+etl.run
+```
+which executes `#ensure_destination`, `#before_etl`, `#etl`, and `#after_etl` in
+that order.
+
+### ETL with iteration
+
+To add in iteration, simply supply `#start`, `#step`, and `#stop` blocks. This
+is useful when dealing with large data sets or when executing queries that,
+while optimized, are still slow.
+
+Again, to kick things off:
+
+```ruby
+etl = ETL.new(description: "a description of what this ETL does",
+ connection: connection)
+```
+
+where `connection` is the same as described above.
+
+Next we can configure the ETL:
+
+```ruby
+# assuming we have the ETL instance from above
+etl.config do |etl|
+ etl.ensure_destination do |etl|
+ # For most ETLs you may want to ensure that the destination exists, so the
+ # #ensure_destination block is ideally suited to fulfill this requirement.
+ #
+ # By way of example:
+ #
+ etl.query %[
+ CREATE TABLE IF NOT EXISTS some_database.some_destination_table (
+ user_id INT UNSIGNED NOT NULL,
+ created_date DATE NOT NULL,
+ total_amount INT SIGNED NOT NULL,
+ message VARCHAR(100) DEFAULT NULL,
+ PRIMARY KEY (user_id),
+ KEY (user_id, created_date),
+ KEY (created_date)
+ )]
+ end
+
+ etl.before_etl do |etl|
+ # All pre-ETL work is performed in this block.
+ #
+ # Now that we are leveraging iteration the #before_etl block becomes
+ # more useful as a way to execute an operation once before we begin
+ # our iteration.
+ #
+ # As an example, let's say we want to get rid of all entries that have an
+ # amount less than zero before moving on to our actual etl:
+ #
+ etl.query %[
+ DELETE FROM some_database.some_source_table
+ WHERE amount < 0]
+ end
+
+ etl.start do |etl|
+ # This defines where the ETL should start. This can be a flat number
+ # or date, or even SQL / other code can be executed to produce a starting
+ # value.
+ #
+ # Usually, this is the last known entry for the destination table with
+ # some sensible default if the destination does not yet contain data.
+ #
+ # As an example:
+ #
+ res = etl.query %[
+ SELECT COALESCE(MAX(created_date), '1970-01-01') AS the_max
+ FROM some_database.some_destination_table]
+
+ res.to_a.first['the_max']
+ end
+
+ etl.step do |etl|
+ # The step block defines the size of the iteration block. To iterate by
+ # ten records, the step block should be set to return 10.
+ #
+ # As an alternative example, to set the iteration to go 10,000 units
+ # at a time, the following value should be provided:
+ #
+ # 10_000 (Note: an underscore is used for readability)
+ #
+ # As an example, to iterate 7 days at a time:
+ #
+ 7.days
+ end
+
+ etl.stop do |etl|
+ # The stop block defines when the iteration should halt.
+ # Again, this can be a flat value or code. Either way, one value *must* be
+ # returned.
+ #
+ # As a flat value:
+ #
+ # 1_000_000
+ #
+ # Or a date value:
+ #
+ # Time.now.to_date
+ #
+ # Or as a code example:
+ #
+ res = etl.query %[
+ SELECT DATE(MAX(created_at)) AS the_max
+ FROM some_database.some_source_table]
+
+ res.to_a.first['the_max']
+ end
+
+ etl.etl do |etl, lbound, ubound|
+ # The etl block is the main part of the framework. Note: there are
+ # two extra args with the iterator this time around: "lbound" and "ubound"
+ #
+ # "lbound" is the lower bound of the current iteration. When iterating
+ # from 0 to 10 and stepping by 2, the lbound would equal 2 on the
+ # second iteration.
+ #
+ # "ubound" is the upper bound of the current iteration. In continuing with the
+ # example above, when iterating from 0 to 10 and stepping by 2, the ubound would
+ # equal 4 on the second iteration.
+ #
+ # These args can be used to "window" SQL queries or other code operations.
+ #
+ # As a first example, to iterate over a set of ids:
+ #
+ # etl.query %[
+ # REPLACE INTO some_database.some_destination_table
+ # SELECT
+ # user_id
+ # , SUM(amount) AS total_amount
+ # FROM
+ # some_database.some_source_table sst
+ # WHERE
+ # sst.user_id > #{lbound} AND sst.user_id <= #{ubound}
+ # GROUP BY
+ # sst.user_id]
+ #
+ # To "window" a SQL query using dates:
+ #
+ etl.query %[
+ REPLACE INTO some_database.some_destination_table
+ SELECT
+ DATE(created_at)
+ , SUM(amount) AS total_amount
+ FROM
+ some_database.some_source_table sst
+ WHERE
+ -- Note the usage of quotes surrounding the lbound and ubound vars.
+ -- This is is required when dealing with dates / datetimes
+ sst.created_at >= '#{lbound}' AND sst.created_at < '#{ubound}'
+ GROUP BY
+ sst.user_id]
+
+ # Note that there is no sql sanitization here so there is *potential* for SQL
+ # injection. That being said you'll likely be using this gem in an internal
+ # tool so hopefully your co-workers are not looking to sabotage your ETL
+ # pipeline. Just be aware of this and handle it as you see fit.
+ end
+
+ etl.after_etl do |etl|
+ # All post-ETL work is performed in this block.
+ #
+ # Again, to finish up with an example:
+ #
+ etl.query %[
+ UPDATE some_database.some_destination_table
+ SET message = "WOW"
+ WHERE total_amount > 100]
+ end
+end
+```
+
+At this point it is possible to run the ETL instance via:
+
+```ruby
+etl.run
+```
+which executes `#ensure_destination`, `#before_etl`, `#etl`, and `#after_etl` in
+that order.
+
+Note that `#etl` executes `#start` and `#stop` once and memoizes the result for
+each. It then begins to iterate from what `#start` evaluated to up until what `#stop`
+evaluated to by what `#step` evaluates to.
+
+## Logger Details
+
+A logger must support two methods: `#info` and `#warn`.
+
+Both methods should accept a single hash argument. The argument will contain:
+
+- `:emitter` => a reference to the ETL instance's `self`
+- `:event_type` => a symbol that includes the type of event being logged. You
+ can use this value to derive which other data you'll have available
+
+When `:event_type` is equal to `:query_start`, you'll have the following
+available in the hash argument:
+
+- `:sql` => the sql that is going to be run
+
+These events are logged at the debug level.
+
+When `:event_type` is equal to `:query_complete`, you'll have the following
+available in the hash argument:
+
+- `:sql` => the sql that was run
+- `:runtime` => how long the query took to execute
+
+These events are logged at the info level.
+
+Following from this you could implement a simple logger as:
+
+```ruby
+class PutsLogger
+ def info data
+ @data = data
+ write!
+ end
+
+ def debug data
+ @data = data
+ write!
+ end
+
+private
+
+ def write!
+ case (event_type = @data.delete(:event_type))
+ when :query_start
+ output = "#{@data[:emitter].description} is about to run\n"
+ output += "#{@data[:sql]}\n"
+ when :query_complete
+ output = "#{@data[:emitter].description} executed:\n"
+ output += "#{@data[:sql]}\n"
+ output += "query completed at #{Time.now} and took #{@data[:runtime]}s\n"
+ else
+ output = "no special logging for #{event_type} event_type yet\n"
+ end
+ puts output
+ @data = nil
+ end
+end
+```
+
## Contributing
-1. Fork it
-2. Create your feature branch (`git checkout -b my-new-feature`)
-3. Commit your changes (`git commit -am 'Added some feature'`)
-4. Push to the branch (`git push origin my-new-feature`)
-5. Create new Pull Request
+If you would like to contribute code to ETL you can do so through GitHub by
+forking the repository and sending a pull request.
+
+When submitting code, please make every effort to follow existing conventions
+and style in order to keep the code as readable as possible.
+
+Before your code can be accepted into the project you must also sign the
+[Individual Contributor License Agreement (CLA)][1].
+
+
+ [1]: https://spreadsheets.google.com/spreadsheet/viewform?formkey=dDViT2xzUHAwRkI3X3k5Z0lQM091OGc6MQ&ndplr=1
+
+## License
+
+Copyright 2013 Square Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.