README.md in wukong-load-0.0.2 vs README.md in wukong-load-0.1.0

- old
+ new

@@ -1,9 +1,9 @@ # Wukong-Load This Wukong plugin makes it easy to load data from the command-line -into various. +into various data stores. It is assumed that you will independently deploy and configure each data store yourself (but see [Ironfan](http://github.com/infochimps-labs/ironfan)). Once you've done that, and once you've written some dataflows with @@ -17,11 +17,11 @@ ## Installation & Setup Wukong-Load can be installed as a RubyGem: ``` -$ sudo gem install wukong-hadoop +$ sudo gem install wukong-load ``` ## Usage Wukong-Load provides a command-line program `wu-load` you can use to @@ -37,58 +37,39 @@ $ wu-load store_name --help ``` Further details will depend on the data store you're writing to. -### Elasticsearch Usage +### Expected Input +All input to `wu-load` should be newline-separated, JSON-formatted, +hash-like records. For some data stores, keys in the record may be +interpreted as metadata about the record or about how to route the +record within the data store. + +## Elasticsearch Usage + Lets you load JSON-formatted records into an [Elasticsearch](http://www.elasticsearch.org) database. See full options with ``` $ wu-load elasticsearch --help ``` -#### Expected Input +### Connecting -All input to `wu-load` should be newline-separated, JSON-formatted, -hash-like record. Some keys in the record will be interpreted as -metadata about the record or about how to route the record within the -database but the entire record will be written to the database -unmodified. +`wu-load` tries to connect to an Elasticsearch server at a default +host (localhost) and port (9200). You can change these: -A (pretty-printed for clarity -- the real record shouldn't contain -newlines) record like - -```json -{ - "_index": "publications" - "_type": "book", - "ISBN": "0553573403", - "title": "A Game of Thrones", - "author": "George R. R. Martin", - "description": "The first of half a hundred novels to come out since...", - ... -} ``` - -might use the `_index` and `_type` fields as metadata but the -**whole** record will be written. - -#### Connecting - -`wu-load` has a default host (localhost) and port (9200) it tries to -connect to but you can change these: - -``` $ cat data.json | wu-load elasticsearch --host=10.122.123.124 --port=80 ``` All queries will be sent to this address. -#### Routing +### Routing Elasticsearch stores data in several *indices* which each contain *documents* of various *types*. `wu-load` loads each document into default index (`wukong`) and type @@ -96,16 +77,101 @@ ``` $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=publication --es_type=book ``` -##### Creates vs. Updates +A record with an `_index` or `_es_type` field will override these +default settings. You can change the names of the fields used. +### Creates vs. Updates + If an input document contains a value for the field `_id` then that value will be as the ID of the record when written, possibly overwriting a record that already exists -- an update. You can change the field you use for the Elasticsearch ID property: ``` $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=media --es_type=books --id_field="ISBN" +``` + +## Kafka Usage + +Lets you load JSON-formatted records into a +[Kafka](http://kafka.apache.org/) queue. See full options with + +``` +$ wu-load kafka --help +``` + +### Connecting + +`wu-load` tries to connect to a Kafka broker at a default host +(localhost) and a port (9092). You can change these: + +``` +$ cat data.json | wu-load kafka --host=10.122.123.124 --port=1234 +``` + +All records will be sent to this address. + +### Routing + +Kafka stores data in several named *queues*. Each queue can have +several numbered *partitions*. + +`wu-load` loads each record into the default queue (`test`) and +partition (0), but you can change these: + +``` +$ cat data.json | wu-load kafka --host=10.123.123.123 --topic=messages --partition=6 +``` + +A record with a `_topic` or `_partition` field will override these +default settings. You can change the names of the fields used. + +## MongoDB Usage + +Lets you load JSON-formatted records into an +[MongoDB](http://www.mongodb.org) database. See full options with + +``` +$ wu-load mongodb --help +``` + +### Connecting + +`wu-load` tries to connect to an MongoDB server at a default host +(localhost) and port (27017). You can change these: + +``` +$ cat data.json | wu-load mongodb --host=10.122.123.124 --port=1234 +``` + +All queries will be sent to this address. + +### Routing + +MongoDB stores *documents* in several *databases* which each contain +*collections*. + +`wu-load` loads each document into default database (`wukong`) and +collection (`streaming_record`), but you can change these: + +``` +$ cat data.json | wu-load mongodb --host=10.123.123.123 --database=publication --collection=book +``` + +A record with a `_database` or `_collection` field will override these +default settings. You can change the names of the fields used. + +### Creates vs. Updates + +If an input document contains a value for the field `_id` then that +value will be as the ID of the record when written, possibly +overwriting a record that already exists -- an update. + +You can change the field you use for the MongoDB ID property: + +``` +$ cat data.json | wu-load mongodb --host=10.123.123.123 --database=media --collection=books --id_field="ISBN" ```