README.textile in wukong-0.1.1 vs README.textile in wukong-0.1.4

- old
+ new

@@ -10,10 +10,26 @@ Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line. The main documentation -- including tutorials and tips for working with big data -- lives on the "Wukong Pages":http://mrflip.github.com/wukong and there is some supplemental information on the "wukong wiki.":http://wiki.github.com/mrflip/wukong +h2. Install + +Wukong is still under active development. The newest version is available at + + http://github.com/mrflip/wukong + +A gem is available from "github:":http://gems.github.com + + gem install mrflip-wukong --source=http://gems.github.com + +or from "gemcutter":http://gemcutter.org + + gem install wukong --source=http://gemcutter.org + +Phil Ripperger has prepared "instructions on getting wukong to work on the Amazon AWS cloud.":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart Thanks Phil! + h2. How to write a Wukong script Here's a script to count words in a text stream: <pre><code> require 'wukong' @@ -92,10 +108,65 @@ end end Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer </code></pre> +h3. Advanced Patterns + +Wukong has a good collection of map/reduce patterns. For example, it's quite common to accumulate all records for a given key and emit some result based on the whole group. + +The AccumulatingReducer calls start! on the first record for each key, calls accumulate() on every example for that key (including the first), and calls finalize() once the last record for that key is seen. + +Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line. + +<pre><code> # + # Roll up all values for each key into a single line + # + class GroupByReducer < Wukong::Streamer::AccumulatingReducer + attr_accessor :values + + # Start with an empty list + def start! *args + self.values = [] + end + + # Aggregate each value in turn + def accumulate key, value + self.values << value + end + + # Emit the key and all values, tab-separated + def finalize + yield [key, values].flatten + end + end +</code></pre> + +So given adjacency pairs for the following directed friend graph: + +<pre><code> + @jerry @elaine + @elaine @jerry + @jerry @kramer + @kramer @jerry + @kramer @bobsacamato + @kramer @newman + @jerry @superman + @newman @kramer + @newman @elaine + @newman @jerry +</code></pre> + +You'd end up with + +<pre><code> + @elaine @jerry + @jerry @elaine @kramer @superman + @kramer @bobsacamato @jerry @newman + @newman @elaine @jerry @kramer +</code></pre> + h3. More info There are many useful examples (including an actually-useful version of the WordCount script) in examples/ directory. h2. Setup @@ -107,10 +178,9 @@ * or create a file 'config/wukong-site.yaml' with a line that points to the top-level directory of your hadoop install: @:hadoop_home: /usr/local/share/hadoop@ 2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts. - h2. How to run a Wukong script To run your script using local files and no connection to a hadoop cluster,