README.textile in wukong-1.5.4 vs README.textile in wukong-2.0.0

- old
+ new

@@ -17,23 +17,11 @@ * "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line * Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop) * "More info":http://mrflip.github.com/wukong/moreinfo.html -h2. Imminent Changes -I'm pushing to release "Wukong 3.0 the actual 1.0 release". - -* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently -* Methods on TypedStruct to - - * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented - * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything. - * May make some things that are derived classes into mixin'ed modules - * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though. - - h2. Help! Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code h2. Install @@ -190,9 +178,67 @@ @elaine @jerry @jerry @elaine @kramer @superman @kramer @bobsacamato @jerry @newman @newman @elaine @jerry @kramer </code></pre> + +h2. Gotchas + +h4. RecordStreamer dies on blank lines with "wrong number of arguments" + +If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up: + +<pre> + class MyUnhappyMapper < Wukong::Streamer::RecordStreamer + # this will fail if the line has more or fewer than 3 fields: + def process x, y, z + p [x, y, z] + end + end +</pre> + +The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields: + +<pre> + class MyHappyMapper < Wukong::Streamer::RecordStreamer + # extracts three fields always; any missing fields are nil, any extra fields discarded + # @example + # recordize("a") # ["a", nil, nil] + # recordize("a\t\b\tc") # ["a", "b", "c"] + # recordize("a\t\b\tc\td") # ["a", "b", "c"] + def recordize raw_record + x, y, z = super(raw_record) + [x, y, z] + end + + # Now all lines produce exactly three args + def process x, y, z + p [x, y, z] + end + end +</pre> + +If you want to preserve any extra fields, use the extra argument to #split(): + +<pre> + class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer + # extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields + # @example + # recordize("a") # ["a", nil, nil] + # recordize("a\t\b\tc") # ["a", "b", "c"] + # recordize("a\t\b\tc\td") # ["a", "b", "c\td"] + def recordize raw_record + x, y, z = split(raw_record, "\t", 3) + [x, y, z] + end + + # Now all lines produce exactly three args + def process x, y, z + p [x, y, z] + end + end +</pre> + h2. Why is it called Wukong? Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill: