README.textile in wukong-1.5.4 vs README.textile in wukong-2.0.0
- old
+ new
@@ -17,23 +17,11 @@
* "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line
* Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html
* Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
* "More info":http://mrflip.github.com/wukong/moreinfo.html
-h2. Imminent Changes
-I'm pushing to release "Wukong 3.0 the actual 1.0 release".
-
-* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
-* Methods on TypedStruct to
-
- * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
- * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
- * May make some things that are derived classes into mixin'ed modules
- * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
-
-
h2. Help!
Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
h2. Install
@@ -190,9 +178,67 @@
@elaine @jerry
@jerry @elaine @kramer @superman
@kramer @bobsacamato @jerry @newman
@newman @elaine @jerry @kramer
</code></pre>
+
+h2. Gotchas
+
+h4. RecordStreamer dies on blank lines with "wrong number of arguments"
+
+If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:
+
+<pre>
+ class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
+ # this will fail if the line has more or fewer than 3 fields:
+ def process x, y, z
+ p [x, y, z]
+ end
+ end
+</pre>
+
+The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:
+
+<pre>
+ class MyHappyMapper < Wukong::Streamer::RecordStreamer
+ # extracts three fields always; any missing fields are nil, any extra fields discarded
+ # @example
+ # recordize("a") # ["a", nil, nil]
+ # recordize("a\t\b\tc") # ["a", "b", "c"]
+ # recordize("a\t\b\tc\td") # ["a", "b", "c"]
+ def recordize raw_record
+ x, y, z = super(raw_record)
+ [x, y, z]
+ end
+
+ # Now all lines produce exactly three args
+ def process x, y, z
+ p [x, y, z]
+ end
+ end
+</pre>
+
+If you want to preserve any extra fields, use the extra argument to #split():
+
+<pre>
+ class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
+ # extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
+ # @example
+ # recordize("a") # ["a", nil, nil]
+ # recordize("a\t\b\tc") # ["a", "b", "c"]
+ # recordize("a\t\b\tc\td") # ["a", "b", "c\td"]
+ def recordize raw_record
+ x, y, z = split(raw_record, "\t", 3)
+ [x, y, z]
+ end
+
+ # Now all lines produce exactly three args
+ def process x, y, z
+ p [x, y, z]
+ end
+ end
+</pre>
+
h2. Why is it called Wukong?
Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill: