h1. Wukong Utility Scripts h2. Stupid command-line tricks h3. Histogram Given data with a date column: message 235623 20090423012345 Now is the winter of our discontent Made glorious summer by this son of York message 235623 20080101230900 These pretzels are making me THIRSTY! ... You can calculate number of messages sent by day with cat messages | cuttab 3 | cutc 8 | sort | uniq -c (see the wuhist command, below.) h3. Simple intersection, union, etc For two datasets (batch_1 and batch_2) with unique entries (no repeated lines), * Their union is simple: cat batch_1 batch_2 | sort -u * Their intersection: cat batch_1 batch_2 | sort | uniq -c | egrep -v '^ *1 ' This concatenates the two sets and filters out everything that only occurred once. * For the complement of the intersection, use "... | egrep '^ *1 '" * In both cases, if the files are each internally sorted, the commandline sort takes a --merge flag: sort --merge -u batch_1 batch_2 h2. Command Listing h3. cutc @cutc [colnum]@ Ex. echo -e 'foo\tbar\tbaz' | cutc 6 foo ba Cuts from beginning of line to given column (default 200). A tab is one character, so right margin can still be ragged. h3. cuttab @cuttab [colspec]@ Cuts given tab-separated columns. You can give a comma separated list of numbers or ranges 1-4. columns are numbered from 1. Ex. echo -e 'foo\tbar\tbaz' | cuttab 1,3 foo baz h3. hdp-* These perform the corresponding commands on the HDFS filesystem. In general, where they accept command-line flags, they go with the GNU-style ones, not the hadoop-style: so, @hdp-du -s dir@ or @hdp-rm -r foo/@ * @hdp-cat@ * @hdp-catd@ -- cats the files that don't start with '_' in a directory. Use this for a pile of @.../part-00000@ files * @hdp-du@ * @hdp-get@ * @hdp-kill@ * @hdp-ls@ * @hdp-mkdir@ * @hdp-mv@ * @hdp-ps@ * @hdp-put@ * @hdp-rm@ * @hdp-sync@ h3. hdp-sort, hdp-stream, hdp-stream-flat * @hdp-sort@ * @hdp-stream@ * @hdp-stream-flat@ <code><pre> hdp-stream input_filespec output_file map_cmd reduce_cmd num_key_fields </pre></code> h3. tabchar Outputs a single tab character. h3. wuhist Occasionally useful to gather a lexical histogram of a single column: Ex. <code><pre> $ echo -e 'foo\nbar\nbar\nfoo\nfoo\nfoo\n7' | ./wuhist 4 foo 2 bar 1 7 </pre></code> (the output will have a tab between the first and second column, for futher processing.) h3. wulign Intelligently format a tab-separated file into aligned columns (while remaining tab-separated for further processing). See README-wulign.textile. h3. hdp-parts_to_keys.rb A *very* clumsy script to rename reduced hadoop output files by their initial key. If your output file has an initial key in the first column and you pass it through hdp-sort, they will be distributed across reducers and thus output files. (Because of the way hadoop hashes the keys, there's no guarantee that each file will get a distinct key. You could have 2 keys with a million entries and they could land sequentially on the same reducer, always fun.) If you're willing to roll the dice, this script will rename files according to the first key in the first line.