Sha256: 2c5b4d9c183ac3b637377ba2e47160dff8deea5d5c94ffdf1106f01c40049224
Contents?: true
Size: 1.9 KB
Versions: 2
Compression:
Stored size: 1.9 KB
Contents
Utility * columnizing / reconstituting * Set up with JRuby * Allow for direct HDFS operations * Make the dfs commands slightly less stupid * add more standard options * Allow for combiners * JobStarter / JobSteps * might as well take dumbo's command line args BUGS: * Can't do multiple input files in local mode Patterns to implement: * Stats reducer (takes sum, avg, max, min, std.dev of a numeric field) * Make StructRecordizer work generically with other reducers (spec. AccumulatingReducer) Example graph scripts: * Multigraph * Pagerank (done) * Breadth-first search * Triangle enumeration * Clustering Example example scripts (from http://www.cloudera.com/resources/learning-mapreduce): 1. Find the [number of] hits by 5 minute timeslot for a website given its access logs. 2. Find the pages with over 1 million hits in day for a website given its access logs. 3. Find the pages that link to each page in a collection of webpages. 4. Calculate the proportion of lines that match a given regular expression for a collection of documents. 5. Sort tabular data by a primary and secondary column. 6. Find the most popular pages for a website given its access logs. /can use --------------------------------------------------------------------------- Add statistics helpers * including "running standard deviation":http://www.johndcook.com/standard_deviation.html --------------------------------------------------------------------------- Make wutils: tsv-oriented implementations of the coreutils (eg uniq, sort, cut, nl, wc, split, ls, df and du) to instrinsically accept and emit tab-separated records. More example hadoop algorithms: Bigram counts: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/bigrams.html * Inverted index construction: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/indexer.html * Pagerank : http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/pagerank.html
Version data entries
2 entries across 2 versions & 1 rubygems
Version | Path |
---|---|
wukong-0.1.4 | doc/TODO.textile |
wukong-0.1.1 | doc/TODO.textile |