Sha256: 2c5b4d9c183ac3b637377ba2e47160dff8deea5d5c94ffdf1106f01c40049224

Contents?: true

Size: 1.9 KB

Versions: 2

Compression:

Stored size: 1.9 KB

Contents

Utility

* columnizing / reconstituting

* Set up with JRuby
* Allow for direct HDFS operations
* Make the dfs commands slightly less stupid
* add more standard options
* Allow for combiners
* JobStarter / JobSteps
* might as well take dumbo's command line args

BUGS:

* Can't do multiple input files in local mode

Patterns to implement:

* Stats reducer (takes sum, avg, max, min, std.dev of a numeric field)
* Make StructRecordizer work generically with other reducers (spec. AccumulatingReducer)

Example graph scripts:

* Multigraph
* Pagerank 		(done)
* Breadth-first search  
* Triangle enumeration  
* Clustering

Example example scripts (from http://www.cloudera.com/resources/learning-mapreduce): 

1. Find the [number of] hits by 5 minute timeslot for a website given its access logs.

2. Find the pages with over 1 million hits in day for a website given its access logs.

3. Find the pages that link to each page in a collection of webpages.

4. Calculate the proportion of lines that match a given regular expression for a collection of documents.

5. Sort tabular data by a primary and secondary column.

6. Find the most popular pages for a website given its access logs.

/can use


---------------------------------------------------------------------------

Add statistics helpers

* including "running standard deviation":http://www.johndcook.com/standard_deviation.html


---------------------------------------------------------------------------

Make wutils: tsv-oriented implementations of the coreutils (eg uniq, sort, cut, nl, wc, split, ls, df and du) to instrinsically accept and emit tab-separated records.

More example hadoop algorithms:
Bigram counts: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/bigrams.html
* Inverted index construction: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/indexer.html
* Pagerank : http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/pagerank.html

Version data entries

2 entries across 2 versions & 1 rubygems

Version Path
wukong-0.1.4 doc/TODO.textile
wukong-0.1.1 doc/TODO.textile