--- layout: default title: mrflip.github.com/wukong - TODO collapse: false --- h1(gemheader). Wukong More Info ** "Why is it called Wukong?":#name ** "Don't Use Wukong, use this instead":#whateverdude ** "Further Reading and useful links":#links ** "Note on Patches/Pull Requests":#patches ** "What's up with Wukong::AndPig?":#andpig ** "Map/Reduce Algorithms":#algorithms ** "TODOs":#TODO
h2(#name). Why is it called Wukong? Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill: bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India. bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong] The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
h2(#algorithms). Map/Reduce Algorithms Example graph scripts: * Multigraph * Pagerank (done) * Breadth-first search * Triangle enumeration * Clustering h3. K-Nearest Neighbors More example hadoop algorithms: * Bigram counts: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/bigrams.html * Inverted index construction: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/indexer.html * Pagerank : http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/pagerank.html * SIPs, Median, classifiers and more : http://matpalm.com/ * Brad Heintz's "Distributed Computing with Ruby":http://www.bradheintz.com/no1thing/talks/ demonstrates Travelling Salesman in map/reduce. * "Clustering billions of images with large scale nearest neighbor search":http://scholar.google.com/scholar?cluster=2473742255769621469&hl=en uses three map/reduce passes: ** Subsample to build a "spill tree" that roughly localizes each object ** Use the spill tree on the full dataset to group each object with its potential neighbors ** Calculate the metrics and emit only the k-nearest neighbors Example example scripts (from http://www.cloudera.com/resources/learning-mapreduce): 1. Find the [number of] hits by 5 minute timeslot for a website given its access logs. 2. Find the pages with over 1 million hits in day for a website given its access logs. 3. Find the pages that link to each page in a collection of webpages. 4. Calculate the proportion of lines that match a given regular expression for a collection of documents. 5. Sort tabular data by a primary and secondary column. 6. Find the most popular pages for a website given its access logs.
h2(#whateverdude). Don't Use Wukong, use this instead There are several worthy Hadoop|Streaming Frameworks: * infochimps.org's "Wukong":http://github.com/mrflip/wukong -- ruby; object-oriented *and* record-oriented * NYTimes' "MRToolkit":http://code.google.com/p/mrtoolkit/ -- ruby; much more log-oriented * Freebase's "Happy":http://code.google.com/p/happy/ -- python; the most performant, as it can use Jython to make direct API calls. * Last.fm's "Dumbo":http://wiki.github.com/klbostee/dumbo -- python Most people use Wukong / one of the above (or straight Java Hadoop, poor souls) for heavy lifting, and several of the following hadoop tools for efficiency: * Pig OR * Hive -- hive is more SQL-ish, Pig is more elegant (in a brushed-metal kind of way). I greatly prefer Pig, because I hate SQL; you may feel differently. * Sqoop * Mahout
h2(#links). Further Reading and useful links: * "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart - dive right in with Wukong, Hadoop and the Amazon Elastic MapReduce cloud. Once you get bored with the command line, this is the fastest path to Wukong power. * "Distributed Computing with Ruby":http://www.bradheintz.com/no1thing/talks/ has some raw ruby, some Wukong and some JRuby/Hadoop integration -- it demonstrates a Travelling Salesman in map/reduce. Cool! * "Hadoop, The Definitive Guide":http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979 * "Running Hadoop On Ubuntu Linux (Single-Node Cluster)":http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) and "unning Hadoop On Ubuntu Linux (Multi-Node Cluster).":http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) * "Running Hadoop MapReduce on Amazon EC2 and S3":http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 * "Hadoop Overview by Doug Cutting":http://video.google.com/videoplay?docid=-4912926263813234341 - the founder of the Hadoop project. (49m video) * "Cluster Computing and Map|Reduce":http://www.youtube.com/results?search_query=cluster+computing+and+mapreduce ** "Lecture 1: Overview":http://www.youtube.com/watch?v=yjPBkvYh-ss ** "Lecture 2 (technical): Map|Reduce":http://www.youtube.com/watch?v=-vD6PUdf3Js ** "Lecture 3 (technical): GFS (Google File System)":http://www.youtube.com/watch?v=5Eib_H_zCEY ** "Lecture 4 (theoretical): Canopy Clustering":http://www.youtube.com/watch?v=1ZDybXl212Q ** "Lecture 5 (theoretical): Breadth-First Search":http://www.youtube.com/watch?v=BT-piFBP4fE * "Cloudera Hadoop Training:":http://www.cloudera.com/hadoop-training ** "Thinking at Scale":http://www.cloudera.com/hadoop-training-thinking-at-scale ** "Mapreduce and HDFS":http://www.cloudera.com/hadoop-training-mapreduce-hdfs ** "A Tour of the Hadoop Ecosystem":http://www.cloudera.com/hadoop-training-ecosystem-tour ** "Programming with Hadoop":http://www.cloudera.com/hadoop-training-programming-with-hadoop ** "Hadoop and Hive: introduction":http://www.cloudera.com/hadoop-training-hive-introduction ** "Hadoop and Hive: tutorial":http://www.cloudera.com/hadoop-training-hive-tutorial ** "Hadoop and Pig: Introduction":http://www.cloudera.com/hadoop-training-pig-introduction ** "Hadoop and Pig: Tutorial":http://www.cloudera.com/hadoop-training-pig-tutorial ** "Mapreduce Algorithms":http://www.cloudera.com/hadoop-training-mapreduce-algorithms ** "Exercise: Getting started with Hadoop":http://www.cloudera.com/hadoop-training-exercise-getting-started-with-hadoop ** "Exercise: Writing mapreduce programs":http://www.cloudera.com/hadoop-training-exercise-writing-mapreduce-programs ** "Cloudera Blog":http://www.cloudera.com/blog/ * "Hadoop Wiki: Hadoop Streaming":http://wiki.apache.org/hadoop/HadoopStreaming * "Hadoop Docs: Hadoop Streaming":http://hadoop.apache.org/common/docs/current/streaming.html * A "dimwitted screed on Ruby, Hadoop and Starling":http://www.theregister.co.uk/2008/08/11/hadoop_dziuba/ seemingly written with jockstrap on head.
h2(#patches). Note on Patches/Pull Requests * Fork the project. * Make your feature addition or bug fix. * Add tests for it. This is important so I don't break it in a future version unintentionally. * Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull) * Send me a pull request. Bonus points for topic branches.
h2(#andpig). What's up with Wukong::AndPig? @Wukong::AndPig@ is a small library to more easily generate code for the "Pig":http://hadoop.apache.org/pig data analysis language. See its "README":http://github.com/mrflip/wukong/tree/master/lib/wukong/and_pig/README.textile for more. It's **not really being worked on**, and you should probably **ignore it**.
h2(#todo). TODOs Utility * columnizing / reconstituting * Set up with JRuby * Allow for direct HDFS operations * Make the dfs commands slightly less stupid * add more standard options * Allow for combiners * JobStarter / JobSteps * might as well take dumbo's command line args BUGS: * Can't do multiple input files in local mode Patterns to implement: * Stats reducer ** basic sum, avg, max, min, std.dev of a numeric field ** the "running standard deviation":http://www.johndcook.com/standard_deviation.html * Efficient median (and other order statistics) * Make StructRecordizer work generically with other reducers (spec. AccumulatingReducer) Make wutils: tsv-oriented implementations of the coreutils (eg uniq, sort, cut, nl, wc, split, ls, df and du) to instrinsically accept and emit tab-separated records.