h1. Wukong Wukong is Ruby for Hadoop -- it makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it. Treat your dataset like a * stream of lines when it's efficient to process by lines * stream of field arrays when it's efficient to deal directly with fields * stream of lightweight objects when it's efficient to deal with objects Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line. The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com/wukong Please feel free to add supplemental information to the "wukong wiki.":http://wiki.github.com/mrflip/wukong * "Install and set up wukong":http://mrflip.github.com/wukong/INSTALL.html * "Tutorial":http://mrflip.github.com/wukong/tutorial.html * "Usage notes":http://mrflip.github.com/wukong/usage.html * "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line * Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop) * "More info":http://mrflip.github.com/wukong/moreinfo.html h2. Imminent Changes I'm pushing to release "Wukong 3.0 the actual 1.0 release". * For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently * Methods on TypedStruct to * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything. * May make some things that are derived classes into mixin'ed modules * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though. * h2. Help! Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code h2. Install ** "Main Install and Setup Documentation":http://mrflip.github.com/wukong/INSTALL.html ** h3. Get the code We're still actively developing wukong. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/wukong pre. $ git clone git://github.com/mrflip/wukong A gem is available from "gemcutter:":http://gemcutter.org/gems/wukong pre. $ sudo gem install wukong --source=http://gemcutter.org (don't use the gems.github.com version -- it's way out of date.) You can instead download this project in either "zip":http://github.com/mrflip/wukong/zipball/master or "tar":http://github.com/mrflip/wukong/tarball/master formats. h3. Dependencies and setup To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/wukong/INSTALL.html and then read the "usage notes":http://mrflip.github.com/wukong/usage.html h2. How to write a Wukong script ** "Tutorial By Example":http://mrflip.github.com/wukong/tutorial.html ** Here's a script to count words in a text stream:
    require 'wukong'
    module WordCount
      class Mapper < Wukong::Streamer::LineStreamer
        # Emit each word in the line.
        def process line
          words = line.strip.split(/\W+/).reject(&:blank?)
          words.each{|word| yield [word, 1] }
        end
      end
      
      class Reducer < Wukong::Streamer::ListReducer
        def finalize
          yield [ key, values.map(&:last).map(&:to_i).sum ]
        end
      end
    end
    
    Wukong::Script.new(
      WordCount::Mapper,
      WordCount::Reducer
      ).run # Execute the script
The first class, the Mapper, eats lines and craps @[word, count]@ records: word is the /key/, its count is the /value/. In the reducer, the values for each key are stacked up into a list; then the record(s) yielded by @#finalize@ are emitted. There are many other ways to write the reducer (most of them are better) -- see the ["examples":examples/] h3. Structured data stream You can also use structs to treat your dataset as a stream of objects:
    require 'wukong'
    require 'my_blog' #defines the blog models
    # structs for our input objects
    Tweet = Struct.new( :id, :created_at, :twitter_user_id,
      :in_reply_to_user_id, :in_reply_to_status_id, :text )
    TwitterUser  = Struct.new( :id, :username, :fullname,
      :homepage, :location, :description )
    module TwitBlog
      class Mapper < Wukong::Streamer::RecordStreamer
        # Watch for tweets by me
        MY_USER_ID = 24601
        #
        # If this is a tweet is by me, convert it to a Post.
        #
        # If it is a tweet not by me, convert it to a Comment that
        # will be paired with the correct Post.
        #
        # If it is a TwitterUser, convert it to a User record and
        # a user_location record
        #
        def process record
          case record
          when TwitterUser
            user     = MyBlog::User.new.merge(record) # grab the fields in common
            user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
            yield user
            yield user_loc
          when Tweet
            if record.twitter_user_id == MY_USER_ID
              post = MyBlog::Post.new.merge record
              post.link = "http://twitter.com/statuses/show/#{record.id}"
              post.body = record.text
              post.title = record.text[0..65] + "..."
              yield post
            else
              comment = MyBlog::Comment.new.merge record
              comment.body    = record.text
              comment.post_id = record.in_reply_to_status_id
              yield comment
            end
          end
        end
      end
    end
    Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
h3. Advanced Patterns Wukong has a good collection of map/reduce patterns. Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
    #
    # Roll up all values for each key into a single line
    #
    class GroupByReducer < Wukong::Streamer::AccumulatingReducer
      attr_accessor :values

      # Start with an empty list
      def start! *args
        self.values = []
      end

      # Aggregate each value in turn 
      def accumulate key, value
        self.values << value
      end

      # Emit the key and all values, tab-separated
      def finalize
        yield [key, values].flatten
      end
    end
So given adjacency pairs for the following directed friend graph:

    @jerry      @elaine
    @elaine     @jerry
    @jerry      @kramer
    @kramer     @jerry
    @kramer     @bobsacamato
    @kramer     @newman
    @jerry      @superman
    @newman     @kramer
    @newman     @elaine
    @newman     @jerry
You'd end up with

    @elaine     @jerry
    @jerry      @elaine      @kramer     @superman
    @kramer     @bobsacamato @jerry      @newman
    @newman     @elaine      @jerry      @kramer   
h2. Why is it called Wukong? Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill: bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India. bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong] The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
h2. More info There are many useful examples in the examples/ directory. h3. Credits Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org Patches submitted by: * gemified by Ben Woosley (ben.woosley with the gmails) * ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui at masuidrive.jp - http://blog.masuidrive.jp/ Thanks to: * "Fredrik Möllerstrand (@lenbust)":http://twitter.com/lenbust for the examples/contrib/jeans working example * "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback * "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial. h3. Help! Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code