h2. There's lots of data


There are two disruptive

* We're instrumenting every realm of human activity
** Conversation
** Relationships
** 

* We have linearly scaling multiprocessing
** Old frontier computing: expensive, N log N, SUUUUUUCKS
** It's cheap, it's scaleable and it's fun

h2. Wukong + Hadoop can help


h2. == Map|Reduce ==

h3. cat input.tsv | mapper.sh | sort | reducer.sh > output.tsv

* Bobo histogram:

  cat twitter_users.tsv | cuttab 3 | cutc 1-6 | sort | uniq -c > histogram.tsv

  cat twitter_users.tsv | \
    cuttab 3 |                  # extract the date column               \
    cutc 1-6 |                  # chop off all but the yearmonth        \
    sort     |                  # sort, to ensure locality              \
    uniq -c  >                  # roll up lines, along with their count \
    histogram.tsv               # save into output file
  
  
h3. Word Count

mapper:

    # output each word on its own line
    @readlines.each{|line| puts line.split(/[^\w]+/) }@

reducer:

    # every word is _guaranteed_ to land in the same place and next to its
    # friends, so we can just output the repetition count for each
    # distinct line.
    uniq -c


h3. Word Count by Person

* Partition Keys vs. Reduce Keys

- reduce by [word, <total>, count] and [word, user_id, count]


h2. == Global Structure ==

h3. Enumerating neighborhood

* adjacency list

* join on center link

* list of 3-paths == 

h2. == Mechanics, HDFS ==


x M _
_ M y

h2. == More Reading ==

h3. Hadoop

* "Hadoop, The Definitive Guide":http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979
* "":

* "Cloudera Blog":http://www.cloudera.com/blog/

h3. Hadoop|Streaming Frameworks

* infochimps.org's "Wukong":http://github.com/mrflip/wukong -- ruby; object-oriented *and* record-oriented
* NYTimes' "MRToolkit":http://code.google.com/p/mrtoolkit/ -- ruby; much more log-oriented
* Freebase's "Happy":http://code.google.com/p/happy/ -- python; the most performant, as it can use Jython to make direct API calls.
* Last.fm's "Dumbo":http://wiki.github.com/klbostee/dumbo -- python

 h3. Hadoop Infrastructure

Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. Actually, if it's still June 2009 when you read this, profile your scripts with Wukong on the command line and kill some time before Hadoop 0.20 comes out.  It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.