---
layout: default
title:  Usage notes
---

h1(gemheader). {{ site.gemname }} %(small):: usage%

** "How to run a Wukong script":#running
** "How to test your scripts":#testing
** "Wukong Plays nicely with others":#playnice
** "Schema export":#schema_export to Pig or SQL
** "Wukong's internal workflow":#workflow
** "Using wukong with internal streaming":#stayinruby
** "Using wukong to Batch-Process ActiveRecord Objects":#activerecord


<notextile><div class="toggle"></notextile>

h2(#running). How to run a Wukong script

To run your script using local files and no connection to a hadoop cluster,

pre. your/script.rb --run=local path/to/input_files path/to/output_dir

To run the command across a Hadoop cluster,

pre. your/script.rb --run=hadoop path/to/input_files path/to/output_dir

You can set the default in the config/wukong-site.yaml file, and then just use @--run@ instead of @--run=something@ --it will just use the default run mode.

If you're running @--run=hadoop@, all file paths are HDFS paths. If you're running @--run=local@, all file paths are local paths.  (your/script path, of course, lives on the local filesystem).

You can supply arbitrary command line arguments (they wind up as key-value pairs in the options path your mapper and reducer receive), and you can use the hadoop syntax to specify more than one input file:

pre. ./path/to/your/script.rb --any_specific_options --options=can_have_vals \
   --run "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir

Note that all @--options@ must precede (in any order) all non-options.

<notextile></div><div class="toggle"></notextile>

h2(#testing). How to test your scripts

To run mapper on its own:

pre. cat ./local/test/input.tsv | ./examples/word_count.rb --map | more

or if your test data lies on the HDFS,

pre. hdp-cat test/input.tsv | ./examples/word_count.rb --map | more

Next graduate to running @--run=local@ mode so you can inspect the reducer.

<notextile></div><div class="toggle"></notextile>

h2(#playnice). Wukong Plays nicely with others

Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.  It even has limited support for "martinis":http://datamapper.org (Datamapper) and "express trains":http://wiki.rubyonrails.org/rails/pages/ActiveRecord (ActiveRecord).

* "Export Wukong classes to SQL or Pig":#schema_export -- easily bulk-load and define SQL tables, or kickstart your pig scripts
* "Batch-Process records from ActiveRecord":#activerecord (the datamapper case is similar)
* Cascade Mappers and Reducers "purely in ruby":#stayinruby -- reportedly useful in an "ETL":http://en.wikipedia.org/wiki/Extract,_transform,_load context.

h3(#schema_export). Schema export to Pig or SQL

There is preliminary support for dumping wukong classes as schemata for other tools. For example, given the following:

{% highlight ruby %}
    require "wukong" ;
    require "wukong/schema"
    User = TypedStruct.new(
        [:id,                     Integer],
        [:scraped_at,             Bignum],
        [:screen_name,            String],
        [:followers_count,        Integer],
        [:created_at,             Bignum]
        );
{% endhighlight %}

You can make a snippet for loading into pig with @puts User.load_pig@:

<pre>    LOAD users.tsv AS ( rsrc:chararray, id: int, scraped_at: long, screen_name: chararray, followers_count: int, created_at: long )</pre>

Export to SQL with @puts User.sql_create_table ; puts User.sql_load_mysql@:

{% highlight sql %}
    CREATE TABLE         `users` (
      `id`                  INT,
      `scraped_at`          BIGINT,
      `screen_name`         VARCHAR(255) CHARACTER SET ASCII,
      `followers_count`     INT,
      `created_at`          BIGINT
      )  ;
    ALTER TABLE            `user` DISABLE KEYS;
    LOAD DATA LOCAL INFILE 'user.tsv'
      REPLACE INTO TABLE   `user`
      COLUMNS
        TERMINATED BY           '\t'
        OPTIONALLY ENCLOSED BY  ''
        ESCAPED BY              ''
      LINES STARTING BY     'user'
      ( @dummy,
        `id`, `scraped_at`, `screen_name`, `followers_count`, `created_at`
      );
    ALTER TABLE `user` ENABLE KEYS ;
    SELECT 'user', NOW(), COUNT(*) FROM `user`;
{% endhighlight %}

<notextile></div><div class="toggle"></notextile>

h2(#workflow). Wukong's internal workflow

Here's a somewhat detailed overview of a wukong script's internal workflow.

# You call @./myscript.rb --run infile outfile@
# Execution begins in the run method of the Script class (@wukong/script.rb@). It launches (depending on if you're local or remote) one of
** @cat infile | ./myscript.rb --map | sort | ./myscript.rb --reduce > outfile@
** @hadoop [a_crapton_of_streaming_args] -mapper './myscript.rb --map' -reducer './myscript.rb --reduce' @
# In either case, the effect is to spawn the exact same script you ran at the command line: one or more times with the --map command in place of the --run command, and one or more times with the --reduce command in place of the --run command. %(quiet)(well, unless you specify no reducers or a :map_command or something)%

# With the @--map@ or @--reduce@ flag given, the Script flag turns over control to the corresponding class: either @mapper_klass.new(self.options).stream@ or @reducer_klass.new(self.options).stream@

When in @--map@ or @--reduce@ mode (we'll just use @--map@ as an example):

# The mapper_klass is usually a subclass of @Streamer::Base@, but in actual fact it can be anything that initializes from a hash of options and responds to #stream.
# The default #stream method
** calls the before_stream hook
** reads each line from stdin ; #recordizes it ; passes it (if non-nil) to #process ; and emits each object yielded by #process
** calls its after_stream hook
# You typically leave #stream alone and just override #process.
# The accumulator classes build on these patterns (they're proper subclasses of Streamer::Base), but are used differently. With an accumulator, you should implement some or all of
** #start! -- called at the start of each accumulation, passing in the first record for that key
** #accumulate -- called on each record (including that first one)
** #finalize --  called when the last key of this accumulation is seen.
** #get_key -- called on each record to recover its key.


h3(#stayinruby). Using wukong with internal streaming

If you're using wukong in local mode, you may not want to spawn new processes all over the place.  Or your records may arrive not from the command line but from, say, a database call.

In that case, just override #stream.  The original:

{% highlight ruby %}
      #
      # Pass each record to +#process+
      #
      def stream
        before_stream
        $stdin.each do |line|
          record = recordize(line.chomp)
          next unless record
          process(*record) do |output_record|
            emit output_record
          end
        end
        after_stream
      end
{% endhighlight %}

h3(#activerecord). Using wukong to Batch-Process ActiveRecord Objects

Here's a stream method, overridden to batch-process ActiveRecord objects (untested sample code):

{% highlight ruby %}
    class Mapper < Wukong::Streamer
      # Set record_klass to the ActiveRecord class you'd like to batch process
      cattr_accessor :record_klass
      # Size of each batch to pull from the database
      cattr_accessor :batch_size

      #
      # Grab records from the database in batches,
      # pass each record to +#process+
      #
      # Everything downstream of this is agnostic of the fact that
      # records are coming from the database and not $stdin
      #
      def stream
        before_stream
        record_klass.find_in_batches(:batch_size => batch_size ) do |record_batch|
          record_batch.each do |record|
            process(record.id, record) do |output_record|
              emit output_record
            end
          end
        end
        after_stream
      end

      # ....
    end
{% endhighlight %}

<notextile></div></notextile>