# Wukong Deploy Pack
The [Infochimps Platform](http://www.infochimps.com) is an end-to-end,
managed solution for building Big Data applications. It integrates
best-of-breed technologies like [Hadoop](http://hadoop.apache.org/),
[Storm](https://github.com/nathanmarz/storm),
[Kafka](http://incubator.apache.org/kafka/),
[MongoDB](http://www.mongodb.org/),
[ElasticSearch](http://www.elasticsearch.org/),
[HBase](http://hbase.apache.org/), &c. and provides simple interfaces
for accessing these powerful tools.
Computation, analytics, scripting, &c. are all handled by
[Wukong](http://github.com/infochimps-labs/wukong/tree/3.0.0) within the
platform. Wukong is an abstract framework for defining computations
on data. Wukong processors and flows can run in many different
execution contexts including:
* locally on the command-line for testing or development purposes
* as a Hadoop mapper or reducer for batch analytics or ETL
* within Storm as part of a real-time data flow
The Infochimps Platform uses the concept of a deploy pack for
developers to develop all their processors, flows, and jobs within.
The deploy pack can be thought of as a container for all the necessary
Wukong code and plugins useful in the context of an Infochimps
Platform application. It includes the following libraries:
* wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
* wukong-storm: Run Wukong processors within the Storm framework. Model flows locally before you run them.
* wukong-load: Load the output data from your local Wukong jobs and flows into a variety of different data stores.
* wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
## Installation
The deploy pack is installed as a RubyGem:
```
$ sudo gem install wukong-deploy
```
## Usage
Wukong-Deploy provides a command-line tool `wu-deploy` which can be
used to create or interact with deploy packs.
### Creating a New Deploy Pack
Create a new deploy pack:
```
$ wu-deploy new my_app
Within /home/user/my_app:
create .
create app/models
create app/processors
...
```
This will create a directory `my_app` in the current directory.
Passing the `dry_run` option will print what should happen without
actually doing anything:
```
$ wu-deploy new my_app --dry_run
Within /home/user/my_app:
create .
create app/models
create app/processors
...
```
You'll be prompted if there is a conflict. You can pass the `force`
option to always overwrite files and the `skip` option to never
overwrite files.
### Working with an Existing Deploy Pack
If your current directory is within an existing deploy pack you can
start up an IRB console with the deploy pack's environment already
loaded:
```
$ wu-deploy console
irb(main):001:0>
```
## File Structure
A deploy pack is a repository with the following
[Rails](http://rubyonrails.org/)-like file structure:
```
├── app
│ ├── models
│ ├── processors
│ ├── flows
│ └── jobs
├── config
│ ├── environment.rb
│ ├── application.rb
│ ├── initializers
│ ├── settings.yml
│ └── environments
│ ├── development.yml
│ ├── production.yml
│ └── test.yml
├── data
├── Gemfile
├── Gemfile.lock
├── lib
├── log
├── Rakefile
├── spec
│ ├── spec_helper.rb
│ └── support
└── tmp
```
Let's look at it piece by piece:
* app: The directory with all the action. It's where you define:
* models: Your domain models or "nouns", which define and wrap the different kinds of data elements in your application. They are built using whatever framework you like (defaults to [Gorillib](http://github.com/infochimps-labs/gorillib))
* processors: Your fundamental operations or "verbs", which are passed records and parse, filter, augment, normalize, or split them.
* flows: Chain together processors into streaming flows for ingestion, real-time processing, or [complex event processing](http://en.wikipedia.org/wiki/Complex_event_processing) (CEP)
* jobs: Pair processors together to create batch jobs to run in Hadoop
* config: Where you place all application configuration for all environments
* environment.rb: Defines the runtime environment for all code, requiring and configuring all Wukong framework code. You shouldn't have to edit this file directly.
* application.rb: Require and configure libraries specific to your application. Choose a model framework, pick what application code gets loaded by default (vs. auto-loaded).
* initializers: Holds any files you need to load before application.rb here. Useful for requiring and configuring external libraries.
* settings.yml: Defines application-wide settings.
* environments: Defines environment-specific settings in YAML files named after the environment. Overrides config/settings.yml.
* data: Holds sample data in flat files. You'll develop and test your application using this data.
* Gemfile and Gemfile.lock: Defines how libraries are resolved with [Bundler](http://gembundler.com/).
* lib: Holds any code you want to use in your application but that isn't "part of" your application (like vendored libraries, Rake tasks, &c.).
* log: A good place to stash logs.
* Rakefile: Defines [Rake](http://rake.rubyforge.org/) tasks for the development, test, and deploy of your application.
* spec: Holds all your [RSpec](http://rspec.info/) unit tests.
* spec_helper.rb: Loads libraries you'll use during testing, includes spec helper libraries from Wukong.
* support: Holds support code for your tests.
* tmp: A good place to stash temporary files.