# Welcome to the Infochimps Platform!
The [Infochimps Platform](http://www.infochimps.com) is an end-to-end,
managed solution for building Big Data applications. It integrates
best-of-breed technologies like [Hadoop](http://hadoop.apache.org/),
[Storm](https://github.com/nathanmarz/storm),
[Kafka](http://incubator.apache.org/kafka/),
[MongoDB](http://www.mongodb.org/),
[ElasticSearch](http://www.elasticsearch.org/),
[HBase](http://hbase.apache.org/), &c. and provides simple interfaces
for accessing these powerful tools.
Computation, analytics, scripting, &c. are all handled by
[Wukong](http://github.com/infochimps-labs/wukong) within the
platform. Wukong is an abstract framework for defining computations
on data. Wukong processors and flows can run in many different
execution contexts including:
* locally on the command-line for testing or development purposes
* as a Hadoop mapper or reducer for batch analytics or ETL
* within Storm as part of a real-time data flow
The Infochimps Platform uses the concept of a deploy pack for
developers to develop all their processors, flows, and jobs within.
The deploy pack can be thought of as a container for all the necessary
Wukong code and plugins useful in the context of an Infochimps
Platform application. It includes the following libraries:
* wukong: The core framework for writing processors and chaining them together.
* wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
* wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
* wukong-deploy: Code for coordinating Wukong and its plugins in a deploy pack.
**This is your deploy pack!** You will build your data processing
pipelines and Hadoop jobs within this repo.
## Setup
### Dependencies
In order to install and run a deploy pack you need the following
dependencies:
#### Ruby 1.9.x
Wukong and the deploy pack framework will only run on Ruby 1.9. There
are a lot of [online
instructions](http://www.ruby-lang.org/en/downloads/) you can use to
get Ruby 1.9 (and RubyGems) installed and configured on your local
system.
If you use [rvm](https://rvm.io/) or
[rbenv](https://github.com/sstephenson/rbenv) to manage your Ruby
installations, make sure you install all gems appropriately and invoke
bundler appropriately in what follows.
#### Git
You'll need [Git](http://git-scm.com/) to push/pull your deploy pack
code to/from the Infochimps Platform.
### Creating/Cloning the Deploy Pack
The first thing you need to do to get started is get a local copy of
this deploy on your computer. If you have already been giving a
deploy pack by Infochimps then you'll want to clone it:
```
$ git clone
```
If you are creating a deploy pack from scratch you'll want to use the
`wu-deploy` tool to create the scaffold of your deploy pack for you:
```
$ sudo gem install wukong-deploy
$ wu-deploy new
```
Once you have the deploy pack on disk, you can install the
dependencies and
### Installation
From within the root of your deploy pack run the following commands
```
$ sudo gem install bundler
$ bundle install
```
If you're using [rbenv](https://github.com/sstephenson/rbenv) you may
want to run `rbenv exec bundle install`.
### Configuration
#### Configuring the Environment
Before any of the `wu` programs can run, the Ruby process must first
boot up, require Wukong and all necessary dependencies (such as
'event-machine') and plugins (such as the deploy pack plugin
`wukong-deploy`), and then hand over control to the `wu` program.
The following Ruby files are loaded in order. Each file is
responsible for configuring some part of this runtime environment:
1. `config/environment` -- requires the rest of the files and adds any additional environmental code
2. `config/application` -- defines the load order of external libraries, Wukong plugins, and application code
3. `config/boot` -- defines how and where the Ruby process will look for code dependencies (through Bundler)
4. `config/initializers/*.rb` -- non-Wukong configuration for external libraries or application code can live here
#### Configuring the Application
The application a given deploy pack is running can be configured at
several different layers.
The simplest layer is settings passed to `wu` programs on the
command-line. These settings have the highest precedence and will
always be read.
When booting any of the `wu` tools the deploy pack will also read and
merge settings from the following configuration files, in order of
**increasing** precedence:
1. `config/settings.yml`
2. `config/settings/*.yml` if present, without any guarantee as to order
3. `config/environments/[environment].yml`
4. `config/environments/[environment]/*.yml` if present, without any guarantee as to order
5. `config/deploy.yml` if present (this file should be ignored by version control)
6. `config/environments/deploy-[environment].yml` if present (this file should be ignored by version control)
Finally, if interaction with Vayacondios is turned on, settings will
also be read from a Vayacondios stash (see the processors: Your fundamental operations or "verbs", which are passed records and parse, filter, augment, normalize, or split them.
* flows: Chain together processors into streaming flows for ingestion, real-time processing, or [complex event processing](http://en.wikipedia.org/wiki/Complex_event_processing) (CEP)
* jobs: Pair processors together to create batch jobs to run in Hadoop
* config: Where you place all application configuration for all environments
* environment.rb: Defines the runtime environment for all code, requiring and configuring all Wukong framework code. You shouldn't have to edit this file directly.
* application.rb: Require and configure libraries specific to your application. Choose a model framework, pick what application code gets loaded by default (vs. auto-loaded).
* initializers: Holds any files you need to load before application.rb here. Useful for requiring and configuring external libraries.
* settings.yml: Defines application-wide settings.
* environments: Defines environment-specific settings in YAML files named after the environment. Overrides config/settings.yml.
* data: Holds sample data in flat files. You'll develop and test your application using this data.
* Gemfile and Gemfile.lock: Defines how libraries are resolved with [Bundler](http://gembundler.com/).
* lib: Holds any code you want to use in your application but that isn't "part of" your application (like vendored libraries, Rake tasks, &c.).
* log: A good place to stash logs.
* Rakefile: Defines [Rake](http://rake.rubyforge.org/) tasks for the development, test, and deploy of your application.
* spec: Holds all your [RSpec](http://rspec.info/) unit tests.
* spec_helper.rb: Loads libraries you'll use during testing, includes spec helper libraries from Wukong.
* support: Holds support code for your tests.
* tmp: A good place to stash temporary files.
## Writing your first models, processors, flows, and jobs
Before you start developing, it might be helpful to read up on some of
the underlying documentation for Wukong and its plugins, specifically:
* on [Wukong](http://github.com/infochimps-labs/wukong/tree/3.0.0) so you understand the basic idea of a processor and how to glue processors together
* on [Wukong-Hadoop](http://github.com/infochimps-labs/wukong-hadoop) so you understand how to move between local and Hadoop modes for batch analytics
## Interacting with Vayacondios
[Vayacondios](http://github.com/infochimps-labs/vayacondios) is a
program which makes it easy to for clients to announce events or read
and write settings to and from a central server.
The basic objects of Vayacondios are **stash** and the **event**:
* a **stash** is an "object", a "configuration", or "setting" designed to be shared among many services
* an **event** is a "fact", "measurement", or "metric" announced by an arbitrary service, possibly related to some stash
Stashes and events are organized in two levels.
The top-level is the **organization**. Data from multiple
organizations is stored together but accessed separately by a running
Vayacondios server. An organization could be the name of a user,
workgroup, application, or service using Vayacondios.
The next level is the **topic**. Each topic within Vayacondios has a
single stash and can have multiple events. An "object" like a server,
a database, an application, a service, or a user maps to the concept
of "topic".
Every `wu` tool running within a deploy pack takes an additional
option `--vcd` which turns on or off interactions with Vayacondios.
This option can be specified at runtime on the command-line as well as
via a configuration file. When not running "in Vayacondios mode"
(with `--vcd` was not passed), interactions with Vayacondios will be
logged instead of transmitted and received.
### Configuring Vayacondios access
If you don't intend to interact with a Vayacondios server, you can
just set `vcd` to `false` for your whole environment and skip this
section (as is done, for example, in the `test` environment by
default).
If you intend to interact with Vayacondios then you need to also
specify the `vcd_host` and `vcd_port` options which otherwise default
to the usual Vayacondios server port running on localhost.
```yaml
---
# in config/environments/production.yml
vcd_host: 10.123.123.123
vcd_port: 9000
```
Vayacondios also requires that all events and stashes are stored under
a given organization name. The Vayacondios organization, which will
likely be shared across all environments of your application, is
usually set at the top-level:
```yaml
---
# in config/settings.yml
organization: my_company
```
### Handle out of band event data with Events
Despite being designed to be powerful and scalable, Vayacondios is not
the appropriate store for high-volume, high-throughput,
mission-critical data which must be persisited over the long-term.
Instead it should be used for "out of band" data, which is typically
much smaller in volume and throughput than the main body of a
dataflow. Examples of such out of band events include:
* signalling some intermittend or runtime error
* warning that some event was bad or suspicious
* logging an error
* registering some periodic metric
* signaling a change in state
Announcements can be made from anywhere within the Wukong framework by
accessing the `Wukong::Deploy.vayacondios_client` object but the most
common approach is to announce events within a processor or within a
dataflow.
#### Announcing from a processor
The `Wukong::Processor#announce` method can be used to directly send
an event to Vayacondios on a given topic.
```ruby
Wukong.processor(:parser) do
def process line
yield parse!(line)
rescue ParseError => e
announce "parser.errors", line: line
end
end
```
It's important when setting up an announcement like this that you
consider how often this piece of code will actually send events to
Vayacondios. If a `ParseError` is triggered once in every 10,000
lines, this may be perfectly fine to be running in production. If 1
in 10 lines causes a similar error, this may not be the right
approach.
#### Announcing from a dataflow
The `announce` processor can be used to send all announce all incoming
events to Vayacondios. Here's an example flow which makes use of it:
```ruby
Wukong.dataflow(:parse_source) do
parser |
[
select(&:valid?) | ... | to_json,
select(&:invalid?) | announce(topic: "invalid_records")
]
end
```
Just as in the above example with a processor, it's important that the
flow through the announce processor is not incredibly high-volume.
The `announce` processor is terminal; it yields no output records.
### Allow dynamic configuration with Stashes
The deploy pack inside a backend system like Hadoop or Storm can fetch
stashes from Vayacondios during runtime. Other systems external to
the deploy pack can simultaneously be writing data into these same
stashes in Vayacondios, allowing for a lightweight, two-way
communication stream between the deploy pack and arbitrary external
resources, mediated by a key-value store (the Vayacondios stash).
Stashes can be read and written from anywhere within the Wukong
framework by accessing the `Wukong::Deploy.vayacondios_client` object
but there are two special places where encapsulated, remote settings
are very useful.
#### Dynamic settings for the deploy pack itself
Each deploy pack, as an application, can fetch a stash of settings
from Vayacondios and use this as bootup time in the same way it uses a
configuration file ond disk. All that is required is a Vayacondios
stash topic name. This is furnished by providing to the deploy pack
an `application` name in a configuration file, usually the top-level
one:
```yaml
---
# in config/settings.yml, for example
application: my_app
```
When any `wu` tool is launched within the deploy pack with the `--vcd`
option (possibly set an an environment-wide level via a configuration
file) then remote settings from Vayacondios for the `application` will
be pulled at boot-time and merged into the local settings from
configuration files and the command-line.
#### Dynamic settings for processors
The processor `tokenizer` in the deploy pack with application name
`my_app` defaults to using the stash with topic
`processors.my_app-tokenizer` in Vayacondios to store its settings
(this can be changed by overriding the `Wukong::Processor#vcd_topic`
method).
These settings, if they exist, can be retrieved and merged into the
processor's current fields at anytime using the
`Wukong::Processor#update_settings`. A common use case is to want to
update a processor's fields every 30 seconds, or similar. This is
most easily accomplished via the
`Wukong::Processor#update_settings_every` method. Here's an example
```ruby
Wukong.processor(:tagger) do
field :tags, Array, doc: "List of tags to check", default: []
def setup
update_settings_every(30)
end
def process record
tags.each do |tag|
...
end
end
end
```
The `tags` field of this processor will be updated every 30 seconds
with the latest values from Vayacondios.
The `Wukong::Processor#save_settings` and
`Wukong::Processor#save_settings_every` and methods can be used to
save settings from a processor **to** Vayacondios.