# Welcome to the Infochimps Platform! The [Infochimps Platform](http://www.infochimps.com) is an end-to-end, managed solution for building Big Data applications. It integrates best-of-breed technologies like [Hadoop](http://hadoop.apache.org/), [Storm](https://github.com/nathanmarz/storm), [Kafka](http://incubator.apache.org/kafka/), [MongoDB](http://www.mongodb.org/), [ElasticSearch](http://www.elasticsearch.org/), [HBase](http://hbase.apache.org/), &c. and provides simple interfaces for accessing these powerful tools. Computation, analytics, scripting, &c. are all handled by [Wukong](http://github.com/infochimps-labs/wukong) within the platform. Wukong is an abstract framework for defining computations on data. Wukong processors and flows can run in many different execution contexts including: * locally on the command-line for testing or development purposes * as a Hadoop mapper or reducer for batch analytics or ETL * within Storm as part of a real-time data flow The Infochimps Platform uses the concept of a deploy pack for developers to develop all their processors, flows, and jobs within. The deploy pack can be thought of as a container for all the necessary Wukong code and plugins useful in the context of an Infochimps Platform application. It includes the following libraries: * wukong: The core framework for writing processors and chaining them together. * wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them. * wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data. * wukong-deploy: Code for coordinating Wukong and its plugins in a deploy pack. **This is your deploy pack!** You will build your data processing pipelines and Hadoop jobs within this repo. ## Setup ### Dependencies In order to install and run a deploy pack you need the following dependencies: #### Ruby 1.9.x Wukong and the deploy pack framework will only run on Ruby 1.9. There are a lot of [online instructions](http://www.ruby-lang.org/en/downloads/) you can use to get Ruby 1.9 (and RubyGems) installed and configured on your local system. If you use [rvm](https://rvm.io/) or [rbenv](https://github.com/sstephenson/rbenv) to manage your Ruby installations, make sure you install all gems appropriately and invoke bundler appropriately in what follows. #### Git You'll need [Git](http://git-scm.com/) to push/pull your deploy pack code to/from the Infochimps Platform. ### Creating/Cloning the Deploy Pack The first thing you need to do to get started is get a local copy of this deploy on your computer. If you have already been giving a deploy pack by Infochimps then you'll want to clone it: ``` $ git clone ``` If you are creating a deploy pack from scratch you'll want to use the `wu-deploy` tool to create the scaffold of your deploy pack for you: ``` $ sudo gem install wukong-deploy $ wu-deploy new ``` Once you have the deploy pack on disk, you can install the dependencies and ### Installation From within the root of your deploy pack run the following commands ``` $ sudo gem install bundler $ bundle install ``` If you're using [rbenv](https://github.com/sstephenson/rbenv) you may want to run `rbenv exec bundle install`. ### Configuration #### Configuring the Environment Before any of the `wu` programs can run, the Ruby process must first boot up, require Wukong and all necessary dependencies (such as 'event-machine') and plugins (such as the deploy pack plugin `wukong-deploy`), and then hand over control to the `wu` program. The following Ruby files are loaded in order. Each file is responsible for configuring some part of this runtime environment: 1. `config/environment` -- requires the rest of the files and adds any additional environmental code 2. `config/application` -- defines the load order of external libraries, Wukong plugins, and application code 3. `config/boot` -- defines how and where the Ruby process will look for code dependencies (through Bundler) 4. `config/initializers/*.rb` -- non-Wukong configuration for external libraries or application code can live here #### Configuring the Application The application a given deploy pack is running can be configured at several different layers. The simplest layer is settings passed to `wu` programs on the command-line. These settings have the highest precedence and will always be read. When booting any of the `wu` tools the deploy pack will also read and merge settings from the following configuration files, in order of **increasing** precedence: 1. `config/settings.yml` 2. `config/settings/*.yml` if present, without any guarantee as to order 3. `config/environments/[environment].yml` 4. `config/environments/[environment]/*.yml` if present, without any guarantee as to order 5. `config/deploy.yml` if present (this file should be ignored by version control) 6. `config/environments/deploy-[environment].yml` if present (this file should be ignored by version control) Finally, if interaction with Vayacondios is turned on, settings will also be read from a Vayacondios stash (see the processors: Your fundamental operations or "verbs", which are passed records and parse, filter, augment, normalize, or split them. * flows: Chain together processors into streaming flows for ingestion, real-time processing, or [complex event processing](http://en.wikipedia.org/wiki/Complex_event_processing) (CEP) * jobs: Pair processors together to create batch jobs to run in Hadoop * config: Where you place all application configuration for all environments * environment.rb: Defines the runtime environment for all code, requiring and configuring all Wukong framework code. You shouldn't have to edit this file directly. * application.rb: Require and configure libraries specific to your application. Choose a model framework, pick what application code gets loaded by default (vs. auto-loaded). * initializers: Holds any files you need to load before application.rb here. Useful for requiring and configuring external libraries. * settings.yml: Defines application-wide settings. * environments: Defines environment-specific settings in YAML files named after the environment. Overrides config/settings.yml. * data: Holds sample data in flat files. You'll develop and test your application using this data. * Gemfile and Gemfile.lock: Defines how libraries are resolved with [Bundler](http://gembundler.com/). * lib: Holds any code you want to use in your application but that isn't "part of" your application (like vendored libraries, Rake tasks, &c.). * log: A good place to stash logs. * Rakefile: Defines [Rake](http://rake.rubyforge.org/) tasks for the development, test, and deploy of your application. * spec: Holds all your [RSpec](http://rspec.info/) unit tests. * spec_helper.rb: Loads libraries you'll use during testing, includes spec helper libraries from Wukong. * support: Holds support code for your tests. * tmp: A good place to stash temporary files. ## Writing your first models, processors, flows, and jobs Before you start developing, it might be helpful to read up on some of the underlying documentation for Wukong and its plugins, specifically: * on [Wukong](http://github.com/infochimps-labs/wukong/tree/3.0.0) so you understand the basic idea of a processor and how to glue processors together * on [Wukong-Hadoop](http://github.com/infochimps-labs/wukong-hadoop) so you understand how to move between local and Hadoop modes for batch analytics ## Interacting with Vayacondios [Vayacondios](http://github.com/infochimps-labs/vayacondios) is a program which makes it easy to for clients to announce events or read and write settings to and from a central server. The basic objects of Vayacondios are **stash** and the **event**: * a **stash** is an "object", a "configuration", or "setting" designed to be shared among many services * an **event** is a "fact", "measurement", or "metric" announced by an arbitrary service, possibly related to some stash Stashes and events are organized in two levels. The top-level is the **organization**. Data from multiple organizations is stored together but accessed separately by a running Vayacondios server. An organization could be the name of a user, workgroup, application, or service using Vayacondios. The next level is the **topic**. Each topic within Vayacondios has a single stash and can have multiple events. An "object" like a server, a database, an application, a service, or a user maps to the concept of "topic". Every `wu` tool running within a deploy pack takes an additional option `--vcd` which turns on or off interactions with Vayacondios. This option can be specified at runtime on the command-line as well as via a configuration file. When not running "in Vayacondios mode" (with `--vcd` was not passed), interactions with Vayacondios will be logged instead of transmitted and received. ### Configuring Vayacondios access If you don't intend to interact with a Vayacondios server, you can just set `vcd` to `false` for your whole environment and skip this section (as is done, for example, in the `test` environment by default). If you intend to interact with Vayacondios then you need to also specify the `vcd_host` and `vcd_port` options which otherwise default to the usual Vayacondios server port running on localhost. ```yaml --- # in config/environments/production.yml vcd_host: 10.123.123.123 vcd_port: 9000 ``` Vayacondios also requires that all events and stashes are stored under a given organization name. The Vayacondios organization, which will likely be shared across all environments of your application, is usually set at the top-level: ```yaml --- # in config/settings.yml organization: my_company ``` ### Handle out of band event data with Events Despite being designed to be powerful and scalable, Vayacondios is not the appropriate store for high-volume, high-throughput, mission-critical data which must be persisited over the long-term. Instead it should be used for "out of band" data, which is typically much smaller in volume and throughput than the main body of a dataflow. Examples of such out of band events include: * signalling some intermittend or runtime error * warning that some event was bad or suspicious * logging an error * registering some periodic metric * signaling a change in state Announcements can be made from anywhere within the Wukong framework by accessing the `Wukong::Deploy.vayacondios_client` object but the most common approach is to announce events within a processor or within a dataflow. #### Announcing from a processor The `Wukong::Processor#announce` method can be used to directly send an event to Vayacondios on a given topic. ```ruby Wukong.processor(:parser) do def process line yield parse!(line) rescue ParseError => e announce "parser.errors", line: line end end ``` It's important when setting up an announcement like this that you consider how often this piece of code will actually send events to Vayacondios. If a `ParseError` is triggered once in every 10,000 lines, this may be perfectly fine to be running in production. If 1 in 10 lines causes a similar error, this may not be the right approach. #### Announcing from a dataflow The `announce` processor can be used to send all announce all incoming events to Vayacondios. Here's an example flow which makes use of it: ```ruby Wukong.dataflow(:parse_source) do parser | [ select(&:valid?) | ... | to_json, select(&:invalid?) | announce(topic: "invalid_records") ] end ``` Just as in the above example with a processor, it's important that the flow through the announce processor is not incredibly high-volume. The `announce` processor is terminal; it yields no output records. ### Allow dynamic configuration with Stashes The deploy pack inside a backend system like Hadoop or Storm can fetch stashes from Vayacondios during runtime. Other systems external to the deploy pack can simultaneously be writing data into these same stashes in Vayacondios, allowing for a lightweight, two-way communication stream between the deploy pack and arbitrary external resources, mediated by a key-value store (the Vayacondios stash). Stashes can be read and written from anywhere within the Wukong framework by accessing the `Wukong::Deploy.vayacondios_client` object but there are two special places where encapsulated, remote settings are very useful. #### Dynamic settings for the deploy pack itself Each deploy pack, as an application, can fetch a stash of settings from Vayacondios and use this as bootup time in the same way it uses a configuration file ond disk. All that is required is a Vayacondios stash topic name. This is furnished by providing to the deploy pack an `application` name in a configuration file, usually the top-level one: ```yaml --- # in config/settings.yml, for example application: my_app ``` When any `wu` tool is launched within the deploy pack with the `--vcd` option (possibly set an an environment-wide level via a configuration file) then remote settings from Vayacondios for the `application` will be pulled at boot-time and merged into the local settings from configuration files and the command-line. #### Dynamic settings for processors The processor `tokenizer` in the deploy pack with application name `my_app` defaults to using the stash with topic `processors.my_app-tokenizer` in Vayacondios to store its settings (this can be changed by overriding the `Wukong::Processor#vcd_topic` method). These settings, if they exist, can be retrieved and merged into the processor's current fields at anytime using the `Wukong::Processor#update_settings`. A common use case is to want to update a processor's fields every 30 seconds, or similar. This is most easily accomplished via the `Wukong::Processor#update_settings_every` method. Here's an example ```ruby Wukong.processor(:tagger) do field :tags, Array, doc: "List of tags to check", default: [] def setup update_settings_every(30) end def process record tags.each do |tag| ... end end end ``` The `tags` field of this processor will be updated every 30 seconds with the latest values from Vayacondios. The `Wukong::Processor#save_settings` and `Wukong::Processor#save_settings_every` and methods can be used to save settings from a processor **to** Vayacondios.