templates/README.md.erb in wukong-deploy-0.1.1 vs templates/README.md.erb in wukong-deploy-0.2.0

- old
+ new

@@ -92,60 +92,89 @@ If you're using [rbenv](https://github.com/sstephenson/rbenv) you may want to run `rbenv exec bundle install`. ### Configuration -Your deploy pack doesn't need any configuration out of the box. As -you begin to extend it you may add functionality which benefits from -the ability to be configured. +#### Configuring the Environment -Put any configuration you want shared across all environments into the -file `config/settings.yml`. Override this with environment-specific -configuration in the appropriate file within `config/environments`. +Before any of the `wu` programs can run, the Ruby process must first +boot up, require Wukong and all necessary dependencies (such as +'event-machine') and plugins (such as the deploy pack plugin +`wukong-deploy`), and then hand over control to the `wu` program. -As an example, you may write a processor like this: +The following Ruby files are loaded in order. Each file is +responsible for configuring some part of this runtime environment: +1. `config/environment` -- requires the rest of the files and adds any additional environmental code +2. `config/application` -- defines the load order of external libraries, Wukong plugins, and application code +3. `config/boot` -- defines how and where the Ruby process will look for code dependencies (through Bundler) +4. `config/initializers/*.rb` -- non-Wukong configuration for external libraries or application code can live here + +#### Configuring the Application + +The application a given deploy pack is running can be configured at +several different layers. + +The simplest layer is settings passed to `wu` programs on the +command-line. These settings have the highest precedence and will +always be read. + +When booting any of the `wu` tools the deploy pack will also read and +merge settings from the following configuration files, in order of +**increasing** precedence: + +1. `config/settings.yml` +2. `config/settings/*.yml` if present, without any guarantee as to order +3. `config/environments/[environment].yml` +4. `config/environments/[environment]/*.yml` if present, without any guarantee as to order +5. `config/deploy.yml` if present (this file should be ignored by version control) +6. `config/environments/deploy-[environment].yml` if present (this file should be ignored by version control) + +Finally, if interaction with Vayacondios is turned on, settings will +also be read from a Vayacondios stash (see the <a +href="#vayacondios>Vayacondios section</a> below). + +Completely merged and resolved configuration settings are accessible +globally (once the Wukong framework has been booted) via the +`Wukong::Deploy.settings` object. Any piece of code in a model, +processor, dataflow, or elsewhere can read and write to this object. + +Processors will *automatically* read settings for their fields from a +subhash within this global settings object. Given a processor like + ```ruby -Wukong.procesor(:configurable_decorator) do - field :suffix, String, :default => '.' - def process record - yield [record, suffix].join +Wukong.processor(:tokenizer) do + field :min_length, Integer, default: 2 + def process line + ... end end ``` -This processor's `suffix` property can be set on the command-line: +you can set override the value of its `min_length` field by putting +the following section into any one of the configuration files above: -``` -$ cat input -1 -2 -3 -$ cat input | wu-local configurable_decorator -1. -2. -3. -$ cat input | wu-local configurable_decorator --suffix=',' -1, -2, -3, - -You can also set the same property in a configuration file, scoped by -the name of the processor: - ```yaml -# in config/settings.yml --- +# in config/settings.yml, for example -configurable_decorator: - suffix: , +tokenizer: + min_length: 5 ``` -which lets you the `--suffix` flag on the command-line while still -overriding the default setting. You can also put such settings in -environment specific files within `config/environments`. +which would now make the command +``` +$ cat corpus.txt | wu local tokenizer +``` + +have the same effect as + +``` +$ cat corpus.txt | wu local tokenizer --min_length=5 +``` + ## File Structure A deploy pack is a repository with the following [Rails](http://rubyonrails.org/)-like file structure: @@ -204,5 +233,202 @@ Before you start developing, it might be helpful to read up on some of the underlying documentation for Wukong and its plugins, specifically: * on [Wukong](http://github.com/infochimps-labs/wukong/tree/3.0.0) so you understand the basic idea of a processor and how to glue processors together * on [Wukong-Hadoop](http://github.com/infochimps-labs/wukong-hadoop) so you understand how to move between local and Hadoop modes for batch analytics + + +<a target="#vayacondios"> +## Interacting with Vayacondios + +[Vayacondios](http://github.com/infochimps-labs/vayacondios) is a +program which makes it easy to for clients to announce events or read +and write settings to and from a central server. + +The basic objects of Vayacondios are **stash** and the **event**: + +* a **stash** is an "object", a "configuration", or "setting" designed to be shared among many services +* an **event** is a "fact", "measurement", or "metric" announced by an arbitrary service, possibly related to some stash + +Stashes and events are organized in two levels. + +The top-level is the **organization**. Data from multiple +organizations is stored together but accessed separately by a running +Vayacondios server. An organization could be the name of a user, +workgroup, application, or service using Vayacondios. + +The next level is the **topic**. Each topic within Vayacondios has a +single stash and can have multiple events. An "object" like a server, +a database, an application, a service, or a user maps to the concept +of "topic". + +Every `wu` tool running within a deploy pack takes an additional +option `--vcd` which turns on or off interactions with Vayacondios. +This option can be specified at runtime on the command-line as well as +via a configuration file. When not running "in Vayacondios mode" +(with `--vcd` was not passed), interactions with Vayacondios will be +logged instead of transmitted and received. + +### Configuring Vayacondios access + +If you don't intend to interact with a Vayacondios server, you can +just set `vcd` to `false` for your whole environment and skip this +section (as is done, for example, in the `test` environment by +default). + +If you intend to interact with Vayacondios then you need to also +specify the `vcd_host` and `vcd_port` options which otherwise default +to the usual Vayacondios server port running on localhost. + +```yaml +--- +# in config/environments/production.yml +vcd_host: 10.123.123.123 +vcd_port: 9000 +``` + +Vayacondios also requires that all events and stashes are stored under +a given organization name. The Vayacondios organization, which will +likely be shared across all environments of your application, is +usually set at the top-level: + +```yaml +--- +# in config/settings.yml +organization: my_company +``` + +### Handle out of band event data with Events + +Despite being designed to be powerful and scalable, Vayacondios is not +the appropriate store for high-volume, high-throughput, +mission-critical data which must be persisited over the long-term. +Instead it should be used for "out of band" data, which is typically +much smaller in volume and throughput than the main body of a +dataflow. Examples of such out of band events include: + +* signalling some intermittend or runtime error +* warning that some event was bad or suspicious +* logging an error +* registering some periodic metric +* signaling a change in state + +Announcements can be made from anywhere within the Wukong framework by +accessing the `Wukong::Deploy.vayacondios_client` object but the most +common approach is to announce events within a processor or within a +dataflow. + +#### Announcing from a processor + +The `Wukong::Processor#announce` method can be used to directly send +an event to Vayacondios on a given topic. + +```ruby +Wukong.processor(:parser) do + def process line + yield parse!(line) + rescue ParseError => e + announce "parser.errors", line: line + end +end +``` + +It's important when setting up an announcement like this that you +consider how often this piece of code will actually send events to +Vayacondios. If a `ParseError` is triggered once in every 10,000 +lines, this may be perfectly fine to be running in production. If 1 +in 10 lines causes a similar error, this may not be the right +approach. + +#### Announcing from a dataflow + +The `announce` processor can be used to send all announce all incoming +events to Vayacondios. Here's an example flow which makes use of it: + +```ruby +Wukong.dataflow(:parse_source) do + parser | + [ + select(&:valid?) | ... | to_json, + select(&:invalid?) | announce(topic: "invalid_records") + ] +end +``` + +Just as in the above example with a processor, it's important that the +flow through the announce processor is not incredibly high-volume. + +The `announce` processor is terminal; it yields no output records. + +### Allow dynamic configuration with Stashes + +The deploy pack inside a backend system like Hadoop or Storm can fetch +stashes from Vayacondios during runtime. Other systems external to +the deploy pack can simultaneously be writing data into these same +stashes in Vayacondios, allowing for a lightweight, two-way +communication stream between the deploy pack and arbitrary external +resources, mediated by a key-value store (the Vayacondios stash). + +Stashes can be read and written from anywhere within the Wukong +framework by accessing the `Wukong::Deploy.vayacondios_client` object +but there are two special places where encapsulated, remote settings +are very useful. + +#### Dynamic settings for the deploy pack itself + +Each deploy pack, as an application, can fetch a stash of settings +from Vayacondios and use this as bootup time in the same way it uses a +configuration file ond disk. All that is required is a Vayacondios +stash topic name. This is furnished by providing to the deploy pack +an `application` name in a configuration file, usually the top-level +one: + +```yaml +--- +# in config/settings.yml, for example + +application: my_app +``` + +When any `wu` tool is launched within the deploy pack with the `--vcd` +option (possibly set an an environment-wide level via a configuration +file) then remote settings from Vayacondios for the `application` will +be pulled at boot-time and merged into the local settings from +configuration files and the command-line. + +#### Dynamic settings for processors + +The processor `tokenizer` in the deploy pack with application name +`my_app` defaults to using the stash with topic +`processors.my_app-tokenizer` in Vayacondios to store its settings +(this can be changed by overriding the `Wukong::Processor#vcd_topic` +method). + +These settings, if they exist, can be retrieved and merged into the +processor's current fields at anytime using the +`Wukong::Processor#update_settings`. A common use case is to want to +update a processor's fields every 30 seconds, or similar. This is +most easily accomplished via the +`Wukong::Processor#update_settings_every` method. Here's an example + +```ruby +Wukong.processor(:tagger) do + field :tags, Array, doc: "List of tags to check", default: [] + + def setup + update_settings_every(30) + end + + def process record + tags.each do |tag| + ... + end + end +end +``` + +The `tags` field of this processor will be updated every 30 seconds +with the latest values from Vayacondios. + +The `Wukong::Processor#save_settings` and +`Wukong::Processor#save_settings_every` and methods can be used to +save settings from a processor **to** Vayacondios.