README.md in rflow-1.0.0a2 vs README.md in rflow-1.0.0a3

- old
+ new

@@ -18,25 +18,38 @@
 supports generalized connection types and message serialization,
 however only two are in current use, namely ZeroMQ connections and
 Avro serialization.
 
 RFlow currently runs as a single-threaded, evented system on top of
-[Eventmachine](http://rubyeventmachine.com/), meaning that any code
+[EventMachine](http://rubyeventmachine.com/), meaning that any code
 should be coded in an asynchronous style so as to not block the
-Eventmachine reactor (and thus block all the other components). There
-is currently work being done to "shard" the workflow among multiple
-processes and/or threads.
+EventMachine reactor (and thus block all the other components). Use
+`EM.defer` and other such patterns, along with EventMachine plugins
+for various servers and clients, to work in this style and defer
+computation to background threads.
 
+RFlow component workflows may be split into `shards` to improve
+parallelism. Each shard is currently represented by a separate process,
+though threads may be supported in the future. Multiple copies of a
+shard may be instantiated, which will cooperate to round-robin
+incoming messages.
+
 Some of the long-term goals of RFlow are to allow for components and
 portions of the workflow to be defined in any language that supports
-Avro and ZeroMQ, which are numerous.
+Avro and ZeroMQ, which are numerous. For this reason, the official
+specification of an RFlow workflow is a SQLite database containing
+information on its components, connections, ports, settings, etc.
+There is a Ruby DSL that aids in populating the database but the intent
+is that multiple processes and languages could access and manipulate
+the database form.
 
 ## Developer Notes
 
 You will need ZeroMQ preinstalled. Currently, EventMachine only supports
 v3.2.4, not v4.x, so install that version. Older versions like 2.2 will not
-work.
+work. (You will probably get errors saying arcane things like
+`assertion failed, mailbox.cpp(84)`).
 
 ## Definitions
 
 * __Component__ - the basic unit of RFlow computation. Each
   component is a shared-nothing, individual computation module that
@@ -48,12 +61,12 @@
   Ports can be "keyed" or "indexed" to allow better multiplexing of
   messages out/in a single port, as well as allow a single port to be
   accessed by an array.
 
 * __Connection__ - a directed link between an output port and an input
-  port.  RFlow supports generalized connection types, however only
-  ZeroMQ IPC links are currently used.
+  port.  RFlow supports generalized connection types; however, only
+  ZeroMQ links are currently used.
 
 * __Message__ - a bit of serialized data that is sent out an output
   port and recieved on an input port. Due to the serialization,
   message types and schemas are explicitly defined. In a departure
   from "pure" FBP, RFlow supports sending multiple message types via a
@@ -61,11 +74,10 @@
 
 * __Workflow__ - the common name for the digraph created when the
   components (nodes) are wired together via connections to their
   respective output/input ports.
 
-
 ## Component Examples
 
 The following describes the API of an RFlow component:
 
 ```ruby
@@ -88,23 +100,23 @@
   input and output ports.
 
 * `configure!` (called with a hash configuration) is called after the
   component is instantiated but before the workflow has been wired or
   any messages have been sent. Note that this is called outside the
-  Eventmachine reactor.
+  EventMachine reactor.
 
 * `run!` is called after all the components have been wired together
   with connections and the entire workflow has been created. For a
   component that is a source of messages, this is where messages will
   be sent. For example, if the component is reading from a file, this
   is where the file will be opened, the contents read into a message,
   and the message sent out the output port. `run!` is called within
-  the Eventmachine reactor.
+  the EventMachine reactor.
 
 * `process_message` is an evented callback that is called whenever the
   component receives a message on one of its input ports.
-  `process_message` is called withing the Eventmachine reactor
+  `process_message` is called within the EventMachine reactor
 
 * `shutdown!` is called when the flow is being terminated, and is
   meant to allow the components to do penultimate processing and send
   any final messages. All components in a flow will be told to
   `shutdown!` before they are told to `cleanup!`.
@@ -184,25 +196,24 @@
 
 ```ruby
 class RFlow::Components::FileOutput < RFlow::Component
   input_port :in
 
-  attr_accessor :output_file_path, :output_file
+  attr_accessor :output_file_path
 
   def configure!(config)
     self.output_file_path = config['output_file_path']
-    self.output_file = File.new output_file_path, 'w+'
   end
 
   def process_message(input_port, input_port_key, connection, message)
-    output_file.puts message.data.data_object.inspect
-    output_file.flush
+    File.open(output_file_path, 'a') do |f|
+      f.flock(File::LOCK_EX)
+      f.puts message.data.data_object.inspect
+      f.flush
+      f.flock(File::LOCK_UN)
+    end
   end
-
-  def cleanup
-    output_file.close
-  end
 end
 ```
 
 ## RFlow Messages
 
@@ -312,52 +323,54 @@
 ## RFlow Workflow Configuration
 
 RFlow currently stores its configuration in a SQLite database which
 are internally accessed via ActiveRecord.  Given that SQLite is a
 rather simple and standard interface, non-RFlow components could
-access it and determine their respsective ZMQ connections.
+access it and determine their respective ZMQ connections.
 
 DB schemas for the configuration database are in
 [lib/rflow/configuration/migrations](lib/rflow/configuration/migrations)
 and define the complete workflow configuration.  Note that each of the
 tables uses a UUID primary key, and UUIDs are used within RFlow to
 identify specific components.
 
 * settings - general application settings, such as log levels, app
-  names, directories, etc
+  names, directories, etc.
 
+* shards - a list of the shards defined for the workflow, including
+  UUID, type, and number of workers for the shard
+
 * components - a list of the components including its name,
-  specification (Ruby class), and options. Note that the options are
+  specification (Ruby class), shard, and options. Note that the options are
   serialized to the database as YAML, and components should understand
   that the round-trip through the database might not be perfect (e.g.
   Ruby symbols might become strings). A component also has a number of
   input ports and output ports.
 
 * ports - belonging to a component (via `component_uuid` foreign key),
-  also has a `type` colum for ActiveRecord STI, which gets set to
+  also has a `type` column for ActiveRecord STI, which gets set to
   either a `RFlow::Configuration::InputPort` or
   `RFlow::Configuration::OutputPort`.
 
 * connections - a connection between two ports via foriegn keys
   `input_port_uuid` and `output_port_uuid`. Like ports, connections
-  are typed via AR STI (`RFlow::Configuration::ZMQConnection` or
-  `RFlow::Configuration::AMQPConnection`) and have a YAML serialized
-  `options` hash. A connection also (potentially) defines the port
-  keys.
+  are typed via AR STI (`RFlow::Configuration::ZMQConnection` and
+  'RFlow::Configuration::BrokeredZMGConnection` are the only
+  supported values for now) and have a YAML serialized `options`
+  hash. A connection also (potentially) defines the port keys.
 
 RFlow also provides a RubyDSL for configuration-like file to be used
 to load the database:
 
 ```ruby
 RFlow::Configuration::RubyDSL.configure do |config|
   # Configure the settings, which include paths for various files, log
   # levels, and component specific stuffs
-  config.setting('rflow.log_level', 'DEBUG')
-  config.setting('rflow.application_directory_path', '../tmp')
+  config.setting 'rflow.log_level', 'DEBUG'
+  config.setting 'rflow.application_directory_path', '../tmp'
+  config.setting 'rflow.application_name', 'testapp'
 
-  config.setting('rflow.application_name', 'testapp')
-
   # Instantiate components
   config.component 'generate_ints1', 'RFlow::Components::GenerateIntegerSequence', {
     'start' => 0,
     'finish' => 10,
     'step' => 3,
@@ -384,13 +397,85 @@
   config.connect 'filter#out' => 'output1#in'
   config.connect 'filter#filtered' => 'output2#in'
 end
 ```
 
+## Parallelism
+
+RFlow supports parallelizing workflows and splitting them into multiple
+`shard`s. By default, components declared in the Ruby DSL exist in the
+default shard, named `DEFAULT`. There is only one worker for the default
+shard.
+
+ZeroMQ communication between components in the same shard uses ZeroMQ's
+`inproc` socket type for maximum performance. ZeroMQ communication between
+components in different shards is accomplished with a ZeroMQ `ipc` socket.
+In the case of a many-to-many connection (many workers in a producing
+shard and many workers in a consuming shard), a ZeroMQ message broker
+process is created to route the messages appropriately. Senders round-robin
+to receivers and receivers fair-queue the messages from the senders.
+Load balancing based on receiver responsiveness is not currently implemented.
+
+To define a custom shard in the Ruby DSL, use the `shard` method. For
+example:
+
+```ruby
+RFlow::Configuration::RubyDSL.configure do |config|
+  # Configure the settings, which include paths for various files, log
+  # levels, and component specific stuffs
+  config.setting 'rflow.log_level', 'DEBUG'
+  config.setting 'rflow.application_directory_path', '../tmp'
+  config.setting 'rflow.application_name', 'testapp'
+
+  config.shard 'integers', :process => 2 do |shard|
+    shard.component 'generate_ints1', 'RFlow::Components::GenerateIntegerSequence', {
+      'start' => 0,
+      'finish' => 10,
+      'step' => 3,
+      'interval_seconds' => 1
+    }
+    shard.component 'generate_ints2', 'RFlow::Components::GenerateIntegerSequence', {
+      'start' => 20,
+      'finish' => 30
+    }
+  end
+
+  # another style of specifying type and count; count defaults to 1
+  config.shard 'filters', :type => :process, :count => 1 do |shard|
+    shard.component 'filter', 'RFlow::Components::RubyProcFilter', {
+      'filter_proc_string' => 'lambda {|message| true}'
+    }
+  end
+
+  # another way of specifying type
+  config.process 'filters', :count => 2 do |shard|
+    shard.component 'output1', 'RFlow::Components::FileOutput', {
+      'output_file_path' => '/tmp/out1'
+    }
+  end
+
+  # this component will be created in the DEFAULT shard
+  config.component 'output2', 'RFlow::Components::FileOutput', {
+    'output_file_path' => '/tmp/out2'
+  }
+
+  # Wire components together
+  config.connect 'generate_ints1#out' => 'filter#in'
+  config.connect 'generate_ints2#out' => 'filter#in'
+  config.connect 'filter#filtered' => 'replicate#in'
+  config.connect 'filter#out' => 'output1#in'
+  config.connect 'filter#filtered' => 'output2#in'
+end
+```
+
+At runtime, shards with no components defined will have no workers and
+will not be started. (So, if you put all components in a custom shard,
+no `DEFAULT` workers will be seen.)
+
 ## Command-Line Operation
 
 RFlow includes the `rflow` binary that can load a database from a Ruby
-DSL, as well as start/stop the wokflow application as a daemon.
+DSL, as well as start/stop the workflow application as a daemon.
 Invoking the `rflow` binary without any options will give a brief help:
 
 ```
 Usage: rflow [options] (start|stop|status|load)
     -d, --database DB                Config database (sqlite) path (GENERALLY REQUIRED)