README.md in rocketjob-0.9.1 vs README.md in rocketjob-1.0.0

- old
+ new

@@ -1,331 +1,21 @@ -# rocketjob[![Build Status](https://secure.travis-ci.org/rocketjob/rocketjob.png?branch=master)](http://travis-ci.org/rocketjob/rocketjob) ![](http://ruby-gem-downloads-badge.herokuapp.com/rocketjob?type=total) +# rocketjob [![Gem Version](https://badge.fury.io/rb/rocketjob.svg)](http://badge.fury.io/rb/rocketjob) [![Build Status](https://secure.travis-ci.org/rocketjob/rocketjob.png?branch=master)](http://travis-ci.org/rocketjob/rocketjob) ![](http://ruby-gem-downloads-badge.herokuapp.com/rocketjob?type=total) -High volume, priority based, background job processing solution for Ruby. +High volume, priority based, distributed, background job processing solution for Ruby. ## Status -Beta - Feedback on the API is welcome. API may change. +Production Ready -Already in use in production internally processing large files with millions +Already in use in production processing large files with millions of records, as well as large jobs to walk though large databases. -## Why? +## Documentation -We have tried for years to make both `resque` and more recently `sidekiq` -work for large high performance batch processing. -Even `sidekiq-pro` was purchased and used in an attempt to process large batches. +* [Guide](http://rocketjob.io/) +* [API Reference](http://www.rubydoc.info/gems/rocketjob/) -Unfortunately, after all the pain and suffering with the existing asynchronous -worker solutions none of them have worked in our production environment without -significant hand-holding and constant support. Mysteriously the odd record/job -was disappearing when processing 100's of millions of jobs with no indication -where those lost jobs went. - -In our environment we cannot lose even a single job or record, as all data is -business critical. The existing batch processing solution do not supply any way -to collect the output from batch processing and as a result every job has custom -code to collect it's output. rocketjob has built in support to collect the results -of any batch job. - -High availability and high throughput were being limited by how much we could get -through `redis`. Being a single-threaded process it is constrained to a single -CPU. Putting `redis` on a large multi-core box does not help since it will not -use more than one CPU at a time. -Additionally, `redis` is constrained to the amount of physical memory is available -on the server. -`redis` worked very well when processing was below around 100,000 jobs a day, -when our workload suddenly increased to over 100,000,000 a day it could not keep -up. Its single CPU would often hit 100% CPU utilization when running many `sidekiq-pro` -servers. We also had to store actual job data in a separate MySQL database since -it would not fit in memory on the `redis` server. - -`rocketjob` was created out of necessity due to constant support. End-users were -constantly contacting the development team to ask on the status of "hung" or -"in-complete" jobs, as part of our DevOps role. - -Another significant production support challenge is trying to get `resque` or `sidekiq` -to process the batch jobs in a very specific order. Switching from queue-based -to priority-based job processing means that all jobs are processed in the order of -their priority and not what queues are defined on what servers and in what quantity. -This approach has allowed us to significantly increase the CPU and IO utilization -across all worker machines. The traditional queue based approach required constant -tweaking in the production environment to try and balance workload without overwhelming -any one server. - -End-users are now able to modify the priority of their various jobs at runtime -so that they can get that business critical job out first, instead of having to -wait for other jobs of the same type/priority to finish first. - -Since `rocketjob` uploads the entire file, or all data for processing it does not -require jobs to store the data in other databases. -Additionally, `rocketjob` supports encryption and compression of any data uploaded -into Sliced Jobs to ensure PCI compliance and to prevent sensitive from being exposed -either at rest in the data store, or in flight as it is being read or written to the -backend data store. -Often large files received for processing contain sensitive data that must not be exposed -in the backend job store. Having this capability built-in ensures all our jobs -are properly securing sensitive data. - -Since moving to `rocketjob` our production support has diminished and now we can -focus on writing code again. :) - -## Introduction - -`rocketjob` is a global "priority based queue" (https://en.wikipedia.org/wiki/Priority_queue) -All jobs are placed in a single global queue and the job with the highest priority -is processed first. Jobs with the same priority are processed on a first-in -first-out (FIFO) basis. - -This differs from the traditional approach of separate queues for jobs which -quickly becomes cumbersome when there are for example over a hundred different -types of jobs. - -The global priority based queue ensures that the workers are utilized to their -capacity without requiring constant manual intervention. - -`rocketjob` is designed to handle hundreds of millions of concurrent jobs -that are often encountered in high volume batch processing environments. -It is designed from the ground up to support large batch file processing. -For example a single file that contains millions of records to be processed -as quickly as possible without impacting other jobs with a higher priority. - -## Management - -The companion project [rocketjob mission control](https://github.com/rocketjob/rocket_job_mission_control) -contains the Rails Engine that can be loaded into your Rails project to add -a web interface for viewing and managing `rocketjob` jobs. - -`rocketjob mission control` can also be run stand-alone in a shell Rails application. - -By separating `rocketjob mission control` into a separate gem means it does not -have to be loaded where `rocketjob` jobs are defined or run. - -## Jobs - -Simple single task jobs: - -Example job to run in a separate worker process - -```ruby -class MyJob < RocketJob::Job - # Method to call asynchronously by the worker - def perform(email_address, message) - # For example send an email to the supplied address with the supplied message - send_email(email_address, message) - end -end -``` - -To queue the above job for processing: - -```ruby -MyJob.perform_later('jack@blah.com', 'lets meet') -``` - -## Directory Monitoring - -A common task with many batch processing systems is to look for the appearance of -new files and kick off jobs to process them. `DirmonJob` is a job designed to do -this task. - -`DirmonJob` runs every 5 minutes by default, looking for new files that have appeared -based on configured entries called `DirmonEntry`. Ultimately these entries will be -configurable via `rocketjob_mission_control`, the web management interface for `rocketjob`. - -Example, creating a `DirmonEntry` - -```ruby -RocketJob::DirmonEntry.new( - path: 'path_to_monitor/*', - job: 'Jobs::TestJob', - arguments: [ { input: 'yes' } ], - properties: { priority: 23, perform_method: :event }, - archive_directory: '/exports/archive' -) -``` - -The attributes of DirmonEntry: - -* path <String> - -Wildcard path to search for files in. -For details on valid path values, see: http://ruby-doc.org/core-2.2.2/Dir.html#method-c-glob - -Example: - - * input_files/process1/*.csv* - * input_files/process2/**/* - -* job <String> - -Name of the job to start - -* arguments <Array> - -Any user supplied arguments for the method invocation -All keys must be UTF-8 strings. The values can be any valid BSON type: - - * Integer - * Float - * Time (UTC) - * String (UTF-8) - * Array - * Hash - * True - * False - * Symbol - * nil - * Regular Expression - -_Note_: Date is not supported, convert it to a UTC time - -* properties <Hash> - -Any job properties to set. - -Example, override the default job priority: - -```ruby -{ priority: 45 } -``` - -* archive_directory - -Archive directory to move the file to before the job is started. It is important to -move the file before it is processed so that it is not picked up again for processing. -If no archive_directory is supplied the file will be moved to a folder called '_archive' -in the same folder as the file itself. - -If the `path` above is a relative path the relative path structure will be -maintained when the file is moved to the archive path. - -* enabled <Boolean> - -Allow a monitoring entry to be disabled so that it is ignored by `DirmonJob`. -This feature is useful for operations to temporarily stop processing files -from a particular source, without having to completely delete the `DirmonEntry`. -It can also be used to create a `DirmonEntry` without it becoming immediately -active. -``` - -### Starting the directory monitor - -The directory monitor job only needs to be started once per installation by running -the following code: - -```ruby -RocketJob::Jobs::DirmonJob.perform_later -``` - -The polling interval to check for new files can be modified when starting the job -for the first time by adding: -```ruby -RocketJob::Jobs::DirmonJob.perform_later do |job| - job.check_seconds = 180 -end -``` - -The default priority for `DirmonJob` is 40, to increase it's priority: -```ruby -RocketJob::Jobs::DirmonJob.perform_later do |job| - job.check_seconds = 300 - job.priority = 25 -end -``` - -Once `DirmonJob` has been started it's priority and check interval can be -changed at any time as follows: - -```ruby -RocketJob::Jobs::DirmonJob.first.set(check_seconds: 180, priority: 20) -``` - -The `DirmonJob` will automatically re-schedule a new instance of itself to run in -the future after it completes a each scan/run. If successful the current job instance -will destroy itself. - -In this way it avoids having a single Directory Monitor process that constantly -sits there monitoring folders for changes. More importantly it avoids a "single -point of failure" that is typical for earlier directory monitoring solutions. -Every time `DirmonJob` runs and scans the paths for new files it could be running -on a new worker. If any server/worker is removed or shutdown it will not stop -`DirmonJob` since it will just run on another worker instance. - -There can only be one `DirmonJob` instance `queued` or `running` at a time. Any -attempt to start a second instance will result in an exception. - -If an exception occurs while running `DirmonJob`, a failed job instance will remain -in the job list for problem determination. The failed job cannot be restarted and -should be destroyed if no longer needed. - -## Rails Configuration - -MongoMapper will already configure itself in Rails environments. `rocketjob` can -be configured to use a separate MongoDB instance from the Rails application as follows: - -For example, we may want `RocketJob::Job` to be stored in a Mongo Database that -is replicated across data centers, whereas we may not want to replicate the -`RocketJob::SlicedJob`** slices due to it's sheer volume. - -```ruby -config.before_initialize do - # Share the common mongo configuration file - config_file = root.join('config', 'mongo.yml') - if config_file.file? - config = YAML.load(ERB.new(config_file.read).result) - if config["#{Rails.env}_rocketjob] - options = (config['options']||{}).symbolize_keys - options[:logger] = SemanticLogger::DebugAsTraceLogger.new('Mongo:rocketjob') - RocketJob::Config.mongo_connection = Mongo::MongoClient.from_uri(config['uri'], options) - end - # It is also possible to store the jobs themselves in a separate MongoDB database - if config["#{Rails.env}_rocketjob_work] - options = (config['options']||{}).symbolize_keys - options[:logger] = SemanticLogger::DebugAsTraceLogger.new('Mongo:rocketjob_work') - RocketJob::Config.mongo_work_connection = Mongo::MongoClient.from_uri(config['uri'], options) - end - else - puts "\nmongo.yml config file not found: #{config_file}" - end -end -``` - -For an example config file, `config/mongo.yml`, see [mongo.yml](https://github.com/rocketjob/rocketjob/blob/master/test/config/mongo.yml) - -## Standalone Configuration - -When running `rocketjob` in a standalone environment without Rails, the MongoDB -connections will need to be setup as follows: - -```ruby -options = { - pool_size: 50, - pool_timeout: 5, - logger: SemanticLogger::DebugAsTraceLogger.new('Mongo:Work'), -} - -# For example when using a replica-set for high availability -uri = 'mongodb://mongo1.site.com:27017,mongo2.site.com:27017/production_rocketjob' -RocketJob::Config.mongo_connection = Mongo::MongoClient.from_uri(uri, options) - -# Use a separate database, or even server for `RocketJob::SlicedJob` slices -uri = 'mongodb://mongo1.site.com:27017,mongo2.site.com:27017/production_rocketjob_slices' -RocketJob::Config.mongo_work_connection = Mongo::MongoClient.from_uri(uri, options) -``` - -## Requirements - -MongoDB V2.6 or greater. V3 is recommended - -* V2.6 includes a feature to allow lookups using the `$or` clause to use an index - -## Meta - -* Code: `git clone git://github.com/rocketjob/rocketjob.git` -* Home: <https://github.com/rocketjob/rocketjob> -* Bugs: <http://github.com/rocketjob/rocketjob/issues> -* Gems: <http://rubygems.org/gems/rocketjob> +## Versioning This project uses [Semantic Versioning](http://semver.org/). ## Author