README.md in rocketjob-0.7.0 vs README.md in rocketjob-0.8.0

- old
+ new

@@ -1,12 +1,12 @@ -# rocketjob +# rocketjob[![Build Status](https://secure.travis-ci.org/rocketjob/rocketjob.png?branch=master)](http://travis-ci.org/rocketjob/rocketjob) ![](http://ruby-gem-downloads-badge.herokuapp.com/rocketjob?type=total) High volume, priority based, background job processing solution for Ruby. ## Status -Alpha - Feedback on the API is welcome. API will change. +Beta - Feedback on the API is welcome. API may change. Already in use in production internally processing large files with millions of records, as well as large jobs to walk though large databases. ## Why? @@ -89,11 +89,11 @@ For example a single file that contains millions of records to be processed as quickly as possible without impacting other jobs with a higher priority. ## Management -The companion project [rocketjob mission control](https://github.com/lambcr/rocket_job_mission_control) +The companion project [rocketjob mission control](https://github.com/rocketjob/rocket_job_mission_control) contains the Rails Engine that can be loaded into your Rails project to add a web interface for viewing and managing `rocketjob` jobs. `rocketjob mission control` can also be run stand-alone in a shell Rails application. @@ -120,41 +120,218 @@ ```ruby MyJob.perform_later('jack@blah.com', 'lets meet') ``` -## Configuration +## Directory Monitoring -MongoMapper will already configure itself in Rails environments. Sometimes we want -to use a different Mongo Database instance for the records and results. +A common task with many batch processing systems is to look for the appearance of +new files and kick off jobs to process them. `DirmonJob` is a job designed to do +this task. -For example, the RocketJob::Job can be stored in a Mongo Database that is replicated -across data centers, whereas we may not want to replicate record and result data -due to it's sheer volume. +`DirmonJob` runs every 5 minutes by default, looking for new files that have appeared +based on configured entries called `DirmonEntry`. Ultimately these entries will be +configurable via `rocketjob_mission_control`, the web management interface for `rocketjob`. +Example, creating a `DirmonEntry` + ```ruby +RocketJob::DirmonEntry.new( + path: 'path_to_monitor/*', + job: 'Jobs::TestJob', + arguments: [ { input: 'yes' } ], + properties: { priority: 23, perform_method: :event }, + archive_directory: '/exports/archive' +) +``` + +The attributes of DirmonEntry: + +* path <String> + +Wildcard path to search for files in. +For details on valid path values, see: http://ruby-doc.org/core-2.2.2/Dir.html#method-c-glob + +Example: + + * input_files/process1/*.csv* + * input_files/process2/**/* + +* job <String> + +Name of the job to start + +* arguments <Array> + +Any user supplied arguments for the method invocation +All keys must be UTF-8 strings. The values can be any valid BSON type: + + * Integer + * Float + * Time (UTC) + * String (UTF-8) + * Array + * Hash + * True + * False + * Symbol + * nil + * Regular Expression + +_Note_: Date is not supported, convert it to a UTC time + +* properties <Hash> + +Any job properties to set. + +Example, override the default job priority: + +```ruby +{ priority: 45 } +``` + +* archive_directory + +Archive directory to move the file to before the job is started. It is important to +move the file before it is processed so that it is not picked up again for processing. +If no archive_directory is supplied the file will be moved to a folder called '_archive' +in the same folder as the file itself. + +If the `path` above is a relative path the relative path structure will be +maintained when the file is moved to the archive path. + +* enabled <Boolean> + +Allow a monitoring entry to be disabled so that it is ignored by `DirmonJob`. +This feature is useful for operations to temporarily stop processing files +from a particular source, without having to completely delete the `DirmonEntry`. +It can also be used to create a `DirmonEntry` without it becoming immediately +active. +``` + +### Starting the directory monitor + +The directory monitor job only needs to be started once per installation by running +the following code: + +```ruby +RocketJob::Jobs::DirmonJob.perform_later +``` + +The polling interval to check for new files can be modified when starting the job +for the first time by adding: +```ruby +RocketJob::Jobs::DirmonJob.perform_later do |job| + job.check_seconds = 180 +end +``` + +The default priority for `DirmonJob` is 40, to increase it's priority: +```ruby +RocketJob::Jobs::DirmonJob.perform_later do |job| + job.check_seconds = 300 + job.priority = 25 +end +``` + +Once `DirmonJob` has been started it's priority and check interval can be +changed at any time as follows: + +```ruby +RocketJob::Jobs::DirmonJob.first.set(check_seconds: 180, priority: 20) +``` + +The `DirmonJob` will automatically re-schedule a new instance of itself to run in +the future after it completes a each scan/run. If successful the current job instance +will destroy itself. + +In this way it avoids having a single Directory Monitor process that constantly +sits there monitoring folders for changes. More importantly it avoids a "single +point of failure" that is typical for earlier directory monitoring solutions. +Every time `DirmonJob` runs and scans the paths for new files it could be running +on a new worker. If any server/worker is removed or shutdown it will not stop +`DirmonJob` since it will just run on another worker instance. + +There can only be one `DirmonJob` instance `queued` or `running` at a time. Any +attempt to start a second instance will result in an exception. + +If an exception occurs while running `DirmonJob`, a failed job instance will remain +in the job list for problem determination. The failed job cannot be restarted and +should be destroyed if no longer needed. + +## Rails Configuration + +MongoMapper will already configure itself in Rails environments. `rocketjob` can +be configured to use a separate MongoDB instance from the Rails application as follows: + +For example, we may want `RocketJob::Job` to be stored in a Mongo Database that +is replicated across data centers, whereas we may not want to replicate the +`RocketJob::SlicedJob`** slices due to it's sheer volume. + +```ruby config.before_initialize do - # If this environment has a separate Work server # Share the common mongo configuration file config_file = root.join('config', 'mongo.yml') if config_file.file? - if config = YAML.load(ERB.new(config_file.read).result)["#{Rails.env}_work] + config = YAML.load(ERB.new(config_file.read).result) + if config["#{Rails.env}_rocketjob] options = (config['options']||{}).symbolize_keys - # In the development environment the Mongo driver generates a lot of - # network trace log data, move its debug logging to :trace - options[:logger] = SemanticLogger::DebugAsTraceLogger.new('Mongo:Work') + options[:logger] = SemanticLogger::DebugAsTraceLogger.new('Mongo:rocketjob') + RocketJob::Config.mongo_connection = Mongo::MongoClient.from_uri(config['uri'], options) + end + # It is also possible to store the jobs themselves in a separate MongoDB database + if config["#{Rails.env}_rocketjob_work] + options = (config['options']||{}).symbolize_keys + options[:logger] = SemanticLogger::DebugAsTraceLogger.new('Mongo:rocketjob_work') RocketJob::Config.mongo_work_connection = Mongo::MongoClient.from_uri(config['uri'], options) - - # It is also possible to store the jobs themselves in a separate MongoDB database - # RocketJob::Config.mongo_connection = Mongo::MongoClient.from_uri(config['uri'], options) end else puts "\nmongo.yml config file not found: #{config_file}" end end ``` +For an example config file, `config/mongo.yml`, see [mongo.yml](https://github.com/rocketjob/rocketjob/blob/master/test/config/mongo.yml) + +## Standalone Configuration + +When running `rocketjob` in a standalone environment without Rails, the MongoDB +connections will need to be setup as follows: + +```ruby +options = { + pool_size: 50, + pool_timeout: 5, + logger: SemanticLogger::DebugAsTraceLogger.new('Mongo:Work'), +} + +# For example when using a replica-set for high availability +uri = 'mongodb://mongo1.site.com:27017,mongo2.site.com:27017/production_rocketjob' +RocketJob::Config.mongo_connection = Mongo::MongoClient.from_uri(uri, options) + +# Use a separate database, or even server for `RocketJob::SlicedJob` slices +uri = 'mongodb://mongo1.site.com:27017,mongo2.site.com:27017/production_rocketjob_slices' +RocketJob::Config.mongo_work_connection = Mongo::MongoClient.from_uri(uri, options) +``` + ## Requirements MongoDB V2.6 or greater. V3 is recommended * V2.6 includes a feature to allow lookups using the `$or` clause to use an index + +## Meta + +* Code: `git clone git://github.com/rocketjob/rocketjob.git` +* Home: <https://github.com/rocketjob/rocketjob> +* Bugs: <http://github.com/rocketjob/rocketjob/issues> +* Gems: <http://rubygems.org/gems/rocketjob> + +This project uses [Semantic Versioning](http://semver.org/). + +## Author + +[Reid Morrison](https://github.com/reidmorrison) :: @reidmorrison + +## Contributors + +* [Chris Lamb](https://github.com/lambcr)