# Ferry Ferry is a data migration and data manipulation tool that seeks to quickly and easily reduce overhead when dealing with big data problems. ## TO-DO - [ ] Refactoring before public release - [x] Define action-items for refactor - [x] Provide working example(s) of using ferry - [ ] Public release fine-tuning - [ ] Tests - [ ] Testing input for migrate method (max_workers, batch_size) - [ ] Testing that there is an ActiveRecord::Relation object being passed to find_in_batches - [ ] Migration Scenarios - dummy class migration - [ ] Refactor logging logic into Logger class - [x] Initial revision - [ ] Review ## Installation Add this line to your application's Gemfile: gem 'ferry' And then execute: $ bundle Or install it yourself as: $ gem install ferry ## Usage Usage pending. See examples / submit PR's for your ideas. ## Example(s) ###### 3 September 2014 Use Case Ideas Note: Demo app can initially function with RoR and Postgres. Manipulation Use Cases - CRUD for Columns - Copy & Paste Columns - CRUD for Rows - Understanding relationships between generating migrations and migration files in place Migration - Exporting data to various file formats (.csv, .sql, .yml) - Importing data from various file formats - Migrating data to third party hosts (Amazon S3, Oracle) - Migrating data to a different database Important things to consider and remember - Rolling back on errors / mishaps during migrations and manipulations - Host documentation site via GitHub pages ###### 30 August 2014 Below is an initial implementation of how ferry will work ``` # encoding: UTF-8 require 'consortium' task :load_wm_design do class WmDesign < Design self.table_name = :wm_design end end namespace :consortium_example do desc "writes design cigs to individual xml files using consortium" task :write_local => [:load_wm_design] do hostname = Socket.gethostname FileUtils.mkdir "consortium_migration_#{hostname}" unless Dir["consortium_migration_#{hostname}"].present? homedir = "consortium_migration_#{hostname}" range = Design.where("savedate > ?", 15.hours.ago.strftime("%d.%m.%Y %H").to_datetime) consortium_runtime = Benchmark.measure do range.migrate({max_workers: 4, batch_size: 500}) do |collection| collection.each do |design| cons_place_design_content_in_batch(design, homedir, design.composite_id) end end end puts "#{consortium_runtime}" end private def cons_place_design_content_in_batch(design, homedir, composite_id) begin create_xml_file(homedir, composite_id, design) rescue Exception => e File.rename("#{homedir}/#{composite_id}.xml", "#{homedir}/#{composite_id}.xml.failed") raise e end end def create_xml_file(homedir, composite_id, design) design.updated_at ? updated_at = design.updated_at.to_time : updated_at = design.created_at.to_time FileUtils.touch "#{homedir}/#{composite_id}.xml" file = File.open("#{homedir}/#{composite_id}.xml", 'w') file.puts design.content file.close FileUtils.touch "#{homedir}/#{composite_id}.xml", :mtime => updated_at end end ``` ###### 29 July 2014 Version 0.0.1 is functional with the rake task defined here :: https://github.com/customink/design_content_migration/blob/master/lib/tasks/ferry_example.rake#L10 Please manually install ferry from your locally cloned repo ... ``` git clone git@github.com:customink/ferry.git cd ferry gem build ferry.gemspec gem install ferry ``` add it to your app's Gemfile ``` gem 'ferry' ``` and then ``` bundle install ``` as it has not been pushed to rubygems.com yet. Tests - Coming soon to an editor near me! ###### 28 July 2014 Ferry should not support Oracle. ###### 25 July 2014 After a few more reviews with @metaskills, @gilr00y, @jdlehman, and @danielwheeler1987, Ferry will extend ActiveRecord with a "migrate" (more legit name search still in naming progress) method. From there we are going to pass the same relation to find in batches to a worker which will plow through the batch passed to it via a yield call from the task. Tests will include; validate the data passed into the worker (log) and testing that there is an ActiveRecord::Relation being passed to find_in_batches. ###### 23 July 2014 After a few chats with @gilr00y and @jdlehman Ferry may extend ActiveRecord with a "migrate" method we could call on an ActiveRecord object. From there that object would call an Engine instance with appropriate fields to kickoff the actual data migration. There is some logic duplication and layer duplication between the Engine class and the "migrate" method that extends ActiveRecord. Still working out how to concisely write logic that handles the management of forking connection and engine init calls. ``` require "ferry/version" require 'models/engine' require 'models/logger' module Ferry class ActiveRecord def self.migrate(&block) yield end end end ``` This implementation should be able to run something like this ... ``` engine = Engine.new( Design.where("savedate > ?", 6.months.ago.strftime("%d.%m.%Y %H").to_datetime).id, Design.where("savedate > ?", 3.months.ago.strftime("%d.%m.%Y %H").to_datetime).id, 100_000, 1_000, "log/ferry" ) Design.where("savedate > ?", 130.hours.ago.strftime("%d.%m.%Y %H").to_datetime).migrate( engine.run do | start_id, end_id, chunk_size, batch_size, log | worker.run do | start_id, chunk_size, batch_size, log | worker_end_id = start_id + chunk_size - 1 Design.where("id >= ? && id <= ?", start_id, worker_end_id).find_in_batches(batch_size: batch_size) do |batch| # move and manipulate data as you please end start_id += batch_size end end ) ``` ###### 22 July 2014 After installing ferry to your local machine or bundling from your gemfile - in your migration task make sure to define your chunker as such ... ``` require 'ferry' namespace :example do task "my_migration_task" do ferry = Engine.new( :max_workers => number_of_workers ex:8, :start_id => where_are_we_starting ex:2910, Model.first.id, :end_id => where_are_we_ending ex:8190, Model.last.id, :chunk_size => size_of_chunks_that_workers_will_process ex:42, :working_dir => ex:"path/to/working_dir" ) ferry.run do |start_id, chunk_size, log| begin work = Model.select(":id").where("? <= id and id < ?", start_id, start_id + chunk_size) rows_to_process = rel.count log.puts("rows_to_process: #{rows_to_process}") work.find_in_batches(:batch_size => 1_000) do # doing things and logging stuff as you please ... end rescue Exception => e log.puts "Broken on id #{id}" raise e end end end end ``` ## Contributing 1. Fork it ( https://github.com/[my-github-username]/ferry/fork ) 2. Create your feature branch (`git checkout -b my-new-feature`) 3. Commit your changes (`git commit -am 'Add some feature'`) 4. Push to the branch (`git push origin my-new-feature`) 5. Create a new Pull Request