# Advance Advance is a framework for building data transformation pipelines. Advance allows you to concisely script your data transformation process and to incrementally build and easily debug that process. Each data transformation is a step and the results of each step become the input to the next step. The artifacts of each step are preserved in step named directories. When the results of a step are not right, just adjust the Advance script, delete the step directory with the bad data and rerun the script. Previously successful steps are skipped so the script moves quickly to the incomplete step. Similarly, when steps fail the results are preserved in directories prefixed with "tmp_". This isolates incomplete step data and ensures that the step is re-processed when the problem is resolved. Your project utilizing Advance contains a primary ruby script that imports Advance and includes your data transformation steps, which we will call "your Advance script." Each step describes a command to be run on your data. These commands can be one of the prepackaged Advance scripts, unix commands (like split, cut, etc), or scripts/commands that you create in whatever language is convenient for you. Advance invokes these scripts one by one much like you would at the command line. Advance logs the exact command that is invoked so that you can run it yourself to check the output manually and to debug failures. Advance steps are composed of a step processing type function, followed by a slug for the step, followed by the command or script. For example: ```ruby single :unzip_7z_raw_data_file, "7z x {previous_file}" single :split_files, "split -l 10000 -a 3 {previous_file} gps_data_" multi :add_local_time, "cat {file_path} | add_local_time.rb timestamp local_time US/Pacific > {file}" # ... ``` The step processing functions are `single` and `multi`. `Single` applies the command to the last output, which should be a single file. `Multi` speeds processing of multiple files by doing work in parallel (via the [TeamEffort gem][1]). [1]: https://rubygems.org/gems/team_effort > _[Advance][2]: To help the progress of (something); to further._ [2]: https://en.wiktionary.org/wiki/advance ## Installation Advance is meant to augment a standalone ruby script. The advance gem needs to be available to your instance of ruby. Here are 2 techniques to make Advance available to your script: * simply install the gem: $ gem install advance * install [bundler][3], and add Advance to your `Gemfile`: [3]: https://rubygems.org/gems/bundler ```ruby source "https://rubygems.org" gem "advance" # other gems... ``` ## Usage You will likely need multiple supporting scripts. Ideally you will create your Advance script and your supporting scripts in a single directory. Creating your Advance script is an incremental process. Start with a single step, run the script and check the results. When the output is as you expect, add the next step. After you add a step to your script you can simply rerun the script. Previously successful steps are skipped and your script moves on to the first incomplete step. When the results are not what you expect, just delete the step directory with the bad data, adjust your step, and rerun. Advance will rerun that step and all subsequent steps. Steps have 3 components: * a step processing type (single or multi) * a descriptive slug describing the step (as a ruby symbol) * the command that transforms the data Advance adds the bin dir of the Advance gem to PATH, so that you can invoke the supporting advance scripts in your pipeline without specifying the full path of the script. Advance also adds the path of _your Advance script_ to PATH so that you can invoke scripts in the same directory as your main script without specifying the full path of the script. Of course, you can invoke any script if the path to the script is fully specified or the path is already on PATH. **Specifying Script Input and Output** Since your command is transforming data, you need a way to specify the input file or directory and the output file name. Advance provides a few tokens that can be inserted in the command string for this purpose: * **`{previous_file}`** indicates the output file from the previous step when the output of the previous step was a single output file. It is also used to indicate the first file to be used and it finds that file in the current working dir. * **`{file_path}`** indicates an output file from the previous step when the previous step generated multiple output files and the current step is a `multi` step. * **`{file}`** indicates an output file name, which is the basename from `{file_path}`. Commands often process multiple files from previous steps, generating multiple output files. Those output files are placed in the step directory. * **`{previous_dir}`** indicates the directory of the previous step. **Example Script** ```ruby #!/usr/bin/env ruby require "advance" include Advance ensure_bin_on_path # ensures the directory for this script is on # the path so that related scripts can be referenced # without paths single :unzip_7z_raw_data_file, "7z x {previous_file}" # uses 7z to inflate a file in the current dir single :split_files, "split -l 10000 -a 3 {previous_file} gps_data_" # split the file multi :add_local_time, "cat {file_path} | add_local_time.rb timestamp local_time US/Pacific > {file}" # adds a local_time column to a csv ``` **Running Your Script** When running your pipeline, it is helpful to have a directory with the single, initial file. 1. Move to your data directory with your single initial file. 2. invoke your script from there. ## Contributing We ♥️ contributions! Found a bug? Ideally submit a pull request. And if that's not possible, make a bug report. Did you create a data transformation script? Please consider adding it to the script collection in Advance by submitting a pull request. Do you find the Advance documentation lacking? Please help us improve it. Can you translate the Advance documentation to your language? Bug reports and pull requests are welcome on GitHub at https://github.com/doctorjane/advance. ## License The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).