**Raka** is a **DSL**(Domain Specific Language) on top of **Rak**e for defining rules and running d**a**t**a** processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity. ## Installation Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. Ruby is available for most \*nix systems including Mac OSX so the only task is to install raka like: ```bash gem install raka ``` ## Quick Start First create a file named `main.raka` and import & initialize the DSL ```ruby require 'raka' dsl = Raka.new(self, output_types: [:txt], input_types: [:txt] ) ``` Then the code below will define two simple rules: ```ruby txt._.first50 = shell* "cat $< | head -n 50 > $@" txt.sort = [txt.input] | shell* "cat $(dep0) | sort -rn > $@" ``` For testing let's prepare an input file named `input.txt`: ```bash seq 1000 > input.txt ``` Invoke: ```bash raka first50__sort.txt ``` Raka will read data from *input.txt*, sort the numbers descendingly and copy the first 50 lines to *first50__sort.txt*. The workflow here is as follows: 1. Try to find *first50__sort.txt*: not exists. 2. Rule with target `txt.sort.first50` matched. 3. Find input file *sort.txt*, not exists. 4. Rule with target `txt.sort` matched. 5. This rule has no input but a depended target `txt.input`. 6. File *input.txt* exists. Use it. 7. Run rule `txt.sort` and create *sort.txt*. 8. Run rule `txt.sort.first50` and create *first50__sort.txt* We may want to skip the sort step, and invoke: ```bash raka first50__input.txt ``` Raka will read data from *input.txt* and copy the first 50 lines to *first50__input.txt*. This illustrates some basic ideas but may not be particularly interesting. Following is a slightly more complex example which covers more features. ```ruby require 'raka' dsl = Raka.new(self, output_types: %i[csv pdf], input_types: %i[csv], lang: ['lang/shell', 'lang/python']) py_template = <<~PYTHON import os.path import pandas as pd def write_variety(input, output, variety): print(variety) folder = os.path.dirname(output) if len(folder) > 0: os.makedirs(folder, exist_ok=True) df = pd.read_csv(input) df[df['class'] == variety].to_csv(output) PYTHON py.config script_template: py_template groups = %i[virginica versicolor] csv(groups.join('|')).iris = [csv.iris_all] | py* %(write_variety('$<', '$@', 'Iris-$(target_scope)')) csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@) dsl.scope(*groups) do pdf.iris.plot['plot_(\S+)_(\S+)'] = py do |rask| <<-PYTHON import seaborn as sns from matplotlib import pyplot as plt df = pd.read_csv('#{rask.input}') ax = sns.displot(x=df['#{rask.captures.plot0}#{rask.captures.plot1}']) ax.set_axis_labels('#{rask.captures.plot0} #{rask.captures.plot1}', 'frequency') plt.savefig('#{rask.output}') PYTHON end end task figures: (groups.product(%w[sepal petal], %w[length width]).map do |info| "_out/#{info[0]}/plot_#{info[1]}_#{info[2]}__iris.pdf" end) ``` In this example, we download a classical dataset named *iris.csv*, use python code to extract two varieties including *virginica* and *versicolor*, and generate thematic plots of frequency histograms for both varieties. To invoke the script, we run in terminal: ```bash raka -j 8 -v figures ``` The option `-j 8` indicates we want to parallelize the tasks with 8 concurrent processes at most where possible. The option `-v` let raka print detailed information so we can view the generated python code. The tool will then act as the following: 1. Match `figures with the last`rule, which is a normal rake task. 2. The prerequisites include 8 figures, none of them exists yet. Take *_out/versicolor/plot_petal_length__iris.pdf * as an example from now on. 3. Rule `pdf.iris.plot['plot_(\S+)_(\S+)']...` is matched, where "petal" is bound to `plot0` and "length" is bound to `plot1`. 4. Neither of the 2 possible input files: *_out/versicolor/iris.csv* and *_out/versicolor/iris.pdf* and can be found. But the rule `csv(groups.join('|')).iris = ...` (`csv('virginica|versicolor').iris`) can be matched for the former, where the target scope is matched as `versicolor`. 5. The only dependecy `csv.iris_all` is resolved as *_out/iris_all_.csv*. The path does not contain `vesicolor` since the target scope only applies to the target. 6. Rule `csv.iris_all` is matched without any dependencies. 7. The protocol `shell` replaces the automatic variable`$@` with `_out/iris_all.csv` to build a curl command and download the iris dataset from ()[datahub.io]. 8. Now raka goes back to generate output *_out/versicolor/iris.csv*, by executing the code generated by the `python` protocol, which extracts rows where the class field equals "Iris-versicolor". 9. Raka goes back to generate output *_out/versicolor/plot_petal_length__iris.pdf*, , by executing the code generated by the `python` protocol, which draws a histogram plot to depict the distribution of petal length. 10. Raka continues to generate plot files until all 8 figures exist. As an example, the generated python code in *9* are: ```python import sys import os.path import pandas as pd def write_variety(input, output, variety): print(variety) folder = os.path.dirname(output) if len(folder) > 0: os.makedirs(folder, exist_ok=True) df = pd.read_csv(input) df[df['class'] == variety].to_csv(output) import seaborn as sns from matplotlib import pyplot as plt df = pd.read_csv('_out/versicolor/iris.csv') ax = sns.displot(x=df['petallength']) ax.set_axis_labels('petal length', 'frequency') plt.savefig('_out/versicolor/plot_petal_length__iris.pdf') ``` The rule-based system, the strategy to execute tasks only when necessary, and the capable host language make it fairly easy to adjust the experiments during the exploration. For example, suppose we want to also apply experiments also to the *setosa* class, we can just change the line `groups = %i[virginica versicolor]` to `groups = %i[virginica versicolor setosa]` The command `raka -j 8 -v figures` will generate 4 figures for the new class, without re-executing tasks for the other two classes. ## Why Raka Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages: 1. Advanced pattern matching and template resolving to define general rules and maximize code reuse. 2. Extensible and context-aware protocol architecture. 3. Multilingual. Other programming languages can be easily embedded. 4. Auto dependency and naming by conventions. 5. Scopes to ease comparative studies. 6. Terser syntax. ... and more. Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages: 1. Lightweight and easy to setup, especially on platforms with ruby preinstalled. 2. Easy to deploy, version-control, backup or share workflows since the workflows are merely text files. 3. Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols). 4. Expressive so a few lines of code can replace many manual operations. ## Documentation ### Conceptual Model A raka rule consists of target, dependencies, actions and ### Syntax Definition It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (W3C EBNF form): ```ebnf rule ::= target "=" (dependencies "|")* action ("|" post_target)* target ::= ext "." ltoken ("." ltoken)* dependencies ::= "[]" | "[" dependency ("," dependency)* "]" dependency ::= rexpr | template post_target ::= rexpr | template rexpr ::= ext "." rtoken ("." rtoken)* ltoken ::= word | word "[" pattern "]" rtoken ::= word | word "(" template ")" word ::= ("_" | letter) ( letter | digit | "_" )* action ::= ("shell" | "r" | "psql" | "py" ) ("*" template | block ) | "run" block ``` The corresponding railroad diagrams are: **rule** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rule.svg) **target** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/target.svg) **dependencies** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/dependencies.svg) **dependency** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/dependency.svg) **post_target_** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/post_target.svg) **rexpr** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rexpr.svg) **ltoken** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/ltoken.svg) **rtoken** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rtoken.svg) **word** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/word.svg) **action** ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/action.svg) The definition is concise but several details are omitted for simplicity: 1. **BLOCK** and **HASH** is ruby's block and hash object. 2. A **template** is just a ruby string, but with some placeholders (see the next section for details) 3. A **pattern** is just a ruby string which represents regex (see the next section for details) 4. The listed protocols are merely what we offered now. It can be greatly extended. 5. Nearly any concept in the syntax can be replaced by a suitable ruby variable. ### Pattern matching and template resolving When defined a rule like `target = `, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be extracted for use in the right side. The specifications on the right side of a rule can contain templates. The "holes" in the templates will be fulfilled by automatic variables and variables captured when matching the left side. #### Pattern matching To match a given _file_ with a `target`, the extension will be matched first. The substrings of the file name between "\_\_" are mapped to tokens separated by `.`, in reverse order. After that, each substring is matched to the corresponding token or the regex in `[]`. For example, the rule ```ruby pdf.buildings.indicator['\S+'].top['top_(\d+)'] ``` can match "top_50\_\_node_num\_\_buildings.pdf". The logical process is: 1. The extension `pdf` matches. 2. The substrings and the tokens are paired and they all match: - `buildings ~ buildings` - `'\S+' ~ node_num` - `top_(\d+) ~ top_50` 3. Two levels of captures are made. First, 'node_num' is captured as `indicator`, 'top_50' is captured as `top`; Second, '50' is captured as `top0` since `\d+` is wrapped in parenthesis and is the first. One can write special token `_` to match any token. Since raka uses prefix matching, something like `token0['']` can also match any token and capture it in `token0` in addition. End-of-line symbol `$` can be used to match the whole token, e.g., `token0['word$']` will not match `word_bench`. #### Template resolving In some places of `rexpr`, templates can be written instead of strings, so that it can represent different values at runtime. There are two types of variables that can be used in templates. The first is automatic variables, which is just like `$@` in Make or `task.name` in Rake. We even preserve some Make conventions for easier migrations. All automatic varibales begin with `$`. The possible automatic variables are: | symbol | description | | ----------------------------------------- | ------------------------------------------------------------------------------------------------------ | | \$@, \$(output) | the output file | | \$<, \$(input) | the input file defined in the chained target | | \$^, \$(deps) | all dependecies concated by comma (including input) | | \$(dep0), \$(dep1), ... | the i-th depdency (input is $(dep0)) | | \$(input_stem) | stem of the input file | | \$(output_stem) | stem of the output file | | \$(func) | the token added to input to generate output, e.g., stat in csv.data.stat | | \$(ext) | extension of the output file | | \$(scope) | scope for current task, i.e. the common directory for output, input and dependencies | | \$(target_scope) | the inline scope defined in target | | \$(target_scope0), \$(target_scope1), ... | the i-th captured value by inline scope defined in target | | \$(rule_scope0), \$(rule_scope1), ... | the i-th scope defined in rule-level by nested calls of the dsl.scope function (i is larger insideout) | The other type of variables are those captured during pattern matching, which can be referred to using `%{var}`. In the example of the [pattern matching](###pattern-matching) section, `%{indicator}` will be replaced by `node_num`, `%{top}` will be replaced by `top_50` and `%{top0}` will be replaced by `50`. In such case, a template as `'calculate top %{top0} of %{indicator} for $@'` will be resolved as `'calculate top 50 of node_num for top_50__node_num__buildings.pdf'` Templates can happen in various places. For depdencies and post targets, tokens with parenthesis can contain templates, like `csv._('%{indicator}')`. The symbol of a token with parenthesis is of no use and is generally omitted with an underscore. It is also possible to write template literal directly, i.e. `'%{indicator}.csv'`. Templates can also be applied in actions but it depends on the implementations of protocols. ### Actions and protocols Raka invokes **actions** when all input and dependencies are presented. Generally, users define an action that generates the output. To maximize the flexibility, users can feed code in an arbitrary programming language to the corresponding **protocol**. The protocol will then transform and execute the code. Raka natively supports the host(ruby) protocol and several foreign protocols including shell, python, psql, and r. The host protocol is special and just executes the given ruby block. All other protocols can accept a templated code string given an aterisk operator or a block producing a templated code string. Following illustrates examples for each protocol. In the host protocol and the block versions of other protocols, a raka task (the *rask* variable) is provided, which offers the following properties: | property | description | | --------------------- | ------------------------------------------------------------------------------------ | | output | the output file | | input | the input file defined in the chained target | | deps | the depdencies (input is deps[0]) | | func | the token added to input to generate output, e.g., stat in csv.data.stat | | ext | extension of the output file | | captures | captured text during pattern matching, key-value | | scope | scope for current task, i.e. the common directory for output, input and dependencies | | target_scope | the inline scope defined in target | | target_scope_captures | captured values by inline scope defined in target | | rule_scopes | the inline scope defined in target | ```ruby require 'raka' require 'csv' dsl = Raka.new( self, output_types: %i[table view csv], lang: ['lang/psql', 'lang/shell', 'lang/python', 'lang/r'] ) csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@) # host(ruby) protocol csv.rb_out = [csv.iris_all] | run do |rask| in_f = File.open(rask.deps[0]) out_f = File.open(rask.output, 'w') options = { headers: true, return_headers: true, write_headers: true } CSV.filter(in_f, out_f, options) do |row| row['class'] == 'Iris-versicolor' end end # python protocol csv.py_out = [csv.iris_all] | py* %( import pandas as pd df = pd.read_csv('$(dep0)') df[df['class'] == 'Iris-versicolor'].to_csv('$@') ) # python protocol (block) csv.py_out2 = [csv.iris_all] | py do |rask| <<-PYTHON import pandas as pd df = pd.read_csv('#{rask.deps[0]}') df[df['class'] == 'Iris-versicolor'].to_csv('#{rask.output}') PYTHON end # r protocol csv.r_out = [csv.iris_all] | r* %( df <- read.csv("$(dep0)") write.csv(df[(df$class == "Iris-versicolor"),], file="$@") ) # r protocol (block) csv.r_out = [csv.iris_all] | r do |rask| <<-R df <- read.csv("#{rask.deps[0]}") write.csv(df[(df$class == "Iris-versicolor"),], file="#{rask.output}") R end # shell protocol csv.shell_out = [csv.iris_all] | shell* %( cat <(head $(dep0)) <(grep "Iris-versicolor" $(dep0)) > $@ ) # shell protocol (block) csv.shell_out2 = [csv.iris_all] | shell do |rask| "cat <(head -1 #{rask.deps[0]}) <(grep 'Iris-versicolor' #{rask.deps[0]}) > rask.output" end # psql protocol pg = OpenStruct.new( user: 'postgres', port: 5433, host: '127.0.0.1', db: 'postgres', password: 'postgres' ) psql.config conn: pg, create: :mview table.iris_all = [csv.iris_all] | psql(create: nil)* %( DROP TABLE IF EXISTS $(output_stem); CREATE TABLE $(output_stem) ( sepallength float, sepalwidth float, petallength float, petalwidth float, class varchar ); \\COPY $(output_stem) FROM '$(dep0)' CSV HEADER; ) table.psql_out = [table.iris_all] | psql* %( SELECT * FROM $(dep0_stem) WHERE class='Iris-versicolor' ) # psql protocol (block) table.psql_out2 = [table.iris_all] | psql do |rask| <<-SQL SELECT * FROM #{dsl.stem(rask.deps[0])} WHERE class='Iris-versicolor' SQL end ``` ### Initialization and options These APIs are bounded to an instance of DSL, you can create the object at the top: ```ruby dsl = DSL.new(, ) ``` The argument `` should be the *self* of a running Rakefile. In most case you can directly write: ```ruby dsl = DSL.new(self, ) ``` Two important fields of `options` are `output_types` and `input_types`. For each item in `output_types`, you will get a global function to bootstrap a rule. For example, with ```ruby dsl = DSL.new(self, { output_types: [:csv, :pdf] }) ``` you can write these rules like: ```ruby csv.data = ... pdf.graph = ... ``` which will match */data.csv* and */graph.pdf* The `input_types` involves the strategy to find inputs. All possible input types will be tried when resolving an input file in chained target. For example, raka will try to find both *numbers.csv* and *numbers.table* for a rule like `table.numbers.mean = …` if `input_type = [:csv, :table]`. ### Scope Scopes define constraints which help users create rules more precisely. A scope generally refer to a folder and can happen in several places. **Task scope** is the scope when executing a task, a.k.a. **scope**. When a rule is matched given a desired output, a task is generated and its scope is the common folder of the output and all dependencies. For example, a rule `csv.out = [csv.in] | ...` can be matched given *out/out.csv* and the task scope is resolved *out/*. The task will thus search for *out/in.csv* as dependency. **Rule scope** is the scope to restrict possible task scope, given by `Raka::scope`. In the following example, the rule scopes are **Target scope.** ## Rakefile Template ## Write your own protocols ## Compare to other tools Raka borrows some ideas from Drake but not much (currently mainly the name "protocol"). Briefly we have different visions and maybe different suitable senarios.