= Piglet Piglet is a DSL for writing Pig Latin scripts in Ruby: a = load 'input' b = a.group :c store b, 'output' The code above will be translated to the following Pig Latin: relation_2 = LOAD 'input'; relation_1 = GROUP relation_2 BY c; STORE relation_1 INTO 'output'; Piglet aims to look like Pig Latin while allowing for things like loops and control of flow that are missing from Pig. I started working on Piglet out of frustration that my Pig scripts started to be very repetitive. Pig lacks control of flow and mechanisms to apply the same set of operations on multiple relations. Piglet is my way of adding those features. == Usage It can be used either as a command line tool for translating a file of Piglet code into Pig Latin, or you can use it inline in a Ruby script: === Command line usage If piggy.rb contains store(load('input')), 'output') then running piglet piggy.rb will output relation_1 = LOAD 'input'; STORE relation_1 INTO 'output'; === Programmatic interface require 'piglet' @piglet = Piglet::Interpreter.new @piglet.interpret do store(load('input'), 'output') end puts @piglet.to_pig_latin or puts Piglet::Interpreter.new { store(load('input'), 'output') }.to_pig_latin or @interpreter = Piglet::Interpreter.new puts @interpreter.to_pig_latin { store(load('input'), 'output') } will print relation_1 = LOAD 'input'; STORE relation_1 INTO 'output'; to standard out. == Examples of what it can do a = load 'input', :schema => [:a, :b, :c] b = a.group :c c = b.foreach { |r| [r[0], r[1].a.max, r[1].b.max] } store c, 'output' will result in the following Pig Latin: relation_3 = LOAD 'input' AS (a, b, c); relation_2 = GROUP relation_3 BY c; relation_1 = FOREACH relation_2 GENERATE $0, MAX($1.a), MAX($1.b); STORE relation_1 INTO 'output'; == Syntax There are two kinds of operators in Piglet: load & store operators, and relation operators. Load & store are called as functions with no receiver, like this: load('input') store(a, 'out') describe(b) illustrate(c) dump(d) explain(e) and those are also all the load & store operators. They mirror the Pig Latin operators +LOAD+, +STORE+, +DESCRIBE+, +ILLUSTRATE+, +DUMP+ and +EXPLAIN+. Relation operators are called as methods on relations. Relations are created by the +load+ operator, and can be stored in regular variables: a = load('input', :schema => [:x, :y, :z]) b = a.group(:x) store(b, 'out') Unlinke Pig Latin, operators can be chained: a = load('input', :schema => [:x, :y, :z]) b = a.sample(3).group(:x) store(b, 'out') In fact, a whole script can be written without using variables at all: store(load('input', :schema => [:x, :y, :z]).sample(3).group(:x)) The relation operators are meant to be close to the Pig Latin syntax, but there are obvious limitations and tradeoffs, see the documentation for the Piglet::Relation::Relation mixin for syntax examples. === load When loading a relation you can specify the schema by passing the :schema option to +load+. The syntax of the schema specification is not perfect at this time: if you don't care about types you can pass an array of symbols or strings, like this: load('input', :schema => %w(a b c d)) load('input', :schema => [:a, :b, :c, :d]) But if you want types, then you need to pass an array of arrays, where the inner arrays contain the field name and the field type: load('input', :schema => [[:a, :chararray], [:b, :long]]) This is a bit inconvenient. I would like to use a hash, like this: {:a => :chararray, :b => :long}, but since the order of the keys isn't guaranteed in Ruby 1.8, it's not possible. I'm working on something better. If you need to specify tuples or bags in a schema you can use the special syntax [:field_name, :tuple, [[:a, :int], [:b, :float]]], i.e. the field name, the field type (:tuple or :bag) and the schema of the tuple or bag. See “Types & schemas” below for more info. You can also specify a load function by passing the :using option: load('input', :using => :pig_storage) load('input', :using => 'MyOwnFunction') Piglet knows to translate :pig_storage to PigStorage, as well as the other pre-defined load and store functions: :binary_serializer, :binary_deserializer, :bin_storage, :pig_dump and :text_loader. == store, dump, describe, etc. +store+ works similarily to +load+, but it takes a relation as its first argument, and the path to the output as second. It too takes the option :using, with the same values as +load+. +dump+, +describe+, +illustrate+ and +explain+ all take a relation as sole argument. +explain+ can be called without argument (see the Pig Latin manual for what +EXPLAIN+ without argument does). == +cross+, +distinct+, +limit+, +sample+, +union+ These operators are the most straightforward in Piglet. To do the equivalent of b = DISTINCT a; you write b = a.distinct in Piglet. More examples: a.cross(b) # => CROSS a, b a.limit(4) # => LIMIT a 4 a.sample(0.1) # => SAMPLE a 0.1 a.union(b, c) # => UNION a, b, c you get the pattern. == +order+ +order+ works more or less like the operators above, with some extra features: to specify ascending or descending order you can pass an array with two elements instead of a field name -- the first element is the field name, the second :asc or :desc: a.order(:x, [:y, :desc]) # => ORDER a BY x, y DESC == +group+ In light of the above +group+ works exactly as you would expect: a.group(:b) becomes GROUP a BY b. You can specify which fields to group by either by passing them as separate arguments, or by passing an array as the first parameter. These statements are equivalent: a.group(:x, :y) a.group([:x, :y]) a.group(%w(x y)) == +filter+ +filter+ works a little bit different from the operators discussed above. It takes a block in which you specify the arguments to the operator. The block receives a parameter which is the relation that the operation is performed on -- this may sound odd, but since operations can be chained in Piglet there are situations where you otherwise wouldn't have a reference to the relation, e.g. a.limit(4).filter { |r| … }. The thing that sets +filter+ apart from the operators above is it needs to support field expressions. For example the x == 3 in FILTER a BY x == 3. Piglet supports simple field operators like == or % quite transparently, but more complex expressions can be less elegant, see ”Limitations” below. For example a.filter { |r| r.x == 3 } works fine, but a.filter { |r| r.x != 3 } doesn't (it has to do with how Ruby parses expressions, unfortunately). To do not equals you can either do r.x.ne(3) or (r.x == 3).not. See “Limitations” below for more info on field expressions. The way field expressions are done in Piglet is that you ask the relation (the object passed to the block) for a field, and then call methods on that object to build up an expression. Some Ruby operators can be used, but other operations are only available as methods, again, see “Limitations” below for a complete reference. a.filter { |r| r.x == 3 } # => FILTER a BY x == 3 a.filter { |r| (r.x > 4).or(r.y < 2) } # => FILTER a BY x > 4 OR r < 2 == +foreach+ FOREACH … GENERATE is probably the most complex operator in Pig Latin. Piglet tries its best to support most of it, but there are things that are still missing -- see “Limitations”. Most things should work without problems though. The operator in Piglet is called simply +foreach+, and just as +filter+ it takes a block, which receives the relation as a parameter. In contrast to +filter+, +foreach+ should return an array of field references and expressions. This array describes the schema of the new relation. The expressions used in +foreach+ are usually not the same as those used in +filter+, although all are of course available in both situations. In +foreach+ common operators to use are the aggregate functions (called “eval functions” in the Pig Latin manual) like +MAX+, +MIN+, +COUNT+, +SUM+, etc. In Piglet these are method calls on field objects. Let's look at an example (I like to use lots of whitespace and newlines for +foreach+ operations, because otherwise it gets very messy): a.foreach do |r| [ r.x.max, r.y.min, r.z.count, r.w + r.q ] end this would be translated into: FOREACH a GENERATE MAX(x), MIN(y), COUNT(z), w + q; pretty straight forward. What if you want to give the fields of the new relation proper names? In Pig Latin you would write MAX(x) AS (x_max), and in Piglet you can write r.x.max.as(:x_max). This is such a common thing to do that I'm thinking of adding some kind of feature that automatically adds AS clauses where appropriate, but it's not there yet. +foreach+ is a very complex beast, and this is just an overview, so I'll just give you a few more examples that are not obvious: Literal values can be specified using +literal+: a.foreach { |r| [literal('hello').as(:hello)] } # => FOREACH a GENERATE 'hello' AS hello Binary conditionals, a.k.a. the ternary operator are supported through +test+ (unfortunately the Ruby ternary operator can't be overridden): a.foreach { |r| [test(r.x == 3, r.y, r.z)] } # => FOREACH a GENERATE (x == 3 ? y : z) The first argument to +test+ is the test expression, the second is the if-true expression and the third is the if-false expression. == +split+ The syntax of +split+ shouldn't be surprising if you've read this far, but there's perhaps some details that aren't obvious. To split a relation into a number of parts you call +split+ on the relation and pass a block in which you specify the expressions describing each shard. Just as with +filter+ and +foreach+ the block receives the relation as an argument. +split+ returns an array containing the relation shards and you can use parallel assignment to make it look really nice: b, c = a.split { |r| [r.x > 2, r.y == 3] } # => SPLIT a INTO b IF x > 2, c IF y == 3 == +cogroup+ & +join+ Thes two operators are the different ways to join relations in Pig Latin. They take the relations to join, and the keys to join them. In Piglet you specify the join expression using a hash: the keys are the relations, and the values are the fields on which to join: a.join(b => :y, a => :x) # => JOIN b BY x, a BY y a.cogroup(b => :y, a => :x) # => COGROUP b BY x, a BY y Notice that you have to specify the +a+ relation twice: you call the method on it, but you also have to pass it as a key to the join description. I'm working on an alternative syntax. If you're joining on more than one field, simply pass an array of field names: a.join(b => [:y, :z], a => [:x, :w]) # => JOIN b BY (y, z), a BY (x, w) I'm not absolutely sure that it is legal to join or cogroup on more than one field, the Pig Latin manual isn't entirely clear on this, but Piglet supports it for the time being. COGROUP lets you specify INNER and OUTER for join fields, and in Piglet you can do this by passing :inner or :outer as the last element in the array that is the value in the join description: a.cogroup(b => [:y, :inner], a => [:z, :outer]) # => COGROUP b BY y INNER, a BY z OUTER == :parallel For some operators in Pig Latin you can specify the PARALLEL keyword to tell Pig how many reducers For the +cogroup+, +cross+, +distinct+, +group+, +join+ and +order+ you can pass :parallel => n as the last parameter to specify the amount of parallelism, e.g. a.group(:x, :y, :z, :parallel => 5). === Putting it all together Let's look at a more complex example: students = load('students.txt', :schema => [%w(student chararray), %w(age int), %w(grade int)]) top_acheivers = students.filter { |r| r.grade == 5 } name_and_age = top_acheivers.foreach { |r| [r.student.as(:name), r.age] } name_by_age = name_and_age.group(:age) count_by_age = name_by_age.foreach { |r| [r[0].as(:age), r[1].name.count.as(:count)]} store(count_by_age, 'student_counts_by_age.txt', :using => :pig_storage) We load the file students.txt as a relation with three fields: student, a string, age an integer and grade another integer. Next we filter out the top acheivers with +filter+. +filter+ takes a block and that block gets a referece to the relation (the one +filter+ was called on), the result of the block will be the filter expression, in this case it's grade == 5. When we have the top acheivers we want to make a projection to remove the grades field, since we will not use it in the next set of calculations. In Pig Latin this is done with FOREACH … GENERATE, which is just +foreach+ in Piglet. Like +filter+, +foreach+ takes a block that gets a reference to the relation. The result of the block should be an array of expressions, and in this case it's [r.student.as(:name), r.age], which means the student field from the relation, renamed to "name" and the age field. The resulting relation will have two fields: "name" and "age". On the next line we group the relation by the age field, and following that we do another projection, this time on the grouped relation. Remember that when doing a grouping in Pig you get a relation that in this case looks like this: (group:int, name_by_age:{name:chararray, age:int}). In the block passed to +foreach+ we use r[0] and r[1] to reference the first and second fields of name_by_age, equivalent to $0 and $1 in Pig Latin. In Pig Latin you could also have used the names group and name_by_age, but for a number of reasons you can't do that in Piglet (r.group unfortunately refers to the group method, and the relation isn't actually called name_by_age after Piglet has translated the code into Pig Latin). The expression r[1].name.count.as(:count) means take the "name" field from the relation in the second field of the relation ($1.name), run the COUNT operator on it, and rename it count, i.e. COUNT($1.name) AS count. Finally we store the result in a file called student_counts_by_age.txt, using PigStorage (which isn't strictly necessary to specify since it's the default. If you have a custom method you can pass its name as a string instead of :pig_storage). Piglet will translate this into the following Pig Latin: relation_5 = LOAD 'students.txt' AS (student:chararray, age:int, grade:int); relation_4 = FILTER relation_5 BY grade == 5; relation_3 = FOREACH relation_4 GENERATE student AS name, age; relation_2 = GROUP relation_3 BY age; relation_1 = FOREACH relation_2 GENERATE $0 AS age, COUNT($1.name) AS count; STORE relation_1 INTO 'student_counts_by_age.txt' USING PigStorage; === Going beyond Pig Latin My goal with Piglet was to add control of flow and reuse mechanisms to Pig, so I'd better show some of the things you can do: input = load('input', :schema => %w(country browser site visit_duration)) %w(country browser site).each do |dimension| grouped = input.group(dimension).foreach do |r| [r[0], r[1].visit_duration.sum] end store(grouped, "output-#{dimension}") end We load a file that contains an ID field, three dimensions (country, browser and site) and a metric: the duration of a visit. This could be data from a the logs of a set of websites, or an ad server. What we want to do is to sum the the visit_duration field for each of the three dimensions. This would be a big tedious in Pig Latin: input = LOAD 'input' AS (country browser site visit_duration); by_country = GROUP input BY country; by_browser = GROUP input BY browser; by_site = GROUP input BY site; sum_by_country = FOREACH by_country GENERATE $0, SUM($1.visit_duration); sum_by_browser = FOREACH by_browser GENERATE $0, SUM($1.visit_duration); sum_by_site = FOREACH by_site GENERATE $0, SUM($1.visit_duration); STORE sum_by_country INTO 'output-country; STORE sum_by_browser INTO 'output-browser; STORE sum_by_site INTO 'output-site; But in Piglet it's as simple as looping over the names of the dimensions. You could even define a method that encapsulates the grouping, summing and storing (although in this case it would be a bit overkill): def sum_dimension(relation, dimension) grouped = relation.group(dimension).foreach do |r| [r[0], r[1].visit_duration.sum] end store(grouped, "output-#{dimension}") end input = load('input', :schema => %w(country browser site visit_duration)) %w(country browser site).each do |dimension| sum_dimension(input, dimension) end You can even define your own relation operations if you want, just add them to Piglet::Relation::Relation: module Piglet::Relation::Relation # Returns a list of sampled relations for each given sample size def samples(*sizes) sizes.map { |s| sample(s) } end end and then use them just as any other operator: small, medium, large = input.samples(0.01, 0.1, 0.5) nifty, huh? === Types & schemas Piglet knows the schema of relations, so you can do something else that Pig lacks: introspection. This lets you do things like like this code, which counts the unique values of all fields in a relation: relation = load('in', :schema => [:a, :b, :c]) relation.schema.field_names.each do |field| grouped = relation.group(field) counted = grouped.foreach { |r| [r[1].count] } store(counted, "out-#{field}") end This feature obviously only works if you have specified a schema in the call to #load. There are currently many limitations to this feature, so use it with caution. Since the schema support isn't completely reliable Piglet does not enforce schemas, and it does not warn you if you try to access a field that doesn't exist. If it had, it would probably be more annoying and limiting than it would be worth. == Limitations The aim is to support most of Pig Latin, but currently there are some limitations. The following Pig operators are supported: * +COGROUP+ * +CROSS+ * +DESCRIBE+ * +DISTINCT+ * +DUMP+ * +EXPLAIN+ * +FILTER+ * FOREACH … GENERATE * +GROUP+ * +ILLUSTRATE+ * +JOIN+ * +LIMIT+ * +LOAD+ * +ORDER+ * +SAMPLE+ * +SPLIT+ * +STORE+ * +UNION+ The following are currently not supported (but will be soon): * +STREAM+ * +DEFINE+ * +DECLARE+ * +REGISTER+ The file commands (+cd+, +cat+, etc.) will probably not be supported for the forseeable future. All the aggregate functions are supported: * +AVG+ * +CONCAT+ * +COUNT+ * +DIFF+ * +IsEmpty+ * +MAX+ * +MIN+ * +SIZE+ * +SUM+ * +TOKENIZE+ Piglet only supports most arithmetic and logic operators (see below) on fields -- but check the output and make sure that it's doing what you expect because some it's tricky to see where Piglet hijacks the operators and when it's Ruby that is running the show. I'm doing the best I can, but there are many things that can't be done, at least not in Ruby 1.8. Piglet does support these field operators: * == (equality) * > (greater than) * < (less than) * >= (greater or equal to) * <= (less than or equal to) * % (modulo) * + (addition) * - (subtraction) * * (multiplication) * / (division) It also has these operators, see below for explanations: * #not (logical negation) * #neg (numerical negation) * #ne (not equals) * #test (binary conditionals) Piglet does not support: * != (not equals, you have to use == and a NOT, e.g. (a == b).not, which will be translated as NOT (a == b) or you can use #ne, which will translate to !=, e.g. a.ne(b) will become a != b. May be supported in the future, but only in Ruby 1.9) * ? : (the ternary operator) * - (negation, but you can use #neg on a field expression to get the same result, e.g. a.neg will be translated as -a. May be supported in the future, but only in Ruby 1.9) * key#value (map dereferencing, may be supported in the future) === Why aren't the aliases in the Pig Latin the same as the variable names in the Piglet script? When you run +piglet+ on a Piglet script the aliases in the output will be relation_1, relation_2, relation_3, and so on, instead of the names of the variables of the Piglet script -- like in the example at the top of this document. The names +a+ and +b+ are lost in translation, this is unfortunate but hard to avoid. Firstly there is no way to discover the names of variables, and secondly there is no correspondence between a statement in a Piglet script and a statement in Pig Latin, a.union(b, c).sample(3).group(:x) is at least three statements in Pig Latin. It simply wouldn't be worth the extra complexity of trying to infer some variable names and reuse them as aliases in the Pig Latin output. In the future I may add a way of manually suggesting relation aliases, so that the Pig Latin output is more readable. You may also wonder why the relation aliases aren't in consecutive order. The reason is that they get their names in the order they are evaluated, and the interpreter walks the relation ancestry upwards from a +store+ (and it only evaluates a relation once). === Why aren’t all operations included in the output? If you try this Piglet code: a = load 'input' b = a.group :c You might be surprised that Piglet will not output anything. In fact, Piglet only creates Pig Latin operations on relations that will somehow be outputed. Unless there is a +store+, +dump+, +describe+, +illustrate+ or +explain+ that outputs a relation, the operations applied to that relation and its ancestors will not be included. When you call +group+, +filter+ or any of the other methods that can be applied to a relation a datastructure that encodes these operations is created. When a relation is passed to +store+ or one of the other output operators the Piglet interpreter traverses the datastructure backwards, building the Pig Latin operations needed to arrive at the relation that should be passed to the output operator. This is similar to how Pig itself interprets a Pig Latin script. As a side effect of using +store+ and the other output operators as the trigger for creating the needed relational operations any relations that are not ancestors of relations that are outputed will not be included in the Pig Latin output. On the other hand, they would be no-ops when run by Pig anyway. == Copyright © 2009-2010 Theo Hultberg / Iconara. See LICENSE for details.