# ---------------------------------------------------------------------------
# 
# = CROSS
# 
# Computes the cross product of two or more relations.
# 
# == Syntax
# 
#   alias = CROSS alias, alias [, alias …] [PARALLEL n];
#   
# == Terms
# 
# alias::
#   The name of a relation.
# 
# PARALLEL n::
#   Increase the parallelism of a job by specifying the number of reduce tasks,
#   n. The optimal number of parallel tasks depends on the amount of memory on
#   each node and the memory required by each of the tasks. To determine n, use
#   the following as a general guideline:
#     n = (nr_nodes - 1) * 0.45 * nr_GB
#   where nr_nodes is the number of nodes used and nr_GB is the amount of physical
#   memory on each node.
# 
#   Note the following:
#   * Parallel only affects the number of reduce tasks. Map parallelism is
#     determined by the input file, one map for each HDFS block.
#   * If you don’t specify parallel, you still get the same map parallelism but
#     only one reduce task.
# 
# == Usage
# 
# Use the CROSS operator to compute the cross product (Cartesian product) of two
# or more relations.
# 
# CROSS is an expensive operation and should be used sparingly.
# 
# == Example
# 
# Suppose we have relations A and B.
# 
#   (A)           (B)
#   ----------- --------   
#   (1, 2, 3)    (2, 4)    
#   (4, 2, 1)    (8, 9)                
#   	     (1, 3)
# 
# In this example the cross product of relation A and B is computed.
# 
#    	X = CROSS A, B;
# 
# Relation X looks like this.
# 
#   (1, 2, 3, 2, 4)
#   (1, 2, 3, 8, 9)
#   (1, 2, 3, 1, 3)
#   (4, 2, 1, 2, 4)
#   (4, 2, 1, 8, 9)
#   (4, 2, 1, 1, 3)
#


# ---------------------------------------------------------------------------
# 
# DISTINCT
# 
# Removes duplicate tuples in a relation.
# 
# == Syntax
# 
# alias = DISTINCT alias [PARALLEL n];             
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# PARALLEL n::
#   Increase the parallelism of a job by specifying the number of reduce tasks,
#   n. The optimal number of parallel tasks depends on the amount of memory on
#   each node and the memory required by each of the tasks. To determine n, use
#   the following as a general guideline:
#     n = (nr_nodes - 1) * 0.45 * nr_GB
#   where nr_nodes is the number of nodes used and nr_GB is the amount of physical
#   memory on each node.
# 
#   Note the following:
#   * Parallel only affects the number of reduce tasks. Map parallelism is
#     determined by the input file, one map for each HDFS block.
#   * If you don’t specify parallel, you still get the same map parallelism but
#     only one reduce task.
# 
# == Usage
# 
# Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT
# does not preserve the original order of the contents (to eliminate duplicates,
# Pig must first sort the data). You cannot use DISTINCT on a subset of fields. To
# do this, use FOREACH … GENERATE to select the fields, and then use DISTINCT.
# 
# == Example
# 
# Suppose we have relation A.
# 
#      (A)       
#   ---------
#   (8, 3, 4)
#   (1, 2, 3)                        
#   (4, 3, 3)                                                                                    
#   (4, 3, 3)
#   (1, 2, 3)
# 
# In this example all duplicate tuples are removed.
# 
#    	X = DISTINCT A;
# 
# Relation X looks like this.
# 
#   (1, 2, 3)
#   (4, 3, 3)
#   (8, 3, 4)
#

# ---------------------------------------------------------------------------
# 
# FILTER
# 
# Selects tuples (rows) from a relation based on some condition.
# 
# == Syntax
# 
# alias = FILTER alias  BY expression;
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# BY::
#   Required keyword.
# 
# expression::
#   An expression.
# 
# == Usage
# 
# Use the FILTER operator to work with tuples (rows) of data. FILTER is commonly
# used to select the data that you want; or, conversely, to filter out (remove)
# the data you don’t want.
# 
# Note: If you want to work with specific fields (columns) of data, use the
# FOREACH …GENERATE operation.
# 
# == Examples
# 
# Suppose we have relation A.
# 
#   (A: f1:int, f2:int, f3:int)
#   ----------------
#   (1, 2, 3)
#   (4, 2, 1)
#   (8, 3, 4)
#   (4, 3, 3)
#   (7, 2, 5)
#   (8, 4, 3)
# 
# In this example the condition states that if the third field equals 3, then add the tuple to relation X.
# 
#    	X = FILTER A BY f3 == 3;
# 
# Relation X looks like this.
# 
#   (1, 2, 3)
#   (4, 3, 3)
#   (8, 4, 3)
# 
# In this example the condition states that if the first field equals 8 or if the sum of fields f2 and f3 is not greater than first field, then add the tuple to relation X.
# 
#    	X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));
# 
# Relation X looks like this.
# 
#   (4, 2, 1)
#   (8, 3, 4)
#   (7, 2, 5)
#   (8, 4, 3)
#

# ---------------------------------------------------------------------------
# 
# FOREACH … GENERATE
# 
# Generates data transformations based on fields (columns) of data.
# 
# == Syntax
# 
# alias  = FOREACH { gen_blk | nested_gen_blk } [AS schema];
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# gen_blk::
#   FOREACH … GENERATE used with a non-nested relation. Use this syntax:
#   
#     alias = FOREACH alias GENERATE expression [expression ….]
# 
# nested_gen_blk::
#   FOREACH … GENERATE used with a nested relation. Use this syntax:
# 
#     alias = FOREACH nested_alias {
#        alias = nested_op; [alias = nested_op; …]
#        GENERATE expression [expression ….]
#     };
# 
#   where:
#   * The nested block is enclosed in opening and closing brackets { … }.
#   * The GENERATE keyword must be the last statement within the nested block.
# 
# expression::
#   An expression.
# 
# nested_alias::
#   If one of the fields (columns) in a relation is a bag, the bag can be treated
#   as an inner or a nested relation.
# 
# nested_op::
#   Allowable operations include FILTER, ORDER, and DISTINCT.
# 
#   The FOREACH … GENERATE operation itself is not allowed since this could lead
#   to an arbitrary number of nesting levels.
# 
# AS::
#   Keyword.
# 
# schema::
#   A schema using the AS keyword (see Schemas).
# 
# * If the FLATTEN keyword is used, enclose the schema in parentheses.
# 
# * If the FLATTEN keyword is not used, don't enclose the schema in parentheses.
# 
# == Usage
# 
# Use the FOREACH …GENERATE operation to work with individual fields (columns) of data. The FOREACH …GENERATE operation works with non-nested and nested relations.
# 
# A statement with a non-nested relation A could look like this.
# 
# X = FOREACH A GENERATE f1;
# 
# A statement with a nested relation A could look like this.
# 
# X = FOREACH B {
# 
#             S = FILTER A by 'xyz';
# 
#             GENERATE COUNT (S.$0);
# 
# }
# 
# Note: FOREACH … GENERATE works with fields (columns) of data. If you want to work with entire tuples (rows) of data, use the FILTER operation.
# 
# == Examples
# 
# Suppose we have relations A and B, and derived relation C (where C = COGROUP A BY a1 INNER, B BY b1 INNER;).
# 
# (A: a1:int, a2:int, a3:int) 
# -----------------                  
# (1, 2, 3)                        
# (4, 2, 1)            
# (8, 3, 4)            
# (4, 3, 3)                        
# (7, 2, 5)                        
# (8, 4, 3)                        
#                        
# 	
# (B: b1:int, b2:int)
# ---------------        
# (2, 4)                
# (8, 9)                
# (1, 3)                
# (2, 7)
# (2, 9)
# (4, 6)
# (4, 9)
# 	
#  (C: c1, c2, c3)
# ---------------------
#  (1, {(1, 2, 3)}, {(1, 3)})
#  (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
#  (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
#  
# 
# == Example: Projection
# 
# In this example the asterisk (*) is used to project all fields from relation A to relation X (this is similar to SQL Select *). Relation A and X are identical.
# 
# X = FOREACH A GENERATE *;
# 
# In this example two fields from relation A are projected to form relation X.
# 
# X = FOREACH A GENERATE a1, a2;
# 
# Relation X looks this.
# 
# (1, 2)
# (4, 2)
# (8, 3)
# (4, 3)
# (7, 2)
# (8, 4)
# == Example: Nested Projection
# 
# Note: See GROUP for information about the "group" field in relation C.
# 
# In this example if one of the fields in the input relation is a tuple, bag or map, we can perform projection on that field.
# 
# X = FOREACH C GENERATE group, B.b2;
# 
# Relation X looks like this.
# 
# (1, {(3)})
# (4, {(6), (9)})
# (8, {(9)})
# 
# In this example multiple nested columns are retained.
# 
# X = FOREACH C GENERATE group, A.(a1, a2);
# 
# Relation X looks like this.
# 
# (1, {(1, 2)})
# (4, {(4, 2), (4, 3)})
# (8, {(8, 3), (8, 4)})
# == Example: Schema
# 
# In this example two fields in relation A are summed to form relation X. A schema is defined for the projected field.
# 
# X = FOREACH A GENERATE a1+a2 AS f1:int;
# 
# Y = FILTER X by f1 > 10;
# 
# Relations X and Y look this.
# 
# (X)        (Y)
# -----       ------
# (3)        (11)
# (6)        (12)
# (11)
# (7)
# (9)
# (12)
# 
# == Example: Applying Functions
# 
# Note: See GROUP for information about the "group" field in relation C.
# 
# In this example the built-in function SUM() is used to sum a set of numbers in a bag.
# 
# X = FOREACH C GENERATE group, SUM (A.a1);
# 
# Relation X looks like this.
# 
# (1, 1)
# (4, 8)
# (8, 16)
# == Example: Flattening
# 
# Note: See GROUP for information about the "group" field in relation C.
# 
# In this example the FLATTEN keyword is used to eliminate nesting.
# 
# X = FOREACH C GENERATE group, FLATTEN(A);
# 
# Relation X looks like this.
# 
# (1, 1, 2, 3)
# (4, 4, 2, 1)
# (4, 4, 3, 3)
# (8, 8, 3, 4)
# (8, 8, 4, 3)
# 
# Another FLATTEN example.
# 
# X = FOREACH C GENERATE GROUP, FLATTEN(A.a3);
# 
# Relation X looks like this.
# 
# (1, 3)
# (4, 1)
# (4, 3)
# (8, 4)
# (8, 3)
# 
# Another FLATTEN example.
# 
# X = FOREACH C GENERATE FLATTEN(A.(f1, f2)), FLATTEN(B.$1);
# 
# Relation X looks like this. Note that for the group '4' in C, there are two tuples in each bag. Thus, when both bags are flattened, the cross product of these tuples is returned; that is, tuples (4, 2, 6), (4, 3, 6), (4, 2, 9), and (4, 3, 9).
# 
# (1, 2, 3)
# (4, 2, 6)
# (4, 3, 6)
# (4, 2, 9)
# (4, 3, 9)
# (8, 3, 9)
# (8, 4, 9)
# 
# == Example: Nested Block
# 
# Suppose we have relation A and derived relation B (where B = GROUP A BY url;). Since relation B contains tuples with bags it can be treated as a nested relation.
# 
# A (url:chararray, outlink:chararray)          
# ---------------------------------------------  
# (www.ccc.com,www.hjk.com)     
# (www.ddd.com,www.xyz.org)     
# (www.aaa.com,www.cvn.org)                     
# (www.www.com,www.kpt.net)
# (www.www.com,www.xyz.org)
# (www.ddd.com,www.xyz.org)
#  
# 
# B
# ---------------------------------------------  
#  (www.aaa.com,{(www.aaa.com,www.cvn.org)})
#  (www.ccc.com,{(www.ccc.com,www.hjk.com)})
#  (www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
#  (www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
# 
# In this example we perform two of the allowed Pig operations, FILTER (FA) and DISTINCT (DA), as well as projection (PA). Note that the last statement in the nested block must be GENERATE.
# 
# X = foreach B {
#         FA= FILTER A BY outlink == 'www.xyz.org';
#         PA = FA.outlink;
#         DA = DISTINCT PA;
#         GENERATE GROUP, COUNT(DA);
# }
# 
# Relation X looks like this.
# 
# (www.ddd.com,1L)
# (www.www.com,1L)


# ---------------------------------------------------------------------------
# 
# GROUP
# 
# Groups the data in a single relation.
# 
# == Syntax
# 
#   alias = GROUP alias
#   	    [BY {[field_alias [, field_alias]] | * | [expression] } ]
#             [ALL] [PARALLEL n];
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# BY::
#   Keyword. Use this clause to group the relation by fields or by expression.
# 
# field_alias::
#   The name of a field in a relation. This is the group key or key field.
# 
#   A relation can be grouped by a single field (f1) or by the composite value of
#   multiple fields (f1,f2).
# 
# *::
#   The asterisk. A designator for all fields in the relation.
# 
# expression::
#   An expression.
# 
# ALL::
#   Keyword. Use ALL if you want all tuples to go to a single group; for example, when doing aggregates across entire relations.
# 
# PARALLEL n::
#   Increase the parallelism of a job by specifying the number of reduce tasks,
#   n. The optimal number of parallel tasks depends on the amount of memory on
#   each node and the memory required by each of the tasks. To determine n, use
#   the following as a general guideline:
#     n = (nr_nodes - 1) * 0.45 * nr_GB
#   where nr_nodes is the number of nodes used and nr_GB is the amount of physical
#   memory on each node.
# 
#   Note the following:
#   * Parallel only affects the number of reduce tasks. Map parallelism is
#     determined by the input file, one map for each HDFS block.
#   * If you don’t specify parallel, you still get the same map parallelism but
#     only one reduce task.
# 
# == Usage
# 
# The GROUP operator groups together tuples that have the same group key (key
# field). The result of a GROUP operation is a relation that includes one tuple
# per group. This tuple contains two fields:
# 
# * The first field is named "group" (do not confuse this with the GROUP operator)
#   and is the same type of the group key.
# 
# * The second field takes the name of the original relation and is type bag.
# 
# Suppose we have the following data:
# 
#   john      25      3.6
#   george  25      2.9
#   anne     27      3.9
#   julia      28      3.6
# 
# And, suppose we perform the LOAD and GROUP statements shown below. We can use
# the DESCRIBE operator to view the schemas for relation Y. We can use DUMP to
# view the contents of Y.
# 
# Note that relation Y has two fields. The first field is named "group" and is
# type int (the same as age). The second field takes the name of the original
# relation "X" and is type bag (that can contain tuples with three elements of
# type chararray, int, and float).
# 
# Statements
# 
#   X = LOAD 'data AS (name:chararray, age:int, gpa:float);
#   Y = GROUP X BY age;
#   DESCRIBE Y;
#   Y: {group: int,X: {name: chararray,age: int,gpa: float}}
#   DUMP Y;
# 
#   (25,{(john,25,3.6F),(george,25,2.9F)})
#   (27,{(anne,27,3.9F)})
#   (28,{(julia,28,3.6F)})
#  
# As shown in this FOREACH statement, we can refer to the fields in relation Y by their names "group" and "X".
#   
#   Z = FOREACH Y GENERATE group, COUNT(X);
# 
# Relation Z looks like this.
# 
#   (25,2L)
#   (27,1L)
#   (28,1L)
# 
# == Examples
# 
# Suppose we have relation A.
#   
#   A: (owner:chararray, pet:chararray)
#   -----------------
#   (Alice, turtle)
#   (Alice, goldfish)
#   (Alice, cat)
#   (Bob, dog)
#   (Bob, cat)
# 
# In this example tuples are grouped using the field "owner."
# 
#   X = GROUP A BY owner;
# 
# Relation X looks like this. "group" is the name of the first field. "A" is the
# name of the second field.
#   
#   (Alice, {(Alice, turtle), (Alice, goldfish)})
#   (Bob, {(Bob, dog), (Bob, cat)})
# 
# In this example tuples are grouped using the ALL keyword. Field "A" is then
# counted and projected to from relation Y.
#     
#   X = GROUP A ALL;
#   Y = FOREACH X GENERATE COUNT(A);
# 
# Relation X looks like this. "group" is the name of the first field. "A" is the
# name of the second field.
# 
#   (all,{(Alice,turtle),(Alice,goldfish),(Alice,cat),(Bob,dog),(Bob,cat)})
# 
# Relation Y looks like this.
# 
#   (5L)
# 
# Suppose we have relation S.
#   
#   S: (f1:chararay, f2:int, f3:int)
#   -----------------
#   (r1, 1, 2)
#   (r2, 2, 1)
#   (r3, 2, 8)
#   (r4, 4, 4)
# 
# In this example tuples are grouped using an expression, f2*f3.
# 
#   X = GROUP S BY f2*f3;
# 
# Relation Y looks like this. The first field is named "group". The second field is named "S".
# 
#   (2, {(r1, 1, 2), (r2, 2, 1)})
#   (16, {(r3, 2, 8), (r4, 4, 4)})


# ---------------------------------------------------------------------------
# 
# JOIN
# 
# Joins two or more relations based on common field values.
# 
# == Syntax
# 
# alias = JOIN alias BY field_alias,
#              alias BY field_alias [, alias BY field_alias …]
# 	     [PARALLEL n];                   
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# BY::
#   Keyword.
# 
# field_alias::
#   The name of a field in a relation. The alias and field_alias specified in the
#   BY clause must correspond.
# 
# == Example:
#   X = JOIN relationA BY fieldA, relationB by fieldB, relationC by fieldC;
# 
# PARALLEL n::
#   Increase the parallelism of a job by specifying the number of reduce tasks,
#   n. The optimal number of parallel tasks depends on the amount of memory on
#   each node and the memory required by each of the tasks. To determine n, use
#   the following as a general guideline:
#     n = (nr_nodes - 1) * 0.45 * nr_GB
#   where nr_nodes is the number of nodes used and nr_GB is the amount of physical
#   memory on each node.
# 
#   Note the following:
#   * Parallel only affects the number of reduce tasks. Map parallelism is
#     determined by the input file, one map for each HDFS block.
#   * If you don’t specify parallel, you still get the same map parallelism but
#     only one reduce task.
# 
# == Usage
# 
# Use the JOIN operator to join two or more relations based on common field
# values. The JOIN operator always performs an inner join.
# 
# Note: The JOIN and COGROUP operators perform similar functions. JOIN creates a
# flat set of output records while COGROUP creates a nested set of output records.
# 
# == Example
# 
# Suppose we have relations A and B.
# 
# (A: a1, a2, a3)   (B: b1, b2)
# -----------------       ---------------        
# (1, 2, 3)             (2, 4)    
# (4, 2, 1)             (8, 9)                
# (8, 3, 4)             (1, 3)                
# (4, 3, 3)             (2, 7)
# (7, 2, 5)             (2, 9)
# (8, 4, 3)             (4, 6)
#                         (4, 9)
# 
# In this example relations A and B are joined on their first fields.
# 
# X = JOIN A BY a1, B BY b1;
# 
# Relation X looks like this.
# 
# (1, 2, 3, 1, 3)
# (4, 2, 1, 4, 6)
# (4, 3, 3, 4, 6)
# (4, 2, 1, 4, 9)
# (4, 3, 3, 4, 9)
# (8, 3, 4, 8, 9)
# (8, 4, 3, 8, 9)
#


# ---------------------------------------------------------------------------
# 
# LIMIT
# 
# Limits the number of output tuples.
# 
# == Syntax
# 
# alias = LIMIT alias  n;
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# n::
#   The number of tuples.
# 
# == Usage
# 
# Use the LIMIT operator to limit the number of output tuples (rows). If the
# specified number of output tuples is equal to or exceeds the number of tuples in
# the relation, the output will include all tuples in the relation.
# 
# There is no guarantee which tuples will be returned, and the tuples that are
# returned can change from one run to the next. A particular set of tuples can be
# requested using the ORDER operator followed by LIMIT.
# 
# Note: The LIMIT operator allows Pig to avoid processing all tuples in a
# relation. In most cases a query that uses LIMIT will run more efficiently than
# an identical query that does not use LIMIT. It is always a good idea to use
# limit if you can.
# 
# == Examples
# 
# Suppose we have relation A.
#   
#   (A: f1:int, f2:int, f3:int)    
#   -----------------                  
#   (1, 2, 3)
#   (4, 2, 1)
#   (8, 3, 4)
#   (4, 3, 3)
#   (7, 2, 5)
#   (8, 4, 3)
# 
# In this example output is limited to 3 tuples.
# 
#   X = LIMIT A 3;
# 
# Relation X could look like this (there is no guarantee which three tuples will be output).
#   
#   (1, 2, 3)
#   (4, 3, 3)
#   (7, 2, 5)
# 
# In this example the ORDER operator is used to order the tuples and the LIMIT operator is used to output the first three tuples.
# 
#   B = ORDER A BY f1 DESC, f2 ASC;
#   X = LIMIT B 3;
# 
# Relation B and relation X look like this.
#   
#   (B)                                     (X)  
#   -----------             -----------
#   (8, 3, 4)                         (8, 3, 4)
#   (8, 4, 3)                         (8, 4, 3)
#   (7, 2, 5)                         (7, 2, 5)
#   (4, 2, 1)
#   (4, 3, 3)
#   (1, 2, 3)


# ---------------------------------------------------------------------------
# 
# LOAD
# 
# Loads data from the file system.
# 
# == Syntax
# 
# LOAD 'data' [USING function] [AS schema];              
# 
# == Terms
# 
# 'data'::
#   The name of the file or directory, in single quotes.
# 
#   If you specify a directory name, all the files in the directory are loaded.
# 
#   You can use hadoop-supported globing to specify files at the file system or
#   directory levels (see [WWW]hadoop glob documentation for details on globing
#   syntax).
# 
# USING::
#   Keyword.
# 
# function::
#   The load function.
# 
#   PigStorage is the default load/store function and does not need to be
#   specified. This function reads/writes simple newline-separated records with
#   delimiter-separated fields. The function has one parameter, the field
#   delimiter (tab (‘\t’) if the default delimiter).
# 
#   If the data is stored in a special format that the Pig load functions cannot
#   parse, you can write your own load function.
# 
# AS::
#   Keyword.
# 
# schema::
#   A schema using the AS keyword, enclosed in parentheses (see Schemas).
# 
# == Usage
# 
# Use the LOAD operator to load data from the file system.
# 
# == Examples
# 
# Suppose we have a data file called myfile.txt. The fields are tab-delimited. The
# records are newline-separated.
#   
#   1          2          3
#   4          2          1
#   8          3          4
# 
# In this example the default load function, PigStorage, loads data from
# myfile.txt into relation A. Note that, because no schema is specified, the
# fields are not named and all fields default to type bytearray. The two
# statements are equivalent.
#   
#   A = LOAD 'myfile.txt';
#   A = LOAD 'myfile.txt' USING PigStorage('\t');
# 
# Relation A looks like this.
#   
#   (1, 2, 3)
#   (4, 2, 1)
#   (8, 3, 4)
# 
# In this example a schema is specified using the AS keyword. The two statements
# are equivalent.
#   
#   A = LOAD 'myfile.txt' AS (f1:int, f2:int, f3:int);
#   A = LOAD 'myfile.txt' USING PigStorage(‘\t’) AS (f1:int, f2:int, f3:int);


# ---------------------------------------------------------------------------
# 
# ORDER
# 
# Sorts a relation based on one or more fields.
# 
# == Syntax
# 
# alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC]
#           [, field_alias [ASC|DESC] …] } [PARALLEL n];
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# BY::
#   Required keyword.
# 
# *::
#   Represents all fields in the relation.
# 
# ASC::
#   Sort in ascending order.
# 
# DESC::
#   Sort in descending order.
# 
# field_alias::
#   A field in the relation.
# 
# PARALLEL n::
#   Increase the parallelism of a job by specifying the number of reduce tasks,
#   n. The optimal number of parallel tasks depends on the amount of memory on
#   each node and the memory required by each of the tasks. To determine n, use
#   the following as a general guideline:
#     n = (nr_nodes - 1) * 0.45 * nr_GB
#   where nr_nodes is the number of nodes used and nr_GB is the amount of physical
#   memory on each node.
# 
#   Note the following:
#   * Parallel only affects the number of reduce tasks. Map parallelism is
#     determined by the input file, one map for each HDFS block.
#   * If you don’t specify parallel, you still get the same map parallelism but
#     only one reduce task.
# 
# == Usage
# 
# In Pig, relations are logically unordered.
# 
# * If you order relation A to produce relation X (X = ORDER A BY * DESC;),
#   relations A and X still contain the same thing.
# 
# * If you retrieve the contents of relation X, they are guaranteed to be in the
#   order you specified (descending).
# 
# * However, if you further process relation X, there is no guarantee that the
#   contents will be processed in the order you specified.
# 
# == Examples
# 
# Suppose we have relation A.
# 
#   (A: f1, f2, f3)     
#   -----------------                  
#   (1, 2, 3)                        
#   (4, 2, 1)                                    
#   (8, 3, 4)                                    
#   (4, 3, 3)
#   (7, 2, 5)            
#   (8, 4, 3)
# 
# In this example relation A is sorted by the third field, f3 in descending order.
# 
#   X = ORDER A BY f3 DESC;
# 
# Relation X could look like this (note that the order of the three tuples ending
# in 3 can vary).
# 
#   (7, 2, 5)
#   (8, 3, 4)
#   (1, 2, 3)
#   (4, 3, 3)
#   (8, 4, 3)
#   (4, 2, 1)


# ---------------------------------------------------------------------------
# 
# SPLIT
# 
# Partitions a relation into two or more relations.
# 
# == Syntax
# 
# SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression …];
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# INTO::
#   Required keyword.
# 
# IF::
#   Required keyword.
# 
# expression::
#   An expression.
# 
# == Usage
# 
# Use the SPLIT operator to partition a relation into two or more relations based
# on some expression. Depending on the expression:
# 
# * A tuple may be assigned to more than one relation.
# 
# * A tuple may not be assigned to any relation.
# 
# == Example
# 
# Suppose we have relation A.
# 
# (A: f1, f2, f3)     
# -----------------                              
# (1, 2, 3)
# (4, 5, 6)
# (7, 8, 9)
# 
# In this example relation A is split into three relations, X, Y, and Z.
# 
# SPLIT A INTO X IF f1< 7, Y IF f2==5, Z IF (f3<6 OR f3>6);
# 
# Relations X, Y, and Z look like this.
# 
# (X)                    (Y)                    (Z)
# ----------              ----------- -----------
# (1, 2, 3)             (4, 5, 6)             (1, 2, 3)
# (4, 5, 6)                                     (7, 8, 9)


# ---------------------------------------------------------------------------
# 
# STORE
# 
# Stores data to the file system.
# 
# == Syntax
# 
# STORE alias INTO 'directory' [USING function];
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# INTO::
#   Required keyword.
# 
# 'directory'::
#   The name of the storage directory, in quotes. If the directory already exists, the STORE operation will fail.
# 
#  
# 
# The output data files, named part-nnnnn, are written to this directory.
# 
# USING::
#   Keyword. Use this clause to name the store function.
# 
# function::
#   The load function.
# 
# PigStorage is the default load/store function and does not need to be specified. This function reads/writes simple newline-separated records with delimiter-separated fields. The function has one parameter, the field delimiter (tab ‘\t’ if the default delimiter)
# 
# If you want to store the data in a special format that the Pig Load/Store functions cannot handle, you can write your own store function.
# 
# == Usage
# 
# Use the STORE operator to store data on the file system.
# 
# == Example
# 
# Suppose we have relation A.
# 
# (A)
# 
# ----------------
# (1, 2, 3)
# (4, 2, 1)
# (8, 3, 4)
# (4, 3, 3)
# (7, 2, 5)
# (8, 4, 3)
# 
# In this example the contents of relation A are written to file part-00000 located in directory myoutput.
# 
# STORE relationA INTO ‘myoutput’ USING PigStorage (‘*’);
# 
# The part-00000 file looks like this. Fields are delimited with the asterisk * characters and records are separated by newlines.
# 
# 1*2*3
# 4*2*1
# 8*3*4
# 4*3*3
# 7*2*5
# 8*4*3
#


# ---------------------------------------------------------------------------
# 
# STREAM
# 
# Sends data to an external script or program.
# 
# == Syntax
# 
# alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ;
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# THROUGH::
#   Keyword.
# 
# `command`::
#   A command, including the arguments, enclosed in back tics (where a command is anything that can be executed).
# 
# cmd_alias::
#   The name of a command created using the DEFINE operator.
# 
# AS::
#   Keyword.
# 
# schema::
#   A schema using the AS keyword, enclosed in parentheses (see Schemas).
# 
# == Usage
# 
# Use the STREAM operator to send data through an external script or program. Multiple stream operators can appear in the same Pig script. The stream operators can be adjacent to each other or have other operations in between.
# 
# When used with a command, a stream statement could look like this:
# 
# A = LOAD 'data';
# 
# B = STREAM A THROUGH `stream.pl -n 5`;
# 
# When used with a cmd_alias, a stream statement could look like this, where cmd is the defined alias.
# 
# A = LOAD 'data';
# 
# DEFINE cmd `stream.pl –n 5`;
# 
# B = STREAM A THROUGH cmd;
# About Data Guarantees
# 
# Data guarantees are determined based on the position of the streaming operator in the Pig script.
# 
# * Unordered data – No guarantee for the order in which the data is delivered to
#   the streaming application.
# 
# * Grouped data – The data for the same grouped key is guaranteed to be provided
#   to the streaming application contiguously
# 
# * Grouped and ordered data – The data for the same grouped key is guaranteed to
#   be provided to the streaming application contiguously. Additionally, the data
#   within the group is guaranteed to be sorted by the provided secondary key.
# 
# In addition to position, data grouping and ordering can be determined by the
# data itself. However, you need to know the property of the data to be able to
# take advantage of its structure.
# 
# == Example: Data Guarantees
# 
# In this example the data is unordered.
# 
# A = LOAD 'data';
# B = STREAM A THROUGH `stream.pl`;
# 
# In this example the data is grouped.
# 
# A = LOAD 'data';
# B = GROUP A BY $1;
# C = FOREACH B FLATTEN(A);
# D = STREAM C THROUGH `stream.pl`
# 
# In this example the data is grouped and ordered.
# 
# A = LOAD 'data';
# B = GROUP A BY $1;
# C = FOREACH B {
#       D = ORDER A BY ($3, $4);
#       GENERATE D;
# }
# E = STREAM C THROUGH `stream.pl`;
# 
# == Example: Schemas
# 
# In this example a schema is specified as part of the STREAM statement.
# 
#   X = STREAM A THROUGH `stream.pl` as (f1:int, f2;int, f3:int);
# 
# Additional Examples
# 
#   See DEFINE for additional examples.


# ---------------------------------------------------------------------------
# 
# UNION
# 
# Computes the union of two or more relations.
# 
# == Syntax
# 
# alias = UNION alias, alias [, alias …];
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# == Usage
# 
# Use the UNION operator to compute the union of two or more relations. The UNION operator:
# 
# * Does not preserve the order of tuples. Both the input and output relations are
#   interpreted as unordered bags of tuples.
# 
# * Does not ensure (as databases do) that all tuples adhere to the same schema or
#   that they have the same number of fields. In a typical scenario, however, this
#   should be the case; therefore, it is the user's responsibility to either (1)
#   ensure that the tuples in the input relations have the same schema or (2) be
#   able to process varying tuples in the output relation.
# 
# * Does not eliminate duplicate tuples.
# 
# == Example
# 
# Suppose we have relations A and B.
# 
# (A)                    (B)
# ----------- --------   
# (1, 2, 3)             (2, 4)    
# (4, 2, 1)             (8, 9)                
#                         (1, 3)
# 
# In this example the union of relation A and B is computed.
# 
# X = UNION A, B;
# 
# Relation X looks like this.
# 
# (1, 2, 3)
# (4, 2, 1)
# (2, 4)
# (8, 9)
# (1, 3)
# Diagnostic Operators
# DESCRIBE
# 
# Returns the schema of an alias.
# 
# == Syntax
# 
# DESCRIBE alias;                                                                      
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# == Usage
# 
# Use the DESCRIBE operator to review the schema of a particular alias.
# 
# == Example
# 
# In this example a schema is specified using the AS clause.
#   
#   A = LOAD 'students' AS (name:chararray, age:int, gpa:float);
#   B = FILTER A BY name matches 'John%';
#   C = GROUP B BY name;
#   D = FOREACH B GENERATE COUNT(B.age);
#   DESCRIBE A;
#   A: {group, B: (name: chararray,age: int,gpa: float}
#   DESCRIBE B;
#   B: {group, B: (name: chararray,age: int,gpa: float}
#   DESCRIBE C;
#   C: {group, chararry,B: (name: chararray,age: int,gpa: float}
#   DESCRIBE D;
#   D: {long}
# 
# In this example no schema is specified. All data items default to type bytearray.
# 
#   grunt> a = LOAD '/data/students';
#   grunt> b = FILTER a BY $0 matches 'John%';
#   grunt> c = GROUP b BY $0;
#   grunt> d = FOREACH c GENERATE COUNT(b.$1);
#   grunt> DESCRIBE a;
# 
# Schema for a unknown.
# 
#   grunt> DESCRIBE b;
#   2008-12-05 01:17:15,316 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly cast to chararray under LORegexp Operator
# 
# Schema for b unknown.
# 
#   grunt> DESCRIBE c;
#   2008-12-05 01:17:23,343 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly caste to chararray under LORegexp Operator
# 
# c: {group: bytearray,b: {null}}
# 
#   grunt> DESCRIBE d;
#   2008-12-05 03:04:30,076 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly caste to chararray under LORegexp Operator
# 
# d: {long}
# 
# DUMP
# 
# Displays the contents of an alias.
# 
# == Syntax
# 
# DUMP alias;                                                           
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# == Usage
# 
# Use the DUMP operator to display the contents of an alias. You can use DUMP as a
# debugging device to make sure the correct results are being generated.
# 
# == Example
# 
# In this example a dump is performed after each statement.
# 
#   A = LOAD 'students' AS (name:chararray, age:int, gpa:float);
#   DUMP A;
#   B = FILTER A BY name matches 'John%';
#   DUMP B;
#   B = GROUP B BY name;
#   DUMP C;
#   D = FOREACH C GENERATE COUNT(B.age);
#   DUMP D;
# 
# EXPLAIN
# 
# Displays execution plans.
# 
# == Syntax
# 
# EXPLAIN alias;                            
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# == Usage
# 
# Use the EXPLAIN operator to review the logical, physical, and map reduce
# execution plans that are used to compute the specified relationship.
# 
# * The logical plan shows a pipeline of operators to be executed to build the
#   relation. Type checking and backend-independent optimizations (such as
#   applying filters early on) also apply.
# 
# * The physical plan shows how the logical operators are translated to
#   backend-specific physical operators. Some backend optimizations also apply.
# 
# * The map reduce plan shows how the physical operators are grouped into map
#   reduce jobs.
# 
# == Example
# 
# In this example the EXPLAIN operator produces all three plans. (Note that only a
# portion of the output is shown in this example.)
# 
# A = LOAD 'students' AS (name:chararray, age:int, gpa:float);
# B = GROUP A BY name;
# C = FOREACH B GENERATE COUNT(A.age);
# EXPLAIN C;
# 
#  
# Logical Plan:
# 
# Store xxx-Fri Dec 05 19:42:29 UTC 2008-23 Schema: {long} Type: Unknown
# |
# |---ForEach xxx-Fri Dec 05 19:42:29 UTC 2008-15 Schema: {long} Type: bag
# etc …
# 
# -----------------------------------------------
# Physical Plan:
# -----------------------------------------------
# Store(fakefile:org.apache.pig.builtin.PigStorage) - xxx-Fri Dec 05 19:42:29 UTC 2008-40
# |
# |---New For Each(false)[bag] - xxx-Fri Dec 05 19:42:29 UTC 2008-39
#     |   |
#     |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - xxx-Fri Dec 05
# etc …
#  
# --------------------------------------------------
# | Map Reduce Plan                                |
# --------------------------------------------------
# MapReduce node xxx-Fri Dec 05 19:42:29 UTC 2008-41
# Map Plan
# Local Rearrange[tuple]{chararray}(false) - xxx-Fri Dec 05 19:42:29 UTC 2008-34
# |   |
# |   Project[chararray][0] - xxx-Fri Dec 05 19:42:29 UTC 2008-35
# etc …  
# ILLUSTRATE
# 
# Displays a step-by-step execution of a sequence of statements.
# 
# == Syntax
# 
# ILLUSTRATE alias;                                                 
# 
# == Terms
# 
# alias::
#   The name of a relation.
# 
# == Usage
# 
# Use the ILLUSTRATE operator to review how data items are transformed through a
# sequence of Pig Latin statements.
# 
# ILLUSTRATE accesses the ExampleGenerator algorithm which can select an
# appropriate and concise set of example data items automatically. It does a
# better job than random sampling would do; for example, random sampling suffers
# from the drawback that selective operations such as filters or joins can
# eliminate all the sampled data items, giving you empty results which is of no
# help with debugging.
# 
# With the ILLUSTRATE operator you can test your programs on small datasets and
# get faster turnaround times. The ExampleGenerator algorithm uses Pig's Local
# mode (rather than Hadoop mode) which means that illustrative example data is
# generated in near real-time.
# 
# == Example
# 
# Suppose we have a data file called 'visits.txt'.
# Amy     cnn.com            20080218
# Fred    harvard.edu         20081204
# Amy     bbc.com            20081205
# Fred    stanford.edu        20081206
# 
# In this example we count the number of sites a user has visited since
# 12/1/08. The ILLUSTRATE statement will show how the results for num_user_visits
# are derived.
# 
# visits = LOAD 'visits.txt' AS (user:chararray, url:chararray, timestamp:chararray);
# 
# recent_visits = FILTER visits BY timestamp >= '20081201';
# 
# user_visits = GROUP recent_visits BY user;
# 
# num_user_visits = FOREACH user_visits GENERATE COUNT(recent_visits);
# 
# ILLUSTRATE num_user_visits
# 
# The output from the ILLUSTRATE statement looks like this.
# 
# ------------------------------------------------------------------------
# 
# | visits     | user: bytearray | url: bytearray | timestamp: bytearray |
# ------------------------------------------------------------------------
# |            | Amy             | cnn.com        | 20080218             |
# |            | Fred            | harvard.edu    | 20081204            |
# |            | Amy             | bbc.com        | 20081205             |
# |            | Fred            | stanford.edu   | 20081206            |
# ------------------------------------------------------------------------
# 
# -------------------------------------------------------------------------------
# | recent_visits     | user: chararray | url: chararray | timestamp: chararray |
# -------------------------------------------------------------------------------
# |                   | Fred            | harvard.edu    | 20081204             |
# |                   | Amy             | bbc.com        | 20081205             |
# |                   | Fred            | stanford.edu   | 20081206             |
# -------------------------------------------------------------------------------
# 
# ------------------------------------------------------------------------------------------------------------------
# | user_visits     | group: chararray | recent_visits: bag({user: chararray,url: chararray,timestamp: chararray}) |
# ------------------------------------------------------------------------------------------------------------------
# |                 | Amy              | {(Amy, bbc.com, 20081205)}                                                |
# |                 | Fred             | {(Fred, harvard.edu, 20081204), (Fred, stanford.edu, 20081206)}           |
# ------------------------------------------------------------------------------------------------------------------
# 
# -------------------------------
# | num_user_visits     | long  |
# -------------------------------
# |                     | 1     |
# |                     | 2     |
# -------------------------------
#

# ---------------------------------------------------------------------------
# 
# DEFINE
# 
# Assigns an alias to a function or command.
# 
# == Syntax
# 
# DEFINE alias {function | [`command` [input] [output] [ship] [cache]] };
# 
# == Terms
# 
# alias::
#   The name for the function or command.
# 
# function::
#   The name of a function.
# 
# Use this option to define functions for use with the FOREACH and FILTER operators.
# 
# `command `::
#   A command, including the arguments, enclosed in back tics (where a command is anything that can be executed).
# 
# Use this option to define commands for use with the STREAM operator.
# 
# input::
#   INPUT ( {stdin | 'path'} [USING serializer] [, {stdin | 'path'} [USING serializer] …] )
# 
# Where:
# * INPUT – Keyword.
# * 'path' – A file path, enclosed in single quotes.
# * USING – Keyword.
# * serializer – A function that converts data from tuples to stream format. PigStorage is the default serializer. You can also write your own UDF.
# 
# output::
#   OUTPUT ( {stdout | stderr | 'path'} [USING deserializer] [, {stdout | stderr | 'path'} [USING deserializer] …] )
# 
#   Where:
#   
#   * OUTPUT – Keyword.
#   * 'path' – A file path, enclosed in single quotes.
#   * USING – Keyword.
#   * deserializer – A function that converts data from stream format to tuples. PigStorage is the default deserializer. You can also write your own UDF.
# 
# ship::
#   SHIP('path' [, 'path' …])
# 
#   Where:
#   
#   * SHIP – Keyword.
#   * 'path' – A file path, enclosed in single quotes.
# 
# cache::
#   CACHE('dfs_path#dfs_file' [, 'dfs_path#dfs_file' …])
#   
#   Where:
#   
#   * CACHE – Keyword.
#   * 'dfs_path#dfs_file' – A file path/file name on the distributed file system,
#     enclosed in single quotes. Example: '/mydir/mydata.txt#mydata.txt'
# 
# 
# == Usage
# 
# Use the DEFINE statement to assign a name (alias) to a function or to a command.
# 
# Use DEFINE to specify a function when:
# 
# * The function has a log package name that you don't want to include in a
#   script, especially if you call the function several times in that script.
# 
# * The constructor for the function takes parameters (see the first example
#   below). If you need to use different constructor parameters for different
#   calls to the function you will need to create multiple defines – one for each
#   parameter set.
# 
# Use DEFINE to specify a command when the streaming command specification is
# complex or requires additional parameters (input, output, and so on).
# 
# === About Input and Output
# 
# Serialization is needed to convert data from tuples to a format that can be
# processed by the streaming application. Deserialization is needed to convert the
# output from the streaming application back into tuples.
# 
# PigStorage, the default serialization/deserialization function, converts tuples
# to tab-delimited lines. Pig's BinarySerializer and BinaryDeserializer functions
# treat the entire file as a byte stream (no formatting or interpretation takes
# place). You can also write your own serialization/deserialization functions.
# 
# === About Ship
# 
# Use the ship option to send streaming binary and supporting files, if any, from
# the client node to the compute nodes. Pig does not automatically ship
# dependencies; it is your responsibility to explicitly specify all the
# dependencies and to make sure that the software the processing relies on (for
# instance, perl or python) is installed on the cluster. Supporting files are
# shipped to the task's current working directory and only relative paths should
# be specified. Any pre-installed binaries should be specified in the path.
# 
# Only files, not directories, can be specified with the ship option. One way to
# work around this limitation is to tar all the dependencies into a tar file that
# accurately reflects the structure needed on the compute nodes, then have a
# wrapper for your script that un-tars the dependencies prior to execution.
# 
# Note that the ship option has two components: the source specification, provided
# in the ship clause, is the view of your machine; the command specification is
# the view of the cluster.The only guarantee is that the shipped files are
# available is the current working directory of the launched job and that your
# current working directory is also on the PATH environment variable.
# 
# Shipping files to relative paths or absolute paths is not supported since you
# might not have permission to read/write/execute from arbitrary paths on the
# clusters.
# 
# === About Cache
# 
# The ship option works with binaries, jars, and small datasets. However, loading
# larger datasets at run time for every execution can severely impact
# performance. Instead, use the cache option to access large files already moved
# to and available on the compute nodes. Only files, not directories, can be
# specified with the cache option.
# 
# == Example: Input/Output
# 
# In this example PigStorage is the default serialization/deserialization
# function. The tuples from relation A are converted to tab-delimited lines that
# are passed to the script.
# 
#   X = STREAM A THROUGH `stream.pl`;
# 
# In this example PigStorage is used as the serialization/deserialization
# function, but a comma is used as the delimiter.
# 
#   DEFINE Y `stream.pl` INPUT(stdin USING PigStorage(',')) OUTPUT (stdout USING PigStorage(','));
#   X = STREAM A THROUGH Y;
# 
# In this example user-defined serialization/deserialization functions are used
# with the script.
# 
#   DEFINE Y `stream.pl` INPUT(stdin USING MySerializer) OUTPUT (stdout USING MyDeserializer);
#   X = STREAM A THROUGH Y;
# 
# == Example: Ship/Cache
# 
# In this example ship is used to send the script to the cluster compute nodes.
# 
#   DEFINE Y `stream.pl` SHIP('/work/stream.pl');
#   X = STREAM A THROUGH Y;
# 
# In this example cache is used to specify a file located on the cluster compute
# nodes.
# 
#   DEFINE Y `stream.pl data.gz` SHIP('/work/stream.pl') CACHE('/input/data.gz#data.gz');
#   X = STREAM A THROUGH Y;
# 
# == Example: Logging
# 
# In this example the streaming stderr is stored in the _logs/<dir> directory of
# the job's output directory. Because the job can have multiple streaming
# applications associated with it, you need to ensure that different directory
# names are used to avoid conflicts. Pig stores up to 100 tasks per streaming job.
# 
#   DEFINE Y `stream.pl` stderr('<dir>' limit 100);
#   X = STREAM A THROUGH Y;
# 
# In this example a function is defined for use with the FOREACH …GENERATE operator.
#   grunt> REGISTER /src/myfunc.jar
#   grunt> define myFunc myfunc.MyEvalfunc('foo');
#   grunt> A = LOAD 'students';
#   grunt> B = FOREACH A GENERATE myFunc($0);
# 
# In this example a command is defined for use with the STREAM operator.
#   grunt> A = LOAD 'data';
#   grunt> DEFINE cmd `stream_cmd –input file.dat`
#   grunt> B = STREAM A through cmd.
# 


# ---------------------------------------------------------------------------
# 
# = REGISTER
# 
# Registers a JAR file so that the UDFs in the file can be used.
# 
# == Syntax
# 
# REGISTER alias;
# 
# == Terms
# 
# [alias]		The path of a Java JAR file. Do not place the name in quotes.
# 
# == Usage
# 
# Use the REGISTER statement to specify the path of a Java JAR file containing UDFs.
# 
# For more information about UDFs, see the User Defined Function Guide. Note that
# Pig currently only supports functions written in Java.
# 
# == Example
# 
# In this example REGISTER states that myfunc.jar is located in the /src
# directory.
# 
#   grunt> REGISTER /src/myfunc.jar;
#   grunt> A = LOAD 'students';
#   grunt> B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
#