To get started, read about the Pig Latin Statements and Data Items.
Then, review the Relational Operators and the Built-in Functions.
|
is null |
||
is not null |
|||
|
|
||
|
and |
||
or |
|||
Simple Data Types |
not |
||
int |
|
||
long |
|||
double |
tuple dereference . |
||
arrays |
map dereference # |
||
chararray |
|
||
bytearray |
|||
Complex Data Types |
positive + |
||
tuple |
negative - |
|
|
bag |
|
||
map |
|||
|
(type)$0 |
|
|
(type)alias |
|
||
addition + |
|
||
subtraction - |
|||
multiplication * |
|
||
division / |
|||
modulo % |
|
||
bincond ? |
|
||
|
|
|
|
|
|||
Equal = = |
|
|
|
not equal != |
|
||
less than < |
|
|
|
greater than > |
|
|
|
less than or equal to <= |
|
|
|
greater than or equal to >= |
|
|
|
pattern matching matches |
|
|
|
|
|
|
|
|
|
|
Conventions for the syntax and code examples included in the Pig Latin Reference Manual are described here.
Convention |
Description |
Example |
( ) |
Parentheses enclose one or more items. Parentheses are also used to indicate the tuple data type. |
Multiple items: (1, abc, (2,4,6) ) |
[ ] |
Straight brackets enclose one or more optional items. Straight brackets are also used to indicate the map data type. In this case <> is used to indicate optional items. |
Optional items: [INNER | OUTER] |
{ } |
Curly brackets enclose two or more items, one of which is required. Curly brackets also used to indicate the bag data type. In this case <> is used to indicate required items. |
Two items, one required: { gen_blk | nested_gen_blk } |
… |
Horizontal ellipsis points indicate that you can repeat a portion of the code. |
Pig Latin syntax statement: cat path [ path …] |
UPPERCASE lowercase |
In general, uppercase type indicates elements the system supplies. In general, lowercase type indicates elements that you supply. Note: Pig Latin operators, commands, and keywords are not case sensitive. The names (aliases) of relations, fields, and functions are case sensitive. |
Pig Latin statement: A = LOAD 'data' AS (f1:int); · LOAD, AS supplied by system · A, f1 are names (aliases) · data supplied by you |
italics |
Italic type indicates placeholders or variables for which you must supply values. |
Pig Latin syntax: alias = LIMIT alias n; You supply the values alias and n. |
A Pig Latin
statement is an operator that takes a relation as input and produces another
relation as output. (This definition applies to all Pig Latin operators except
LOAD and STORE which read data from and write data to the file system.) Pig
Latin statements can span multiple lines and must end with a semi-colon ( ; ).
Pig Latin
statements are generally organized in the following manner. A LOAD statement
reads data from the file system. A series of "transformation"
statements process the data. A STORE statement writes data back to the file
system (or, a DUMP statement displays output to the screen.)
You can run Pig
Latin statements manually (using the Grunt shell) or place statements in a
script. Pig processes statements this way:
· First, Pig validates the syntax and semantics of all of the statements.
· Next, if Pig has encountered a DUMP or STORE operator, Pig will execute the statements.
In this example,
Pig will validate the LOAD, GROUP and FOREACH statements but will not execute
these statements.
grunt> A
= LOAD 'data' USING PigStorage() >> AS
(a:int, b:int, c:int); grunt> B
= GROUP A BY a; grunt> C
= FOREACH B GENERATE COUNT ($0); |
In this example,
Pig will validate the LOAD, GROUP, FOREACH, and DUMP statements. Then, if no
errors are encountered, Pig will execute these statements.
grunt> A
= LOAD 'data' USING PigStorage() >> AS
(a:int, b:int, c:int); grunt> B
= GROUP A BY a; grunt> C
= FOREACH B GENERATE COUNT ($0); grunt>
DUMP C; |
As noted, Pig
Latin statements work with relations. Relations have tuples (or rows). Tuples
have fields or columns of fields.
Relations are
referred to by name (where the name is called an alias). Fields are referred to
by name (where the name is called an alias or field_alias) or by positional
notation if no schema is defined.
Pig Latin
operators, commands, and keywords are not case sensitive. The names of
relations, fields, and Pig Latin function are case sensitive.
In the example
above, note the following:
·
The
first statement spans two lines.
·
The
names (aliases) for relations A, B, and C are case sensitive.
·
The
names (aliases) for fields a, b, and c are case sensitive.
·
Function
names PigStorage and COUNT are case sensitive.
·
Operators
and keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, and DUMP are not
case sensitive. They can also be written as load, using, as, group, by, etc.
·
In
the FOREACH statement, the field in relation B is referred to by positional
notation ($0).
A data item is a
field in a tuple. A data item can be any of the supported data types. You can refer to data items in various ways. Suppose
we have the following tuple, which has three fields.
tuple:( 1, {(2,3),(4,5),(5,7)}, ['apache'#'search'] ) |
The three fields
are separated out in the table below.
You have the
option to assign or not assign names (aliases) to fields.
·
If
you assign names to fields you can use any word except the Pig keywords. For
example: f1, f2, f3 or a, b, c or name, age, gpa.
·
If
you don't assign names to fields, you can refer to the fields using positional
notation, which begins with zero (0). For example: $0, $1, $2.
|
First Field: int |
Second Field: bag |
Third Field: map |
Field Value |
1 |
{(2,3),(4,5),(5,7)} |
['apache'#'search'] |
Name (alias) |
f1 |
f2 |
f3 |
Positional Notation |
$0 |
$1 |
$2 |
In addition to
positional notation and names, here are some other ways to refer to data items:
Referral Method |
Example |
Field Value |
Notes |
Constant |
1.0 or 'apache.org' |
n/a |
See Constants. |
Positional Notation |
$0 |
1 |
See the table above. Always starts with $0. |
Name (alias) |
f1 |
1 |
See the table above. Can be any word except the Pig |
Projection |
f2.$0 |
{(2),(4),(5)} |
|
Map Lookup |
f3#'apache' |
'search' |
|
Function |
SUM(f2.$0) |
2+4+5 = 11 |
See Eval Functions. |
Expression |
COUNT(f2) + f1 / '2.0' |
3 + 1 / 2.0 = 3.5 |
See Eval Functions. |
Bincond |
(f1 = = '1' ? '2' : COUNT(f2)) |
'2' since f1=='1' is true. If f1 were != '1', then the value of this data item for t would be COUNT(f2)=3 |
See Arithmetic Operators. |
Simple Data Types |
|
Scalars |
|
int |
Signed 32-bit integer |
long |
Signed 64-bit integer |
float |
32-bit floating point |
double |
64-bit floating point |
Arrays |
|
chararray |
Character array (string) in UTF-8 (Unicode) format |
bytearray |
Byte array (blob) |
Complex Data Types |
|
tuple |
An ordered set of data items (see Tuple) |
bag |
An ordered collection of tuples (see Bag) |
map |
A set of key value pairs (see Map) |
Note the following general observations about data types:
·
A loader produces the data of the type specified
by the schema. If any of the data does not conform to the schema, the loader
generates a null value.
A = LOAD 'data' AS (name:chararray, age:int, gpa:float); |
·
If an explicit cast is specified and the
conversion from the source type is not available, a type-checker error will
occur and the processing would be aborted. For instance:
A = LOAD 'data' AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE (int)name; |
· Implicit casts are not applied for those types for which conversions are not available. The following would result in a parsing error.
A = LOAD 'data' AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE name + gpa; |
A tuple (row) is an
ordered set of data items.
( data_item [, data_item …] ) |
( ) |
Tuples are enclosed in parentheses ( ). |
data_item |
A single data item (datum), referred to as a field or column (of fields). A data item can be any Pig data type. |
Note the following:
· Relations can have duplicate tuples.
· The tuples in a relation can have a different number of data items. If Pig tries to access a data item that does not exist, a null value is returned.
· The data items in tuples can have different data types.
In this example the data contains four tuples (rows) with differing numbers of fields. When Pig processes the data, null values are substituted for the missing data.
data ------------------------- Bob dog Tom Jane cat 50 |
A = LOAD 'data' AS (f1:chararray, f2:chararray, f3:int); ------------------------- ( (Bob,dog,) (Tom,,) (Jane,cat,50) |
B = FOREACH A GENERATE f3: ------------------------- (25) () () (50) |
A bag is a
collection of tuples.
{ tuple
[, tuple …] } |
{ } |
Bags are enclosed in curly brackets { }. |
tuple |
A tuple. |
The usage rules that apply to tuples (duplicate tuples, different number of data items, and different data types) also apply to the tuples within a bag. However, for Pig to effectively process a relation containing bags, the schemas of the tuples within those bags should be the same.
For example, if half of the tuples include chararray fields and while the other half include float fields, only half of the tuples will participate in any kind of computation because the chararray fields will be converted to null.
In this example each bag contains three tuples all of which have the same data type (int).
{ (1), (1, 2, 3), (25, 50) } { (2), (4, 5, 6), (50, 75) } |
A map is a set of
key value pairs.
[ key#value
<, key#value …> ] |
[ ] |
Maps are enclosed in straight brackets [ ]. |
# |
Key value pairs are denoted by the pound sign #. |
key |
Must be a scalar data type. Must be a unique value. |
value |
Any data type. |
Key values within a relation must be unique.
In this example the map includes two key value pairs.
[ name#john, phone#5551212 ] |
Operator |
Symbol |
Notes |
addition |
+ |
|
subtraction |
- |
|
multiplication |
* |
|
division |
/ |
|
modulo |
% |
Returns the remainder of a divided by b. |
bincond |
? |
The value of the data item is one of two options, chosen according to a condition. |
X = FOREACH A GENERATE f1, f2, f1%f2; |
Suppose we have relation A.
(1,{(2,3),(4,6),(5,7)}) (2,{(2,3),(4,6),(5,7)}) (1,{(2,3),(4,6)}) |
In this example the bincond condition states that if f1 equals 1 then the value is 2; if f1 does not equal 1, then the value is COUNT(f2).
X = FOREACH A GENERATE f1, (f1==1?2:COUNT(f2)); |
Relation X looks like this.
(1,2L) (2,3L) (1,2L) |
* bytearray cast as this data type
|
bag |
tuple |
map |
int |
long |
float |
double |
chararray |
bytearray |
bag |
error |
error |
error |
error |
error |
error |
error |
error |
error |
tuple |
|
not yet |
error |
error |
error |
error |
error |
error |
error |
map |
|
|
error |
error |
error |
error |
error |
error |
error |
int |
|
|
|
int |
long |
float |
double |
error |
cast as int |
long |
|
|
|
|
long |
float |
double |
error |
cast as long |
float |
|
|
|
|
|
float |
double |
error |
cast as float |
double |
|
|
|
|
|
|
double |
error |
cast as double |
chararray |
|
|
|
|
|
|
|
error |
error |
bytearray |
|
|
|
|
|
|
|
|
cast as double |
* bytearray cast as this data type
|
bag |
tuple |
map |
int |
long |
float |
double |
chararray |
bytearray |
bag |
error |
error |
error |
not yet |
not yet |
not yet |
not yet |
error |
error |
tuple |
|
error |
error |
not yet |
not yet |
not yet |
not yet |
error |
error |
map |
|
|
error |
error |
error |
error |
error |
error |
error |
int |
|
|
|
int |
long |
float |
double |
error |
cast as int |
long |
|
|
|
|
long |
float |
double |
error |
cast as long |
float |
|
|
|
|
|
float |
double |
error |
cast as float |
double |
|
|
|
|
|
|
double |
error |
cast as double |
chararray |
|
|
|
|
|
|
|
error |
error |
bytearray |
|
|
|
|
|
|
|
|
cast as double |
|
int |
long |
bytearray |
int |
int |
long |
cast as int |
long |
|
long |
cast as long |
bytearray |
|
|
error |
Operator |
Symbol |
Notes |
equal |
== |
|
not equal |
!= |
|
less than |
< |
|
greater than |
> |
|
less than or equal to |
<= |
|
greater than or equal to |
>= |
|
pattern matching |
matches |
Regular expression matching. Use the Java format for regular expressions. |
Use the comparison operators with numeric and string data.
X = FILTER A BY (f1 == 8); |
X = FILTER A BY (f2 == 'apache'); |
X = FILTER A BY (f1 matches '.*apache.*'); |
* bytearray cast as this data type
|
bag |
tuple |
map |
int |
long |
float |
double |
chararray |
bytearray |
bag |
error |
error |
error |
error |
error |
error |
error |
error |
error |
tuple |
|
boolean (see Note 1) |
error |
error |
error |
error |
error |
error |
error |
map |
|
|
boolean (see Note 2) |
error |
error |
error |
error |
error |
error |
int |
|
|
|
boolean |
boolean |
boolean |
boolean |
error |
cast as boolean |
long |
|
|
|
|
boolean |
boolean |
boolean |
error |
cast as boolean |
float |
|
|
|
|
|
boolean |
boolean |
error |
cast as boolean |
double |
|
|
|
|
|
|
boolean |
error |
cast as boolean |
chararray |
|
|
|
|
|
|
|
boolean |
cast as boolean |
bytearray |
|
|
|
|
|
|
|
|
boolean |
Note 1: boolean
(Tuple A is equal to tuple B if they have the same size s, and for all 0 <=
i < s A[i] = = B[i])
Note 2: boolean
(Map A is equal to map B if A and B have the same number of entries, and for
every key k1 in A with
a value of v1, there is a key k2 in B with a value of v2, such that k1 = = k2
and v1 = = v2)
|
bag |
tuple |
map |
int |
long |
float |
double |
chararray |
bytearray |
bag |
error |
error |
error |
error |
error |
error |
error |
error |
error |
tuple |
|
error |
error |
error |
error |
error |
error |
error |
error |
map |
|
|
error |
error |
error |
error |
error |
error |
error |
int |
|
|
|
boolean |
boolean |
boolean |
boolean |
error |
boolean (bytearray cast as int) |
long |
|
|
|
|
boolean |
boolean |
boolean |
error |
boolean (bytearray cast as long) |
float |
|
|
|
|
|
boolean |
boolean |
error |
boolean (bytearray cast as float) |
double |
|
|
|
|
|
|
boolean |
error |
boolean (bytearray cast as double) |
chararray |
|
|
|
|
|
|
|
boolean |
boolean (bytearray cast as chararray) |
bytearray |
|
|
|
|
|
|
|
|
boolean |
|
chararray |
bytearray |
chararray |
boolean |
boolean (bytearray cast as chararray) |
bytearray |
|
boolean |
Operator |
Symbol |
Notes |
is null |
is null |
|
is not null |
is not null |
|
X = FILTER A BY f1 is not null; |
The null operators can be applied to all data types. For more information, see Nulls.
Operator |
Symbol |
Notes |
AND |
and |
|
OR |
or |
|
NOT |
not |
|
X = FILTER A BY (f1==8) OR (NOT (f2+f3 >
f1)); |
Operator |
Symbol |
Notes |
tuple dereference |
. (dot) |
a.$0 |
map dereference |
# |
[key#map} |
Note the following:
·
Tuple dereferencing can be done by name or
position. For example mytuple.myfield and mytuple.$0 are both
valid dereferences.
If the dot operator is applied to a bytearray, the bytearray will be assumed to
be a tuple.
·
Tuple dereferencing can be done at the relation
or field level. For example, if you have relation A with three fields
(f1,f2,and f3) you
can state A.f1. Or, if you have a bag with three tuples B{(1),(2),(3)} you can
state B.$0
·
Map dereferencing must be done by key, for
example mymap#mykey
If the pound operator is applied to a bytearray, the bytearray
is assumed to be a map. If the key being looked up does not exist, the empty
string is returned.
Operator |
Symbol |
Notes |
positive |
. (dot) |
Has no affect. |
negative (negation) |
# |
Changes the sign of a positive or negative number. |
A = LOAD 'data' as (x, y, z); B = FOREACH A GENERATE -x, y; |
bag |
error |
tuple |
error |
map |
error |
int |
int |
long |
long |
float |
float |
double |
double |
chararray |
error |
bytearray |
double (as double) |
Pig Latin supports casts as shown in this table.
|
to |
|
|
|
|
|
|
|
|
from |
bag |
tuple |
map |
int |
long |
float |
double |
chararray |
bytearray |
bag |
|
error |
error |
error |
error |
error |
error |
error |
error |
tuple |
error |
|
error |
error |
error |
error |
error |
error |
error |
map |
error |
error |
|
error |
error |
error |
error |
error |
error |
int |
error |
error |
error |
|
yes |
yes |
yes |
yes |
error |
long |
error |
error |
error |
yes |
|
yes |
yes |
yes |
error |
float |
error |
error |
error |
yes |
yes |
|
yes |
yes |
error |
double |
error |
error |
error |
yes |
yes |
yes |
|
yes |
error |
chararray |
error |
error |
error |
yes |
yes |
yes |
yes |
|
error |
bytearray |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
|
(data_type) data_item |
(data_type ) |
Any Pig Latin data type except bytearray (you cannot cast to type bytearray) enclosed in parentheses ( ). |
data_item |
A single data item (datum), referred to as a field. The data item can be represented in positional notation or as an alias. For example: (int)$0 or (int)f1 |
Please note the following:
· A field can be explicitly cast. Once cast, the field remains that type (it is not automatically cast back). In this example $0 is explicitly cast to int.
B = FOREACH A GENERATE (int)$0 + 1; |
· Where possible implicit casts are performed. In this example $0 is cast to int (regardless of underlying data) and $1 is cast to double.
B = FOREACH A GENERATE $0 + 1, $1 + 1.0 |
· When two bytearrays are used in arithmetic expressions or with built-in aggregate functions (such as SUM) they are implicitly cast to double. If the underlying data is really int or long, you’ll get better performance by declaring the type or explicitly casting the data.
· Downcasts may cause loss of data. For example casting from long to int may drop bits.
In Pig Latin,
nulls are implemented using the SQL definition of null as unknown or
non-existent. Nulls can occur naturally in data or can be the result of an
operation.
Pig Latin
operators interact with nulls as shown in this table.
Operator |
Interaction |
Comparison operators: ==, != >, < >=, <= |
If either sub-expression is null, the result is null. |
Comparison operator: matches |
If either the string being matched against or the string defining the match is null, the result is null. |
Arithmetic operators: + , -, *, / % modulo ? bincond |
If either sub-expression is null, the resulting expression is null. |
Null operator: is null |
If the tested value is null, returns true; otherwise, returns false. |
Null operator: is not null |
If the tested value is not null, returns true; otherwise, returns false. |
Dereference operators: tuple (.) or map (#) |
If the de-referenced tuple or map is null, returns null. |
Cast operator |
Casting a null from one type to another type results in a null. |
Functions: AVG, MIN, MAX, SUM |
These functions ignore nulls. |
Function: COUNT |
This function counts all values, including nulls. |
Function: CONCAT |
If either sub-expression is null, the resulting expression is null. |
Function: SIZE |
If the tested object is null, returns null. |
For Boolean
sub-expressions, note the results when nulls are used with these operators:
· FILTER operator – If a Boolean sub-expression results in null value, the filter does not pass them through (if X is null, !X is also null, and the filter will reject both).
· Bincond operator – If a Boolean sub-expression results in null value, the resulting expression is null (see the interactions above for Arithmetic operators)
As noted, the
COUNT function counts all values, including nulls. If you don't want the
function to count null values, you can use one of the methods shown here.
In this example the
is
not null operator is used to
filter (remove) all null values before subsequent operations, including the
COUNT function, are applied.
A = LOAD 'data'; B = FILTER A BY $1 is not null; C = GROUP A BY $0; D = FOREACH B GENERATE GROUP, COUNT(B.$1); |
Suppose you have written a function, RemoveNulls, to filter null values. In this example RemoveNulls is used to filter nulls values for the COUNT function only.
A = LOAD 'data'; B = GROUP A BY $0; D = FOREACH B GENERATE GROUP, COUNT(RemoveNulls($1)); |
Nulls can be used as constant
expressions in place of expressions of any type.
In this example a and null are projected.
A = LOAD 'data' AS (a, b, c). B = FOREACH A GENERATE a, null; |
In this
example of an outer join, if the join key is missing from a table it is
replaced by null.
A = LOAD '/studenttab10k' AS (name: chararray, age: int, gpa: float); B = LOAD 'votertab10k' AS (name: chararray, age: int, registration: chararray, donation: float); C = COGROUP A BY name, B BY name; D = FOREACH C GENERATE FLATTEN((IsEmpty(A)
? null : A)), FLATTEN((IsEmpty(B) ? null : B)); |
Like any other expression, null constants can be implicitly or
explicitly cast.
In this example both a and null will be implicitly cast to double.
A = LOAD 'data' AS (a, b, c). B = FOREACH A GENERATE a + null; |
In this example both a and null
will be cast to int, a implicitly, and null explicitly.
A = LOAD 'data' AS (a, b, c). B = FOREACH A GENERATE a + (int)null; |
As noted, nulls
can be the result of an operation. These operations can produce null values:
· Division by zero
· Returns from user defined functions (UDFs)
· Dereferencing a map key that does not exist in a map. For example, given a map info containing [name#john, phone#5551212] if a user tries to use info#address a null is returned.
· Accessing a field that does not exist in a tuple. As a further explanation, see the examples below.
In this example if
some rows in 'data' contain only a single column, nulls are injected for these
rows.
A = LOAD 'data'; B = FOREACH A GENERATE $1; |
Similarly, in
this example if a load statement does not include a declared schema, nulls are
injected if some rows contain only a single column.
A = LOAD 'data'; B = FOREACH A GENERATE $1; |
In this example an
error is generated because the requested column ($2) is outside of the declared
schema (remember, positional notation begins with $0).
A = LOAD 'data' AS (f1, f2); B = FOREACH A
GENERATE $2; |
As noted,
nulls can occur naturally in the data. If nulls are part of the data, it is the responsibility of the load function
to handle them correctly. Keep in mind that what is considered a null value is
loader-specific; however, the load function should always communicate null
values to Pig by producing Java nulls.
The Pig Latin
load functions (for example, PigStorage and TextLoader) produce null values
wherever data is missing. For example, empty strings (chararrays) are not
loaded; instead, they are replaced by nulls.
PigStorage is the
default load function for the LOAD operator. In this example the is not null operator is used to filter names with
null values.
A = LOAD 'data' AS (name, age, gpa); B = FILTER A BY name is not null; |
Pig
provides constant representations for all data types except bytearrays.
|
Constant Example |
Notes |
Simple Data Types |
|
|
Scalars |
|
|
int |
19 |
|
long |
19L |
|
float |
19.2F or 1.92e2f |
|
double |
19.2 or 1.92e2 |
|
Arrays |
|
|
chararray |
'hello world' |
|
bytearray |
|
Not applicable. |
Complex Data Types |
|
|
tuple |
(19, 2, 1) |
A constant in this form creates a tuple |
bag |
{ (19, 2), (1, 2) } |
A constant in this form creates a bag |
map |
[ 'name' # 'John', 'ext' # 5555 ] |
A constant in this form creates a map |
Please
note the following:
· Chararray (string) constants can be specified as UTF-8 characters (for example, f1 matches 'abc' ).
· Any numeric constant consisting of just digits (for example, 123) is assigned the type of int.
· To specify a long constant, l or L must be appended to the number (for example, 12345678L). If the l or L is not specified, but the number is too large to fit into an int, the problem will be detected at parse time and the processing is terminated.
· Any numeric constant with decimal point (for example, 1.5) and/or exponent (for example, 5e+1) is treated as double unless it ends with f or F in which case it is assigned type float (for example, 1.5f).
The data type definitions for tuples, bags,
and maps apply to constants:
· A tuple can be any data type
· A bag is a collection of tuples
· A map key must be a scalar; a map value can be any data type
Complex
constants can be used in the same places scalar constants can be used, that is,
in FILTER and GENERATE statements.
A = LOAD 'data' USING MyStorage() AS (T: tuple(name:chararray, age: int)); B = FILTER A BY T == ('john', 25); D = FOREACH B GENERATE T.name, [25#5.6], {(1, 5, 18)}; |
In Pig Latin, expressions
are language constructs used with the FILTER, FOREACH, GROUP, and
Expressions are
written in conventional mathematical infix notation and are adapted to
the UTF-8 character set. Depending on the context, expressions can include:
·
Any
Pig data type (simple data types, complex data types)
·
Any
Pig operator (arithmetic, comparison, null, boolean, dereference, sign, and
cast)
·
Any
Pig built-in function.
· Any user-defined function (UDF) written in Java.
In a Pig Latin statement, an arithmetic expression could look like this:
X = GROUP A by f2*f3; |
A string expression could look like this, where a and b are both chararrays:
X = FOREACH A GENERATE CONCAT(a,b); |
A boolean expression could look like this:
X = FILTER A BY (f1==8) OR (NOT (f2+f3 >
f1)); |
Schemas enable you to assign names to and declare types for fields. Schemas are defined using the AS keyword with the LOAD, STREAM, and FOREACH statements.
Schemas are optional but we encourage you to use
them whenever possible; type declarations result in better parse-time error
checking and more efficient code execution.
If you don't define a schema, the fields are not named and default to type bytearray. You can refer to un-named fields using positional notation (see Data Items). You can change the field type using the cast operators (see Cast Operators).
With LOAD and STREAM statements, the schema following the AS keyword must be enclosed in parentheses.
In this example the LOAD statement includes a schema definition for simple data types.
A = LOAD 'data' AS (f1:int, f2:int); |
With FOREACH statements, the schema following the AS keyword must be enclosed in parentheses when the FLATTEN keyword is used. Otherwise, the schema should not be enclosed in parentheses.
In this example the FOREACH statement includes the FLATTEN keyword and a schema for simple data types.
X = FOREACH C GENERATE FLATTEN(A) AS (f1:int, f2:int, f3:int); |
In this example the FOREACH statement includes a schema for simple data types.
X = FOREACH A GENERATE f1+f2 AS x1:int; |
Simple data types include int, long, float, double, chararray, and bytearray (see Data Types).
(alias[:type]) [, (alias[:type]) …] ) |
alias |
The name assigned to the field. |
type |
(Optional) The simple data type assigned to the field. The alias and type are separated by a colon ( : ). If the type is omitted, the field defaults to type bytearray. |
( , ) |
Multiple fields are enclosed in parentheses and separated by commas. |
In this example the schema defines multiple fields.
A = LOAD 'data' AS (name:chararray, age:int, gpa:float); |
In this example the schema defines multiple fields. Field "b" will default to bytearray because no type is declared.
A = LOAD 'data' AS (a:int, b, c:int); |
Complex data types include tuples, bags, and maps (see Data Types).
A tuple (row) is an
ordered set of data items.
alias[:tuple] (alias[:type]) [, (alias[:type]) …] ) |
alias |
The name assigned to the tuple. |
:tuple |
(Optional) The data type, tuple (case insensitive). |
( ) |
The designation for a tuple, a set of parentheses. |
alias[:type] |
The constituents of the tuple, where the schema definition rules for the corresponding type applies to the constituents of the tuple: · alias – the name assigned to the field · type (optional) – the simple or complex data type assigned to the field |
In this example the schema defines one field as a tuple. Field "f3" will default to type bytearray. Both statements are equivalent.
A = LOAD 'data' AS (T: tuple (f1:int, f2: int, f3: long)); A = LOAD 'data' AS (T: (f1:int, f2: int, f3: long)); |
In this example the schema defines two fields as tuples.
A = LOAD 'data' AS (X:TUPLE (x1:int, x2: int, x3: int), Y:TUPLE (y1:int, y2: int, y3:int)); |
A bag is a collection of tuples.
alias[:bag] {tuple} |
alias |
The name assigned to the bag. |
:bag |
(Optional) The data type, bag (case insensitive). |
{ } |
The designation for a bag, a set of curly brackets. |
tuple |
A tuple (see Tuple Schema). |
In this example the schema defines a field as a bag. This bag contains one tuple. Both statements are equivalent.
A = LOAD 'data' AS (B: bag {T: tuple(f1:int, f2:int, f3:long)}); A = LOAD 'data' AS (B: {T: (f1:int, f2:int, f3:long)}); |
In this example the schema defines a field as a bag. The tuples in the bag must conform to tuple schemas (T1 and T2) described in the bag schema (B).
A = LOAD 'data' AS (B: bag {T1: tuple(a:int, b:int), T2:tuple(x:int, y:int)}); |
In this example the schema defines two fields as bags. The tuples in the bags must conform to the tuple schemas (T1 and T2) described in the bag schemas (B1 and B2).
A = LOAD 'data' AS (B1: bag {T1: tuple(a:int, b:int)}, B2: bag {T2:tuple(x:int, y:int)}); |
A map is a set of key value pairs.
alias<:map> [ ] |
alias |
The name assigned to the map. |
:map |
(Optional) The data type, map (case insensitive). |
[ ] |
The designation for a map, a set of straight brackets [ ]. |
In this example the schema defines a field as a map. Both statements are equivalent.
A = LOAD 'data' AS (M:map []); A = LOAD 'data' AS (M:[]); |
You can define
schemas for data that includes multiple types.
In this example the schema defines three fields.
The field names are T, B, and M. The field types are tuple, bag, and map. The
two statements are equivalent.
A = LOAD 'mydata' USING MyStorage() AS (T:tuple(f1:int, f2:int, f3:long), B:bag{t:tuple(x:int)}, M: map[] ); A = LOAD 'mydata' USING MyStorage() AS (T:(f1:int, f2:int, f3:long), B:{t:(x:int)}, M: [] ); |
A |
F |
M |
Functions |
and |
f |
map |
AVG |
all |
F |
matches |
BinaryDeserializer |
as |
filter |
mkdir |
BinarySerializer |
asc |
flatten |
mv |
BinStorage |
|
float |
N |
CONCAT |
B |
foreach |
not |
COUNT |
bag |
G |
null |
DIFF |
by |
generate |
O |
MIN |
bytearray |
group |
or |
MAX |
C |
H |
order |
PigDump |
cache |
help |
outer |
PigStorage |
cat |
I |
output |
SIZE |
cd |
if |
P |
SUM |
chararray |
illustrate |
parallel |
TextLoader |
cogroup |
inner |
pwd |
TOKENIZE |
copyFromLocal |
input |
Q |
|
copyToLocal |
int |
quit |
Symbols |
cp |
into |
R |
= = != < > <= >= |
cross |
is |
register |
+ - * / % |
D |
J |
rm |
? $ . # ( ) [ ] { } |
distinct |
join |
rmf |
|
define |
K |
S |
Preprocessor Statements |
desc |
kill |
set |
%declare |
describe |
L |
ship |
%default |
double |
l |
split |
|
du |
L |
stderr |
|
dump |
limit |
stdin |
|
E |
load |
stdout |
|
e |
long |
store |
|
E |
ls |
stream |
|
explain |
|
T |
|
|
|
through |
|
|
|
tuple |
|
|
|
U |
|
|
|
union |
|
|
|
using |
|
Groups the data in two or more
relations.
alias = COGROUP alias BY field_alias [INNER | OUTER] , alias BY field_alias [INNER | OUTER] [PARALLEL n] ; |
alias |
The name a relation. |
field_alias |
The name of one or more fields in a relation. If multiple fields are specified, separate with commas and enclose in parentheses. For example, X = COGROUP A BY (f1, f2); The number of fields specified in each BY
clause must match. For example, |
BY |
Keyword. |
INNER |
Keyword. |
OUTER |
Keyword. |
PARALLEL n |
Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline: n = (nr_nodes - 1) * 0.45 * nr_GB where nr_nodes is the number of nodes used and nr_GB is the amount of physical memory on each node. Note the following: · Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. · If you don’t specify parallel, you still get the same map parallelism but only one reduce task. |
The COGOUP
operator groups the data in two or more relations based on the common field
values.
Note: The COGROUP and JOIN operators perform similar functions. COGROUP creates
a nested set of output tuples while JOIN creates a flat set of output tuples.
Suppose we have
two relations, A and B.
A: (owner:chararray, pet:chararray) --------------- ( ( ( (Bob, dog) (Bob, cat) B: (friend1:chararray, friend2:charrarray) --------------------- (Cindy, (Mark, (Paul, Bob) (Paul, Jane) |
In this example
tuples are co-grouped using field “owner” from relation A and field “friend2” from
relation B as the key fields. The DESCRIBE operator shows the schema for
relation X, which has two fields, "group" and "A" (for an
explanation, see GROUP).
X = COGROUP A BY owner, B BY friend2; DESCRIBE X; X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1: chararray,friend2: chararray}} |
Relation X looks
like this. A tuple is created for each unique key field. The tuple includes the
key field and two bags. The first bag is the tuples from the first relation
with the matching key field. The second bag is the tuples from the second
relation with the matching key field. If no tuples match the key field, the bag
is empty.
(Alice,
{(Alice, turtle), (Alice, goldfish), (Alice, cat)}, {(Cindy, Alice), (Mark,
Alice)}) (Bob,
{(Bob, dog), (Bob, cat)}, {(Paul, Bob)}) (Jane, {}, {(Paul, Jane)}) |
In this example
tuples are co-grouped and the INNER keyword is used to ensure that only bags
with at least one tuple are returned.
X = COGROUP A BY owner INNER, B BY friend2 INNER; |
Relation X looks
like this.
(Alice, {(Alice, turtle), (Alice, goldfish), (Alice, cat)}, {(Cindy, Alice), (Mark, Alice)}) (Bob,
{(Bob, dog), (Bob, cat)}, {(Paul, Bob)}) |
In this example
tuples are co-grouped and the INNER keyword is used asymmetrically on only one
of the relations.
X = COGROUP A BY owner, B BY friend2 INNER; |
Relation X looks
like this.
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)}) (Jane,{},{(Paul,Jane)}) (Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)}) |
Computes the
cross product of two or more relations.
alias = CROSS alias, alias [, alias …] [PARALLEL n]; |
alias |
The name of a relation. |
PARALLEL n |
Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline: n = (nr_nodes - 1) * 0.45 * nr_GB where nr_nodes is the number of nodes used and nr_GB is the amount of physical memory on each node. Note the following: · Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. · If you don’t specify parallel, you still get the same map parallelism but only one reduce task. |
Use the CROSS
operator to compute the cross product (Cartesian product) of two or more
relations.
CROSS is an
expensive operation and should be used sparingly.
Suppose we have
relations A and B.
(A) (B) ----------- -------- (1, 2, 3) (2, 4) (4, 2, 1) (8, 9) (1, 3) |
In this example the
cross product of relation A and B is computed.
X = CROSS A, B; |
Relation X looks
like this.
(1, 2, 3,
2, 4) (1, 2, 3,
8, 9) (1, 2, 3,
1, 3) (4, 2, 1,
2, 4) (4, 2, 1, 8, 9) (4, 2, 1,
1, 3) |
Removes duplicate
tuples in a relation.
alias = DISTINCT alias [PARALLEL n]; |
alias |
The name of the relation. |
PARALLEL n |
Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline: n = (nr_nodes - 1) * 0.45 * nr_GB where nr_nodes is the number of nodes used and nr_GB is the amount of physical memory on each node. Note the following: · Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. · If you don’t specify parallel, you still get the same map parallelism but only one reduce task. |
Use the DISTINCT operator
to remove duplicate tuples in a relation. DISTINCT does not preserve the original
order of the contents (to eliminate duplicates, Pig must first sort the data). You
cannot use DISTINCT on a subset of fields. To do this, use FOREACH … GENERATE
to select the fields, and then use DISTINCT.
Suppose we have
relation A.
(A) ----------------- (8, 3, 4) (1, 2, 3) (4, 3, 3) (4, 3, 3) (1, 2, 3) |
In this example
all duplicate tuples are removed.
X = DISTINCT A; |
Relation X looks
like this.
(1, 2, 3) (4, 3, 3) (8, 3, 4) |
Selects tuples
(rows) from a relation based on some condition.
alias = FILTER alias BY expression; |
alias |
The name of the relation. |
BY |
Required keyword. |
expression |
An expression. |
Use the FILTER operator to work with tuples (rows) of data. FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.
Note: If you want to work with
specific fields (columns) of data, use the FOREACH …GENERATE operation.
Suppose we have
relation A.
(A: f1:int,
f2:int, f3:int) ---------------- (1, 2, 3) (4, 2, 1) (8, 3, 4) (4, 3, 3) (7, 2, 5) (8, 4, 3) |
In this example
the condition states that if the third field equals 3, then add the tuple to
relation X.
X = FILTER A BY f3 == 3; |
Relation X looks
like this.
(1, 2, 3) (4, 3, 3) (8, 4, 3) |
In this example
the condition states that if the first field equals 8 or if the sum of fields
f2 and f3 is not greater than first field, then add the tuple to relation X.
X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1)); |
Relation X looks
like this.
(4, 2, 1) (8, 3, 4) (7, 2, 5) (8, 4, 3) |
Generates data
transformations based on fields (columns) of data.
alias = FOREACH { gen_blk | nested_gen_blk } [AS schema]; |
alias |
The name of a relation. |
gen_blk |
FOREACH … GENERATE used with a non-nested relation. Use this syntax: alias = FOREACH alias GENERATE expression [expression ….] |
nested_gen_blk |
FOREACH … GENERATE used with a nested relation. Use this syntax: alias = FOREACH nested_alias { alias = nested_op; [alias = nested_op; …] GENERATE expression [expression ….] }; Where: The nested block is enclosed in opening and closing brackets { … }. The GENERATE keyword must be the last statement within the nested block. |
expression |
An expression. |
nested_alias |
If one of the fields (columns) in a relation is a bag, the bag can be treated as an inner or a nested relation. |
nested_op |
Allowable operations include FILTER, ORDER, and DISTINCT. The FOREACH … GENERATE operation itself is not allowed since this could lead to an arbitrary number of nesting levels. |
AS |
Keyword. |
schema |
A schema using the AS keyword (see Schemas). · If the FLATTEN keyword is used, enclose the schema in parentheses. · If the FLATTEN keyword is not used, don't enclose the schema in parentheses. |
Use the FOREACH …GENERATE
operation to work with individual fields (columns) of data. The FOREACH …GENERATE
operation works with non-nested and nested relations.
A statement with
a non-nested relation A could look like this.
X = FOREACH A GENERATE f1; |
A statement with
a nested relation A could look like this.
X = FOREACH B { S = FILTER A by 'xyz'; GENERATE COUNT (S.$0); } |
Note: FOREACH … GENERATE works with fields (columns) of data. If you want to
work with entire tuples (rows) of data, use the FILTER operation.
Suppose we have relations
A and B, and derived relation C (where C = COGROUP A BY a1 INNER, B BY b1 INNER;).
(A: a1:int,
a2:int, a3:int) ----------------- (1, 2, 3) (4, 2, 1) (8, 3, 4) (4, 3, 3) (7, 2, 5) (8, 4, 3) |
(B: b1:int,
b2:int) --------------- (2, 4) (8, 9) (1, 3) (2, 7) (2, 9) (4, 6) (4, 9) |
(C: c1, c2, c3) --------------------- (1, {(1, 2, 3)}, {(1, 3)}) (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4,
9)}) (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)}) |
In this example
the asterisk (*) is used to project all fields from relation A to relation X (this
is similar to SQL Select *). Relation A and X are identical.
X = FOREACH A GENERATE *; |
In this example
two fields from relation A are projected to form relation X.
X = FOREACH A GENERATE a1, a2; |
Relation X looks
this.
(1, 2) (4, 2) (8, 3) (4, 3) (7, 2) (8, 4) |
Note: See GROUP
for information about the "group" field in relation C.
In this example
if one of the fields in the input relation is a tuple, bag or map, we can perform
projection on that field.
X = FOREACH C GENERATE group, B.b2; |
Relation X looks
like this.
(1, {(3)}) (4, {(6), (9)}) (8, {(9)}) |
In this example
multiple nested columns are retained.
X = FOREACH C GENERATE group, A.(a1, a2); |
Relation X looks
like this.
(1, {(1, 2)}) (4, {(4, 2), (4, 3)}) (8, {(8, 3), (8, 4)}) |
In this example
two fields in relation A are summed to form relation X. A schema is defined for
the projected field.
X = FOREACH A GENERATE a1+a2 AS f1:int; Y = FILTER X by f1 > 10; |
Relations X and Y
look this.
(X) (Y) ----- ------ (3) (11) (6) (12) (11) (7) (9) (12) |
Note: See GROUP
for information about the "group" field in relation C.
In this example
the built-in function SUM() is used to sum a set of numbers in a bag.
X = FOREACH C GENERATE group, SUM (A.a1); |
Relation X looks
like this.
(1, 1) (4, 8) (8, 16) |
Note: See GROUP
for information about the "group" field in relation C.
In this example
the FLATTEN keyword is used to eliminate nesting.
X = FOREACH C GENERATE group, FLATTEN(A); |
Relation X looks
like this.
(1, 1, 2, 3) (4, 4, 2, 1) (4, 4, 3, 3) (8, 8, 3, 4) (8, 8, 4, 3) |
Another FLATTEN
example.
X = FOREACH C GENERATE GROUP, FLATTEN(A.a3); |
Relation X looks
like this.
(1, 3) (4, 1) (4, 3) (8, 4) (8, 3) |
Another FLATTEN
example.
X = FOREACH C
GENERATE FLATTEN(A.(f1, f2)), FLATTEN(B.$1); |
Relation X looks
like this. Note that for the group '4' in C, there are two tuples in each bag.
Thus, when both bags are flattened, the cross product of these tuples is
returned; that is, tuples (4, 2, 6), (4, 3, 6), (4, 2, 9), and (4, 3, 9).
(1, 2, 3) (4, 2, 6) (4, 3, 6) (4, 2, 9) (4, 3, 9) (8, 3, 9) (8, 4, 9) |
Suppose we have
relation A and derived relation B (where B = GROUP A BY url;). Since relation B
contains tuples with bags it can be treated as a nested relation.
A (url:chararray, outlink:chararray) --------------------------------------------- (www.ccc.com,www.hjk.com) (www.ddd.com,www.xyz.org) (www.aaa.com,www.cvn.org) (www.www.com,www.kpt.net) (www.www.com,www.xyz.org) (www.ddd.com,www.xyz.org) B --------------------------------------------- (www.aaa.com,{(www.aaa.com,www.cvn.org)}) (www.ccc.com,{(www.ccc.com,www.hjk.com)}) (www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)}) (www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
|
In this example
we perform two of the allowed Pig operations, FILTER (FA) and DISTINCT (DA), as
well as projection (PA). Note that the last statement in the nested block must
be GENERATE.
X = foreach
B { FA= FILTER A BY outlink == 'www.xyz.org'; PA = FA.outlink; DA = DISTINCT PA; GENERATE GROUP, COUNT(DA); } |
Relation X looks
like this.
(www.ddd.com,1L) (www.www.com,1L) |
Groups the data
in a single relation.
alias = GROUP
alias [BY {[field_alias [, field_alias]] | * | [expression] } ] [ALL]
[PARALLEL n]; |
alias |
The name of a relation. |
BY |
Keyword. Use this clause to group the relation by fields or by expression. |
field_alias |
The name of a field in a relation. This is the group key or key field. A relation can be grouped by a single field (f1) or by the composite value of multiple fields (f1,f2). |
* |
The asterisk. A designator for all fields in the relation. |
expression |
An expression. |
ALL |
Keyword. Use ALL if you want all tuples to go to a single group; for example, when doing aggregates across entire relations. |
PARALLEL n |
Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline: n = (nr_nodes - 1) * 0.45 * nr_GB where nr_nodes is the number of nodes used and nr_GB is the amount of physical memory on each node. Note the following: · Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. · If you don’t specify parallel, you still get the same map parallelism but only one reduce task. |
The GROUP operator groups together tuples that have the same group key (key
field). The result of a GROUP operation is a relation that includes one tuple
per group. This tuple contains two fields:
·
The
first field is named "group" (do not confuse this with the GROUP
operator) and is the same type of the group key.
·
The
second field takes the name of the original relation and is type bag.
Suppose we have the following data:
john 25 3.6 george 25 2.9 anne 27 3.9 julia 28 3.6 |
And, suppose we perform the LOAD and GROUP statements shown below. We can
use the DESCRIBE operator to view the schemas for relation Y. We can use DUMP
to view the contents of Y.
Note that relation Y has two fields. The first field is named
"group" and is type int (the same as age). The second field takes the
name of the original relation "X" and is type bag (that can contain
tuples with three elements of type chararray, int, and float).
Statements |
X = LOAD 'data AS (name:chararray, age:int, gpa:float); Y = GROUP X BY age; DESCRIBE Y; Y: {group: int,X: {name: chararray,age: int,gpa: float}} DUMP Y; (25,{(john,25,3.6F),(george,25,2.9F)}) (27,{(anne,27,3.9F)}) (28,{(julia,28,3.6F)}) |
As shown in this FOREACH statement, we can
refer to the fields in relation Y by their names "group" and
"X".
Z = FOREACH Y GENERATE group, COUNT(X); |
Relation Z looks like this.
(25,2L) (27,1L) (28,1L) |
Suppose we have relation
A.
A: (owner:chararray, pet:chararray) ----------------- ( ( ( (Bob, dog) (Bob, cat) |
In this example
tuples are grouped using the field "owner."
X = GROUP A BY owner; |
Relation X looks
like this. "group" is the name of the first field. "A" is
the name of the second field.
( (Bob, {(Bob, dog), (Bob, cat)}) |
In this example tuples are grouped using the ALL keyword. Field
"A" is then counted and
projected to from relation Y.
X = GROUP A ALL; Y = FOREACH X GENERATE
COUNT(A); |
Relation X looks like this. "group" is the name of the first
field. "A" is the name of the second field.
(all,{( |
Relation Y looks
like this.
(5L) |
Suppose we have
relation S.
S: (f1:chararay, f2:int, f3:int) ----------------- (r1, 1, 2) (r2, 2, 1) (r3, 2, 8) (r4, 4, 4) |
In this example tuples are grouped using an
expression, f2*f3.
X = GROUP S BY f2*f3; |
Relation Y looks like this. The first field is named "group". The
second field is named "S".
(2, {(r1, 1, 2), (r2, 2, 1)}) (16, {(r3, 2, 8), (r4, 4, 4)}) |
Joins two or more
relations based on common field values.
alias = JOIN alias BY field_alias, alias BY field_alias [, alias BY field_alias …] [PARALLEL n]; |
alias |
The name of a relation. |
BY |
Keyword. |
field_alias |
The name of a field in a relation. The alias and field_alias specified in the BY clause must correspond. Example: X = JOIN relationA BY fieldA, relationB by fieldB, relationC by fieldC; |
PARALLEL n |
Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline: n = (nr_nodes - 1) * 0.45 * nr_GB where nr_nodes is the number of nodes used and nr_GB is the amount of physical memory on each node. Note the following: · Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. · If you don’t specify parallel, you still get the same map parallelism but only one reduce task. |
Use the JOIN
operator to join two or more relations based on common field values. The JOIN
operator always performs an inner join.
Note: The JOIN and COGROUP operators perform similar functions. JOIN creates a
flat set of output records while COGROUP creates a nested set of output
records.
Suppose we have
relations A and B.
(A: a1, a2,
a3) (B: b1, b2) ----------------- --------------- (1, 2, 3) (2, 4) (4, 2, 1) (8, 9) (8, 3, 4) (1, 3) (4, 3, 3) (2, 7) (7, 2, 5) (2, 9) (8, 4, 3) (4, 6) (4, 9) |
In this example
relations A and B are joined on their first fields.
X = JOIN A BY a1, B BY b1; |
Relation X looks
like this.
(1, 2, 3, 1, 3) (4, 2, 1, 4, 6) (4, 3, 3, 4, 6) (4, 2, 1, 4, 9) (4, 3, 3, 4, 9) (8, 3, 4, 8, 9) (8, 4, 3, 8, 9) |
Limits the number
of output tuples.
alias = LIMIT alias n; |
alias |
The name of a relation. |
n |
The number of tuples. |
Use the LIMIT
operator to limit the number of output tuples (rows). If the specified number
of output tuples is equal to or exceeds the number of tuples in the relation,
the output will include all tuples in the relation.
There is no
guarantee which tuples will be returned, and the tuples that are returned can
change from one run to the next. A particular set of tuples can be requested
using the ORDER operator followed by LIMIT.
Note: The LIMIT operator allows Pig to avoid processing all tuples in a
relation. In most cases a query that uses LIMIT will run more efficiently than
an identical query that does not use LIMIT. It is always a good idea to use
limit if you can.
Suppose we have
relation A.
(A: f1:int,
f2:int, f3:int) ----------------- (1, 2, 3) (4, 2, 1) (8, 3, 4) (4, 3, 3) (7, 2, 5) (8, 4, 3) |
In this example
output is limited to 3 tuples.
X = LIMIT A 3; |
Relation X could
look like this (there is no guarantee which three tuples will be output).
(1, 2, 3) (4, 3, 3) (7, 2, 5) |
In this example
the ORDER operator is used to order the tuples and the LIMIT operator is used
to output the first three tuples.
B = ORDER A BY f1 DESC, f2 ASC; X = LIMIT B 3; |
Relation B and
relation X look like this.
(B) (X) ----------- ----------- (8, 3, 4) (8, 3, 4) (8, 4, 3) (8, 4, 3) (7, 2, 5) (7, 2, 5) (4, 2, 1) (4, 3, 3) (1, 2, 3) |
Loads data from
the file system.
LOAD 'data' [USING function] [AS schema]; |
'data' |
The name of the file or directory, in single quotes. If you specify a directory name, all the files in the directory are loaded. You can use hadoop-supported globing to
specify files at the file system or directory levels (see |
USING |
Keyword. |
function |
The load function. PigStorage is the default load/store function and does not need to be specified. This function reads/writes simple newline-separated records with delimiter-separated fields. The function has one parameter, the field delimiter (tab (‘\t’) if the default delimiter). If the data is stored in a special format that the Pig load functions cannot parse, you can write your own load function. |
AS |
Keyword. |
schema |
A schema using the AS keyword, enclosed in parentheses (see Schemas). |
Use the LOAD
operator to load data from the file system.
Suppose we have a
data file called myfile.txt. The fields are tab-delimited. The records are
newline-separated.
1 2 3 4 2 1 8 3 4 |
In this example the
default load function, PigStorage, loads data from myfile.txt into relation A. Note
that, because no schema is specified, the fields are not named and all fields
default to type bytearray. The two statements are equivalent.
A = LOAD 'myfile.txt'; A = LOAD 'myfile.txt' USING PigStorage('\t'); |
Relation A looks
like this.
(1, 2, 3) (4, 2, 1) (8, 3, 4) |
In this example a
schema is specified using the AS keyword. The two statements are equivalent.
A = LOAD 'myfile.txt' AS (f1:int, f2:int, f3:int); A = LOAD 'myfile.txt' USING PigStorage(‘\t’) AS (f1:int, f2:int, f3:int); |
Sorts a relation based
on one or more fields.
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n]; |
alias |
The name of a relation. |
BY |
Required keyword. |
* |
Represents all fields in the relation. |
ASC |
Sort in ascending order. |
DESC |
Sort in descending order. |
field_alias |
A field in the relation. |
PARALLEL n |
Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline: n = (nr_nodes - 1) * 0.45 * nr_GB where nr_nodes is the number of nodes used and nr_GB is the amount of physical memory on each node. Note the following: · Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. · If you don’t specify parallel, you still get the same map parallelism but only one reduce task. |
In Pig, relations
are logically unordered.
· If you order relation A to produce relation X (X = ORDER A BY * DESC;), relations A and X still contain the same thing.
· If you retrieve the contents of relation X, they are guaranteed to be in the order you specified (descending).
· However, if you further process relation X, there is no guarantee that the contents will be processed in the order you specified.
Suppose we have
relation A.
(A: f1, f2,
f3) ----------------- (1, 2, 3) (4, 2, 1) (8, 3, 4) (4, 3, 3) (7, 2, 5) (8, 4, 3) |
In this example
relation A is sorted by the third field, f3 in descending order.
X = ORDER A BY f3 DESC; |
Relation X could
look like this (note that the order of the three tuples ending in 3 can vary).
(7, 2, 5) (8, 3, 4) (1, 2, 3) (4, 3, 3) (8, 4, 3) (4, 2, 1) |
Partitions a
relation into two or more relations.
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression …]; |
alias |
The name
of a relation. |
INTO |
Required
keyword. |
IF |
Required
keyword. |
expression |
An expression. |
Use the
· A tuple may be assigned to more than one relation.
· A tuple may not be assigned to any relation.
Suppose we have
relation A.
(A: f1, f2,
f3) ----------------- (1, 2, 3) (4, 5, 6) (7, 8, 9) |
In this example
relation A is split into three relations, X, Y, and Z.
SPLIT A INTO X IF f1< 7, Y IF f2==5, Z IF (f3<6 OR f3>6); |
Relations X, Y,
and Z look like this.
(X) (Y) (Z) ---------- ----------- ----------- (1, 2, 3) (4, 5, 6) (1, 2, 3) (4, 5, 6) (7, 8, 9) |
Stores data to
the file system.
STORE alias INTO 'directory' [USING function]; |
alias |
The name of a relation. |
INTO |
Required keyword. |
'directory' |
The name of the storage directory, in quotes. If the directory already exists, the STORE operation will fail. The output data files, named part-nnnnn, are written to this directory. |
USING |
Keyword. Use this clause to name the store function. |
function |
The load function. PigStorage is the default load/store function and does not need to be specified. This function reads/writes simple newline-separated records with delimiter-separated fields. The function has one parameter, the field delimiter (tab ‘\t’ if the default delimiter) If you want to store the data in a special format that the Pig Load/Store functions cannot handle, you can write your own store function. |
Use the STORE
operator to store data on the file system.
Suppose we have
relation A.
(A) ---------------- (1, 2, 3) (4, 2, 1) (8, 3, 4) (4, 3, 3) (7, 2, 5) (8, 4, 3) |
In this example
the contents of relation A are written to file part-00000 located in directory
myoutput.
STORE relationA INTO ‘myoutput’ USING PigStorage (‘*’); |
The part-00000 file looks like this. Fields are delimited with the asterisk *
characters and records are separated by newlines.
1*2*3 4*2*1 8*3*4 4*3*3 7*2*5 8*4*3 |
Sends data to an external
script or program.
alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ; |
alias |
The name of a relation. |
THROUGH |
Keyword. |
`command` |
A command, including the arguments, enclosed in back tics (where a command is anything that can be executed). |
cmd_alias |
The name of a command created using the DEFINE operator. |
AS |
Keyword. |
schema |
A schema using the AS keyword, enclosed in parentheses (see Schemas). |
Use the STREAM
operator to send data through an external script or program. Multiple stream
operators can appear in the same Pig script. The stream operators can be
adjacent to each other or have other operations in between.
When used with a command, a stream statement could look
like this:
A = LOAD 'data'; B = STREAM A THROUGH `stream.pl -n 5`; |
When used with a cmd_alias, a stream statement could look
like this, where cmd is the defined alias.
A = LOAD 'data'; DEFINE cmd `stream.pl –n 5`; B = STREAM A THROUGH cmd; |
Data guarantees are determined based on the position of the streaming operator in the Pig script.
·
Unordered data – No guarantee for the order in which the data is delivered to the
streaming application.
·
Grouped data – The data for the same grouped key is guaranteed to be provided to
the streaming application contiguously
·
Grouped and ordered data – The data for the same grouped key
is guaranteed to be provided to the streaming application contiguously.
Additionally, the data within the group is guaranteed to be sorted by the
provided secondary key.
In addition to position, data grouping
and ordering can be determined by the data itself. However, you need to know
the property of the data to be able to take advantage of its structure.
In this example
the data is unordered.
A = LOAD 'data'; B = STREAM A THROUGH `stream.pl`; |
In this example
the data is grouped.
A = LOAD 'data'; B = GROUP A BY $1; C = FOREACH B FLATTEN(A); D = STREAM C THROUGH `stream.pl` |
In this example
the data is grouped and ordered.
A = LOAD 'data'; B = GROUP A BY $1; C = FOREACH B { D = ORDER A BY ($3, $4); GENERATE D; } E = STREAM C THROUGH `stream.pl`; |
In this example a schema is specified as part of the STREAM statement.
X = STREAM A THROUGH `stream.pl` as (f1:int, f2;int, f3:int); |
See DEFINE
for additional examples.
Computes the
union of two or more relations.
alias = UNION alias, alias [, alias …]; |
alias |
The name of a relation. |
Use the UNION
operator to compute the union of two or more relations. The UNION operator:
· Does not preserve the order of tuples. Both the input and output relations are interpreted as unordered bags of tuples.
·
Does not
ensure (as databases do) that all tuples adhere to the same schema or that they
have the same number of fields. In a typical scenario, however, this should be
the case; therefore, it is the user's responsibility to either (1) ensure that
the tuples in the input relations have the same schema or (2) be able to
process varying tuples in the output relation.
·
Does not
eliminate duplicate tuples.
Suppose we have
relations A and B.
(A) (B) ----------- -------- (1, 2, 3) (2, 4) (4, 2, 1) (8, 9) (1, 3) |
In this example
the union of relation A and B is computed.
X = UNION A, B; |
Relation X looks
like this.
(1, 2, 3) (4, 2, 1) (2, 4) (8, 9) (1, 3) |
Returns the
schema of an alias.
DESCRIBE alias; |
alias |
The name of a relation. |
Use the DESCRIBE
operator to review the schema of a particular alias.
In this example a
schema is specified using the AS clause.
A = LOAD 'students'
AS (name:chararray, age:int, gpa:float); B = FILTER A BY
name matches 'John%'; C = GROUP B BY
name; D = FOREACH B
GENERATE COUNT(B.age); DESCRIBE A; A: {group, B: (name: chararray,age: int,gpa: float} DESCRIBE B; B: {group, B: (name: chararray,age: int,gpa: float} DESCRIBE C; C: {group, chararry,B: (name: chararray,age: int,gpa:
float} DESCRIBE D; D: {long} |
In this example
no schema is specified. All data items default to type bytearray.
grunt> a = LOAD
'/data/students'; grunt> b =
FILTER a BY $0 matches 'John%'; grunt> c = GROUP
b BY $0; grunt> d = FOREACH
c GENERATE COUNT(b.$1); grunt> DESCRIBE
a; Schema for a unknown. grunt> DESCRIBE b; 2008-12-05 01:17:15,316 [main] WARN org.apache.pig.PigServer - bytearray is
implicitly cast to chararray under LORegexp Operator Schema for b unknown. grunt> DESCRIBE
c; 2008-12-05 01:17:23,343 [main] WARN org.apache.pig.PigServer - bytearray is
implicitly caste to chararray under LORegexp Operator c: {group: bytearray,b: {null}} grunt> DESCRIBE
d; 2008-12-05 03:04:30,076 [main] WARN org.apache.pig.PigServer - bytearray is
implicitly caste to chararray under LORegexp Operator d: {long} |
Displays the
contents of an alias.
DUMP alias; |
alias |
The name of
a relation. |
Use the DUMP
operator to display the contents of an alias. You can use DUMP as a debugging
device to make sure the correct results are being generated.
In this example a
dump is performed after each statement.
A = LOAD 'students'
AS (name:chararray, age:int, gpa:float); DUMP A; B = FILTER A BY
name matches 'John%'; DUMP B; B = GROUP B BY
name; DUMP C; D = FOREACH C
GENERATE COUNT(B.age); DUMP D; |
Displays
execution plans.
EXPLAIN alias; |
alias |
The name of a relation. |
Use the EXPLAIN
operator to review the logical, physical, and map reduce execution plans that
are used to compute the specified relationship.
· The logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent optimizations (such as applying filters early on) also apply.
· The physical plan shows how the logical operators are translated to backend-specific physical operators. Some backend optimizations also apply.
· The map reduce plan shows how the physical operators are grouped into map reduce jobs.
In this example the
EXPLAIN operator produces all three plans. (Note that only a portion of the
output is shown in this example.)
A = LOAD 'students' AS (name:chararray, age:int, gpa:float); B = GROUP A BY
name; C = FOREACH B
GENERATE COUNT(A.age); EXPLAIN C; Logical Plan: Store xxx-Fri Dec 05 19:42:29 UTC 2008-23 Schema:
{long} Type: Unknown | |---ForEach xxx-Fri Dec 05 19:42:29 UTC 2008-15 Schema:
{long} Type: bag etc … ----------------------------------------------- Physical Plan: ----------------------------------------------- Store(fakefile:org.apache.pig.builtin.PigStorage) -
xxx-Fri Dec 05 19:42:29 UTC 2008-40 | |---New For Each(false)[bag] - xxx-Fri Dec 05 19:42:29
UTC 2008-39 | | |
POUserFunc(org.apache.pig.builtin.COUNT)[long] - xxx-Fri Dec 05 etc … -------------------------------------------------- | Map Reduce Plan | -------------------------------------------------- MapReduce node xxx-Fri Dec 05 19:42:29 UTC 2008-41 Map Plan Local Rearrange[tuple]{chararray}(false) - xxx-Fri Dec
05 19:42:29 UTC 2008-34 | | |
Project[chararray][0] - xxx-Fri Dec 05 19:42:29 UTC 2008-35 etc … |
Displays a
step-by-step execution of a sequence of statements.
ILLUSTRATE alias; |
alias |
The name of a relation. |
Use the ILLUSTRATE
operator to review how data items are transformed through a sequence of Pig Latin
statements.
ILLUSTRATE
accesses the ExampleGenerator algorithm which can
select an appropriate and concise set of example data items automatically. It
does a better job than random sampling would do; for example, random sampling
suffers from the drawback that selective operations such as filters or joins
can eliminate all the sampled data items, giving you empty results which is of
no help with debugging.
With
the ILLUSTRATE operator you can test your programs on small datasets and get
faster turnaround times. The ExampleGenerator algorithm uses Pig's Local mode
(rather than Hadoop mode) which means that illustrative example data is
generated in near real-time.
Suppose we have a
data file called 'visits.txt'.
Amy cnn.com 20080218 Fred harvard.edu 20081204 Amy bbc.com 20081205 Fred stanford.edu 20081206 |
In this example
we count the number of sites a user has visited since 12/1/08. The ILLUSTRATE
statement will show how the results for num_user_visits are derived.
visits = LOAD 'visits.txt' AS (user:chararray, url:chararray, timestamp:chararray); recent_visits = FILTER visits BY timestamp >= '20081201'; user_visits = GROUP recent_visits BY user; num_user_visits = FOREACH user_visits GENERATE COUNT(recent_visits); ILLUSTRATE num_user_visits |
The output from
the ILLUSTRATE statement looks like this.
------------------------------------------------------------------------ | visits |
user: bytearray | url: bytearray | timestamp: bytearray | ------------------------------------------------------------------------ | |
Amy | cnn.com | 20080218 | | |
Fred | harvard.edu | 20081204 | | |
Amy | bbc.com | 20081205 | | |
Fred | stanford.edu | 20081206 | ------------------------------------------------------------------------ ------------------------------------------------------------------------------- | recent_visits
| user: chararray | url: chararray | timestamp: chararray | ------------------------------------------------------------------------------- |
| Fred | harvard.edu | 20081204 | |
| Amy | bbc.com | 20081205 | |
| Fred |
stanford.edu | 20081206 | ------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------ | user_visits
| group: chararray | recent_visits: bag({user: chararray,url:
chararray,timestamp: chararray}) | ------------------------------------------------------------------------------------------------------------------ |
| Amy | {(Amy,
bbc.com, 20081205)}
| |
| Fred | {(Fred, harvard.edu, 20081204), (Fred,
stanford.edu, 20081206)} | ------------------------------------------------------------------------------------------------------------------ ------------------------------- | num_user_visits
| long | ------------------------------- |
| 1 | |
| 2 | ------------------------------- |
Assigns an alias to a function or command.
DEFINE alias {function | [`command` [input] [output] [ship] [cache]] }; |
alias |
The name for the function or command. |
function |
The name of a function. Use this option to define functions for use with the FOREACH and FILTER operators. |
`command ` |
A command, including the arguments, enclosed in back tics (where a command is anything that can be executed). Use this option to define commands for use with the STREAM operator. |
input |
INPUT ( {stdin | 'path'} [USING serializer]
[, {stdin | 'path'} [USING serializer] …] ) Where: · INPUT – Keyword. · 'path' – A file path, enclosed in single quotes. · USING – Keyword. · serializer – A function that converts data from tuples to stream format. PigStorage is the default serializer. You can also write your own UDF. |
output |
OUTPUT ( {stdout | stderr | 'path'}
[USING deserializer] [, {stdout | stderr | 'path'} [USING deserializer]
…] ) Where: · OUTPUT – Keyword. · 'path' – A file path, enclosed in single quotes. · USING – Keyword. · deserializer – A function that converts data from stream format to tuples. PigStorage is the default deserializer. You can also write your own UDF. |
ship |
SHIP('path' [, 'path' …]) Where: · SHIP – Keyword. · 'path' – A file path, enclosed in single quotes. |
cache |
CACHE('dfs_path#dfs_file' [, 'dfs_path#dfs_file'
…]) Where: · CACHE – Keyword. · 'dfs_path#dfs_file' – A file path/file name on the distributed file system, enclosed in single quotes. Example: '/mydir/mydata.txt#mydata.txt' |
Use the DEFINE
statement to assign a name (alias) to a function or to a command.
Use DEFINE to
specify a function when:
· The function has a log package name that you don't want to include in a script, especially if you call the function several times in that script.
· The constructor for the function takes parameters (see the first example below). If you need to use different constructor parameters for different calls to the function you will need to create multiple defines – one for each parameter set.
Use DEFINE to specify a command when the streaming command specification is complex or requires additional parameters (input, output, and so on).
Serialization is needed to convert data from tuples to a format that can be processed by the streaming application. Deserialization is needed to convert the output from the streaming application back into tuples.
PigStorage, the default serialization/deserialization function, converts tuples to tab-delimited lines. Pig's BinarySerializer and BinaryDeserializer functions treat the entire file as a byte stream (no formatting or interpretation takes place). You can also write your own serialization/deserialization functions.
Use the ship option to send streaming binary and supporting
files, if any, from the client node to the compute nodes. Pig does not automatically ship
dependencies; it is your responsibility to explicitly specify all the
dependencies and to make sure that the software the processing relies on (for
instance, perl or python) is installed on the cluster. Supporting
files are shipped to the task's current working directory and only relative
paths should be specified. Any pre-installed binaries should be specified in
the path.
Only files, not
directories, can be specified with the ship option. One way to work around this
limitation is to tar all the dependencies into a tar file that accurately
reflects the structure needed on the compute nodes, then have a wrapper for your
script that un-tars the dependencies prior to execution.
Note that the ship option has two components: the source specification, provided in the ship clause, is the view of your machine; the command specification is the view of the cluster.The only guarantee is that the shipped files are available is the current working directory of the launched job and that your current working directory is also on the PATH environment variable.
Shipping files to relative paths or absolute paths is not supported since you might not have permission to read/write/execute from arbitrary paths on the clusters.
The ship option works with binaries, jars, and small
datasets. However, loading larger datasets at run time for every execution can
severely impact performance. Instead, use the cache option to access large
files already moved to and available on the compute nodes. Only files, not directories, can be specified with
the cache option.
In this example PigStorage is the default serialization/deserialization function. The tuples from relation A are converted to tab-delimited lines that are passed to the script.
X = STREAM A THROUGH `stream.pl`; |
In this example PigStorage is used as the serialization/deserialization function, but a
comma is used as the delimiter.
DEFINE Y `stream.pl` INPUT(stdin USING PigStorage(',')) OUTPUT (stdout USING PigStorage(',')); X = STREAM A THROUGH Y; |
In this example
user-defined serialization/deserialization functions are used with the script.
DEFINE Y `stream.pl` INPUT(stdin USING MySerializer) OUTPUT (stdout USING MyDeserializer); X = STREAM A THROUGH Y; |
In this example
ship is used to send the script to the cluster compute nodes.
DEFINE Y `stream.pl` SHIP('/work/stream.pl'); X = STREAM A THROUGH Y; |
In this example
cache is used to specify a file located on the cluster compute nodes.
DEFINE Y `stream.pl data.gz` SHIP('/work/stream.pl') CACHE('/input/data.gz#data.gz'); X = STREAM A THROUGH Y; |
In this example the streaming stderr is stored in the _logs/<dir> directory of the job's output directory. Because the job can have multiple streaming applications associated with it, you need to ensure that different directory names are used to avoid conflicts. Pig stores up to 100 tasks per streaming job.
DEFINE Y `stream.pl` stderr('<dir>' limit 100); X = STREAM A THROUGH Y; |
In this example a
function is defined for use with the FOREACH …GENERATE operator.
grunt> REGISTER /src/myfunc.jar grunt> define myFunc myfunc.MyEvalfunc('foo'); grunt> A = LOAD 'students'; grunt> B = FOREACH A GENERATE myFunc($0); |
In this example a
command is defined for use with the STREAM operator.
grunt> A
= LOAD 'data'; grunt>
DEFINE cmd `stream_cmd –input file.dat` grunt> B
= STREAM A through cmd. |
Registers a JAR file so that the UDFs in the file can be
used.
REGISTER alias; |
alias |
The path of a Java
JAR file. Do not place the name in quotes. |
Use the REGISTER
statement to specify the path of a Java JAR file containing UDFs.
For more
information about UDFs, see the User Defined Function Guide.
Note that Pig currently only supports functions written in Java.
In this example
REGISTER states that myfunc.jar is located in the /src directory.
/src $ java
-jar pig.jar - grunt> REGISTER /src/myfunc.jar; grunt> A
= LOAD 'students'; grunt> B
= FOREACH A GENERATE myfunc.MyEvalFunc($0); |
Computes the
average of the numeric values in a single-column bag
AVG(expression) |
expression |
Any expression whose result is a bag. The elements of the bag should be data type int, long, float, or double. |
Use the AVG function
to compute the average of the numeric values in a single-column bag.
Suppose we have
relation A.
A (name:chararray, session:chararray, gpa:float) ----------------------------------------------------------------- (John,fl,3.9F) (John,wt,3.7F) (John,sp,4.0F) (John,sm,3.8F) (Mary,fl,3.8F) (Mary,wt,3.9F) (Mary,sp,4.0F) (Mary,sm,4.0F) |
In this example
the average GPA is computed.
B = GROUP A by name; C = FOREACH B GENERATE A.name, AVG(A.gpa); |
Relation C looks
like this.
({(John),(John),(John),(John)},3.850000023841858) ({(Mary),(Mary),(Mary),(Mary)},3.925000011920929) |
|
int |
long |
float |
double |
chararray |
bytearray |
AVG |
long |
long |
double |
double |
error |
cast as double |
Concatenates two
fields of type chararray or two fields of type bytearray.
CONCAT (expression, expression) |
expression |
An expression with data types chararray or bytearray. |
Use the CONCAT
function to concatenate two elements. The data type of the two elements must be
the same, either chararray or bytearray.
Suppose we have
relation A.
A (f1:chararray, f2:chararray, f3:chararray) --------------------------------------------------------- (apache,open,source) (hadoop,map,reduce) (pig,pig,latin) |
In this example
fields f2 and f3 are concatenated.
X = FOREACH A GENERATE CONCAT(f2,f3); |
Relation X looks
like this.
(opensource) (mapreduce) (piglatin) |
|
chararray |
bytearray |
chararray |
chararray |
cast as chararray |
bytearray |
|
bytearray |
Computes the
number of elements in a bag.
COUNT(expression) |
expression |
An expression with data type bag. |
Use the COUNT
function to computer the number of elements in a bag.
Suppose we have
relation A.
A (f1:int, f2:bag) ---------------------------- (1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (7,{(7,2,5)}) (8,{(8,3,4),(8,4,3)}) |
In this example
the tuples in the bag are counted.
X = FOREACH A GENERATE COUNT(f2); |
Relation X looks
like this.
(1L) (2L) (1L) (2L) |
|
int |
long |
float |
double |
chararray |
bytearray |
COUNT |
long |
long |
long |
long |
long |
long |
Compares the
content of two bags.
DIFF (expression, expression) |
expression |
An expression with any data type. |
The DIFF function
compares the content of two bags and returns a bag that contains the tuples
that are not in both bags.
Suppose we have
relation A.
A (f1:chararray, f2:chararray, f3:chararray) --------------------------------------------------------- (apache,open,source) (hadoop,map,reduce) (pig,pig,latin) |
In this example fields f1 and f2 are compared.
X = FOREACH A GENERATE DIFF(f1,f2); |
Relation X looks
like this.
({(apache),(open)}) ({(hadoop),(map)}) ({}) |
Computes the maximum
of the numeric values or chararrays in a single-column bag.
MAX(expression) |
expression |
An expression with data types int, long, float, double, or chararray. |
Use the MAX
function to compute the maximum of the numeric values or chararrays in a
single-column bag.
Suppose we have
relation A.
A (name:chararray, session:chararray, gpa:float) ------------------------------------------------------------------ (John,fl,3.9F) (John,wt,3.7F) (John,sp,4.0F) (John,sm,3.8F) (Mary,fl,3.8F) (Mary,wt,3.9F) (Mary,sp,4.0F) (Mary,sm,4.0F) |
In this example
the maximum GPA is computed.
B = GROUP A by name; C = FOREACH B GENERATE A.name, MAX(A.gpa); |
|
int |
long |
float |
double |
chararray |
bytearray |
MAX |
int |
long |
float |
double |
chararray |
cast as double |
Computes the
minimum of the numeric values or chararrays in a single-column bag.
MIN(expression) |
expression |
An expression with data types int, long, float, double, or chararray. |
Use the MIN
function to compute the minimum of a set of numeric values or chararrays in a
single-column bag.
Suppose we have
relation A.
A (name:chararray, session:chararray, gpa:float) ------------------------------------------------------------------ (John,fl,3.9F) (John,wt,3.7F) (John,sp,4.0F) (John,sm,3.8F) (Mary,fl,3.8F) (Mary,wt,3.9F) (Mary,sp,4.0F) (Mary,sm,4.0F) |
In this example
the minimum GPA is computed.
B = GROUP A by name; C = FOREACH B GENERATE A.name, MIN(A.gpa); |
|
int |
long |
float |
double |
chararray |
bytearray |
MIN |
int |
long |
float |
double |
chararray |
cast as double |
Computes the
number of data items based on the data type.
SIZE(expression) |
expression |
An expression with any data type. |
Use the SIZE
function to compute the number of data items based on the data type (see the
Types Tables below).
Suppose we have
relation A.
A (f1:chararray, f2:chararray, f3:chararray) ----------------------------------------------------------- (apache,open,source) (hadoop,map,reduce) (pig,pig,latin) |
In this example
the number of characters in each field of the first column are counted.
X = FOREACH A GENERATE SIZE(f1); |
Relation X looks
like this.
(6L) (6L) (3L) |
int |
returns 1 |
long |
returns 1 |
float |
returns 1 |
double |
returns 1 |
chararray |
returns number of characters in the array |
bytearray |
returns number of bytes in the array |
tuple |
returns number of fields in the tuple |
bag |
returns number of tuples in bag |
map |
returns number of key/value pairs in map |
Computes the sum of
the numeric values in a single-column bag.
SUM(expression) |
expression |
An expression with data types int, long, float, double, or bytearray cast as double. |
Use the SUM
function to compute the sum of a set of numeric values in a single-column bag.
Suppose we have
relation A.
A (owner:chararray, pet_type:chararray, pet_num:int) ---------------------------------------------------------------- ( ( ( (Bob,dog,2) (Bob,cat,2) |
In this example
the number of sessions is computed.
B = GROUP A by owner; C = FOREACH B GENERATE A.owner, SUM(A.petnum); |
|
int |
long |
float |
double |
chararray |
bytearray |
SUM |
long |
long |
double |
double |
error |
cast as double |
Splits a string
and outputs a bag of words.
TOKENIZE(expression) |
expression |
An expression with data type chararray. |
Use the TOKENIZE
function to split a string of words (all words in a single tuple) into a bag of
words (each word in a single tuple). The following characters are
considered to be word separators: space, double quote("), coma(,)
parenthesis(()), star(*).
Suppose we have
relation A.
A (f1:chararray) ------------------------------------- (Here is the first string.) (Here is the second string.) (Here is the third string.) |
In this example
the strings in each row are split.
X = FOREACH A GENERATE TOKENIZE(f1); |
Relation X looks
like this.
({(Here),(is),(the),(first),(string.)}) ({(Here),(is),(the),(second),(string.)}) ({(Here),(is),(the),(third),(string.)}) |
Converts a file
to a byte stream.
BinarySerializer() |
none |
no parameters |
Use the
BinarySerializer with the DEFINE operator to convert a file to a byte stream.
No Formatting or interpretation takes place.
In this example
the BinarySerializer and BinaryDeserializer are use to convert data to and from
streaming format.
DEFINE Y `stream.pl` INPUT(stdin USING BinarySerializer()) OUTPUT (stdout USING BinaryDeserializer()); X = STREAM A THROUGH Y; |
Converts a byte
stream into a file.
BinarySerializer() |
none |
no parameters |
Use the
BinaryDeserializer with the DEFINE operator to convert a byte stream into a
file. No Formatting or interpretation takes place.
In this example
the BinarySerializer and BinaryDeserializer are use to convert data to and from
streaming format.
DEFINE Y `stream.pl` INPUT(stdin USING BinarySerializer()) OUTPUT (stdout USING BinaryDeserializer()); X = STREAM A THROUGH Y; |
Loads and stores
data in machine-readable format.
BinStorage() |
none |
no parameters |
BinStorage works
with data that is represented on disk in machine-readable format.
BinStorage is
used internally by Pig to store the temporary data that is created between multiple
map/reduce jobs.
In this example
BinStorage is used with the LOAD and STORE functions.
A = LOAD 'data'
USING BinStorage(); STORE X into 'output' USING BinStorage(); |
Loads and stores data in UTF-8 format.
PigStorage(field_delimiter) |
field_delimiter |
Parameter. The default field delimiter is tab ('\t'). You can specify other characters as field delimiters. |
PigStorage works
with structured
For load
statements, PigStorage expects delimiter-separated fields and newline-separated
records. If you use the default field delimiter, tab ('\t'), it does not need
to be specified. For store statements, PigStorage outputs data as
delimiter-separated fields and newline-separated records. If you use the
default field delimiter, tab ('\t'), it does not need to be specified.
PigStorage works
with both simple and complex data types and is the default function for the
LOAD and STORE operators (which means you can choose to include or not include
the function in the load statement).
In this example
PigStorage expects input.txt to contain tab-separated fields and
newline-separated records. The statements are equivalent.
A = LOAD 'student_data' USING PigStorage('/t') AS (name: chararray, age:int, gpa: float); A = LOAD 'student_data' AS (name: chararray, age:int, gpa: float); |
In this example PigStorage stores the contents of X into files with fields that are delimited with an asterisk ( * ). The STORE function specifies that the files will be located in a directory named output and that the files will be named part-nnnnn (for example, part-00000).
STORE X INTO 'output' USING PigStorage('*'); |
Stores data in UTF-8
format.
PigDump() |
none |
no
parameters |
PigDump stores
data as tuples in human-readable UTF-8 format.
In this example
PigDump is used with the STORE function.
STORE X INTO 'output' USING PigDump(); |
Loads
unstructured data in UTF-8 format.
TextLoader() |
none |
no parameters |
TextLoader works
with unstructured data in UTF8 format. Each resulting tuple contains a single
field with one line of input text. TextLoader cannot be used to store data.
In this example TextLoader
is used with the LOAD function.
A = LOAD 'data' USING TextLoader(); |
Prints the content of one or more files to the screen.
cat path [ path …] |
path |
The location of a file or directory. |
The cat command
is similar to the Unix cat command. If multiple files are specified, content
from all files is concatenated together. If multiple directories are specified,
content from all files in all directories is concatenated together.
In this example
the students file in the data directory is printed.
grunt> cat data/students joe smith john anne white grunt> |
Changes the current directory to another directory.
cd [dir] |
dir |
The name of
the directory you want to navigate to. |
The cd command is
similar to the Unix cd command and can be used to navigate the file system. If a
directory is specified, this directory is made your current working directory
and all other operations happen relatively to this directory. If no directory
is specified, your home directory (/user/NAME) becomes the current working
directory.
In this example
we navigate to the /data directory.
grunt> cd /data grunt> |
Copies a file or
directory from the local file system to HDFS.
copyFromLocal src_path dst_path |
src_path |
The path on
the local file system for a file or directory |
dst_path |
The path on
HDFS. |
The copyFromLocal
command enables you to copy a file or a director from the local file system to the
Hadoop Distributed File System (HDFS). If a directory is specified, it is
recursively copied over. Dot "." can be used to specify that the new
file/directory should be created in the current working directory and retain
the name of the source file/directory.
In this example a
file (students) and a
directory (/data/tests) are copied from the local file system to HDFS.
grunt> copyFromLocal /data/students students grunt> ls students /data/students <r 3> 8270 grunt> copyFromLocal /data/tests new_tests grunt> ls new_test /data/new_test/test1.data<r 3> 664 /data/new_test/test2.data<r 3> 344 /data/new_test/more_data |
Copies a file or
directory from HDFS to a local file system.
copyToLocal src_path dst_path |
src_path |
The path on HDFS. |
dst_path |
The path on the local file system for a file or directory. |
The copyToLocal
command enables you to copy a file or a director from Hadoop Distributed File
System (HDFS) to a local file system. If a directory is specified, it is
recursively copied over. Dot "." can be used to specify that the new
file/directory should be created in the current working directory (directory
from which the script was executed or grunt shell started) and retain the name
of the source file/directory.
In this example
two files are copied from HDFS to the local file system.
grunt> copyToLocal students /data grunt> copyToLocal data /data/mydata |
Copies a file or
directory within HDFS.
cp src_path dst_path |
src_path |
The path on
HDFS. |
dst_path |
The path on
HDFS. |
The cp command is
similar to the Unix cp command and enables you to copy files or directories
within DFS. If a directory is specified, it is recursively copied over. Dot
"." can be used to specify that the new file/directory should be
created in the current working directory and retain the name of the source
file/directory.
In this example a
file (students) is copied to another file (students_save).
cp students students_save |
Lists the
contents of a directory.
ls [path] |
path |
The name of
the path/directory. |
The ls command is
similar to the Unix ls command and enables you to list the contents of a
directory. If DIR is specified, the command lists the content of the specified
directory. Otherwise, the content of the current working directory is listed.
In this example
the contents of the data directory are listed.
grunt> ls /data /data/DDLs <dir> /data/count <dir> /data/data <dir> /data/schema <dir> grunt> |
Creates a new
directory.
mkdir path |
path |
The name of
the path/directory. |
The mkdir command
is similar to the Unix mkdir command and enables you to create a new directory.
If you specify a directory or path that does not exist, it will be created.
In this example a
directory and subdirectory are created.
grunt> mkdir data/20070905 |
Moves a file or
directory within the Hadoop Distributed File System (HDFS).
mv src_path dst_path |
src_path |
The path on
HDFS. |
dst_path |
The path on
HDFS. |
The mv command is
identical to the Unix mv command (which copies files or directories within DFS)
except that it deletes the source file or directory as soon as it is copied.
If a directory is
specified, it is recursively moved. Dot "." can be used to specify
that the new file/directory should be created in the current working directory
and retain the name of the source file/directory.
In this example
the output directory is copied to output2 and then deleted.
grunt> mv output output2 grunt> ls output File
or directory output does not exist. grunt> ls output2 /data/output2/map-000000<r 3> 508844 /data/output2/output3 <dir> /data/output2/part-00000<r 3> 0 |
Prints the name of the current working directory.
pwd |
none |
no parameters |
The pwd command
is identical to Unix pwd command and it prints the name of the current working
directory.
In this example
the name of the current working directory is /data.
grunt> pwd /data grunt> |
Removes one or
more files or directories.
rm path [path…] |
path |
The name of
the path/directory/file. |
The rm command is
similar to the Unix rm command and enables you to remove one or more files or
directories.
Note: This command recursively removes a directory even if it is not empty and
it does not confirm remove and the removed data is not recoverable.
In this example
grunt> rm /data/students grunt> rm students students_sav |
Forcibly removes
one or more files or directories.
rmf path [path …] |
path |
The name of
the path/directory/file. |
The rmf command
is similar to the Unix rm -f command and enables you to forcibly remove one or
more files or directories.
Note: This command recursively removes a directory even if it is not empty and
it does not confirm remove and the removed data is not recoverable.
In this example
grunt> rmf /data/students grunt> rmf students students_sav |
Prints a list of
Pig commands.
help |
none |
no parameters |
The help command
prints a list of Pig commands.
In this example
the students file in the data directory is printed out.
grunt> help Commands: <pig latin statement>; store <alias> into <filename> [using <functionSpec>] dump <alias> etc …. |
Kills a job.
kill jobid |
jobid |
The job id. |
The kill command enables you to kill a job based on a job id.
In this example
the job with id job_0001 is killed.
grunt> kill job_0001 |
Quits from the
Pig grunt shell.
exit |
none |
no parameters |
The quit command
enables you to quit or exit the Pig grunt shell.
In this example
the quit command exits the Pig grunt shall.
grunt> quit |
Assigns values to
keys used in Pig.
set key 'value' |
key |
Key (see
table). Case sensitive. |
value |
Value for
key (see table). Case sensitive. |
The set command
enables you to assign values to keys, as shown here:
Key |
Value |
Description |
debug |
on/off |
enables/disables debug-level logging |
job.name |
single quoted string that contains the name |
sets user-specified name for the job |
In this example
debug is set on and the job is assigned a name.
grunt>
set debug on grunt> set job.name 'my job' |