Pig Latin Reference Manual

To get started, read about the Pig Latin Statements and Data Items.

Then, review the Relational Operators and the Built-in Functions.

 

Conventions

Null Operators

Relational Operators

Built-in Functions

 

is null        

cogroup

Eval Functions

Pig Latin Statements

is not null 

cross

AVG

 

 

distinct

CONCAT

Data Items

Boolean Operators 

filter

COUNT

 

and

foreach

DIFF

Data Types 

or

group

MIN

Simple Data Types

not

join

MAX

int

 

limit

SIZE

long

Dereference Operators

load

SUM

double

tuple dereference  . 

order

TOKENIZE

arrays

map dereference   #   

split

Load/Store Functions

chararray

 

store

BinaryDeserializer

bytearray

Sign Operators

stream

BinarySerializer

Complex Data Types

positive +

union

BinStorage

tuple

negative -

 

PigStorage

bag

 

UDF Statements

PigDump

map

Cast Operators

define

TextLoader

 

(type)$0

register

 

Arithmetic Operators

(type)alias

 

File Commands

addition +

 

Diagnostic Operators

cat

subtraction  -

Nulls

describe

cd

multiplication  *

 

dump

copyFromLocal

division  /

Constants

explain

copyToLocal

modulo  %

 

illustrate

cp

bincond ?

Expressions

 

ls

 

 

 

mkdir

Comparison Operators

Schemas

 

mv

Equal  = =

 

 

pwd

not equal  !=

Keywords

 

rm

less than  <

 

 

rmf

greater than  >

 

 

 

less than or equal to  <=

 

 

Utility Commands

greater than or equal to  >=

 

 

help

pattern matching  matches

 

 

kill

 

 

 

quit

 

 

 

set

Conventions

Conventions for the syntax and code examples included in the Pig Latin Reference Manual are described here.

Convention

Description

Example

( )

Parentheses enclose one or more items.

 

Parentheses are also used to indicate the tuple data type.

Multiple items:

(1, abc, (2,4,6) )

 

 

[ ]

Straight brackets enclose one or more optional items.

 

Straight brackets are also used to indicate the map data type. In this case <> is used to indicate optional items.

Optional items:

[INNER | OUTER]

 

 

{ }

Curly brackets enclose two or more items, one of which is required.

 

Curly brackets also used to indicate the bag data type. In this case <> is used to indicate required items.

Two items, one required:

{ gen_blk | nested_gen_blk }

 

 

Horizontal ellipsis points indicate that you can repeat a portion of the code.

Pig Latin syntax statement:

cat path [ path …]

UPPERCASE

 

 

lowercase

In general, uppercase type indicates elements the system supplies.

 

In general, lowercase type indicates elements that you supply.

 

Note: Pig Latin operators, commands, and keywords are not case sensitive. The names (aliases) of relations, fields, and functions are case sensitive.

Pig Latin statement:

A = LOAD 'data' AS (f1:int);

 

·       LOAD, AS supplied by system

·       A, f1 are names (aliases)

·       data supplied by you

italics

Italic type indicates placeholders or variables for which you must supply values.

Pig Latin syntax:

alias = LIMIT alias  n;

 

You supply the values alias and n.

Pig Latin Statements

A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. (This definition applies to all Pig Latin operators except LOAD and STORE which read data from and write data to the file system.) Pig Latin statements can span multiple lines and must end with a semi-colon ( ; ).

Pig Latin statements are generally organized in the following manner. A LOAD statement reads data from the file system. A series of "transformation" statements process the data. A STORE statement writes data back to the file system (or, a DUMP statement displays output to the screen.)

You can run Pig Latin statements manually (using the Grunt shell) or place statements in a script. Pig processes statements this way:

·         First, Pig validates the syntax and semantics of all of the statements.

·         Next, if Pig has encountered a DUMP or STORE operator, Pig will execute the statements.

In this example, Pig will validate the LOAD, GROUP and FOREACH statements but will not execute these statements.

grunt> A = LOAD 'data' USING PigStorage()

>> AS (a:int, b:int, c:int);

grunt> B = GROUP A BY a;

grunt> C = FOREACH B GENERATE COUNT ($0);

In this example, Pig will validate the LOAD, GROUP, FOREACH, and DUMP statements. Then, if no errors are encountered, Pig will execute these statements.

grunt> A = LOAD 'data' USING PigStorage()

>> AS (a:int, b:int, c:int);

grunt> B = GROUP A BY a;

grunt> C = FOREACH B GENERATE COUNT ($0);

grunt> DUMP C;

As noted, Pig Latin statements work with relations. Relations have tuples (or rows). Tuples have fields or columns of fields.

Relations are referred to by name (where the name is called an alias). Fields are referred to by name (where the name is called an alias or field_alias) or by positional notation if no schema is defined.

Pig Latin operators, commands, and keywords are not case sensitive. The names of relations, fields, and Pig Latin function are case sensitive.

In the example above, note the following:

·         The first statement spans two lines.

·         The names (aliases) for relations A, B, and C are case sensitive.

·         The names (aliases) for fields a, b, and c are case sensitive.

·         Function names PigStorage and COUNT are case sensitive.

·         Operators and keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, and DUMP are not case sensitive. They can also be written as load, using, as, group, by, etc.

·         In the FOREACH statement, the field in relation B is referred to by positional notation ($0).

Data Items

A data item is a field in a tuple. A data item can be any of the supported data types. You can refer to data items in various ways. Suppose we have the following tuple, which has three fields.

tuple:( 1, {(2,3),(4,5),(5,7)}, ['apache'#'search'] )         

The three fields are separated out in the table below.

You have the option to assign or not assign names (aliases) to fields.

·         If you assign names to fields you can use any word except the Pig keywords. For example: f1, f2, f3 or a, b, c or name, age, gpa.

·         If you don't assign names to fields, you can refer to the fields using positional notation, which begins with zero (0). For example: $0, $1, $2.

 

 

First Field: int

Second Field: bag

Third Field: map

 Field Value

1

{(2,3),(4,5),(5,7)}

['apache'#'search']

 Name (alias)

f1

f2

f3

 Positional Notation

$0

$1

$2

In addition to positional notation and names, here are some other ways to refer to data items:

Referral Method

Example

Field Value

Notes

Constant

1.0 or 'apache.org'

 n/a

See Constants.

Positional

Notation

$0

 1

See the table above.

Always starts with $0.

Name (alias)

f1

 1

See the table above.

Can be any word except the Pig

keywords.

Projection

f2.$0

{(2),(4),(5)}  

See Dereference Operators.

Map Lookup

f3#'apache'

'search'

See Dereference Operators.

Function

SUM(f2.$0)

2+4+5 = 11

See Eval Functions.

Expression  

COUNT(f2) + f1 / '2.0'

3 + 1 / 2.0 = 3.5

See Eval Functions.

Bincond

(f1 = = '1' ? '2' : COUNT(f2))

'2' since f1=='1' is true. If f1 were != '1', then the value of this data item for t would be COUNT(f2)=3

See Arithmetic Operators.

Data Types

Simple Data Types

 

Scalars

 

int

Signed 32-bit integer

long

Signed 64-bit integer

float

32-bit floating point

double

64-bit floating point

Arrays

 

chararray

Character array (string) in UTF-8 (Unicode) format

bytearray

Byte array (blob)

Complex Data Types

 

tuple

An ordered set of data items (see Tuple)

bag

An ordered collection of tuples (see Bag)

map

A set of key value pairs (see Map)

Note the following general observations about data types:

·        A loader produces the data of the type specified by the schema. If any of the data does not conform to the schema, the loader generates a null value.

A = LOAD 'data' AS (name:chararray, age:int, gpa:float);

 

·        If an explicit cast is specified and the conversion from the source type is not available, a type-checker error will occur and the processing would be aborted. For instance:

A = LOAD 'data' AS (name:chararray, age:int, gpa:float);

B = FOREACH A GENERATE (int)name;

 

·         Implicit casts are not applied for those types for which conversions are not available. The following would result in a parsing error.

A = LOAD 'data' AS (name:chararray, age:int, gpa:float);

B = FOREACH A GENERATE name + gpa;

Tuple

A tuple (row) is an ordered set of data items.

Syntax

( data_item [, data_item …] ) 

Terms

(  )

Tuples are enclosed in parentheses ( ).

data_item

A single data item (datum), referred to as a field or column (of fields).

A data item can be any Pig data type.

Usage

Note the following:

·         Relations can have duplicate tuples.

·         The tuples in a relation can have a different number of data items. If Pig tries to access a data item that does not exist, a null value is returned.

·         The data items in tuples can have different data types.

Example

In this example the data contains four tuples (rows) with differing numbers of fields. When Pig processes the data, null values are substituted for the missing data.

data

-------------------------

Alice   turtle  25

Bob     dog

Tom

Jane    cat     50

A = LOAD 'data' AS (f1:chararray, f2:chararray, f3:int);

-------------------------

(Alice,turtle,25)

(Bob,dog,)

(Tom,,)

(Jane,cat,50)

B = FOREACH A GENERATE f3:

-------------------------

(25)

()

()

(50)

Bag

A bag is a collection of tuples.

Syntax

{ tuple [, tuple …] }

Terms

{  }

Bags are enclosed in curly brackets { }.

tuple

A tuple.

Usage

The usage rules that apply to tuples (duplicate tuples, different number of data items, and different data types) also apply to the tuples within a bag. However, for Pig to effectively process a relation containing bags, the schemas of the tuples within those bags should be the same.

For example, if half of the tuples include chararray fields and while the other half include float fields, only half of the tuples will participate in any kind of computation because the chararray fields will be converted to null.

Example

In this example each bag contains three tuples all of which have the same data type (int).

{ (1), (1, 2, 3), (25, 50) }

{ (2), (4, 5, 6), (50, 75) }

Map

A map is a set of key value pairs.

Syntax (<> denotes optional)

[ key#value <, key#value …> ]

Terms

[ ]

Maps are enclosed in straight brackets [ ].

#

Key value pairs are denoted by the pound sign #.

key

Must be a scalar data type. Must be a unique value.

value

Any data type.

Usage

Key values within a relation must be unique.

Example

In this example the map includes two key value pairs.

[ name#john, phone#5551212 ]

Arithmetic Operators

Operator

Symbol

 Notes

addition

+

 

subtraction 

-

 

multiplication 

*

 

division 

/

 

modulo 

%

Returns the remainder of a divided by b.

bincond

?

The value of the data item is one of two options, chosen according to a condition.

Example: modulo

X = FOREACH A GENERATE f1, f2, f1%f2;

Example: bincond

Suppose we have relation A.

(1,{(2,3),(4,6),(5,7)})

(2,{(2,3),(4,6),(5,7)})

(1,{(2,3),(4,6)})

In this example the bincond condition states that if f1 equals 1 then the value is 2; if f1 does not equal 1, then the value is COUNT(f2).

X = FOREACH A GENERATE f1, (f1==1?2:COUNT(f2));

Relation X looks like this.

(1,2L)

(2,3L)

(1,2L)

Types Table: addition (+) and subtraction (-) operators

* bytearray cast as this data type

 

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

error

error

error

error

error

error

error

error

error

tuple

 

not yet

error

error

error

error

error

error

error

map

 

 

error

error

error

error

error

error

error

int

 

 

 

int

long

float

double

error

cast as int

long

 

 

 

 

long

float

double

error

cast as long  

float

 

 

 

 

 

float

double

error

cast as float  

double

 

 

 

 

 

 

double

error

cast as double   

chararray

 

 

 

 

 

 

 

error

error

bytearray

 

 

 

 

 

 

 

 

cast as double

Types Table: multiplication (*) and division (/) operators

* bytearray cast as this data type

 

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

error

error

error

not yet

not yet

not yet

not yet

error

error

tuple

 

error

error

not yet

not yet

not yet

not yet

error

error

map

 

 

error

error

error

error

error

error

error

int

 

 

 

int

long

float

double

error

cast as int

long

 

 

 

 

long

float

double

error

cast as long

float

 

 

 

 

 

float

double

error

cast as float

double

 

 

 

 

 

 

double

error

cast as double 

chararray

 

 

 

 

 

 

 

error

error

bytearray

 

 

 

 

 

 

 

 

cast as double  

Types Table: modulo (%) operator

 

int

long

bytearray

int

int

long

cast as int

long

 

long

cast as long

bytearray

 

 

error

Comparison Operators

Operator

Symbol

 Notes

equal 

==

 

not equal

!=

 

less than 

< 

 

greater than

> 

 

less than or equal to 

<=

 

greater than or equal to

>=

 

pattern matching 

matches

Regular expression matching.  Use the Java format for regular expressions.

Use the comparison operators with numeric and string data.

Example: numeric

X = FILTER A BY (f1 == 8);

Example: string

X = FILTER A BY (f2 == 'apache');

Example: matches

X = FILTER A BY (f1 matches '.*apache.*');

Types Table: numeric equal (==) and not equal (!=) operators

* bytearray cast as this data type

 

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

error

error

error

error

error

error

error

error

error

tuple

 

boolean (see Note 1)

error

error

error

error

error

error

error

map

 

 

boolean

(see Note 2)

error

error

error

error

error

error

int

 

 

 

boolean

boolean

boolean

boolean

error

cast as boolean

long

 

 

 

 

boolean

boolean

boolean

error

cast as boolean

float

 

 

 

 

 

boolean

boolean

error

cast as boolean  

double

 

 

 

 

 

 

boolean

error

cast as boolean  

chararray

 

 

 

 

 

 

 

boolean

cast as boolean

bytearray

 

 

 

 

 

 

 

 

boolean

Note 1: boolean (Tuple A is equal to tuple B if they have the same size s, and for all 0 <= i < s A[i] = = B[i])

Note 2: boolean (Map A is equal to map B if A and B have the same number of entries, and for every key k1 in A with
a value of v1, there is a key k2 in B with a value of v2, such that k1 = = k2 and v1 = = v2)

Types Table: numeric inequality operators (>, <, >=, <=)

 

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

error

error

error

error

error

error

error

error

error

tuple

 

error

error

error

error

error

error

error

error

map

 

 

error

error

error

error

error

error

error

int

 

 

 

boolean

boolean

boolean

boolean

error

boolean (bytearray cast as int)

long

 

 

 

 

boolean

boolean

boolean

error

boolean (bytearray cast as long)

float

 

 

 

 

 

boolean

boolean

error

boolean (bytearray cast as float)

double

 

 

 

 

 

 

boolean

error

boolean (bytearray cast as double)

chararray

 

 

 

 

 

 

 

boolean

boolean (bytearray cast as chararray)

bytearray

 

 

 

 

 

 

 

 

boolean

Types Table: matches operator

 

chararray

bytearray

chararray

boolean

boolean (bytearray cast as chararray)

bytearray

 

boolean

Null Operators

Operator

Symbol

 Notes

is null        

is null

 

is not null 

is not null 

 

Example

X = FILTER A BY f1 is not null;

Types Table

The null operators can be applied to all data types. For more information, see Nulls.

Boolean Operators

Operator

Symbol

 Notes

AND       

and

 

OR 

or

 

NOT

not

 

Example

X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1));

Dereference Operators

Operator

Symbol

 Notes

tuple dereference     

. (dot)

 a.$0

map dereference

#

 [key#map}

Note the following:

·         Tuple dereferencing can be done by name or position. For example mytuple.myfield and mytuple.$0 are both valid dereferences.
If the dot operator is applied to a bytearray, the bytearray will be assumed to be a tuple.

·         Tuple dereferencing can be done at the relation or field level. For example, if you have relation A with three fields (f1,f2,and f3) you
can state A.f1. Or, if you have a bag with three tuples B{(1),(2),(3)} you can state B.$0

·         Map dereferencing must be done by key, for example mymap#mykey If the pound operator is applied to a bytearray, the bytearray
is assumed to be a map. If the key being looked up does not exist, the empty string is returned.

Sign Operators

Operator

Symbol

 Notes

positive       

. (dot)

 Has no affect.

negative (negation)

#

 Changes the sign of a positive or negative number.

Example

A = LOAD 'data' as (x, y, z);

B = FOREACH A GENERATE -x, y;

Types Table: negation ( - ) operator

bag

error

tuple

error

map

error

int

int

long

long

float

float

double

double

chararray

error

bytearray

double (as double)

Cast Operators

Pig Latin supports casts as shown in this table.

 

to

 

 

 

 

 

 

 

 

from

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

 

error

error

error

error

error

error

error

error

tuple

error

 

error

error

error

error

error

error

error

map

error

error

 

error

error

error

error

error

error

int

error

error

error

 

yes

yes

yes

yes

error

long

error

error

error

yes

 

yes

yes

yes

error

float

error

error

error

yes

yes

 

yes

yes

error

double

error

error

error

yes

yes

yes

 

yes

error

chararray

error

error

error

yes

yes

yes

yes

 

error

bytearray

yes

yes

yes

yes

yes

yes

yes

yes

 

Syntax 

(data_type) data_item

Terms

(data_type )

Any Pig Latin data type except bytearray (you cannot cast to type bytearray) enclosed in parentheses (  ).

data_item

A single data item (datum), referred to as a field. The data item can be represented in positional notation or as an alias. For example: (int)$0 or (int)f1

Usage

Please note the following:

·         A field can be explicitly cast. Once cast, the field remains that type (it is not automatically cast back). In this example $0 is explicitly cast to int.

B = FOREACH A GENERATE (int)$0 + 1;

 

·         Where possible implicit casts are performed. In this example $0 is cast to int (regardless of underlying data) and $1 is cast to double.

B = FOREACH A GENERATE $0 + 1, $1 + 1.0

 

·         When two bytearrays are used in arithmetic expressions or with built-in aggregate functions (such as SUM) they are implicitly cast to double. If the underlying data is really int or long, you’ll get better performance by declaring the type or explicitly casting the data.

·         Downcasts may cause loss of data. For example casting from long to int may drop bits.

Nulls

In Pig Latin, nulls are implemented using the SQL definition of null as unknown or non-existent. Nulls can occur naturally in data or can be the result of an operation.

Nulls and Operators

Pig Latin operators interact with nulls as shown in this table.

Operator

Interaction

Comparison operators:

==, !=

>, <

>=, <=

If either sub-expression is null, the result is null.

Comparison operator:

matches

If either the string being matched against or the string defining the match is null, the result is null.

Arithmetic operators:

 + , -, *, /

% modulo

? bincond

If either sub-expression is null, the resulting expression is null.

 

Null operator:

is null

If the tested value is null, returns true; otherwise, returns false.

Null operator:

is not null

If the tested value is not null, returns true; otherwise, returns false.

Dereference operators:

tuple (.) or map (#)

If the de-referenced tuple or map is null, returns null.

Cast operator

Casting a null from one type to another type results in a null.

Functions:

AVG, MIN, MAX, SUM

These functions ignore nulls.

Function:

COUNT

This function counts all values, including nulls.

Function:

CONCAT

If either sub-expression is null, the resulting expression is null.

Function:

SIZE

If the tested object is null, returns null.

For Boolean sub-expressions, note the results when nulls are used with these operators:

·         FILTER operator – If a Boolean sub-expression results in null value, the filter does not pass them through (if X is null, !X is also null, and the filter will reject both).

·         Bincond operator – If a Boolean sub-expression results in null value, the resulting expression is null (see the interactions above for Arithmetic operators)

Example: COUNT function

As noted, the COUNT function counts all values, including nulls. If you don't want the function to count null values, you can use one of the methods shown here.

In this example the is not null operator is used to filter (remove) all null values before subsequent operations, including the COUNT function, are applied.

A = LOAD 'data';

B = FILTER A BY $1 is not null;

C = GROUP A BY $0;

D = FOREACH B GENERATE GROUP, COUNT(B.$1);

Suppose you have written a function, RemoveNulls, to filter null values. In this example RemoveNulls is used to filter nulls values for the COUNT function only.

A = LOAD 'data';

B = GROUP A BY $0;

D = FOREACH B GENERATE GROUP, COUNT(RemoveNulls($1));

Nulls and Constants

Nulls can be used as constant expressions in place of expressions of any type.

In this example a and null are projected.

A = LOAD 'data' AS (a, b, c).

B = FOREACH A GENERATE a, null;

In this example of an outer join, if the join key is missing from a table it is replaced by null.

A = LOAD '/studenttab10k' AS (name: chararray, age: int, gpa: float);

B = LOAD 'votertab10k' AS (name: chararray, age: int, registration: chararray, donation: float);

C = COGROUP A BY name, B BY name;

D = FOREACH C GENERATE FLATTEN((IsEmpty(A) ? null : A)), FLATTEN((IsEmpty(B) ? null : B));

Like any other expression, null constants can be implicitly or explicitly cast.

In this example both a and null will be implicitly cast to double.

A = LOAD 'data' AS (a, b, c).

B = FOREACH A GENERATE a + null;

In this example  both a and null will be cast to int, a implicitly, and null explicitly.

A = LOAD 'data' AS (a, b, c).

B = FOREACH A GENERATE a + (int)null;

Operations That Produce Nulls

As noted, nulls can be the result of an operation. These operations can produce null values:

·         Division by zero

·         Returns from user defined functions (UDFs)

·         Dereferencing a map key that does not exist in a map. For example, given a map info containing [name#john, phone#5551212] if a user tries to use info#address a null is returned.

·         Accessing a field that does not exist in a tuple. As a further explanation, see the examples below.

Example: Accessing a field that does not exist in a tuple

In this example if some rows in 'data' contain only a single column, nulls are injected for these rows.

A = LOAD 'data';

B = FOREACH A GENERATE $1;

Similarly, in this example if a load statement does not include a declared schema, nulls are injected if some rows contain only a single column.

A = LOAD 'data';

B = FOREACH A GENERATE $1;

In this example an error is generated because the requested column ($2) is outside of the declared schema (remember, positional notation begins with $0).

A = LOAD 'data' AS (f1, f2);

B = FOREACH A GENERATE $2;

Nulls and Load Functions

As noted, nulls can occur naturally in the data. If nulls are part of the data, it is the responsibility of the load function to handle them correctly. Keep in mind that what is considered a null value is loader-specific; however, the load function should always communicate null values to Pig by producing Java nulls.

The Pig Latin load functions (for example, PigStorage and TextLoader) produce null values wherever data is missing. For example, empty strings (chararrays) are not loaded; instead, they are replaced by nulls.

PigStorage is the default load function for the LOAD operator. In this example the is not null operator is used to filter names with null values.

A = LOAD 'data' AS (name, age, gpa);

B = FILTER A BY name is not null;

Constants

Pig provides constant representations for all data types except bytearrays.

 

Constant Example

Notes

Simple Data Types

 

 

Scalars

 

 

int

19

 

long

19L

 

float

19.2F or 1.92e2f

 

double

19.2 or 1.92e2

 

Arrays

 

 

chararray

'hello world'

 

bytearray

 

Not applicable.

Complex Data Types

 

 

tuple

(19, 2, 1)

A constant in this form creates a tuple

bag

{ (19, 2), (1, 2) }

A constant in this form creates a bag

map

[ 'name' # 'John', 'ext' # 5555 ]

A constant in this form creates a map

Please note the following:

·         Chararray (string) constants can be specified as UTF-8 characters (for example, f1 matches 'abc' ).

·         Any numeric constant consisting of just digits (for example, 123) is assigned the type of int.

·         To specify a long constant, l or L must be appended to the number (for example, 12345678L). If the l or L is not specified, but the number is too large to fit into an int, the problem will be detected at parse time and the processing is terminated.

·         Any numeric constant with decimal point (for example, 1.5) and/or exponent (for example, 5e+1) is treated as double unless it ends with f or F in which case it is assigned type float (for example,  1.5f).

The data type definitions for tuples, bags, and maps apply to constants:

·         A tuple can be any data type

·         A bag is a collection of tuples

·         A map key must be a scalar; a map value can be any data type

Complex constants can be used in the same places scalar constants can be used, that is, in FILTER and GENERATE statements.

A = LOAD 'data' USING MyStorage() AS (T: tuple(name:chararray, age: int));

B = FILTER A BY T == ('john', 25);

D = FOREACH B GENERATE T.name, [25#5.6], {(1, 5, 18)};

Expressions

In Pig Latin, expressions are language constructs used with the FILTER, FOREACH, GROUP, and SPLIT operators as well as the eval functions.

Expressions are written in conventional mathematical infix notation and are adapted to the UTF-8 character set. Depending on the context, expressions can include:

·         Any Pig data type (simple data types, complex data types)

·         Any Pig operator (arithmetic, comparison, null, boolean, dereference, sign, and cast)

·         Any Pig built-in function.

·         Any user-defined function (UDF) written in Java.

 

In a Pig Latin statement, an arithmetic expression could look like this:

X = GROUP A by f2*f3;

 

A string expression could look like this, where a and b are both chararrays:

X = FOREACH A GENERATE CONCAT(a,b);

 

A boolean expression could look like this:

X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1));

Schemas

Schemas enable you to assign names to and declare types for fields. Schemas are defined using the AS keyword with the LOAD, STREAM, and FOREACH statements.

Schemas are optional but we encourage you to use them whenever possible; type declarations result in better parse-time error checking and more efficient code execution.

If you don't define a schema, the fields are not named and default to type bytearray. You can refer to un-named fields using positional notation (see Data Items). You can change the field type using the cast operators (see Cast Operators).

Schemas with LOAD and STREAM Statements

With LOAD and STREAM statements, the schema following the AS keyword must be enclosed in parentheses.

In this example the LOAD statement includes a schema definition for simple data types.

A = LOAD 'data' AS (f1:int, f2:int);

Schemas with FOREACH Statements

With FOREACH statements, the schema following the AS keyword must be enclosed in parentheses when the FLATTEN keyword is used. Otherwise, the schema should not be enclosed in parentheses.

In this example the FOREACH statement includes the FLATTEN keyword and a schema for simple data types.

X = FOREACH C GENERATE FLATTEN(A) AS (f1:int, f2:int, f3:int);

In this example the FOREACH statement includes a schema for simple data types.

X = FOREACH A GENERATE f1+f2 AS x1:int;

Schemas for Simple Data Types

Simple data types include int, long, float, double, chararray, and bytearray (see Data Types).

Syntax

(alias[:type]) [, (alias[:type]) …] )

Terms

alias

The name assigned to the field.

type

(Optional) The simple data type assigned to the field.

The alias and type are separated by a colon ( : ).

If the type is omitted, the field defaults to type bytearray.

( , )

Multiple fields are enclosed in parentheses and separated by commas.

Examples

In this example the schema defines multiple fields.

A = LOAD 'data' AS (name:chararray, age:int, gpa:float);

In this example the schema defines multiple fields. Field "b" will default to bytearray because no type is declared.

A = LOAD 'data' AS (a:int, b, c:int);

Schemas for Complex Data Types

Complex data types include tuples, bags, and maps (see Data Types).

Tuple Schema

A tuple (row) is an ordered set of data items.

Syntax

alias[:tuple] (alias[:type]) [, (alias[:type]) …] )

Terms

alias

The name assigned to the tuple.

:tuple

(Optional) The data type, tuple (case insensitive).

( )

The designation for a tuple, a set of parentheses.

alias[:type]

The constituents of the tuple, where the schema definition rules for the corresponding type applies to the constituents of the tuple:

·         alias – the name assigned to the field

·         type (optional) – the simple or complex data type assigned to the field

Examples

In this example the schema defines one field as a tuple. Field "f3" will default to type bytearray. Both statements are equivalent.

A = LOAD 'data' AS (T: tuple (f1:int, f2: int, f3: long));

A = LOAD 'data' AS (T: (f1:int, f2: int, f3: long));

In this example the schema defines two fields as tuples.

A = LOAD 'data' AS (X:TUPLE (x1:int, x2: int, x3: int), Y:TUPLE (y1:int, y2: int, y3:int));

Bag Schema

A bag is a collection of tuples.

Syntax

alias[:bag] {tuple}

Terms

alias

The name assigned to the bag.

:bag

(Optional) The data type, bag (case insensitive).

{ }

The designation for a bag, a set of curly brackets.

tuple

A tuple (see Tuple Schema).

Examples

In this example the schema defines a field as a bag. This bag contains one tuple. Both statements are equivalent.

A = LOAD 'data' AS (B: bag {T: tuple(f1:int, f2:int, f3:long)});

A = LOAD 'data' AS (B: {T: (f1:int, f2:int, f3:long)});

In this example the schema defines a field as a bag. The tuples in the bag must conform to tuple schemas (T1 and T2) described in the bag schema (B).

A = LOAD 'data' AS (B: bag {T1: tuple(a:int, b:int), T2:tuple(x:int, y:int)});

In this example the schema defines two fields as bags. The tuples in the bags must conform to the tuple schemas (T1 and T2) described in the bag schemas (B1 and B2).

A = LOAD 'data' AS (B1: bag {T1: tuple(a:int, b:int)}, B2: bag {T2:tuple(x:int, y:int)});

Map Schema

A map is a set of key value pairs.

Syntax (where <> means optional)

alias<:map> [ ]

Terms

alias

The name assigned to the map.

:map

(Optional) The data type, map (case insensitive).

[ ]  

The designation for a map, a set of straight brackets [ ].

Example

In this example the schema defines a field as a map. Both statements are equivalent.

A = LOAD 'data' AS (M:map []);

A = LOAD 'data' AS (M:[]);

Schemas for Multiple Types

You can define schemas for data that includes multiple types.

Example

In this example the schema defines three fields. The field names are T, B, and M. The field types are tuple, bag, and map. The two statements are equivalent.

A = LOAD 'mydata' USING MyStorage() AS (T:tuple(f1:int, f2:int, f3:long), B:bag{t:tuple(x:int)}, M: map[] );

A = LOAD 'mydata' USING MyStorage() AS (T:(f1:int, f2:int, f3:long), B:{t:(x:int)}, M: [] );

Keywords

A

F

M

Functions

and

f

map

AVG

all

F

matches

BinaryDeserializer

as

filter

mkdir

BinarySerializer

asc

flatten

mv

BinStorage

 

float

N

CONCAT

B

foreach

not

COUNT

bag

G

null

DIFF

by

generate

O

MIN

bytearray

group

or

MAX

C

H

order

PigDump

cache

help

outer

PigStorage

cat

I

output

SIZE

cd

if

P

SUM

chararray

illustrate

parallel

TextLoader

cogroup

inner

pwd

TOKENIZE

copyFromLocal

input

Q

 

copyToLocal

int

quit

Symbols

cp

into

R

= =   !=   <   >   <=   >=

cross

is

register

+    -  *   /   %

D

J

rm

? $  .  #  ( )  [ ]  { }

distinct

join

rmf

 

define

K

S

Preprocessor Statements

desc

kill

set

%declare

describe

L

ship

%default

double

l

split

 

du

L

stderr

 

dump

limit

stdin

 

E

load

stdout

 

e

long

store

 

E

ls

stream

 

explain

 

T

 

 

 

through

 

 

 

tuple

 

 

 

U

 

 

 

union

 

 

 

using

 

Relational Operators

COGROUP

Groups the data in two or more relations.

Syntax

alias  = COGROUP alias BY field_alias [INNER | OUTER] , alias  BY field_alias [INNER | OUTER] [PARALLEL n] ;

Terms

alias

The name a relation.

field_alias

The name of one or more fields in a relation.

If multiple fields are specified, separate with commas and enclose in parentheses. For example, X = COGROUP A BY (f1, f2);

The number of fields specified in each BY clause must match. For example,
X = COGROUP A BY (a1,a2,a3), B BY (b1,b2,b3);

BY

Keyword.

INNER

Keyword.

OUTER

Keyword.

PARALLEL n

Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline:

 

   n = (nr_nodes - 1) * 0.45 * nr_GB

 

where nr_nodes is the number of nodes used and nr_GB is the amount of  physical memory on each node.

Note the following:

·         Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.

·         If you don’t specify parallel, you still get the same map parallelism but only one reduce task.

Usage

The COGOUP operator groups the data in two or more relations based on the common field values.

Note: The COGROUP and JOIN operators perform similar functions. COGROUP creates a nested set of output tuples while JOIN creates a flat set of output tuples.

Examples

Suppose we have two relations, A and B.

A: (owner:chararray, pet:chararray)

---------------

(Alice, turtle)

(Alice, goldfish)

(Alice, cat)

(Bob, dog)

(Bob, cat)

 

B: (friend1:chararray, friend2:charrarray)

---------------------

(Cindy, Alice)

(Mark, Alice)

(Paul, Bob)

(Paul, Jane)

In this example tuples are co-grouped using field “owner” from relation A and field “friend2” from relation B as the key fields. The DESCRIBE operator shows the schema for relation X, which has two fields, "group" and "A" (for an explanation, see GROUP).

X = COGROUP A BY owner, B BY friend2;

DESCRIBE X;

X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1: chararray,friend2: chararray}}

Relation X looks like this. A tuple is created for each unique key field. The tuple includes the key field and two bags. The first bag is the tuples from the first relation with the matching key field. The second bag is the tuples from the second relation with the matching key field. If no tuples match the key field, the bag is empty.

(Alice, {(Alice, turtle), (Alice, goldfish), (Alice, cat)}, {(Cindy, Alice), (Mark, Alice)})

(Bob, {(Bob, dog), (Bob, cat)}, {(Paul, Bob)})

(Jane, {}, {(Paul, Jane)})

In this example tuples are co-grouped and the INNER keyword is used to ensure that only bags with at least one tuple are returned.

X = COGROUP A BY owner INNER, B BY friend2 INNER;

Relation X looks like this.

(Alice, {(Alice, turtle), (Alice, goldfish), (Alice, cat)}, {(Cindy, Alice), (Mark, Alice)})

(Bob, {(Bob, dog), (Bob, cat)}, {(Paul, Bob)})

In this example tuples are co-grouped and the INNER keyword is used asymmetrically on only one of the relations.

X = COGROUP A BY owner, B BY friend2 INNER;

Relation X looks like this.

(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})

(Jane,{},{(Paul,Jane)})

(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})

CROSS

Computes the cross product of two or more relations.

Syntax

alias = CROSS alias, alias [, alias …] [PARALLEL n];

Terms

alias

The name of a relation.

PARALLEL n

Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline:

 

   n = (nr_nodes - 1) * 0.45 * nr_GB

 

where nr_nodes is the number of nodes used and nr_GB is the amount of  physical memory on each node.

Note the following:

·         Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.

·         If you don’t specify parallel, you still get the same map parallelism but only one reduce task.

Usage

Use the CROSS operator to compute the cross product (Cartesian product) of two or more relations.

CROSS is an expensive operation and should be used sparingly.

Example

Suppose we have relations A and B.

(A)                    (B)

----------- --------   

(1, 2, 3)             (2, 4)    

(4, 2, 1)             (8, 9)                

                        (1, 3)

In this example the cross product of relation A and B is computed.

X = CROSS A, B;

Relation X looks like this.

(1, 2, 3, 2, 4)

(1, 2, 3, 8, 9)

(1, 2, 3, 1, 3)

(4, 2, 1, 2, 4)

(4, 2, 1, 8, 9)

(4, 2, 1, 1, 3)

DISTINCT

Removes duplicate tuples in a relation.

Syntax

alias = DISTINCT alias [PARALLEL n];             

Terms

alias

The name of the relation.

PARALLEL n

Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline:

 

   n = (nr_nodes - 1) * 0.45 * nr_GB

 

where nr_nodes is the number of nodes used and nr_GB is the amount of  physical memory on each node.

Note the following:

·         Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.

·         If you don’t specify parallel, you still get the same map parallelism but only one reduce task.

Usage

Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT does not preserve the original order of the contents (to eliminate duplicates, Pig must first sort the data). You cannot use DISTINCT on a subset of fields. To do this, use FOREACH … GENERATE to select the fields, and then use DISTINCT.

Example

Suppose we have relation A.

(A)       

-----------------                  

(8, 3, 4)

(1, 2, 3)                        

(4, 3, 3)                                                                                    

(4, 3, 3)

(1, 2, 3)

In this example all duplicate tuples are removed.

X = DISTINCT A;

Relation X looks like this.

(1, 2, 3)

(4, 3, 3)

(8, 3, 4)

FILTER

Selects tuples (rows) from a relation based on some condition.

Syntax

alias = FILTER alias  BY expression;

Terms

alias

The name of the relation.

BY

Required keyword.

expression

An expression.

Usage

Use the FILTER operator to work with tuples (rows) of data. FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.

Note: If you want to work with specific fields (columns) of data, use the FOREACH …GENERATE operation.

Examples

Suppose we have relation A.

(A: f1:int, f2:int, f3:int)

----------------

(1, 2, 3)

(4, 2, 1)

(8, 3, 4)

(4, 3, 3)

(7, 2, 5)

(8, 4, 3)

In this example the condition states that if the third field equals 3, then add the tuple to relation X.

X = FILTER A BY f3 == 3;

Relation X looks like this.

(1, 2, 3)

(4, 3, 3)

(8, 4, 3)

In this example the condition states that if the first field equals 8 or if the sum of fields f2 and f3 is not greater than first field, then add the tuple to relation X.

X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));

Relation X looks like this.

(4, 2, 1)

(8, 3, 4)

(7, 2, 5)

(8, 4, 3)

FOREACH … GENERATE

Generates data transformations based on fields (columns) of data.

Syntax

alias  = FOREACH { gen_blk | nested_gen_blk } [AS schema];

Terms

alias

The name of a relation.

gen_blk

FOREACH … GENERATE used with a non-nested relation. Use this syntax:

 

alias = FOREACH alias GENERATE expression [expression ….]

nested_gen_blk

FOREACH … GENERATE used with a nested relation. Use this syntax:

 

alias = FOREACH nested_alias {

   alias = nested_op; [alias = nested_op; …]

   GENERATE expression [expression ….]

};

 

Where:

The nested block is enclosed in opening and closing brackets { … }.

The GENERATE keyword must be the last statement within the nested block.

expression

An expression.

nested_alias

If one of the fields (columns) in a relation is a bag, the bag can be treated as an inner or a nested relation.

nested_op

Allowable operations include FILTER, ORDER, and DISTINCT.

The FOREACH … GENERATE operation itself is not allowed since this could lead to an arbitrary number of nesting levels.

AS

Keyword.

schema

A schema using the AS keyword (see Schemas).

·         If the FLATTEN keyword is used, enclose the schema in parentheses.

·         If the FLATTEN keyword is not used, don't enclose the schema in parentheses.

Usage

Use the FOREACH …GENERATE operation to work with individual fields (columns) of data. The FOREACH …GENERATE operation works with non-nested and nested relations.

A statement with a non-nested relation A could look like this.

X = FOREACH A GENERATE f1;

A statement with a nested relation A could look like this.

X = FOREACH B {

            S = FILTER A by 'xyz';

            GENERATE COUNT (S.$0);

}

Note: FOREACH … GENERATE works with fields (columns) of data. If you want to work with entire tuples (rows) of data, use the FILTER operation.

Examples

Suppose we have relations A and B, and derived relation C (where C = COGROUP A BY a1 INNER, B BY b1 INNER;).

(A: a1:int, a2:int, a3:int) 

-----------------                  

(1, 2, 3)                        

(4, 2, 1)            

(8, 3, 4)            

(4, 3, 3)                        

(7, 2, 5)                        

(8, 4, 3)                        

                       

(B: b1:int, b2:int)

---------------        

(2, 4)                

(8, 9)                

(1, 3)                

(2, 7)

(2, 9)

(4, 6)

(4, 9)

 (C: c1, c2, c3)

---------------------

 (1, {(1, 2, 3)}, {(1, 3)})

 (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})

 (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})

 

Example: Projection

In this example the asterisk (*) is used to project all fields from relation A to relation X (this is similar to SQL Select *). Relation A and X are identical.

X = FOREACH A GENERATE *;

In this example two fields from relation A are projected to form relation X.

X = FOREACH A GENERATE a1, a2;

Relation X looks this.

(1, 2)

(4, 2)

(8, 3)

(4, 3)

(7, 2)

(8, 4)

Example: Nested Projection

Note: See GROUP for information about the "group" field in relation C.

In this example if one of the fields in the input relation is a tuple, bag or map, we can perform projection on that field.

X = FOREACH C GENERATE group, B.b2;

Relation X looks like this.

(1, {(3)})

(4, {(6), (9)})

(8, {(9)})

In this example multiple nested columns are retained.

X = FOREACH C GENERATE group, A.(a1, a2);

Relation X looks like this.

(1, {(1, 2)})

(4, {(4, 2), (4, 3)})

(8, {(8, 3), (8, 4)})

Example: Schema

In this example two fields in relation A are summed to form relation X. A schema is defined for the projected field.

X = FOREACH A GENERATE a1+a2 AS f1:int;

Y = FILTER X by f1 > 10;

Relations X and Y look this.

(X)        (Y)

-----       ------

(3)        (11)

(6)        (12)

(11)

(7)

(9)

(12)

Example: Applying Functions

Note: See GROUP for information about the "group" field in relation C.

In this example the built-in function SUM() is used to sum a set of numbers in a bag.

X = FOREACH C GENERATE group, SUM (A.a1);

Relation X looks like this.

(1, 1)

(4, 8)

(8, 16)

Example: Flattening

Note: See GROUP for information about the "group" field in relation C.

In this example the FLATTEN keyword is used to eliminate nesting.

X = FOREACH C GENERATE group, FLATTEN(A);

Relation X looks like this.

(1, 1, 2, 3)

(4, 4, 2, 1)

(4, 4, 3, 3)

(8, 8, 3, 4)

(8, 8, 4, 3)

Another FLATTEN example.

X = FOREACH C GENERATE GROUP, FLATTEN(A.a3);

Relation X looks like this.

(1, 3)

(4, 1)

(4, 3)

(8, 4)

(8, 3)

Another FLATTEN example.

X = FOREACH C GENERATE FLATTEN(A.(f1, f2)), FLATTEN(B.$1);

Relation X looks like this. Note that for the group '4' in C, there are two tuples in each bag. Thus, when both bags are flattened, the cross product of these tuples is returned; that is, tuples (4, 2, 6), (4, 3, 6), (4, 2, 9), and (4, 3, 9).

(1, 2, 3)

(4, 2, 6)

(4, 3, 6)

(4, 2, 9)

(4, 3, 9)

(8, 3, 9)

(8, 4, 9)

Example: Nested Block

Suppose we have relation A and derived relation B (where B = GROUP A BY url;). Since relation B contains tuples with bags it can be treated as a nested relation.

A (url:chararray, outlink:chararray)          

---------------------------------------------  

(www.ccc.com,www.hjk.com)     

(www.ddd.com,www.xyz.org)     

(www.aaa.com,www.cvn.org)                     

(www.www.com,www.kpt.net)

(www.www.com,www.xyz.org)

(www.ddd.com,www.xyz.org)

 

B

---------------------------------------------  

 (www.aaa.com,{(www.aaa.com,www.cvn.org)})

 (www.ccc.com,{(www.ccc.com,www.hjk.com)})

 (www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})

 (www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})

In this example we perform two of the allowed Pig operations, FILTER (FA) and DISTINCT (DA), as well as projection (PA). Note that the last statement in the nested block must be GENERATE.

X = foreach B {

        FA= FILTER A BY outlink == 'www.xyz.org';

        PA = FA.outlink;

        DA = DISTINCT PA;

        GENERATE GROUP, COUNT(DA);

}

Relation X looks like this.

(www.ddd.com,1L)

(www.www.com,1L)

GROUP

Groups the data in a single relation.

Syntax

alias = GROUP alias [BY {[field_alias [, field_alias]] | * | [expression] } ]  [ALL]  [PARALLEL n];

Terms

alias

The name of a relation.

BY

Keyword. Use this clause to group the relation by fields or by expression.

field_alias

The name of a field in a relation. This is the group key or key field.

A relation can be grouped by a single field (f1) or by the composite value of multiple fields (f1,f2).

*

The asterisk. A designator for all fields in the relation.

expression

An expression.

ALL

Keyword. Use ALL if you want all tuples to go to a single group; for example, when doing aggregates across entire relations.

PARALLEL n

Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline:

 

   n = (nr_nodes - 1) * 0.45 * nr_GB

 

where nr_nodes is the number of nodes used and nr_GB is the amount of  physical memory on each node.

Note the following:

·         Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.

·         If you don’t specify parallel, you still get the same map parallelism but only one reduce task.

Usage

The GROUP operator groups together tuples that have the same group key (key field). The result of a GROUP operation is a relation that includes one tuple per group. This tuple contains two fields:

·         The first field is named "group" (do not confuse this with the GROUP operator) and is the same type of the group key.

·         The second field takes the name of the original relation and is type bag.

 

Suppose we have the following data:

john      25      3.6

george  25      2.9

anne     27      3.9

julia      28      3.6

And, suppose we perform the LOAD and GROUP statements shown below. We can use the DESCRIBE operator to view the schemas for relation Y. We can use DUMP to view the contents of Y.

Note that relation Y has two fields. The first field is named "group" and is type int (the same as age). The second field takes the name of the original relation "X" and is type bag (that can contain tuples with three elements of type chararray, int, and float).

Statements

X = LOAD 'data AS (name:chararray, age:int, gpa:float);

Y = GROUP X BY age;

DESCRIBE Y;

Y: {group: int,X: {name: chararray,age: int,gpa: float}}

DUMP Y;

(25,{(john,25,3.6F),(george,25,2.9F)})

(27,{(anne,27,3.9F)})

(28,{(julia,28,3.6F)})

 

As shown in this FOREACH statement, we can refer to the fields in relation Y by their names "group" and "X".

Z = FOREACH Y GENERATE group, COUNT(X);

 

Relation Z looks like this.

(25,2L)

(27,1L)

(28,1L)

Examples

Suppose we have relation A.

A: (owner:chararray, pet:chararray)

-----------------

(Alice, turtle)

(Alice, goldfish)

(Alice, cat)

(Bob, dog)

(Bob, cat)

In this example tuples are grouped using the field "owner."

X = GROUP A BY owner;

Relation X looks like this. "group" is the name of the first field. "A" is the name of the second field.

(Alice, {(Alice, turtle), (Alice, goldfish)})

(Bob, {(Bob, dog), (Bob, cat)})

In this example tuples are grouped using the ALL keyword. Field "A"  is then counted and projected to from relation Y.

X = GROUP A ALL;

Y = FOREACH X GENERATE COUNT(A);

Relation X looks like this. "group" is the name of the first field. "A" is the name of the second field.

(all,{(Alice,turtle),(Alice,goldfish),(Alice,cat),(Bob,dog),(Bob,cat)})

Relation Y looks like this.

(5L)

Suppose we have relation S.

S: (f1:chararay, f2:int, f3:int)

-----------------

(r1, 1, 2)

(r2, 2, 1)

(r3, 2, 8)

(r4, 4, 4)

In this example tuples are grouped using an expression, f2*f3.

X = GROUP S BY f2*f3;

Relation Y looks like this. The first field is named "group". The second field is named "S".

(2, {(r1, 1, 2), (r2, 2, 1)})

(16, {(r3, 2, 8), (r4, 4, 4)})

JOIN

Joins two or more relations based on common field values.

Syntax

alias = JOIN alias BY field_alias, alias BY field_alias [, alias BY field_alias …] [PARALLEL n];                   

Terms

alias

The name of a relation.

BY

Keyword.

field_alias

The name of a field in a relation. The alias and field_alias specified in the BY clause must correspond.

Example: X = JOIN relationA BY fieldA, relationB by fieldB, relationC by fieldC;

PARALLEL n

Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline:

 

   n = (nr_nodes - 1) * 0.45 * nr_GB

 

where nr_nodes is the number of nodes used and nr_GB is the amount of  physical memory on each node.

Note the following:

·         Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.

·         If you don’t specify parallel, you still get the same map parallelism but only one reduce task.

Usage

Use the JOIN operator to join two or more relations based on common field values. The JOIN operator always performs an inner join.

Note: The JOIN and COGROUP operators perform similar functions. JOIN creates a flat set of output records while COGROUP creates a nested set of output records.

Example

Suppose we have relations A and B.

(A: a1, a2, a3)   (B: b1, b2)

-----------------       ---------------        

(1, 2, 3)             (2, 4)    

(4, 2, 1)             (8, 9)                

(8, 3, 4)             (1, 3)                

(4, 3, 3)             (2, 7)

(7, 2, 5)             (2, 9)

(8, 4, 3)             (4, 6)

                        (4, 9)

In this example relations A and B are joined on their first fields.

X = JOIN A BY a1, B BY b1;

Relation X looks like this.

(1, 2, 3, 1, 3)

(4, 2, 1, 4, 6)

(4, 3, 3, 4, 6)

(4, 2, 1, 4, 9)

(4, 3, 3, 4, 9)

(8, 3, 4, 8, 9)

(8, 4, 3, 8, 9)

LIMIT

Limits the number of output tuples.

Syntax

alias = LIMIT alias  n;

Terms

alias

The name of a relation.

n

The number of tuples.

Usage

Use the LIMIT operator to limit the number of output tuples (rows). If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, the output will include all tuples in the relation.

There is no guarantee which tuples will be returned, and the tuples that are returned can change from one run to the next. A particular set of tuples can be requested using the ORDER operator followed by LIMIT.

Note: The LIMIT operator allows Pig to avoid processing all tuples in a relation. In most cases a query that uses LIMIT will run more efficiently than an identical query that does not use LIMIT. It is always a good idea to use limit if you can.

Examples

Suppose we have relation A.

(A: f1:int, f2:int, f3:int)    

-----------------                  

(1, 2, 3)

(4, 2, 1)

(8, 3, 4)

(4, 3, 3)

(7, 2, 5)

(8, 4, 3)

In this example output is limited to 3 tuples.

X = LIMIT A 3;

Relation X could look like this (there is no guarantee which three tuples will be output).

(1, 2, 3)

(4, 3, 3)

(7, 2, 5)

In this example the ORDER operator is used to order the tuples and the LIMIT operator is used to output the first three tuples.

B = ORDER A BY f1 DESC, f2 ASC;

X = LIMIT B 3;

Relation B and relation X look like this.

(B)                                     (X)  

-----------             -----------

(8, 3, 4)                         (8, 3, 4)

(8, 4, 3)                         (8, 4, 3)

(7, 2, 5)                         (7, 2, 5)

(4, 2, 1)

(4, 3, 3)

(1, 2, 3)

LOAD

Loads data from the file system.

Syntax

LOAD 'data' [USING function] [AS schema];              

Terms

'data'

The name of the file or directory, in single quotes.

If you specify a directory name, all the files in the directory are loaded.

You can use hadoop-supported globing to specify files at the file system or directory levels (see [WWW]hadoop glob documentation for details on globing syntax).

USING

Keyword.

function

The load function.

PigStorage is the default load/store function and does not need to be specified. This function reads/writes simple newline-separated records with delimiter-separated fields. The function has one parameter, the field delimiter (tab (‘\t’) if the default delimiter).

If the data is stored in a special format that the Pig load functions cannot parse, you can write your own load function.

AS

Keyword.

schema

A schema using the AS keyword, enclosed in parentheses (see Schemas).

Usage

Use the LOAD operator to load data from the file system.

Examples

Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.

1          2          3

4          2          1

8          3          4

In this example the default load function, PigStorage, loads data from myfile.txt into relation A. Note that, because no schema is specified, the fields are not named and all fields default to type bytearray. The two statements are equivalent.

A = LOAD 'myfile.txt';

A = LOAD 'myfile.txt' USING PigStorage('\t');

Relation A looks like this.

(1, 2, 3)

(4, 2, 1)

(8, 3, 4)

In this example a schema is specified using the AS keyword. The two statements are equivalent.

A = LOAD 'myfile.txt' AS (f1:int, f2:int, f3:int);

A = LOAD 'myfile.txt' USING PigStorage(‘\t’) AS (f1:int, f2:int, f3:int);

ORDER

Sorts a relation based on one or more fields.

Syntax

alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] ] } [PARALLEL n];

Terms

alias

The name of a relation.

BY

Required keyword.

*

Represents all fields in the relation.

ASC

Sort in ascending order.

DESC

Sort in descending order.

field_alias

A field in the relation.

PARALLEL n

Increase the parallelism of a job by specifying the number of reduce tasks, n. The optimal number of parallel tasks depends on the amount of memory on each node and the memory required by each of the tasks. To determine n, use the following as a general guideline:

 

   n = (nr_nodes - 1) * 0.45 * nr_GB

 

where nr_nodes is the number of nodes used and nr_GB is the amount of  physical memory on each node.

Note the following:

·         Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.

·         If you don’t specify parallel, you still get the same map parallelism but only one reduce task.

Usage

In Pig, relations are logically unordered.

·         If you order relation A to produce relation X (X = ORDER A BY * DESC;), relations A and X still contain the same thing.

·         If you retrieve the contents of relation X, they are guaranteed to be in the order you specified (descending).

·         However, if you further process relation X, there is no guarantee that the contents will be processed in the order you specified.

Examples

Suppose we have relation A.

(A: f1, f2, f3)     

-----------------                  

(1, 2, 3)                        

(4, 2, 1)                                    

(8, 3, 4)                                    

(4, 3, 3)

(7, 2, 5)            

(8, 4, 3)

In this example relation A is sorted by the third field, f3 in descending order.

X = ORDER A BY f3 DESC;

Relation X could look like this (note that the order of the three tuples ending in 3 can vary).

(7, 2, 5)

(8, 3, 4)

(1, 2, 3)

(4, 3, 3)

(8, 4, 3)

(4, 2, 1)

SPLIT

Partitions a relation into two or more relations.

Syntax

SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression …];

Terms

alias

The name of a relation.

INTO

Required keyword.

IF

Required keyword.

expression

An expression.

Usage

Use the SPLIT operator to partition a relation into two or more relations based on some expression. Depending on the expression:

·         A tuple may be assigned to more than one relation.

·         A tuple may not be assigned to any relation.

Example

Suppose we have relation A.

(A: f1, f2, f3)     

-----------------                              

(1, 2, 3)

(4, 5, 6)

(7, 8, 9)

In this example relation A is split into three relations, X, Y, and Z.

SPLIT A INTO X IF f1< 7, Y IF f2==5, Z IF (f3<6 OR f3>6);

Relations X, Y, and Z look like this.

(X)                    (Y)                    (Z)

----------              ----------- -----------

(1, 2, 3)             (4, 5, 6)             (1, 2, 3)

(4, 5, 6)                                     (7, 8, 9)

STORE

Stores data to the file system.

Syntax

STORE alias INTO 'directory' [USING function];

Terms

alias

The name of a relation.

INTO

Required keyword.

'directory'

The name of the storage directory, in quotes. If the directory already exists, the STORE operation will fail.

 

The output data files, named part-nnnnn, are written to this directory.

USING

Keyword. Use this clause to name the store function.

function

The load function.

PigStorage is the default load/store function and does not need to be specified. This function reads/writes simple newline-separated records with delimiter-separated fields. The function has one parameter, the field delimiter (tab ‘\t’ if the default delimiter)

If you want to store the data in a special format that the Pig Load/Store functions cannot handle, you can write your own store function.

Usage

Use the STORE operator to store data on the file system.

Example

Suppose we have relation A.

(A)

----------------

(1, 2, 3)

(4, 2, 1)

(8, 3, 4)

(4, 3, 3)

(7, 2, 5)

(8, 4, 3)

In this example the contents of relation A are written to file part-00000 located in directory myoutput.

STORE relationA INTO ‘myoutput’ USING PigStorage (‘*’);

The part-00000 file looks like this. Fields are delimited with the asterisk * characters and records are separated by newlines.

1*2*3

4*2*1

8*3*4

4*3*3

7*2*5

8*4*3

STREAM

Sends data to an external script or program.

Syntax

alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ;

Terms

alias

The name of a relation.

THROUGH

Keyword.

`command`

A command, including the arguments, enclosed in back tics (where a command is anything that can be executed).

cmd_alias

The name of a command created using the DEFINE operator.

AS

Keyword.

schema

A schema using the AS keyword, enclosed in parentheses (see Schemas).

Usage

Use the STREAM operator to send data through an external script or program. Multiple stream operators can appear in the same Pig script. The stream operators can be adjacent to each other or have other operations in between.

When used with a command, a stream statement could look like this:

A = LOAD 'data';

B = STREAM A THROUGH `stream.pl -n 5`;

When used with a cmd_alias, a stream statement could look like this, where cmd is the defined alias.

A = LOAD 'data';

DEFINE cmd `stream.pl –n 5`;

B = STREAM A THROUGH cmd;

About Data Guarantees

Data guarantees are determined based on the position of the streaming operator in the Pig script.

·         Unordered data – No guarantee for the order in which the data is delivered to the streaming application.

·         Grouped data – The data for the same grouped key is guaranteed to be provided to the streaming application contiguously

·         Grouped and ordered data – The data for the same grouped key is guaranteed to be provided to the streaming application contiguously. Additionally, the data within the group is guaranteed to be sorted by the provided secondary key.

In addition to position, data grouping and ordering can be determined by the data itself. However, you need to know the property of the data to be able to take advantage of its structure.

Example: Data Guarantees

In this example the data is unordered.

A = LOAD 'data';

B = STREAM A THROUGH `stream.pl`;

In this example the data is grouped.

A = LOAD 'data';

B = GROUP A BY $1;

C = FOREACH B FLATTEN(A);

D = STREAM C THROUGH `stream.pl`

In this example the data is grouped and ordered.

A = LOAD 'data';

B = GROUP A BY $1;

C = FOREACH B {

      D = ORDER A BY ($3, $4);

      GENERATE D;

}

E = STREAM C THROUGH `stream.pl`;

Example: Schemas

In this example a schema is specified as part of the STREAM statement.

X = STREAM A THROUGH `stream.pl` as (f1:int, f2;int, f3:int);

Additional Examples

See DEFINE for additional examples.

UNION

Computes the union of two or more relations.

Syntax

alias = UNION alias, alias [, alias …];

Terms

alias

The name of a relation.

Usage

Use the UNION operator to compute the union of two or more relations. The UNION operator:

·         Does not preserve the order of tuples. Both the input and output relations are interpreted as unordered bags of tuples.

·         Does not ensure (as databases do) that all tuples adhere to the same schema or that they have the same number of fields. In a typical scenario, however, this should be the case; therefore, it is the user's responsibility to either (1) ensure that the tuples in the input relations have the same schema or (2) be able to process varying tuples in the output relation.

·         Does not eliminate duplicate tuples.

Example

Suppose we have relations A and B.

(A)                    (B)

----------- --------   

(1, 2, 3)             (2, 4)    

(4, 2, 1)             (8, 9)                

                        (1, 3)

In this example the union of relation A and B is computed.

X = UNION A, B;

Relation X looks like this.

(1, 2, 3)

(4, 2, 1)

(2, 4)

(8, 9)

(1, 3)

Diagnostic Operators

DESCRIBE

Returns the schema of an alias.

Syntax

DESCRIBE alias;                                                                      

Terms

alias

The name of a relation.

Usage

Use the DESCRIBE operator to review the schema of a particular alias.

Example

In this example a schema is specified using the AS clause.

A = LOAD 'students' AS (name:chararray, age:int, gpa:float);

B = FILTER A BY name matches 'John%';

C = GROUP B BY name;

D = FOREACH B GENERATE COUNT(B.age);

DESCRIBE A;

A: {group, B: (name: chararray,age: int,gpa: float}

DESCRIBE B;

B: {group, B: (name: chararray,age: int,gpa: float}

DESCRIBE C;

C: {group, chararry,B: (name: chararray,age: int,gpa: float}

DESCRIBE D;

D: {long}

In this example no schema is specified. All data items default to type bytearray.

grunt> a = LOAD '/data/students';

grunt> b = FILTER a BY $0 matches 'John%';

grunt> c = GROUP b BY $0;

grunt> d = FOREACH c GENERATE COUNT(b.$1);

grunt> DESCRIBE a;

Schema for a unknown.

grunt> DESCRIBE b;

2008-12-05 01:17:15,316 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly cast to chararray under LORegexp Operator

Schema for b unknown.

grunt> DESCRIBE c;

2008-12-05 01:17:23,343 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly caste to chararray under LORegexp Operator

c: {group: bytearray,b: {null}}

grunt> DESCRIBE d;

2008-12-05 03:04:30,076 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly caste to chararray under LORegexp Operator

d: {long}

DUMP

Displays the contents of an alias.

Syntax

DUMP alias;                                                           

Terms

alias

The name of a relation.

Usage

Use the DUMP operator to display the contents of an alias. You can use DUMP as a debugging device to make sure the correct results are being generated.

Example

In this example a dump is performed after each statement.

A = LOAD 'students' AS (name:chararray, age:int, gpa:float);

DUMP A;

B = FILTER A BY name matches 'John%';

DUMP B;

B = GROUP B BY name;

DUMP C;

D = FOREACH C GENERATE COUNT(B.age);

DUMP D;

EXPLAIN

Displays execution plans.

Syntax

EXPLAIN alias;                            

Terms

alias

The name of a relation.

Usage

Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship.

·         The logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent optimizations (such as applying filters early on) also apply.

·         The physical plan shows how the logical operators are translated to backend-specific physical operators. Some backend optimizations also apply.

·         The map reduce plan shows how the physical operators are grouped into map reduce jobs.

Example

In this example the EXPLAIN operator produces all three plans. (Note that only a portion of the output is shown in this example.)

A = LOAD 'students' AS (name:chararray, age:int, gpa:float);

B = GROUP A BY name;

C = FOREACH B GENERATE COUNT(A.age);

EXPLAIN C;

 

Logical Plan:

Store xxx-Fri Dec 05 19:42:29 UTC 2008-23 Schema: {long} Type: Unknown

|

|---ForEach xxx-Fri Dec 05 19:42:29 UTC 2008-15 Schema: {long} Type: bag

etc …

-----------------------------------------------

Physical Plan:

-----------------------------------------------

Store(fakefile:org.apache.pig.builtin.PigStorage) - xxx-Fri Dec 05 19:42:29 UTC 2008-40

|

|---New For Each(false)[bag] - xxx-Fri Dec 05 19:42:29 UTC 2008-39

    |   |

    |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - xxx-Fri Dec 05

etc …

 

--------------------------------------------------

| Map Reduce Plan                                |

--------------------------------------------------

MapReduce node xxx-Fri Dec 05 19:42:29 UTC 2008-41

Map Plan

Local Rearrange[tuple]{chararray}(false) - xxx-Fri Dec 05 19:42:29 UTC 2008-34

|   |

|   Project[chararray][0] - xxx-Fri Dec 05 19:42:29 UTC 2008-35

etc …  

ILLUSTRATE

Displays a step-by-step execution of a sequence of statements.

Syntax

ILLUSTRATE alias;                                                 

Terms

alias

The name of a relation.

Usage

Use the ILLUSTRATE operator to review how data items are transformed through a sequence of Pig Latin statements.

ILLUSTRATE accesses the ExampleGenerator algorithm which can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate all the sampled data items, giving you empty results which is of no help with debugging.

With the ILLUSTRATE operator you can test your programs on small datasets and get faster turnaround times. The ExampleGenerator algorithm uses Pig's Local mode (rather than Hadoop mode) which means that illustrative example data is generated in near real-time.

Example

Suppose we have a data file called 'visits.txt'.

Amy     cnn.com            20080218

Fred    harvard.edu         20081204

Amy     bbc.com            20081205

Fred    stanford.edu        20081206

In this example we count the number of sites a user has visited since 12/1/08. The ILLUSTRATE statement will show how the results for num_user_visits are derived.

visits = LOAD 'visits.txt' AS (user:chararray, url:chararray, timestamp:chararray);

recent_visits = FILTER visits BY timestamp >= '20081201';

user_visits = GROUP recent_visits BY user;

num_user_visits = FOREACH user_visits GENERATE COUNT(recent_visits);

ILLUSTRATE num_user_visits

The output from the ILLUSTRATE statement looks like this.

------------------------------------------------------------------------

| visits     | user: bytearray | url: bytearray | timestamp: bytearray |

------------------------------------------------------------------------

|            | Amy             | cnn.com        | 20080218             |

|            | Fred            | harvard.edu    | 20081204            |

|            | Amy             | bbc.com        | 20081205             |

|            | Fred            | stanford.edu   | 20081206            |

------------------------------------------------------------------------

-------------------------------------------------------------------------------

| recent_visits     | user: chararray | url: chararray | timestamp: chararray |

-------------------------------------------------------------------------------

|                   | Fred            | harvard.edu    | 20081204             |

|                   | Amy             | bbc.com        | 20081205             |

|                   | Fred            | stanford.edu   | 20081206             |

-------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------

| user_visits     | group: chararray | recent_visits: bag({user: chararray,url: chararray,timestamp: chararray}) |

------------------------------------------------------------------------------------------------------------------

|                 | Amy              | {(Amy, bbc.com, 20081205)}                                                |

|                 | Fred             | {(Fred, harvard.edu, 20081204), (Fred, stanford.edu, 20081206)}           |

------------------------------------------------------------------------------------------------------------------

-------------------------------

| num_user_visits     | long  |

-------------------------------

|                     | 1     |

|                     | 2     |

-------------------------------

UDF Statements

DEFINE

Assigns an alias to a function or command.

Syntax

DEFINE alias {function | [`command` [input] [output] [ship] [cache]] };

Terms

alias

The name for the function or command.

function

The name of a function.

Use this option to define functions for use with the FOREACH and FILTER operators.

`command `

A command, including the arguments, enclosed in back tics (where a command is anything that can be executed).

Use this option to define commands for use with the STREAM operator.

input

INPUT ( {stdin | 'path'} [USING serializer] [, {stdin | 'path'} [USING serializer] …] )

Where:

·       INPUT – Keyword.

·       'path' – A file path, enclosed in single quotes.

·       USING – Keyword.

·       serializer – A function that converts data from tuples to stream format. PigStorage is the default serializer. You can also write your own UDF.

output

OUTPUT ( {stdout | stderr | 'path'} [USING deserializer] [, {stdout | stderr | 'path'} [USING deserializer] …] )

Where:

·       OUTPUT – Keyword.

·       'path' – A file path, enclosed in single quotes.

·       USING – Keyword.

·       deserializer – A function that converts data from stream format to tuples. PigStorage is the default deserializer. You can also write your own UDF.

ship

SHIP('path' [, 'path' …])

Where:

·       SHIP – Keyword.

·       'path' – A file path, enclosed in single quotes.

cache

CACHE('dfs_path#dfs_file' [, 'dfs_path#dfs_file' …])

Where:

·       CACHE – Keyword.

·       'dfs_path#dfs_file' – A file path/file name on the distributed file system, enclosed in single quotes. Example: '/mydir/mydata.txt#mydata.txt'


Usage

Use the DEFINE statement to assign a name (alias) to a function or to a command.

Use DEFINE to specify a function when:

·         The function has a log package name that you don't want to include in a script, especially if you call the function several times in that script.

·         The constructor for the function takes parameters (see the first example below). If you need to use different constructor parameters for different calls to the function you will need to create multiple defines – one for each parameter set.

Use DEFINE to specify a command when the streaming command specification is complex or requires additional parameters (input, output, and so on).

About Input and Output

Serialization is needed to convert data from tuples to a format that can be processed by the streaming application. Deserialization is needed to convert the output from the streaming application back into tuples.

PigStorage, the default serialization/deserialization function, converts tuples to tab-delimited lines. Pig's BinarySerializer and BinaryDeserializer functions treat the entire file as a byte stream (no formatting or interpretation takes place). You can also write your own serialization/deserialization functions.

About Ship

Use the ship option to send streaming binary and supporting files, if any, from the client node to the compute nodes. Pig does not automatically ship dependencies; it is your responsibility to explicitly specify all the dependencies and to make sure that the software the processing relies on (for instance, perl or python) is installed on the cluster. Supporting files are shipped to the task's current working directory and only relative paths should be specified. Any pre-installed binaries should be specified in the path.

Only files, not directories, can be specified with the ship option. One way to work around this limitation is to tar all the dependencies into a tar file that accurately reflects the structure needed on the compute nodes, then have a wrapper for your script that un-tars the dependencies prior to execution.

Note that the ship option has two components: the source specification, provided in the ship clause, is the view of your machine; the command specification is the view of the cluster.The only guarantee is that the shipped files are available is the current working directory of the launched job and that your current working directory is also on the PATH environment variable.

Shipping files to relative paths or absolute paths is not supported since you might not have permission to read/write/execute from arbitrary paths on the clusters.

About Cache

The ship option works with binaries, jars, and small datasets. However, loading larger datasets at run time for every execution can severely impact performance. Instead, use the cache option to access large files already moved to and available on the compute nodes. Only files, not directories, can be specified with the cache option.

Example: Input/Output

In this example PigStorage is the default serialization/deserialization function. The tuples from relation A are converted to tab-delimited lines that are passed to the script.

X = STREAM A THROUGH `stream.pl`;

In this example PigStorage is used as the serialization/deserialization function, but a comma is used as the delimiter.

DEFINE Y `stream.pl` INPUT(stdin USING PigStorage(',')) OUTPUT (stdout USING PigStorage(','));

X = STREAM A THROUGH Y;

In this example user-defined serialization/deserialization functions are used with the script.

DEFINE Y `stream.pl` INPUT(stdin USING MySerializer) OUTPUT (stdout USING MyDeserializer);

X = STREAM A THROUGH Y;

Example: Ship/Cache

In this example ship is used to send the script to the cluster compute nodes.

DEFINE Y `stream.pl` SHIP('/work/stream.pl');

X = STREAM A THROUGH Y;

In this example cache is used to specify a file located on the cluster compute nodes.

DEFINE Y `stream.pl data.gz` SHIP('/work/stream.pl') CACHE('/input/data.gz#data.gz');

X = STREAM A THROUGH Y;

Example: Logging

In this example the streaming stderr is stored in the _logs/<dir> directory of the job's output directory. Because the job can have multiple streaming applications associated with it, you need to ensure that different directory names are used to avoid conflicts. Pig stores up to 100 tasks per streaming job.

DEFINE Y `stream.pl` stderr('<dir>' limit 100);

X = STREAM A THROUGH Y;

In this example a function is defined for use with the FOREACH …GENERATE operator.

grunt> REGISTER /src/myfunc.jar

grunt> define myFunc myfunc.MyEvalfunc('foo');

grunt> A = LOAD 'students';

grunt> B = FOREACH A GENERATE myFunc($0);

In this example a command is defined for use with the STREAM operator.

grunt> A = LOAD 'data';

grunt> DEFINE cmd `stream_cmd –input file.dat`

grunt> B = STREAM A through cmd.

REGISTER

Registers a JAR file so that the UDFs in the file can be used.

Syntax

REGISTER alias;

Terms

alias

The path of a Java JAR file. Do not place the name in quotes.

Usage

Use the REGISTER statement to specify the path of a Java JAR file containing UDFs.

For more information about UDFs, see the User Defined Function Guide. Note that Pig currently only supports functions written in Java.

Example

In this example REGISTER states that myfunc.jar is located in the /src directory.

/src $ java -jar pig.jar -

grunt> REGISTER /src/myfunc.jar;

grunt> A = LOAD 'students';

grunt> B = FOREACH A GENERATE myfunc.MyEvalFunc($0);

Built-in Functions: Eval Functions

AVG

Computes the average of the numeric values in a single-column bag

Syntax

AVG(expression)

Terms

expression

Any expression whose result is a bag. The elements of the bag should be data type int, long, float, or double.

Usage

Use the AVG function to compute the average of the numeric values in a single-column bag.

Example

Suppose we have relation A.

A (name:chararray, session:chararray, gpa:float)

-----------------------------------------------------------------

(John,fl,3.9F)

(John,wt,3.7F)

(John,sp,4.0F)

(John,sm,3.8F)

(Mary,fl,3.8F)

(Mary,wt,3.9F)

(Mary,sp,4.0F)

(Mary,sm,4.0F)

In this example the average GPA is computed.

B = GROUP A by name;

C = FOREACH B GENERATE A.name, AVG(A.gpa);

Relation C looks like this.

({(John),(John),(John),(John)},3.850000023841858)

({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)

Types Tables

 

int

long

float

double

chararray

bytearray

AVG

long

long

double

double

error

cast as double

CONCAT

Concatenates two fields of type chararray or two fields of type bytearray.

Syntax

CONCAT (expression, expression)

Terms

expression

An expression with data types chararray or bytearray.

Usage

Use the CONCAT function to concatenate two elements. The data type of the two elements must be the same, either chararray or bytearray.

Example

Suppose we have relation A.

A (f1:chararray, f2:chararray, f3:chararray)

---------------------------------------------------------

(apache,open,source)

(hadoop,map,reduce)

(pig,pig,latin)

In this example fields f2 and f3 are concatenated.

X = FOREACH A GENERATE CONCAT(f2,f3);

Relation X looks like this.

(opensource)

(mapreduce)

(piglatin)

Types Tables

 

chararray

bytearray

chararray

chararray

cast as chararray  

bytearray

 

bytearray

COUNT

Computes the number of elements in a bag.

Syntax

COUNT(expression)                             

Terms

expression

An expression with data type bag.

Usage

Use the COUNT function to computer the number of elements in a bag.

Example

Suppose we have relation A.

A (f1:int, f2:bag)

----------------------------

(1,{(1,2,3)})

(4,{(4,2,1),(4,3,3)})

(7,{(7,2,5)})

(8,{(8,3,4),(8,4,3)})         

In this example the tuples in the bag are counted.

X = FOREACH A GENERATE COUNT(f2);

Relation X looks like this.

(1L)

(2L)

(1L)

(2L)

Types Tables

 

int

long

float

double

chararray

bytearray

COUNT

long

long

long

long

long

long

DIFF

Compares the content of two bags.

Syntax

DIFF (expression, expression)

Terms

expression

An expression with any data type.

Usage

The DIFF function compares the content of two bags and returns a bag that contains the tuples that are not in both bags.

Example

Suppose we have relation A.

A (f1:chararray, f2:chararray, f3:chararray)

---------------------------------------------------------

(apache,open,source)

(hadoop,map,reduce)

(pig,pig,latin)

In this example fields f1 and f2 are compared.

X = FOREACH A GENERATE DIFF(f1,f2);

Relation X looks like this.

({(apache),(open)})

({(hadoop),(map)})

({})

MAX

Computes the maximum of the numeric values or chararrays in a single-column bag.

Syntax

MAX(expression)                                                                        

Terms

expression

An expression with data types int, long, float, double, or chararray.

Usage

Use the MAX function to compute the maximum of the numeric values or chararrays in a single-column bag.

Example

Suppose we have relation A.

A (name:chararray, session:chararray, gpa:float)

------------------------------------------------------------------

(John,fl,3.9F)

(John,wt,3.7F)

(John,sp,4.0F)

(John,sm,3.8F)

(Mary,fl,3.8F)

(Mary,wt,3.9F)

(Mary,sp,4.0F)

(Mary,sm,4.0F)

In this example the maximum GPA is computed.

B = GROUP A by name;

C = FOREACH B GENERATE A.name, MAX(A.gpa);

Types Tables

 

int

long

float

double

chararray

bytearray

MAX

int

long

float

double

chararray

cast as double

MIN

Computes the minimum of the numeric values or chararrays in a single-column bag.

Syntax

MIN(expression)

Terms

expression

An expression with data types int, long, float, double, or chararray.

Usage

Use the MIN function to compute the minimum of a set of numeric values or chararrays in a single-column bag.

Example

Suppose we have relation A.

A (name:chararray, session:chararray, gpa:float)

------------------------------------------------------------------

(John,fl,3.9F)

(John,wt,3.7F)

(John,sp,4.0F)

(John,sm,3.8F)

(Mary,fl,3.8F)

(Mary,wt,3.9F)

(Mary,sp,4.0F)

(Mary,sm,4.0F)

In this example the minimum GPA is computed.

B = GROUP A by name;

C = FOREACH B GENERATE A.name, MIN(A.gpa);

Types Tables

 

int

long

float

double

chararray

bytearray

MIN

int

long

float

double

chararray

cast as double

SIZE

Computes the number of data items based on the data type.

Syntax

SIZE(expression)

Terms

expression

An expression with any data type.

Usage

Use the SIZE function to compute the number of data items based on the data type (see the Types Tables below).

Example

Suppose we have relation A.

A (f1:chararray, f2:chararray, f3:chararray)

-----------------------------------------------------------

(apache,open,source)

(hadoop,map,reduce)

(pig,pig,latin)

In this example the number of characters in each field of the first column are counted.

X = FOREACH A GENERATE SIZE(f1);

Relation X looks like this.

(6L)

(6L)

(3L)

Types Tables

int

returns 1

long

returns 1

float

returns 1

double

returns 1

chararray

returns number of characters in the array

bytearray

returns number of bytes in the array

tuple

returns number of fields in the tuple

bag

returns number of tuples in bag

map

returns number of key/value pairs in map

SUM

Computes the sum of the numeric values in a single-column bag.

Syntax

SUM(expression)                        

Terms

expression

An expression with data types int, long, float, double, or bytearray cast as double.

Usage

Use the SUM function to compute the sum of a set of numeric values in a single-column bag.

Example

Suppose we have relation A.

A (owner:chararray, pet_type:chararray, pet_num:int)

----------------------------------------------------------------

(Alice,turtle,1)

(Alice,goldfish,5)

(Alice,cat,2)

(Bob,dog,2)

(Bob,cat,2)

In this example the number of sessions is computed.

B = GROUP A by owner;

C = FOREACH B GENERATE A.owner, SUM(A.petnum);

Types Tables

 

int

long

float

double

chararray

bytearray

SUM

long

long

double

double

error

cast as double

TOKENIZE

Splits a string and outputs a bag of words.

Syntax

TOKENIZE(expression)                                                     

Terms

expression

An expression with data type chararray.

Usage

Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple). The following characters are considered to be word separators: space, double quote("), coma(,) parenthesis(()), star(*).

Example

Suppose we have relation A.

A (f1:chararray)

-------------------------------------

(Here is the first string.)

(Here is the second string.)

(Here is the third string.)

In this example the strings in each row are split.

X = FOREACH A GENERATE TOKENIZE(f1);

Relation X looks like this.

({(Here),(is),(the),(first),(string.)})

({(Here),(is),(the),(second),(string.)})

({(Here),(is),(the),(third),(string.)})

Built-in Functions: Load/Store Functions

BinarySerializer

Converts a file to a byte stream.

Syntax

BinarySerializer()                                  

Terms

none

no parameters

Usage

Use the BinarySerializer with the DEFINE operator to convert a file to a byte stream. No Formatting or interpretation takes place.

Example

In this example the BinarySerializer and BinaryDeserializer are use to convert data to and from streaming format.

DEFINE Y `stream.pl` INPUT(stdin USING BinarySerializer()) OUTPUT (stdout USING BinaryDeserializer());

X = STREAM A THROUGH Y;

BinaryDeserializer

Converts a byte stream into a file.

Syntax

BinarySerializer()                        

Terms

none

no parameters

Usage

Use the BinaryDeserializer with the DEFINE operator to convert a byte stream into a file. No Formatting or interpretation takes place.

Example

In this example the BinarySerializer and BinaryDeserializer are use to convert data to and from streaming format.

DEFINE Y `stream.pl` INPUT(stdin USING BinarySerializer()) OUTPUT (stdout USING BinaryDeserializer());

X = STREAM A THROUGH Y;

BinStorage

Loads and stores data in machine-readable format.

Syntax

BinStorage()        

Terms

none

no parameters

Usage

BinStorage works with data that is represented on disk in machine-readable format.

BinStorage is used internally by Pig to store the temporary data that is created between multiple map/reduce jobs.

Example

In this example BinStorage is used with the LOAD and STORE functions.

A = LOAD 'data' USING BinStorage();

STORE X into 'output' USING BinStorage();         

PigStorage

Loads and stores data in UTF-8 format.

Syntax

PigStorage(field_delimiter)                                       

Terms

field_delimiter

Parameter.

The default field delimiter is tab ('\t').

You can specify other characters as  field delimiters.

Usage

PigStorage works with structured text files in human-readable UTF-8 format.

For load statements, PigStorage expects delimiter-separated fields and newline-separated records. If you use the default field delimiter, tab ('\t'), it does not need to be specified. For store statements, PigStorage outputs data as delimiter-separated fields and newline-separated records. If you use the default field delimiter, tab ('\t'), it does not need to be specified.

PigStorage works with both simple and complex data types and is the default function for the LOAD and STORE operators (which means you can choose to include or not include the function in the load statement).

Example

In this example PigStorage expects input.txt to contain tab-separated fields and newline-separated records. The statements are equivalent.

A = LOAD 'student_data' USING PigStorage('/t') AS (name: chararray, age:int, gpa: float);

A = LOAD 'student_data' AS (name: chararray, age:int, gpa: float);

In this example PigStorage stores the contents of X into files with fields that are delimited with an asterisk ( * ). The STORE function specifies that the files will be located in a directory named output and that the files will be named part-nnnnn (for example, part-00000).

STORE X INTO  'output' USING PigStorage('*');

PigDump

Stores data in UTF-8 format.

Syntax

PigDump()                                                                       

Terms

none

no parameters

Usage

PigDump stores data as tuples in human-readable UTF-8 format.

Example

In this example PigDump is used with the STORE function.

STORE X INTO 'output' USING PigDump();

TextLoader

Loads unstructured data in UTF-8 format.

Syntax

TextLoader()

Terms

none

no parameters

Usage

TextLoader works with unstructured data in UTF8 format. Each resulting tuple contains a single field with one line of input text. TextLoader cannot be used to store data.

Example

In this example TextLoader is used with the LOAD function.

A = LOAD 'data' USING TextLoader();

File Commands

cat

Prints the content of one or more files to the screen.

Syntax

cat path [ path …]

Terms

path

The location of a file or directory.

Usage

The cat command is similar to the Unix cat command. If multiple files are specified, content from all files is concatenated together. If multiple directories are specified, content from all files in all directories is concatenated together.

Example

In this example the students file in the data directory is printed.

grunt> cat data/students

joe smith

john adams

anne white

grunt>

cd

Changes the current directory to another directory.

Syntax

cd [dir]

Terms

dir

The name of the directory you want to navigate to.

Usage

The cd command is similar to the Unix cd command and can be used to navigate the file system. If a directory is specified, this directory is made your current working directory and all other operations happen relatively to this directory. If no directory is specified, your home directory (/user/NAME) becomes the current working directory.

Example

In this example we navigate to the /data directory.

grunt> cd /data

grunt>

copyFromLocal

Copies a file or directory from the local file system to HDFS.

Syntax

copyFromLocal src_path dst_path

Terms

src_path

The path on the local file system for a file or directory

dst_path

The path on HDFS.

Usage

The copyFromLocal command enables you to copy a file or a director from the local file system to the Hadoop Distributed File System (HDFS). If a directory is specified, it is recursively copied over. Dot "." can be used to specify that the new file/directory should be created in the current working directory and retain the name of the source file/directory.

Example

In this example a file (students) and a directory (/data/tests) are copied from the local file system to HDFS.

grunt> copyFromLocal /data/students students

grunt> ls students

/data/students <r 3> 8270

grunt>  copyFromLocal  /data/tests new_tests

grunt> ls new_test

/data/new_test/test1.data<r 3>   664

/data/new_test/test2.data<r 3>    344

/data/new_test/more_data     

copyToLocal

Copies a file or directory from HDFS to a local file system.

Syntax

copyToLocal src_path dst_path

Terms

src_path

The path on HDFS.

dst_path

The path on the local file system for a file or directory.

Usage

The copyToLocal command enables you to copy a file or a director from Hadoop Distributed File System (HDFS) to a local file system. If a directory is specified, it is recursively copied over. Dot "." can be used to specify that the new file/directory should be created in the current working directory (directory from which the script was executed or grunt shell started) and retain the name of the source file/directory.

Example

In this example two files are copied from HDFS to the local file system.

grunt> copyToLocal students /data

grunt> copyToLocal data /data/mydata

cp

Copies a file or directory within HDFS.

Syntax

cp src_path dst_path                 

Terms

src_path

The path on HDFS.

dst_path

The path on HDFS.

Usage

The cp command is similar to the Unix cp command and enables you to copy files or directories within DFS. If a directory is specified, it is recursively copied over. Dot "." can be used to specify that the new file/directory should be created in the current working directory and retain the name of the source file/directory.

Example

In this example a file (students) is copied to another file (students_save).

cp students students_save

ls

Lists the contents of a directory.

Syntax

ls [path]              

Terms

path

The name of the path/directory.

Usage

The ls command is similar to the Unix ls command and enables you to list the contents of a directory. If DIR is specified, the command lists the content of the specified directory. Otherwise, the content of the current working directory is listed.

Example

In this example the contents of the data directory are listed.

grunt> ls /data

/data/DDLs  <dir>

/data/count <dir>

/data/data  <dir>

/data/schema <dir>

grunt>

mkdir

Creates a new directory.

Syntax

mkdir path                                                      

Terms

path

The name of the path/directory.

Usage

The mkdir command is similar to the Unix mkdir command and enables you to create a new directory. If you specify a directory or path that does not exist, it will be created.

Example

In this example a directory and subdirectory are created.

grunt> mkdir data/20070905

mv

Moves a file or directory within the Hadoop Distributed File System (HDFS).

Syntax

mv src_path dst_path

Terms

src_path

The path on HDFS.

dst_path

The path on HDFS.

Usage

The mv command is identical to the Unix mv command (which copies files or directories within DFS) except that it deletes the source file or directory as soon as it is copied.

If a directory is specified, it is recursively moved. Dot "." can be used to specify that the new file/directory should be created in the current working directory and retain the name of the source file/directory.  

Example

In this example the output directory is copied to output2 and then deleted.

grunt> mv output output2

grunt> ls output

File or directory output does not exist.

grunt> ls output2

/data/output2/map-000000<r 3>     508844

/data/output2/output3     <dir>

/data/output2/part-00000<r 3>     0

pwd

 

Prints the name of the current working directory.

Syntax

pwd

Terms

none

no parameters

Usage

The pwd command is identical to Unix pwd command and it prints the name of the current working directory.

Example

In this example the name of the current working directory is /data.

grunt> pwd

/data

grunt>

rm

Removes one or more files or directories.

Syntax

rm path [path…]                       

Terms

path

The name of the path/directory/file.

Usage

The rm command is similar to the Unix rm command and enables you to remove one or more files or directories.

Note: This command recursively removes a directory even if it is not empty and it does not confirm remove and the removed data is not recoverable.

Example

In this example

grunt> rm /data/students

grunt> rm students students_sav

rmf

Forcibly removes one or more files or directories.

Syntax

rmf path [path …]             

Terms

path

The name of the path/directory/file.

Usage

The rmf command is similar to the Unix rm -f command and enables you to forcibly remove one or more files or directories.

Note: This command recursively removes a directory even if it is not empty and it does not confirm remove and the removed data is not recoverable.

Example

In this example

grunt> rmf /data/students

grunt> rmf students students_sav

Utility Commands

help

Prints a list of Pig commands.

Syntax

help                                                          

Terms

none

no parameters

Usage

The help command prints a list of Pig commands.

Example

In this example the students file in the data directory is printed out.

grunt> help

Commands:

<pig latin statement>;

store <alias> into <filename> [using <functionSpec>]

dump <alias>

etc ….

kill

Kills a job.

Syntax

kill jobid

Terms

jobid

The job id.

Usage

The kill command enables you to kill a job based on a job id.

Example

In this example the job with id job_0001 is killed.

grunt> kill job_0001

quit

Quits from the Pig grunt shell.

Syntax

exit

Terms

none

no parameters

Usage

The quit command enables you to quit or exit the Pig grunt shell.

Example

In this example the quit command exits the Pig grunt shall.

grunt> quit

set

Assigns values to keys used in Pig.

Syntax

set key 'value'

Terms

key

Key (see table). Case sensitive.

value

Value for key (see table). Case sensitive.

Usage

The set command enables you to assign values to keys, as shown here:

Key

Value

Description

debug

on/off

enables/disables debug-level logging

job.name

single quoted string that contains the name

sets user-specified name for the job

 

Example

In this example debug is set on and the job is assigned a name.

grunt> set debug on

grunt> set job.name 'my job'