Pig Setup

Overview
- Requirements
Beginning Pig
Advanced Pig
Sample Code

Overview

Requirements

Unix and Windows users need the following:

Hadoop 0.20.2 - http://hadoop.apache.org/common/releases.html
Java 1.6 - http://java.sun.com/javase/downloads/index.jsp (set JAVA_HOME to the root of your Java installation)
Ant 1.7 - http://ant.apache.org/ (optional, for builds)
JUnit 4.5 - http://junit.sourceforge.net/ (optional, for unit tests)

Windows users need to install Cygwin and the Perl package: http://www.cygwin.com/

Beginning Pig

Download Pig

To get a Pig distribution, download a recent stable release from one of the Apache Download Mirrors (see Pig Releases).

Unpack the downloaded Pig distribution. The Pig script is located in the bin directory (/pig-n.n.n/bin/pig).

Add /pig-n.n.n/bin to your path. Use export (bash,sh,ksh) or setenv (tcsh,csh). For example:

$ export PATH=/<my-path-to-pig>/pig-n.n.n/bin:$PATH

Try the following command, to get a list of Pig commands:

$ pig -help

Try the following command, to start the Grunt shell:

$ pig

Run Modes

Pig has two run modes or exectypes:

Local Mode - To run Pig in local mode, you need access to a single machine.
Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Pig will automatically allocate and deallocate a 15-node cluster.

You can run the Grunt shell, Pig scripts, or embedded programs using either mode.

Grunt Shell

Use Pig's interactive shell, Grunt, to enter pig commands manually. See the Sample Code for instructions about the passwd file used in the example.

You can also run or execute script files from the Grunt shell. See the run and exec commands.

Local Mode

$ pig -x local

Mapreduce Mode

$ pig
or
$ pig -x mapreduce

For either mode, the Grunt shell is invoked and you can enter commands at the prompt. The results are displayed to your terminal screen (if DUMP is used) or to a file (if STORE is used).

grunt> A = load 'passwd' using PigStorage(':'); 
grunt> B = foreach A generate $0 as id; 
grunt> dump B; 
grunt> store B;

Script Files

Use script files to run Pig commands as batch jobs. See the Sample Code for instructions about the passwd file and the script file (id.pig) used in the example.

Local Mode

$ pig -x local id.pig

Mapreduce Mode

$ pig id.pig
or
$ pig -x mapreduce id.pig

For either mode, the Pig Latin statements are executed and the results are displayed to your terminal screen (if DUMP is used) or to a file (if STORE is used).

Advanced Pig

Build Pig

To build pig, do the following:

Check out the Pig code from SVN: svn co http://svn.apache.org/repos/asf/hadoop/pig/trunk.
Build the code from the top directory: ant. If the build is successful, you should see the pig.jar created in that directory.
Validate your pig.jar by running a unit test: ant test

Environment Variables and Properties

See Download Pig.

The Pig environment variables are described in the Pig script file, located in the /pig-n.n.n/bin directory.

The Pig properties file, pig.properties, is located in the /pig-n.n.n/conf directory. You can specify an alternate location using the PIG_CONF_DIR environment variable.

Run Modes

See Run Modes.

Embedded Programs

Used the embedded option to embed Pig commands in a host language and run the program. See the Sample Code for instructions about the passwd file and java files (idlocal.java, idmapreduce.java) used in the examples.

Local Mode

From your current working directory, compile the program:

$ javac -cp pig.jar idlocal.java

Note: idlocal.class is written to your current working directory. Include “.” in the class path when you run the program.

From your current working directory, run the program:

Unix:   $ java -cp pig.jar:. idlocal
Cygwin: $ java –cp ‘.;pig.jar’ idlocal

To view the results, check the output file, id.out.

Mapreduce Mode

Point $HADOOPDIR to the directory that contains the hadoop-site.xml file. Example:

$ export HADOOPDIR=/yourHADOOPsite/conf

From your current working directory, compile the program:

$ javac -cp pig.jar idmapreduce.java

Note: idmapreduce.class is written to your current working directory. Include “.” in the class path when you run the program.

From your current working directory, run the program:

Unix:   $ java -cp pig.jar:.:$HADOOPDIR idmapreduce
Cygwin: $ java –cp ‘.;pig.jar;$HADOOPDIR’ idmapreduce

To view the results, check the idout directory on your Hadoop system.

Sample Code

The sample code is based on Pig Latin statements that extract all user IDs from the /etc/passwd file.

Copy the /etc/passwd file to your local working directory.

id.pig

For the Grunt Shell and script files.

A = load 'passwd' using PigStorage(':'); 
B = foreach A generate $0 as id;
dump B; 
store B into ‘id.out’;

idlocal.java

For embedded programs.

import java.io.IOException;
import org.apache.pig.PigServer;
public class idlocal{ 
public static void main(String[] args) {
try {
    PigServer pigServer = new PigServer("local");
    runIdQuery(pigServer, "passwd");
    }
    catch(Exception e) {
    }
 }
public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
    pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(':');");
    pigServer.registerQuery("B = foreach A generate $0 as id;");
    pigServer.store("B", "id.out");
 }
}

idmapreduce.java

For embedded programs.

import java.io.IOException;
import org.apache.pig.PigServer;
public class idmapreduce{
   public static void main(String[] args) {
   try {
     PigServer pigServer = new PigServer("mapreduce");
     runIdQuery(pigServer, "passwd");
   }
   catch(Exception e) {
   }
}
public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
   pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(':');")
   pigServer.registerQuery("B = foreach A generate $0 as id;");
   pigServer.store("B", "idout");
   }
}