Thursday, February 12, 2015

Ubuntu - Install Pig

Pig 
Pig is framework that translates programs written in pig Latin into jobs that are executed by the MapReduce framework. Pig does not provide any funcationality that isn't provided by MapReduce, but it make some types of data operations significantly easier to perform.

a) local mode
b) distributed/Map Reduce mode. (Cluster mode)

Pig converts all transformation into a map-reduce job so that the developer can focus mainly on data scripting instead of putting an effort to writing a complex set of MR programs.

Using the ’-x local’ options starts pig in the local mode whereas executing the pig command without any options starts in Pig in the cluster mode. When in local mode, pig can access files on the local file system. In cluster mode, pig can access files on HDFS.

Pig provides no additional capabilities to MapReduce. Pig programs are executed as MapReduce job via the Pig Interpreter.

1) Download Pig

2) Switch user to root : sudo su

3) cd /usr/local; cp /home//Download/pig-0.14.0.tar.gz /usr/local/

3) tar -xvzf pig-0.14.0.tar.gz

4) chown -R hduser:hadoop pig-0.14.0 (Hope you followed "Install Hadoop" to have user hduser)

6) Update /etc/profile PATH or su hduser; vi ~/.bashrc
export PIG_HOME=/usr/local/pig-0.14.0
export PATH=$PATH:$HADOOP_HOME/bin:$PIG_HOME/bin


7) Update your pig.properties file. cd $PIG_HOME/conf; vi pig.properties. Add below lines.
   fs.default.name=hdfs://localhost:9090 (value of port where hdfs is running)
   mapred.job.tracker=localhost:8021 (value of port where MR job is running)

Note : find your above fs.default.name in "$HADOOP_HOME/conf/core-site.xml" and mapred.job.tracker in "$HADOOP_HOME/conf/mapred-site.xml"

8) Create vi /home/hduser/samples/user.txt in your local file system and add below content

1,John,Montgomery,Alabama,US
2,David,Phoenix,Arizona,US
3,Sarah,Sacramento,California,US
4,Anoop,Montgomery,Alabama,US
5,Gubs,Villupuram,TamilNadu,India


10) Start Pig

There are two modes to run Pig; these can be updated in the pig.properties file available in the conf directory of the Pig installed location.

    Local mode using the following command:

$pig -x local

11) 
The part to the left of “=” is called the relation or alias. It looks like a variable but you should note that this is not a variable. When this statement is executed, no MapReduce task is executed. Since our dataset has records with fields separated by a comma we use the keyword USING PigStorage(‘,’).
The file should be available in your local disk

chararray equivalent to String

grunt>user_record = LOAD '/home/hduser/samples/user.txt' USING PigStorage(',') AS (id:INT,name:chararray,city:chararray,state:chararray,country:chararray);


12) Statement to print the content on alias .
grunt>DUMP user_record;

12) final and last command will give the desired output, which will group records by state 

Problem : How many people belong to each state?

grunt>state_record = Group user_record BY state;

13) outputting the records
grunt>output_record = FOREACH state_record GENERATE group, COUNT(user_record.state);

14)
grunt>DUMP output_record;

15) Store the output into file
grunt>store output_record into '/user/hduser/people_belong_to_state';


DESCRIBE :
To describe the schema
grunt>DESCRIBE user_record;

You can modify / uncomment $PIG_HOME/conf/pig.properties to update logger path (log4jconf), error log path (pig.logfile) and provide necessary path for details


ILLUSTRATE:
To view the step-by-step execution of a sequence of statements you can use the ILLUSTRATE command (Useful for debugging..):
grunt>ILLUSTRATE output_record;


LIMIT : 
grunt>top_5_records = LIMIT desc output_record 5;
grunt>DUMP top_5_records;

DISTINCT :

grunt>distinct_records = DISTINCT output_record;
grunt>DUMP distinct_records;

SAMPLE:
grunt>sample_record = sample user_record 0.1; 
grunt>DUMP sample_record

Here, 0.1 = 10% of your total user_record.

GROUP  :
grunt> group_by_state = GROUP user_record by state;
grunt>DUMP group_by_state

3 Types of Complex Types

(Output print in complex Type bags {(record1),(record2)} and tuples. Tuples type will be (record1). Bag will contain unordered list of tuples
Map type will be ['key'#'value']




ORDER :
grunt>state_order_by = ORDER user_record by state DESC;
grunt>DUMP state_order_by;

Most of the SQL statements or like relational database. In between "keyword" and "by" you need to pass the alias name.

Note :  You can modify / uncomment $PIG_HOME/conf/pig.properties to update logger path (log4jconf), error log path (pig.logfile) and provide necessary path for details 

In pig session (reverse-i-search) - Ctrl + R is working in grunt> To pick previous history scripts typed in pig session.


 

References : Pig Reference,
Tutorial Reference  

1 comment :

Unknown said...

really u made note very helpful for any developer...i will need ur help in future

// Below script tag for SyntaxHighLighter