Pig is framework that translates programs written in pig Latin into jobs that are executed by the MapReduce framework. Pig does not provide any funcationality that isn't provided by MapReduce, but it make some types of data operations significantly easier to perform.
a) local mode
b) distributed/Map Reduce mode. (Cluster mode)
Pig converts all transformation into a map-reduce job so that the developer can focus mainly on data scripting instead of putting an effort to writing a complex set of MR programs.
Using the ’-x local’ options starts pig in the local mode whereas executing the pig command without any options starts in Pig in the cluster mode. When in local mode, pig can access files on the local file system. In cluster mode, pig can access files on HDFS.
Pig provides no additional capabilities to MapReduce. Pig programs are executed as MapReduce job via the Pig Interpreter.
1) Download Pig
2) Switch user to root : sudo su
3) cd /usr/local; cp /home/
3) tar -xvzf pig-0.14.0.tar.gz
4) chown -R hduser:hadoop pig-0.14.0 (Hope you followed "Install Hadoop" to have user hduser)
6) Update /etc/profile PATH or su hduser; vi ~/.bashrc
export PIG_HOME=/usr/local/pig-0.14.0
export PATH=$PATH:$HADOOP_HOME/bin:$PIG_HOME/bin
7) Update your pig.properties file. cd $PIG_HOME/conf; vi pig.properties. Add below lines.
fs.default.name=hdfs://localhost:9090 (value of port where hdfs is running)
mapred.job.tracker=localhost:8021 (value of port where MR job is running)
Note : find your above fs.default.name in "$HADOOP_HOME/conf/core-site.xml" and mapred.job.tracker in "$HADOOP_HOME/conf/mapred-site.xml"
8) Create vi /home/hduser/samples/user.txt in your local file system and add below content
1,John,Montgomery,Alabama,US
2,David,Phoenix,Arizona,US
3,Sarah,Sacramento,California,US
4,Anoop,Montgomery,Alabama,US
5,Gubs,Villupuram,TamilNadu,India
10) Start Pig
There are two modes to run Pig; these can be updated in the pig.properties file available in the conf directory of the Pig installed location.
Local mode using the following command:
$pig -x local
11)
chararray equivalent to String
12) final and last command will give the desired output, which will group records by state
13) outputting the records
14)
To describe the schema
grunt>DESCRIBE user_record;
You can modify / uncomment $PIG_HOME/conf/pig.properties to update logger path (log4jconf), error log path (pig.logfile) and provide necessary path for details
To view the step-by-step execution of a sequence of statements you can use the ILLUSTRATE command (Useful for debugging..):
grunt>ILLUSTRATE output_record;
grunt>top_5_records = LIMIT desc output_record 5;
grunt>DUMP top_5_records;
grunt>distinct_records = DISTINCT output_record;
grunt>DUMP distinct_records;
SAMPLE:
grunt>sample_record = sample user_record 0.1;
grunt>DUMP sample_record
Here, 0.1 = 10% of your total user_record.
GROUP :
grunt> group_by_state = GROUP user_record by state;
grunt>DUMP group_by_state
3 Types of Complex Types
(Output print in complex Type bags {(record1),(record2)} and tuples. Tuples type will be (record1). Bag will contain unordered list of tuples
Map type will be ['key'#'value']
ORDER :
grunt>state_order_by = ORDER user_record by state DESC;
grunt>DUMP state_order_by;
Most of the SQL statements or like relational database. In between "keyword" and "by" you need to pass the alias name.
1 comment :
really u made note very helpful for any developer...i will need ur help in future
Post a Comment