Tuesday, January 27, 2015

Install Hadoop in a Single Node (Linux / Ubuntu)

What is hadoop ?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

You can group lot of small hardWare CPU's as cluster and process / analyse your data using those clusters instead of data getting processed in a single system. 

PreRequisites:
Java 1.6+ (Recommended : Oracle Java)

Update .bashrc or /etc/profiles
export JAVA_HOME=/usr/local/java/jdk1.6.0_25
export JRE_HOME=$JAVA_HOME/jre
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
Note : Make sure JAVA_HOME is set in /etc/profile. So that, java will be available for different users in the machine.
To Check the java version :
$ java -version
Create group and user for hadoop as a best practice
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
Install and Configure ssh & rsync Tool used by hadoop distributed file system (HDFS)
$ sudo apt-get install ssh
$ sudo apt-get install rsync
Note : Make sure sshd is running in your machine
$ ps -ef | grep sshd
$ /etc/init.d/ssh start

Create SSH key for hduser
$ su - hduser
hduser@laptop: ssh-keygen -t rsa -P ""
Note : RSA key should be empty without password

Authorize SSH Key to avoid hdfs to provide password each time
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
hduser$ ssh localhost

Hadoop Installation
Download Apache Hadoop stable version
$ cd /usr/local
$ sudo tar -xvzf hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 hadoop
$ sudo chown -R hduser:hadoop hadoop

Configure Hadoop  
Export HADOOP_HOME and add hadoop into bin Path in /etc/profile or hduser .bashrc. Make sure JAVA_HOME also configured
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Update
$ vi /usr/local/hadoop/conf/hadoop-env.sh 
export JAVA_HOME=/usr/local/java/jdk1.6.0_25

Create directory for hadoop.tmp.dir (hadoop storage data files directory)
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp

Update $HADOOP_HOME/conf/conf/core-site.xml configuration tags with below configuration
<code>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.</description>
</property>
</code>

Update $HADOOP_HOME/conf/mapred-site.xml configuration tags with below configuration
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs. If "local", then jobs are run in-process as a single map and reduce task
</description>
</property>

Format namenode before you start your daemons
$HADOOP_HOME/bin/hadoop namenode -format
Note : Please execute only in the local environment where you installing hadoop. This command format / delete the entire data from the hadoop distributed file system. It will format and create HDFS directory based on the dfs.name.dir variable declared in the $HADOOP_HOME/src/hdfs/hdfs-default.xml.

Starting your single  node cluster
hduser$ $HADOOP_HOME/bin/start-all.sh
Note : Above command will start NameNode, DataNode, JobTracker and TaskTracker

Check the java process to see the daemons started and check the listening port
$ jps
$ netstat -plten | grep java
Note : Hadoop error log files in the $HADOOP_HOME/logs/ directory. You can see separate log file for each and every daemons.

Hadoop Web UI's and ports
http://localhost:50070/ – NameNode UI
http://localhost:50030/ – JobTracker UI
http://localhost:50060/ – TaskTracker UI

MapReduce Job Examples
  • Make sure hadoop started and above mentioned ports are available
  • Download sample for hadoop from gitHub user.txt
  • Right click and click 'Save Page as'
Create directory in hdfs and copy sample files into hdfs
$ sudo su -
$ cp /home/user/Downloads/user.txt /home/hduser/
$ su hduser
$ hadoop fs -mkdir /samples/hadoop
$ hadoop fs -put /home/hduser/user.txt /samples/hadoop/
Command to run the wordCount example from hadoop
hduser$ cd /usr/local/hadoop
hduser$ hadoop jar hadoop*examples*.jar wordcount /samples/hadoop/user.txt /samples/hadoop-output
Delete existing output folder / folder from HDFS
hduser$ hadoop fs -rmr /samples/hadoop-output
Note : Make sure hadoop-output directory is not exist in hdfs. Hadoop example will create hadoop-output directory with output files. You can increase reduce task by passing "-D" mapred.reduce.tasks
hduser$ cd /usr/local/hadoop
hduser$ hadoop jar hadoop*examples*.jar wordcount -D mapred.reduce.tasks=16 /samples/hadoop/user.txt /samples/hadoop-output
Note : MapReduce job can accepts the user specified mapred.reduce.tasks and doesn’t manipulate. No. of mapper tasks will be decided by daemons based on the input content and available clusters. We cannot pass as input.
Verify the output generated file from HDFS
hduser$ /usr/local/hadoop/bin/hadoop dfs -cat /samples/hadoop-output/part-r-00000
Download the hdfs output file from hdfs to local
hduser$ hadoop dfs -get /samples/hadoop-output/part-r-00000 /tmp/
Command to stop your cluster
hduser$ /usr/local/hadoop/bin/stop-all.sh

Thanks Michael. I modified and added instruction upon my experience while following his blog.
Know more about Apache Hadoop and Developer.com BigData.
Install Cloudera VM, Counters, Partitioning, Combiners
Excellent hortonWorks tutorial



// Below script tag for SyntaxHighLighter