For Developers: Install Hadoop in a Single Node (Linux / Ubuntu)

What is hadoop ?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

You can group lot of small hardWare CPU's as cluster and process / analyse your data using those clusters instead of data getting processed in a single system.

PreRequisites:
Java 1.6+ (Recommended : Oracle Java)

Update .bashrc or /etc/profiles

export JAVA_HOME=/usr/local/java/jdk1.6.0_25
export JRE_HOME=$JAVA_HOME/jre
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin

Note : Make sure JAVA_HOME is set in /etc/profile. So that, java will be available for different users in the machine.

To Check the java version :

$ java -version

Create group and user for hadoop as a best practice

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Install and Configure ssh & rsync Tool used by hadoop distributed file system (HDFS)

$ sudo apt-get install ssh
$ sudo apt-get install rsync

Note : Make sure sshd is running in your machine

$ ps -ef | grep sshd
$ /etc/init.d/ssh start

Create SSH key for hduser

$ su - hduser
hduser@laptop: ssh-keygen -t rsa -P ""

Note : RSA key should be empty without password

Authorize SSH Key to avoid hdfs to provide password each time

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
hduser$ ssh localhost

Hadoop Installation
Download Apache Hadoop stable version

$ cd /usr/local
$ sudo tar -xvzf hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 hadoop
$ sudo chown -R hduser:hadoop hadoop

Configure Hadoop
Export HADOOP_HOME and add hadoop into bin Path in /etc/profile or hduser .bashrc. Make sure JAVA_HOME also configured

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Update

$ vi /usr/local/hadoop/conf/hadoop-env.sh 
export JAVA_HOME=/usr/local/java/jdk1.6.0_25

Create directory for hadoop.tmp.dir (hadoop storage data files directory)

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp

Update $HADOOP_HOME/conf/conf/core-site.xml configuration tags with below configuration

<code>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.</description>
</property>
</code>

Update $HADOOP_HOME/conf/mapred-site.xml configuration tags with below configuration

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs. If "local", then jobs are run in-process as a single map and reduce task
</description>
</property>

Format namenode before you start your daemons

$HADOOP_HOME/bin/hadoop namenode -format

Note : Please execute only in the local environment where you installing hadoop. This command format / delete the entire data from the hadoop distributed file system. It will format and create HDFS directory based on the dfs.name.dir variable declared in the $HADOOP_HOME/src/hdfs/hdfs-default.xml.

Starting your single node cluster

hduser$ $HADOOP_HOME/bin/start-all.sh

Note : Above command will start NameNode, DataNode, JobTracker and TaskTracker

Check the java process to see the daemons started and check the listening port

$ jps
$ netstat -plten | grep java

Note : Hadoop error log files in the $HADOOP_HOME/logs/ directory. You can see separate log file for each and every daemons.

Hadoop Web UI's and ports
http://localhost:50070/ – NameNode UI
http://localhost:50030/ – JobTracker UI
http://localhost:50060/ – TaskTracker UI

MapReduce Job Examples

Make sure hadoop started and above mentioned ports are available
Download sample for hadoop from gitHub user.txt
Right click and click 'Save Page as'

Create directory in hdfs and copy sample files into hdfs

$ sudo su -
$ cp /home/user/Downloads/user.txt /home/hduser/
$ su hduser
$ hadoop fs -mkdir /samples/hadoop
$ hadoop fs -put /home/hduser/user.txt /samples/hadoop/

Command to run the wordCount example from hadoop

hduser$ cd /usr/local/hadoop
hduser$ hadoop jar hadoop*examples*.jar wordcount /samples/hadoop/user.txt /samples/hadoop-output

Delete existing output folder / folder from HDFS

hduser$ hadoop fs -rmr /samples/hadoop-output

Note : Make sure hadoop-output directory is not exist in hdfs. Hadoop example will create hadoop-output directory with output files. You can increase reduce task by passing "-D" mapred.reduce.tasks

hduser$ cd /usr/local/hadoop
hduser$ hadoop jar hadoop*examples*.jar wordcount -D mapred.reduce.tasks=16 /samples/hadoop/user.txt /samples/hadoop-output

Note : MapReduce job can accepts the user specified mapred.reduce.tasks and doesn’t manipulate. No. of mapper tasks will be decided by daemons based on the input content and available clusters. We cannot pass as input.

Verify the output generated file from HDFS

hduser$ /usr/local/hadoop/bin/hadoop dfs -cat /samples/hadoop-output/part-r-00000

Download the hdfs output file from hdfs to local

hduser$ hadoop dfs -get /samples/hadoop-output/part-r-00000 /tmp/

Command to stop your cluster

hduser$ /usr/local/hadoop/bin/stop-all.sh

Thanks Michael. I modified and added instruction upon my experience while following his blog.
Know more about Apache Hadoop and Developer.com BigData.
Install Cloudera VM, Counters, Partitioning, Combiners
Excellent hortonWorks tutorial

For Developers

Menu

Tuesday, January 27, 2015

Install Hadoop in a Single Node (Linux / Ubuntu)

1 comment :