What is hadoop ?
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
You can group lot of small hardWare CPU's as cluster and process / analyse your data using those clusters instead of data getting processed in a single system.
PreRequisites:
Java 1.6+ (Recommended : Oracle Java)
Update .bashrc or /etc/profiles
export JAVA_HOME=/usr/local/java/jdk1.6.0_25
export JRE_HOME=$JAVA_HOME/jre
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
Note : Make sure JAVA_HOME is set in /etc/profile. So that, java will be available for different users in the machine.
To Check the java version :
$ java -version
Create group and user for hadoop as a best practice
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
Install and Configure ssh & rsync Tool used by hadoop distributed file system (HDFS)
$ sudo apt-get install ssh
$ sudo apt-get install rsync
Note : Make sure sshd is running in your machine
$ ps -ef | grep sshd
$ /etc/init.d/ssh start
Create SSH key for hduser
$ su - hduser
hduser@laptop: ssh-keygen -t rsa -P ""
Note : RSA key should be empty without password
Authorize SSH Key to avoid hdfs to provide password each time
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
hduser$ ssh localhost
Hadoop Installation
Download Apache Hadoop stable version
$ cd /usr/local
$ sudo tar -xvzf hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 hadoop
$ sudo chown -R hduser:hadoop hadoop
Configure Hadoop
Export HADOOP_HOME and add hadoop into bin Path in /etc/profile or hduser .bashrc. Make sure JAVA_HOME also configured
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
Update
$ vi /usr/local/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/local/java/jdk1.6.0_25
Create directory for hadoop.tmp.dir (hadoop storage data files directory)
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp
Update $HADOOP_HOME/conf/conf/core-site.xml configuration tags with below configuration
<code>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.</description>
</property>
</code>
Update $HADOOP_HOME/conf/mapred-site.xml configuration tags with below configuration
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs. If "local", then jobs are run in-process as a single map and reduce task
</description>
</property>
Format namenode before you start your daemons
$HADOOP_HOME/bin/hadoop namenode -format
Note : Please execute only in the local environment where you installing hadoop. This command format / delete the entire data from the hadoop distributed file system. It will format and create HDFS directory based on the dfs.name.dir variable declared in the $HADOOP_HOME/src/hdfs/hdfs-default.xml.
Starting your single node cluster
hduser$ $HADOOP_HOME/bin/start-all.sh
Note : Above command will start NameNode, DataNode, JobTracker and TaskTracker
Check the java process to see the daemons started and check the listening port
$ jps
$ netstat -plten | grep java
Note : Hadoop error log files in the $HADOOP_HOME/logs/ directory. You can see separate log file for each and every daemons.
Hadoop Web UI's and ports
http://localhost:50070/ – NameNode UI
http://localhost:50030/ – JobTracker UI
http://localhost:50060/ – TaskTracker UI
MapReduce Job Examples
- Make sure hadoop started and above mentioned ports are available
- Download sample for hadoop from gitHub user.txt
- Right click and click 'Save Page as'
Create directory in hdfs and copy sample files into hdfs
$ sudo su -
$ cp /home/user/Downloads/user.txt /home/hduser/
$ su hduser
$ hadoop fs -mkdir /samples/hadoop
$ hadoop fs -put /home/hduser/user.txt /samples/hadoop/
Command to run the wordCount example from hadoop
hduser$ cd /usr/local/hadoop
hduser$ hadoop jar hadoop*examples*.jar wordcount /samples/hadoop/user.txt /samples/hadoop-output
Delete existing output folder / folder from HDFS
hduser$ hadoop fs -rmr /samples/hadoop-output
Note : Make sure hadoop-output directory is not exist in hdfs. Hadoop example will create hadoop-output directory with output files. You can increase reduce task by passing "-D" mapred.reduce.tasks
hduser$ cd /usr/local/hadoop
hduser$ hadoop jar hadoop*examples*.jar wordcount -D mapred.reduce.tasks=16 /samples/hadoop/user.txt /samples/hadoop-output
Note : MapReduce job can accepts the user specified mapred.reduce.tasks and doesn’t manipulate. No. of mapper tasks will be decided by daemons based on the input content and available clusters. We cannot pass as input.
Verify the output generated file from HDFS
hduser$ /usr/local/hadoop/bin/hadoop dfs -cat /samples/hadoop-output/part-r-00000
Download the hdfs output file from hdfs to local
hduser$ hadoop dfs -get /samples/hadoop-output/part-r-00000 /tmp/
Command to stop your cluster
hduser$ /usr/local/hadoop/bin/stop-all.sh
Thanks
Michael. I modified and added instruction upon my experience while following his blog.
Know more about
Apache Hadoop and
Developer.com BigData.
Install Cloudera VM, Counters, Partitioning, Combiners
Excellent hortonWorks tutorial