What is hadoop ?
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
You can group lot of small hardWare CPU's as cluster and process / analyse your data using those clusters instead of data getting processed in a single system.
PreRequisites:
Java 1.6+ (Recommended : Oracle Java)
Update .bashrc or /etc/profiles
Create SSH key for hduser
Authorize SSH Key to avoid hdfs to provide password each time
Hadoop Installation
Download Apache Hadoop stable version
Configure Hadoop
Export HADOOP_HOME and add hadoop into bin Path in /etc/profile or hduser .bashrc. Make sure JAVA_HOME also configured
Update
Create directory for hadoop.tmp.dir (hadoop storage data files directory)
Update $HADOOP_HOME/conf/conf/core-site.xml configuration tags with below configuration
Update $HADOOP_HOME/conf/mapred-site.xml configuration tags with below configuration
Format namenode before you start your daemons
Starting your single node cluster
Check the java process to see the daemons started and check the listening port
Hadoop Web UI's and ports
http://localhost:50070/ – NameNode UI
http://localhost:50030/ – JobTracker UI
http://localhost:50060/ – TaskTracker UI
MapReduce Job Examples
Thanks Michael. I modified and added instruction upon my experience while following his blog.
Know more about Apache Hadoop and Developer.com BigData.
Install Cloudera VM, Counters, Partitioning, Combiners
Excellent hortonWorks tutorial
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
You can group lot of small hardWare CPU's as cluster and process / analyse your data using those clusters instead of data getting processed in a single system.
PreRequisites:
Java 1.6+ (Recommended : Oracle Java)
Update .bashrc or /etc/profiles
export JAVA_HOME=/usr/local/java/jdk1.6.0_25 export JRE_HOME=$JAVA_HOME/jre export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
Note : Make sure JAVA_HOME is set in /etc/profile. So that, java will be available for different users in the machine.
To Check the java version :$ java -versionCreate group and user for hadoop as a best practice
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduserInstall and Configure ssh & rsync Tool used by hadoop distributed file system (HDFS)
$ sudo apt-get install ssh $ sudo apt-get install rsync
Note : Make sure sshd is running in your machine
$ ps -ef | grep sshd $ /etc/init.d/ssh start
Create SSH key for hduser
$ su - hduser hduser@laptop: ssh-keygen -t rsa -P ""
Note : RSA key should be empty without password
Authorize SSH Key to avoid hdfs to provide password each time
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys hduser$ ssh localhost
Hadoop Installation
Download Apache Hadoop stable version
$ cd /usr/local $ sudo tar -xvzf hadoop-1.2.1.tar.gz $ sudo mv hadoop-1.2.1 hadoop $ sudo chown -R hduser:hadoop hadoop
Configure Hadoop
Export HADOOP_HOME and add hadoop into bin Path in /etc/profile or hduser .bashrc. Make sure JAVA_HOME also configured
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/bin
Update
$ vi /usr/local/hadoop/conf/hadoop-env.sh export JAVA_HOME=/usr/local/java/jdk1.6.0_25
Create directory for hadoop.tmp.dir (hadoop storage data files directory)
$ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp
Update $HADOOP_HOME/conf/conf/core-site.xml configuration tags with below configuration
<code>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.</description>
</property>
</code>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.</description>
</property>
</code>
Update $HADOOP_HOME/conf/mapred-site.xml configuration tags with below configuration
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs. If "local", then jobs are run in-process as a single map and reduce task
</description>
</property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs. If "local", then jobs are run in-process as a single map and reduce task
</description>
</property>
Format namenode before you start your daemons
$HADOOP_HOME/bin/hadoop namenode -format
Note : Please execute only in the local environment where you installing hadoop. This command format / delete the entire data from the hadoop distributed file system. It will format and create HDFS directory based on the dfs.name.dir variable declared in the $HADOOP_HOME/src/hdfs/hdfs-default.xml.
Starting your single node cluster
hduser$ $HADOOP_HOME/bin/start-all.sh
Note : Above command will start NameNode, DataNode, JobTracker and TaskTracker
Check the java process to see the daemons started and check the listening port
$ jps $ netstat -plten | grep java
Note : Hadoop error log files in the $HADOOP_HOME/logs/ directory. You can see separate log file for each and every daemons.
Hadoop Web UI's and ports
http://localhost:50070/ – NameNode UI
http://localhost:50030/ – JobTracker UI
http://localhost:50060/ – TaskTracker UI
MapReduce Job Examples
- Make sure hadoop started and above mentioned ports are available
- Download sample for hadoop from gitHub user.txt
- Right click and click 'Save Page as'
$ sudo su - $ cp /home/user/Downloads/user.txt /home/hduser/ $ su hduser $ hadoop fs -mkdir /samples/hadoop $ hadoop fs -put /home/hduser/user.txt /samples/hadoop/Command to run the wordCount example from hadoop
hduser$ cd /usr/local/hadoop hduser$ hadoop jar hadoop*examples*.jar wordcount /samples/hadoop/user.txt /samples/hadoop-outputDelete existing output folder / folder from HDFS
hduser$ hadoop fs -rmr /samples/hadoop-output
Note : Make sure hadoop-output directory is not exist in hdfs. Hadoop example will create hadoop-output directory with output files. You can increase reduce task by passing "-D" mapred.reduce.tasks
hduser$ cd /usr/local/hadoop hduser$ hadoop jar hadoop*examples*.jar wordcount -D mapred.reduce.tasks=16 /samples/hadoop/user.txt /samples/hadoop-output
Note : MapReduce job can accepts the user specified mapred.reduce.tasks and doesn’t manipulate. No. of mapper tasks will be decided by daemons based on the input content and available clusters. We cannot pass as input.
Verify the output generated file from HDFShduser$ /usr/local/hadoop/bin/hadoop dfs -cat /samples/hadoop-output/part-r-00000Download the hdfs output file from hdfs to local
hduser$ hadoop dfs -get /samples/hadoop-output/part-r-00000 /tmp/Command to stop your cluster
hduser$ /usr/local/hadoop/bin/stop-all.sh
Thanks Michael. I modified and added instruction upon my experience while following his blog.
Know more about Apache Hadoop and Developer.com BigData.
Install Cloudera VM, Counters, Partitioning, Combiners
Excellent hortonWorks tutorial
1 comment :
Hadoop is a cloud based tool.It give more information about massive storage and it helps to improve our skills.
Big Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery
Post a Comment