For Developers: Hadoop

Hadoop Overview : (http://www.cloudera.com/what-is-hadoop/hadoop-overview/)
Apache Hadoop is a scalable, fault-tolerant system for data storage and processing. Hadoop is economical and reliable, which makes it perfect to run data-intensive applications on commodity hardware.

Important Note :

Data Localization - Normal application will fetch the data from RDBMS system and process the data in local with in application. In HDFS the program will get copy into data server and it will process.

Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.

Note : Data will be available in all the data nodes virtually. Basically, part of data will be available in data node and it creates block id and the block id will be stored in the name node.

HDFS (Fault-tolerant Hadoop Distributed File System (HDFS))
The Hadoop software package includes the robust, reliable Hadoop Distributed File System (HDFS), which splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss.

HDFS breaks incoming files into blocks and stores them redundantly across the cluster.

MapReduce Software Framework

Processes large jobs in parallel across many nodes and combines results.
Eliminates the bottlenecks imposed by monolithic storage systems
Results are collated and digested into a single output after each piece has been analyzed.

If a machine fails, Hadoop continues to operate the cluster by shifting work to the remaining machines. It automatically creates an additional copy of the data from one of the replicas it manages. As a result, clusters are self-healing for both storage and computation without requiring intervention by systems administrators.

Installation :
Single node installation or single system installation follow : https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode

sudo yum install hadoop-0.20-native - If you get snappy library error.

Start Hadoop Daemons
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done

Logs
tail -f /var/log/hadoop-0.20/hadoop-hadoop-*.log

Hadoop Default Port:
http://localhost:50030/jobtracker.jsp
Default port quick reference for all the hadoop jobs : http://www.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/

NameNode:

Tracks all metadata and co-ordinates access.
Holds filenames, block locations on slave nodes, as well as ownership info.
The whole filesystem is stored in RAM for fast look-up.
Thus, filesystem size and metadata capacity is limited to the amount of available RAM on the NN.
Back this guy up and don’t use commodity hardware here. If you lose the metadata, absolutely nothing can be done, you’ve just got a lot of useless 1′s & 0′s on your datanodes.
Works like a journaling file system, fsimage and edits files.
Backup is as easy as curl’ing a url to download the fsimage and edits files.
Some people use DRDB and heartbeat for a manual fail-over situation.
We have the NN write the metadata to two different drives locally as well as to an NFS mount off of the machine.
There are a few projects on making a federated namenode, but nothing ready for primetime yet.
HDFS doesn’t like lots of small files because it wastes A LOT of RAM on the NN.
Default blocksize is 64M, but if you have lots of files smaller than 64M the files may be physically smaller than 64M on the disk (they won’t be padded), the block, as stored in the NN’s RAM will be the same size regardless.
Example: A single 64M file will use one block and one ‘slot’ in the NameNodes fsimage, where 64 * 1M files will use 64 slots. Further extrapolated, it can hose a name node.

DataNodes (DN):

Holds data in ‘blocks’ stored on the local file system.
DNs send heartbeats to the NN and after a period without any the DN is assumed lost. In that case:
The NN determines what blocks were lost on that node.
The NN finds other DN’s with copies of those blocks and tells the DN’s to copies the blocks to other DNs.
None of the transferring is done through the NN, but direct DN to DN.
This is all automatic. With a default replication level of 3, you can lose 2 nodes simultaneously without even worrying about losing any data.

The Secondary NameNode (SNN):

Is a very unfortunately named process!
It is NOT a failover node for the NN.
HDFS works like any other journaling filesystem.
The NN keeps the filesystem image in RAM (based off of the fsimage file it started with) and appends changes to a journal (the edits file).
The SNN periodically grabs the fsimage and edits files from the NN, replays the edits in RAM and combines it with the fsimage to make a new fsimage, then ships it back to the NN.
The SNN should have it’s own machine because it requires as much RAM as the NN to replay and merge the file system image.

Job Workflow:

A user runs a client on a client computer.
The client submits a Job (a mapper, a reducer and a list of inputs).
The Job is sent to the JobTracker (JT), which is the master process that coordinates all the action. Typically there is only one JT and usually (except in big clusters) is run on the same machine as the NN process.
The process that does the actual work is the TaskTracker (TT) and usually resides on the same machines as the DN process, soo…
Usually a DN/TT pair is a slave and a JT/NN pair is a master.
The JT instructs the TT’s to run a task… a part of a Job.
The TT will spawn child processes to run a task attempt, an instance of a task running on a slave. If it fails it can be restarted.
Task child-processes send heartbeats to the TT which in turn sends heartbeats to the JT. If a task doesn’t send a heartbeat in 10 minutes it is assumed dead and the TT kills its’ JVM.
Additionally, any task that throws an exception is considered to have failed.
The TT then reports failures to the JT which in turn tries to reschedule the task (preferable on the same TT)
Similarly, any TT that fails to send a heartbeat in 10 minutes is assumed to be dead and the JT schedules it’s tasks elsewhere.
Any TT that fails enough times is blacklisted from that specific Job and furthermore any TT that fails across multiple Jobs is added to a global blacklist.

Above points from blog : http://blog.milford.io/2011/01/slides-and-notes-from-my-recent-hadoop-talk-in-israel/

Copy Data from hdfs server (qa / prod / dev) to local hdfs server

Syntax : hadoop distcp
example : hadoop distcp hdfs://hadoopdbnode1/export/qamove/SEMMaster hdfs://localhost/export/gubs_export

Above is distributed copy. If its not working use filesystem copy
hadoop fs -cp /export/backup_energy/0001_ConsolidatedPerformanceSummary hdfs://destination_ip/export/backup_energy/

http://hadoop.apache.org/common/docs/current/distcp.html

How to turn off safe mode in namenode ?
In hdfs run command : hadoop dfsadmin -safemode leave
To delete the corrupted files : hadoop fsck / -delete
To check the health do : hadoop fsck /

Kill the job :
hadoop job -kill job_201207241936_0656

List the running job :
hadoop job -list

Remove the directory in hadoop file system
hadoop fs -rmr /export/

List the files in hadoop file system
hadoop fs -ls /

Use rowcounter function on hbase jar to find row count
hadoop jar hbase-0.90.6-cdh3u4.jar rowcounter

If hadoop command shows local filesystem than hadoop configuration is not in classPath. Download from cloudera manager and copy in .bashrc

Copy between two hdfs file systems
hadoop fs -cp hdfsSourcePath/* hdfs://destinationIP/export/backup_20120824/

For Developers

Menu

Sunday, December 18, 2011

Hadoop

MapReduce Software Framework

No comments :