Monday, February 16, 2015

Install Apache Sqoop - Sql to hadoop (HDFS) and inverse the same

sqoop - Sql to hadoop

Efficient tool to transfer bulk data from structured (relational db's) to hadoop (hdfs / hive / hbase). You can also export data from hdfs to import into other datawarehouses.

Multiple ways to install sqoop :
To install sqoop in Debian distributions (Ubuntu / Debian)

$ sudo apt-get install sqoop

Advantage installing with debian packages than Tar ball. Still to handle yourself efficiently the package use tar ball.
  •     Handle dependencies
  •     Provide for easy upgrades
  •     Automatically install resources to conventional locations
Download Sqoop  - Download sqoop version based on the hadoop you installed in your box. If you hadoop 1.x then download sqoop with hadoop 1.x version to avoid incompatible error upon import / export.

$ sudo su
$ (cp ~/Downloads/sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz /usr/local && cd /usr/local/
 && tar -zxvf path_to_sqoop.tar.gz)
$ mv sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz sqoop-1.4.5
$ chown -R hduser:hadoop sqoop-1.4.5

Configure sqoop wrapper with hadoop
$cd /usr/local/sqoop-1.4.5/conf/
$mv sqoop-env-template.sh sqoop-env.sh

Enable HADOOP_COMMON_HOME and HADOOP_MAPRED_HOME by providing hadoop available path. If you want to sqoop data to HBASE or Hive enable those variables as well and provide path in sqoop-env.sh

$vi sqoop-env.sh
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop

In /home/hduser/.bashrc or /etc/profiles

export SQOOP_HOME=/usr/local/sqoop-1.4.5
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin:$PIG_HOME/bin:$SQOOP_HOME/bin

Installing JDBC Driver for Sqoop
sudo apt-get install libmysql-java

(This will install the jar for mysql connector through java, cd /usr/share/java/ to see the mysql-connector-java.jar). mysql-connector-java.jar must be symlink to mysql.jar. If so follow below

sudo ln -s /usr/share/java/mysql.jar /usr/local/sqoop-1.4.5/lib/

Check your sqoop version
$sqoop-version

Create table movies in your mysql
Refer : Download sqoopSample.sql from github for mysql schema.

Import data (table or all tables) from MySQL into Hadoop (HDFS)
Make sure you started your hadoop. (cd $HADOOP_HOME/bin; ./start-all.sh)
-m Use n map tasks to import in parallel
$cd /home/hduser
$sqoop import --connect jdbc:mysql://hostName/dbName --username userName --password password --table tableName --target-dir /samples/movies -m 1
 
$sqoop import-all-tables --connect jdbc:mysql://hostName/dbName --username userName --password password 

Important :
  • If you see error "Exception in thread "main" java.lang.IncompatibleClassChangeError Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected" then hadoop version between you installed and sqoop hadoop has mismatched. Hadoop 1 and hadoop 2 has major changes. So, have 1.x or 2.x on hadoop and sqoop-hadoop.
  • Make sure table has primary key otherwise you need to mention map tasks.

Export data from hadoop (hdfs) into MySQL
$sqoop export --connect jdbc:mysql://localhost/sqoop_test --table movies_export --username root --password root --export-dir '/samples/movies/' -m 1;

sqoop command to create hive tables matches the database table. If already exist it will throw error
hduser$ sqoop create-hive-table --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table movies

Note : Impala uses the same metadata as hive, So you can use create-hive-table to import and query in Impala

sqoop command to import the entire content from the database table to hive, that uses commas(,) to separate the files in data files
hduser$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table movies --fields-terminated-by ',' --hive-import
 
hduser$  

--hive-overwrite -> This will overwrite the existing content and write the data into hive table
--hive-import -> To import the table into hive we need --hive-import when there is no table in hive.
--hive-table -> If this is not specified by default the database table name will be created in the hive

Default delimiter using sqoop into hive is ^a. You can specify --fields-terminated-by ',' for the specific separator. 

Refer : Download sqoopSample.sql from github for mysql schema.

References : Tutorial Reference,
GoodPlaceToStart 

2 comments :

Unknown said...

There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.

Big Data Training Chennai
Big Data Training
Big Data Course in Chennai

veera Ravula said...

Really very nice blog,keep sharing more posts with us.
Thank you...

big data and hadoop course
big data hadoop certification
big data hadoop training

// Below script tag for SyntaxHighLighter