Tuesday, February 24, 2015

Java Quiz

1) What will be the output when you execute below code ?

for (int i = 0; i >= (-1) * 5; i--) {
          System.out.println(i);
}

Output : 0 -1 -2 -3 -4 -5


for loop repetitive control structure flow :
for (initialize; boolean_expression; update) {
}

  • Initialize will be called only once in the for loop to be initialize for the first time loop starts. You can leave blank by providing semicolon(;) placeholder.
  • boolean_expression will be called next to initialize to validate expression and if its true body of the for loop exucutes. False loop exits
  • update will be called upon completion of the body execution of for loop.
  • Once update complete it calls the boolean expression to validate to proceeds body and body completion will call update and moves on back to booleanExpression

Note : There is no difference between i-- or --i or i++ or ++i in the for loop update.


2) What will be the output when you execute below code ?
int i = 1;
int j = ++i;
System.out.println("i: " + i + " j: " + j);

and

int i = 1;
int j = i++;
System.out.println("i: " + i + " j: " + j);

Ouput : 
i: 2 j: 2
i: 2 j: 1


Why Difference?
    ++i increments the value first and then return it
    i++ return the value first and then increments it

Usually, try to use i = i + 1 for clear understanding in the code.

3) What will be the output when you execute below code ?
String testSplit = "hello,,world";
String[] testSplits = testSplit.split(",");
System.out.println(testSplits[2]);

Output : world

String testSplit = "hello,world,";
String[] testSplits = testSplit.split(",");
System.out.println(testSplits[2]);

Output : java.lang.ArrayIndexOutOfBoundsException: 2

String testSplit = "hello,world,";
String[] testSplits = testSplit.split(",", -2);
System.out.println(testSplits[2]);

Output :      (empty)

Note : By default, split drops all empty trailing columns, so any attempt to access the final column will result ArrayIndexOutOfBoundsException. Passing negative number (limit) as the second arg to split causes it to retain the trailing empty columns.

4) What will be the output when you execute below code ?
Pattern p = Pattern.compile("^([\"']?)\\d\\d:\\d\\d\\1,([\"']?)[A-Z]\\w+\\2,.*$");
  String regexpInput = "1:23,Logout Now";
  if (p.matcher(regexpInput.toString()).matches()) {
   System.out.println("Good");
  } else {
   System.out.println("Bad");
  }
Ouput : Bad

Why Bad ?
1. hour has only 1 digit 1 instead 2 digit
2. 2nd column has space in regexp which is not mentioned
3. Regexp expects 3rd column with comma(,) after some text 1 or more which is missing

\\1 and \\2 is backreference to repeat the grouping element
group starts from 1, 2, 3 Backreference also considered as group in regexp.

Difference between iBatis and Hibernate ? (Both are persistence framework)
iBatis / myBatis is sql driven model. It means based on sql you want to control application changes (SqlMap XML file)
Hibernate is object driven model. It means you design your object and create fields in database. (hbm - hibernate mapping xml file)
Pros and Cons :
iBatis is database dependent due to SQL usage, but faster development and lighter with cache support.
Hibernate is database independent due to HQL based approach. Its more heavy compare to JPA / iBatis where as highly scalable with advance cache support.

iBATOR - Code generator for iBatis.

castor - Castor is open source java data binding  framework. Moving data from XML to Java programming language objects and from Java to database. same as JAXB.

Difference between JPA and Hibernate ? 
JPA is specification / interface based on JSR
Hibernate / iBatis is implementation using JPA.

Saturday, February 21, 2015

Oozie Examples

Setting Up the Examples
The examples/ directory must be extracted from oozie and copied to the user HOME directory in HDFS:
cd /usr/local/oozie
tar -xvzf oozie-examples.tar.gz
chown -R hduser:hadoop examples/
hadoop fs -put /usr/local/oozie/examples/ /user/hduser/examples/
NOTE: If an examples directory already exists in HDFS, it must be deleted before copying it again. Otherwise files may not be copied.

Running the Examples
Add Oozie bin/ to the environment PATH in .bashrc or /etc/profiles
export OOZIE_HOME=/usr/local/oozie
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin:$PIG_HOME/bin:$SQOOP_HOME/bin:$OOZIE_HOME/bin

Execute job from Terminal (hduser)
oozie job -oozie http://localhost:11000/oozie -config /usr/local/oozie/examples/apps/map-reduce/job.properties -run

NOTE: The job.properties file needs to be a local file during submissions, and not a HDFS path. Modify job.properties for namenode & jobtracker url.

Check Oozie URL to track the oozie workflow job status : http://localhost:11000/oozie

Note : The example applications are under the $OOZIE_HOME/examples/app directory, one directory per example. The directory contains the application XML file (workflow, or worklfow and coordinator), the job.properties file to submit the job and any JAR files the example may need. Go through each workflow to run hive, java, sqoop, streaming examples and try

Reference : Oozie Example
Oozie workflow with Pig and Hive

Friday, February 20, 2015

Big Data Growth Trends and Events

Hadoop is Growing faster than expected..

Click to Zoom the image : Hadoop Growth


US President speaks about importance of BigData Trend.






















Apache Oozie - Schedule your job using powerful workflow engine

Apache Oozie
Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes:

  * XML-based declarative framework to specify a job or a complex workflow of dependent jobs.
  * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications.
  * Workflow scheduling based on frequency and/or data availability.
  * Monitoring capability, automatic retry and failure handing of jobs.
  * Extensible and pluggable architecture to allow arbitrary grid programming paradigms.
  * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service.

Oozie Engines :
Oozie has Workflow Engine,  Coordinator Engine and Bundle Engine

PreRequiste :
JAVA_HOME (Java)
M2_HOME (Install Maven)
HADOOP_HOME (hadoop)
Pig (PIG_HOME)

Download Oozie

Make sure below command works
$ java -version
$ javac -version
$ mvn -version


Extract Oozie archieve
$ sudo cp ~/Downloads/oozie-*.tar.gz /usr/local/
$ sudo su -
$ cd /usr/local
$ tar -xzf oozie-3.3.2.tar.gz


Building Oozie
The simplest way to build Oozie is to run the mkdistro.sh script:
$ cd oozie-3.3.2
$ ./bin/mkdistro.sh -DskipTests


Oozie Server Setup
Copy the built binaries to the home directory as ‘oozie’
$ cd ..
$ cp -R oozie-3.3.2/distro/target/oozie-3.3.2-distro/oozie-3.3.2/ oozie

Create the required libext directory
$ cd oozie
$ mkdir libext


Copy all the required jars from hadooplibs to the libext directory using the following command:
$ cp ../oozie-3.3.2/hadooplibs/target/oozie-3.3.2-hadooplibs.tar.gz .
$ tar xzvf oozie-3.3.2-hadooplibs.tar.gz
$ cp oozie-3.3.2/hadooplibs/hadooplib-1.1.1.oozie-3.3.2/* libext/


Get Ext2Js – This library is not bundled with Oozie and needs to be downloaded separately. This library is used for the Oozie Web Console:
$ cd libext
$ wget http://extjs.com/deploy/ext-2.2.zip
$ cd ..


Update ../hadoop/conf/core-site.xml as follows. Hadoop Version 1.2.x:

<property>
<name>hadoop.proxyuser.hduser.hosts</name>
<value>localhost</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.groups</name>
<value>hadoop</value>
</property>


Note : Here, ‘hduser’ is the username and it belongs to ‘hadoop’ group.


Prepare the WAR file
$../bin/oozie-setup.sh prepare-war

INFO: Oozie is ready to be started


Provide permission to oozie directory
$ chown -R hduser:hadoop oozie


Create sharelib on HDFS
$ su hduser
$ /usr/local/oozie/
$ ./bin/oozie-setup.sh sharelib create -fs hdfs://localhost:54310


Create the OoozieDB
$ ./bin/ooziedb.sh create -sqlfile oozie.sql -run

The SQL commands have been written to: oozie.sql


Start Oozie as a daemon process run / start Oozie as a foreground process run:
oozie-start.sh, oozie-run.sh , and oozie-stop.sh
$ ./bin/oozie-start.sh
or
$ ./bin/oozie-run.sh
or
$ ./bin/oozie-stop.sh


Note : oozie log will be in /usr/local/oozie/logs/oozie.log


URL for Oozie Web Console is http://localhost:11000/oozie

Check Oozie status, should be NORMAL.
$ bin/oozie admin -oozie http://localhost:11000/oozie -status

Try Oozie Examples : Oozie Examples which i tried from my same blog

Oozie Client Setup may required in the remote machine
$ cd ..
$ cp oozie/oozie-client-3.3.2.tar.gz .
$ tar xvzf oozie-client-3.3.2.tar.gz
$ mv oozie-client-3.3.2 oozie-client
$ cd bin

Add the /home/hduser/oozie-client/bin to PATH in .bashrc or /etc/profiles and restart your terminal.

References : Oozie Installation : (Apache Oozie, Rohit Blog, CloudBlog)

Tuesday, February 17, 2015

Install Maven from archive

Why Maven ?

Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information.
  •     Quick project set-up, no complicated build.xml files, just a POM and go
  •     All developers in a project use the same jar dependencies due to centralized POM.
  •     Getting a number of reports and metrics for a project "for free"
  •     Reduce the size of source distributions, because jars can be pulled from a central location
Make sure you installed Java and exported JAVA_HOME in your PATH variable.

Verify java installation on your machine
$java -version 
$javac -version
Download maven
Extract Maven archive
$sudo cp ~/Downloads/apache-maven-*.tar.gz /usr/local/
$sudo su -
$cd /usr/local
$tar -xzf apache-maven-3.1.1-bin.tar.gz

Set maven environment variables (vi ~/.bashrc or /etc/profiles)
export M2_HOME=/usr/local/apache-maven-x.x.x
export MAVEN_OPTS="-Xms256m -Xmx512m"

Add maven bin directory to system path
export PATH=$PATH:$M2_HOME/bin
Verify Maven installed
$mvn -version

References : Maven

Monday, February 16, 2015

Install Apache Sqoop - Sql to hadoop (HDFS) and inverse the same

sqoop - Sql to hadoop

Efficient tool to transfer bulk data from structured (relational db's) to hadoop (hdfs / hive / hbase). You can also export data from hdfs to import into other datawarehouses.

Multiple ways to install sqoop :
To install sqoop in Debian distributions (Ubuntu / Debian)

$ sudo apt-get install sqoop

Advantage installing with debian packages than Tar ball. Still to handle yourself efficiently the package use tar ball.
  •     Handle dependencies
  •     Provide for easy upgrades
  •     Automatically install resources to conventional locations
Download Sqoop  - Download sqoop version based on the hadoop you installed in your box. If you hadoop 1.x then download sqoop with hadoop 1.x version to avoid incompatible error upon import / export.

$ sudo su
$ (cp ~/Downloads/sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz /usr/local && cd /usr/local/
 && tar -zxvf path_to_sqoop.tar.gz)
$ mv sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz sqoop-1.4.5
$ chown -R hduser:hadoop sqoop-1.4.5

Configure sqoop wrapper with hadoop
$cd /usr/local/sqoop-1.4.5/conf/
$mv sqoop-env-template.sh sqoop-env.sh

Enable HADOOP_COMMON_HOME and HADOOP_MAPRED_HOME by providing hadoop available path. If you want to sqoop data to HBASE or Hive enable those variables as well and provide path in sqoop-env.sh

$vi sqoop-env.sh
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop

In /home/hduser/.bashrc or /etc/profiles

export SQOOP_HOME=/usr/local/sqoop-1.4.5
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin:$PIG_HOME/bin:$SQOOP_HOME/bin

Installing JDBC Driver for Sqoop
sudo apt-get install libmysql-java

(This will install the jar for mysql connector through java, cd /usr/share/java/ to see the mysql-connector-java.jar). mysql-connector-java.jar must be symlink to mysql.jar. If so follow below

sudo ln -s /usr/share/java/mysql.jar /usr/local/sqoop-1.4.5/lib/

Check your sqoop version
$sqoop-version

Create table movies in your mysql
Refer : Download sqoopSample.sql from github for mysql schema.

Import data (table or all tables) from MySQL into Hadoop (HDFS)
Make sure you started your hadoop. (cd $HADOOP_HOME/bin; ./start-all.sh)
-m Use n map tasks to import in parallel
$cd /home/hduser
$sqoop import --connect jdbc:mysql://hostName/dbName --username userName --password password --table tableName --target-dir /samples/movies -m 1
 
$sqoop import-all-tables --connect jdbc:mysql://hostName/dbName --username userName --password password 

Important :
  • If you see error "Exception in thread "main" java.lang.IncompatibleClassChangeError Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected" then hadoop version between you installed and sqoop hadoop has mismatched. Hadoop 1 and hadoop 2 has major changes. So, have 1.x or 2.x on hadoop and sqoop-hadoop.
  • Make sure table has primary key otherwise you need to mention map tasks.

Export data from hadoop (hdfs) into MySQL
$sqoop export --connect jdbc:mysql://localhost/sqoop_test --table movies_export --username root --password root --export-dir '/samples/movies/' -m 1;

sqoop command to create hive tables matches the database table. If already exist it will throw error
hduser$ sqoop create-hive-table --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table movies

Note : Impala uses the same metadata as hive, So you can use create-hive-table to import and query in Impala

sqoop command to import the entire content from the database table to hive, that uses commas(,) to separate the files in data files
hduser$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table movies --fields-terminated-by ',' --hive-import
 
hduser$  

--hive-overwrite -> This will overwrite the existing content and write the data into hive table
--hive-import -> To import the table into hive we need --hive-import when there is no table in hive.
--hive-table -> If this is not specified by default the database table name will be created in the hive

Default delimiter using sqoop into hive is ^a. You can specify --fields-terminated-by ',' for the specific separator. 

Refer : Download sqoopSample.sql from github for mysql schema.

References : Tutorial Reference,
GoodPlaceToStart 

Fomat your blog

How to add CSS to blogger to be available in all pages
Blogger -> Posts -> Layout -> Template Designer -> Advanced -> Add CSS

Add below CSS
.preTagCodeStyle {
font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;
background-image:URL(https://github.com/gubs4u/DownloadSamples/blob/master/codebg.gif);
padding:0px;color:#000000;text-align:left;line-height:20px;
}
.codeTagStyle {
 color:#000000;word-wrap:normal;
}
div.note {
    background-color: #fefaee;
    padding: 10pt;
    margin-top: 0.6em;
    margin-bottom: 0.6em;
    margin-right: 0.6em;
    border: 1px dashed;
    border-radius: 3px;
    border-color: #CCCCCC;
}
div.configuration {
    padding: 1em;
    border: 1px dashed #2f6fab;
    color: black;
    background-color: #f9f9f9;
    line-height: 1.1em;
    font-family: Courier New, Courier, mono;
    font-size: 12px;
    font-style: italic;
    display: block;
} 
To format your code add below:
<pre class="preTagCodeStyle"><code class="codeTagStyle">
Your code here
</code></pre>
To add a note use below:

<div class="note">
Your note here
</div>
To add a configuration use below:

<div class="configuration">
Your configuration here
</div>
Note : Go to HTML tab and add this tag. In 'Compose' have option 'Show HTML Literally'

Saturday, February 14, 2015

CSS - Cascading Style Sheets - Make your content Powerful

Quickly Read and understand CSS. Only important topics covered to get start with CSS.

What is CSS ?
  • CSS stands for Cascading Style Sheets
  • CSS defines how HTML elements are to be displayed
  • Styles were added to HTML 4.0 to solve a problem
  • CSS saves a lot of work
  • External Style Sheets are stored in CSS files
In HTML 4.0, all formatting could (and should!) be removed from the HTML document, and stored in a separate CSS file.

CSS Saves a Lot of Work!

The style definitions are normally saved in external .css files.

With an external style sheet file, you can change the look of an entire Web site by changing just one file!

CSS Syntax :
Selector 
H1       
Declaration : {color:blue;font-size:12px}
property:value and for multiple add semicolon separator
Ex : (Comments : /* */)
 p {color:red;text-align:center;}

CSS Selectors

Element Selector
The element selector selects elements based on the element name.
You can select all elements on a page like this: (all elements will be center-aligned, with a red text color)
p {
    text-align: center;
    color: red;
}

Id Selector
An id should be unique within a page, so the id selector is used if you want to select a single, unique element.
To select an element with a specific id, write a hash character, followed by the id of the element.

#para1 {
    text-align: center;
    color: red;
}
Html Ex : id="para1":

Class Selector
The class selector selects elements with a specific class attribute.
To select elements with a specific class, write a period character, followed by the name of the class:
.center {
    text-align: center;
    color: red;
}
HTML ex : class="center":

Note : Do NOT start an ID or Class name with a number!

Grouping Selectors
If you have elements with the same style definitions, like this:
h1 {
    text-align: center;
    color: red;
}

h2 {
    text-align: center;
    color: red;
}
you can group the selectors, to minimize the code. To group selectors, separate each selector with a comma.
h1, h2 {
    text-align: center;
    color: red;
}

CSS How To..
  • External style sheet
  • Internal style sheet
  • Inline style
External style sheet
An external style sheet is ideal when the style is applied to many pages. You can Change your entire webpages.

In each page you should use "<link>" tag next to "<head>" same as below." same as below.

Internal Style Sheet
An internal style sheet should be used when a single document has a unique style. You define internal styles in the head section of an HTML page, inside the   Inline Styles An inline style loses many of the advantages of a style sheet (by mixing content with presentation). Use this method sparingly! To use inline styles, add the style attribute to the relevant tag. The style attribute can contain any CSS property. <h1 style="color:blue;margin-left:30px;">This is a heading.</h1>  Multiple Style Sheets If some properties have been set for the same selector in different style sheets, the values will be inherited from the more specific style sheet.  External Sheet:  h1 {     color: navy;                  margin-left: 20px; }  Internal Sheet:  h1 {          color: orange;   } Output element applied is  color: orange; margin-left: 20px;  Multiple Styles Will Cascade into One Styles can be specified:
  •     inside an HTML element
  •     inside the section of an HTML page
  •     in an external CSS file
Cascading order (Style loading Priority)
  • Inline style (inside an HTML element)
  • Internal style sheet (in the head section)
  • External style sheet
  • Browser default
Note: If the link to the external style sheet is placed after the internal style sheet in HTML , the external style sheet will override the internal style sheet! Important HTML tag to know to play around with CSS :
tag
The
tag defines a division or a section in an HTML document. The
tag is used to group block-elements to format them with CSS.

<div style="color:#0000FF"> <h3>This is a heading</h3> <p>This is a paragraph.</p> </div>
  Other tag you can use mostly for CSS is <table>, <tr><th>, tags..
References : CSS from W3Schools. Explained neat with more example.

Thursday, February 12, 2015

Ubuntu - Install Pig

Pig 
Pig is framework that translates programs written in pig Latin into jobs that are executed by the MapReduce framework. Pig does not provide any funcationality that isn't provided by MapReduce, but it make some types of data operations significantly easier to perform.

a) local mode
b) distributed/Map Reduce mode. (Cluster mode)

Pig converts all transformation into a map-reduce job so that the developer can focus mainly on data scripting instead of putting an effort to writing a complex set of MR programs.

Using the ’-x local’ options starts pig in the local mode whereas executing the pig command without any options starts in Pig in the cluster mode. When in local mode, pig can access files on the local file system. In cluster mode, pig can access files on HDFS.

Pig provides no additional capabilities to MapReduce. Pig programs are executed as MapReduce job via the Pig Interpreter.

1) Download Pig

2) Switch user to root : sudo su

3) cd /usr/local; cp /home//Download/pig-0.14.0.tar.gz /usr/local/

3) tar -xvzf pig-0.14.0.tar.gz

4) chown -R hduser:hadoop pig-0.14.0 (Hope you followed "Install Hadoop" to have user hduser)

6) Update /etc/profile PATH or su hduser; vi ~/.bashrc
export PIG_HOME=/usr/local/pig-0.14.0
export PATH=$PATH:$HADOOP_HOME/bin:$PIG_HOME/bin


7) Update your pig.properties file. cd $PIG_HOME/conf; vi pig.properties. Add below lines.
   fs.default.name=hdfs://localhost:9090 (value of port where hdfs is running)
   mapred.job.tracker=localhost:8021 (value of port where MR job is running)

Note : find your above fs.default.name in "$HADOOP_HOME/conf/core-site.xml" and mapred.job.tracker in "$HADOOP_HOME/conf/mapred-site.xml"

8) Create vi /home/hduser/samples/user.txt in your local file system and add below content

1,John,Montgomery,Alabama,US
2,David,Phoenix,Arizona,US
3,Sarah,Sacramento,California,US
4,Anoop,Montgomery,Alabama,US
5,Gubs,Villupuram,TamilNadu,India


10) Start Pig

There are two modes to run Pig; these can be updated in the pig.properties file available in the conf directory of the Pig installed location.

    Local mode using the following command:

$pig -x local

11) 
The part to the left of “=” is called the relation or alias. It looks like a variable but you should note that this is not a variable. When this statement is executed, no MapReduce task is executed. Since our dataset has records with fields separated by a comma we use the keyword USING PigStorage(‘,’).
The file should be available in your local disk

chararray equivalent to String

grunt>user_record = LOAD '/home/hduser/samples/user.txt' USING PigStorage(',') AS (id:INT,name:chararray,city:chararray,state:chararray,country:chararray);


12) Statement to print the content on alias .
grunt>DUMP user_record;

12) final and last command will give the desired output, which will group records by state 

Problem : How many people belong to each state?

grunt>state_record = Group user_record BY state;

13) outputting the records
grunt>output_record = FOREACH state_record GENERATE group, COUNT(user_record.state);

14)
grunt>DUMP output_record;

15) Store the output into file
grunt>store output_record into '/user/hduser/people_belong_to_state';


DESCRIBE :
To describe the schema
grunt>DESCRIBE user_record;

You can modify / uncomment $PIG_HOME/conf/pig.properties to update logger path (log4jconf), error log path (pig.logfile) and provide necessary path for details


ILLUSTRATE:
To view the step-by-step execution of a sequence of statements you can use the ILLUSTRATE command (Useful for debugging..):
grunt>ILLUSTRATE output_record;


LIMIT : 
grunt>top_5_records = LIMIT desc output_record 5;
grunt>DUMP top_5_records;

DISTINCT :

grunt>distinct_records = DISTINCT output_record;
grunt>DUMP distinct_records;

SAMPLE:
grunt>sample_record = sample user_record 0.1; 
grunt>DUMP sample_record

Here, 0.1 = 10% of your total user_record.

GROUP  :
grunt> group_by_state = GROUP user_record by state;
grunt>DUMP group_by_state

3 Types of Complex Types

(Output print in complex Type bags {(record1),(record2)} and tuples. Tuples type will be (record1). Bag will contain unordered list of tuples
Map type will be ['key'#'value']




ORDER :
grunt>state_order_by = ORDER user_record by state DESC;
grunt>DUMP state_order_by;

Most of the SQL statements or like relational database. In between "keyword" and "by" you need to pass the alias name.

Note :  You can modify / uncomment $PIG_HOME/conf/pig.properties to update logger path (log4jconf), error log path (pig.logfile) and provide necessary path for details 

In pig session (reverse-i-search) - Ctrl + R is working in grunt> To pick previous history scripts typed in pig session.


 

References : Pig Reference,
Tutorial Reference  

Ubuntu - Install Hive and configure Hive Metastore with MySQL

 Install Hive and configure Hive Metastore with MySQL

Hive is designed for data summarization, ad-hoc querying, and analysis of large volumes of data. Hive is a Data Warehousing package built on top of Hadoop.

HiveQL -> HiveQueryLanguage will create MapReduce job for your task and you can assign the number of reducer and bytes to be consume by reducer tasks etc..

Hive Vs MapReduce
Hive provides no additional capabilities to MapReduce. Hive programs are executed as MapReduce jobs via Hive interpreter.

Hive is powerful that in Hive, you can join table data with log file and query to see the result using regexp pattern.

1) Download hive

2) Extract Hive (Hope you followed "Install Hadoop" with user hduser)
sudo su
cd /usr/local; cp /home//Download/apache-hive-0.14.0-bin.tar.gz /usr/local/
tar -xvzf apache-hive-0.14.0-bin.tar.gz
mv apache-hive-0.14.0-bin hive-0.14
chown -R hduser:hadoop hive-0.14 


6) Update /etc/profile PATH or su hduser; vi ~/.bashrc
export HIVE_HOME=/usr/local/hive-0.14
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin

7) Running Hive

Note : Make sure HADOOP_HOME is installed and configured.

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before you can create a table in Hive.


$HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse


8) Remove .template extension from all the files stored. (Ex : mv hive-env.sh.template hive-env.sh)
cd $HIVE_HOME/conf

9) Enter into hive console
$hive (Press Enter to hive shell)

Note : If you want to proceed with the default hive metastore with derby db in your local machine experimental proceed $hive and start creating your table and try HiveQueryLanguage..(HQL). Go to Step 18

10) Configure the hive metaStore pointing MySQL (Recommended)
This will install the jar for mysql connector through java, cd /usr/share/java/ to see the mysql-connector-java.jar)
sudo apt-get install mysql-server-5.5
sudo apt-get install libmysql-java
sudo ln -s /usr/share/java/mysql.jar /usr/local/hive-0.14/lib/libmysql-java.jar


11) sudo sysv-rc-conf mysql on (Start mysql upon machine start. If sysv-rc-conf missing in your ubuntu then "sudo apt-get install sysv-rc-conf")

 Create the initial database schema using the hive-schema-.sql file located in the $HIVE_HOME/scripts/metastore/upgrade/mysql directory.
 

$ mysql -uroot -proot
mysql>CREATE DATABASE metastore;
mysql>USE metastore;
mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-.sql


 You also need a MySQL user account for Hive to use to access the metastore. It is very important to prevent this user account from creating or altering tables in the metastore database schema. Replace metastorehost(your remote metastore server) with localhost if your metastore server is same.

mysql> CREATE USER 'hive'@'metastorehost' IDENTIFIED BY 'mypassword';
...
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'metastorehost';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'metastorehost';
mysql> FLUSH PRIVILEGES;
mysql> quit;

 This step shows the configuration properties you need to set in hive-site.xml to configure the metastore service to communicate with the MySQL database, and provides sample settings. Though you can use the same $HIVE_HOME/conf/hive-site.xml or $HIVE_HOME/conf/hive-default.xml on all hosts (client, metastore, HiveServer), hive.metastore.uris is the only property that must be configured on all of them; the others are used only on the metastore host.

12) Update hive-site.xml or hive-default.xml under $HIVE_HOME/conf with below
   <property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost/metastore</value>
  <description>the URL of the MySQL database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>hive</value>
</property>

<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>

<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://localhost:9083</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>

13) Start Hive Metastore Service
hive --service metastore &

14) Create vi /home/hduser/user.txt in your local file system and add below content or download user.txt
userid,username,city,state,country
1,John,Montgomery,Alabama,US
2,David,Phoenix,Arizona,US
3,Sarah,Sacramento,California,US
4,Anoop,Montgomery,Alabama,US
5,Gubs,Villupuram,TamilNadu,India

Note : In hive, table names are all case insensitive


15) Go to hive prompt and we'll create the table users in the Hive MetaStore to map data from user.txt

$ hive (Enter)
hive>CREATE TABLE user(id INT, name STRING, City STRING, State STRING, Country STRING) 
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
 LINES TERMINATED BY '\n' STORED AS TEXTFILE;
hive>show tables;


16) The following command maps user.txt data to the users table by loading data from user.txt.
  (LOCAL => keyword needed to load file from local into hive. Not necessary if your file is in hdfs)
  You can add OVERWRITE prior to 'INTO TABLE' if you want to overwrite existing user table content
  You can use
hadoop fs -put <localfilesystempath> <hdfsfiledirectory>

Load data into hive from your local directory. Remove LOCAL keyword if you file is in hdfs.
$hive>LOAD DATA LOCAL INPATH '/home/hduser/user.txt' INTO TABLE user;

Note : Load Local command copies the file from local location to /user/hive/warehouse/userdb/user/user.txt. When you execute "DROP TABLE user" the file will also be dropped/removed from the hive location /user/hive/warehouse/userdb/user/user.txt.

17) Query How many people belong to each state?
$hive>select state, count(state) from user group by state;

18) You can use most of the relational db (SQL) queries in hive
hive>describe user;
hive>show create table user;

19) Drop hive table drops the table and data file from the warehouse (/user/hive/warehouse/db/table/user.txt). What if you have MapReduce Program reference to this data file ? Refer next External Table creation
hive>DROP TABLE user

20) Create external table in hive. So, multiple tables can refer to the same data file and if you drop the table the data file will be available in the same location.
hive>CREATE EXTERNAL TABLE user(id INT, name STRING, city STRING, state STRING, country STRING) 
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
 LINES TERMINATED BY '\n' STORED AS TEXTFILE;

hive>LOAD DATA INPATH '/samples/user.txt' INTO TABLE user;
 
Note : Load command moves the file from hdfs location to /user/hive/warehouse/userdb/user/user.txt. So, file won't be available in original hdfs location. Upon DROP TABLE user (External created table) the file will be available still in the same location /user/hive/warehouse/userdb/user/user.txt.Even after table in hive got dropped.

21) Exit from hive
$hive>quit;

22) Hive Log file location :
Note : hive.log path can be find in $HIVE_HOME/conf/hive-log4j.properties. ${java.io.tmpdir}/${user.name}/hive.log (/tmp/hduser/hive.log)

You can do all the DDL (DataDefinitionLanguage commands) you do in SQL in HiveQueryLanguage. Refer : Hive DDL Manual


References :  Developer.com About Pig and Hive,
Cloudera - Hive Install

Feel free to add your inputs / something wrong.
// Below script tag for SyntaxHighLighter