Monday, December 19, 2011

NoSQL - Not Only SQL

http://nosql-database.org/
  • NoSQL - Next Generation Database
  • non-relational
  • distributed
  • open-source
  • horizontally scalable
  • schema-free
  • easy replication support
  • Simple API 
  • a huge amount, of data
  • eventually consistent / BASE (not ACID -  (atomicity, consistency, isolation, durability)) - Not transactional like RDBMS
In 2011, work began on UnQL (Unstructured Query Language), a specification for a query language for NoSQL databases.[8] It is built to query collections (versus tables) of documents (versus rows) with loosely defined fields (versus columns). UnQL is a superset of SQL within which SQL is a very constrained type of UnQL for which the queries always return the same fields (same number, names and types). However, UnQL does not cover the data definition language (DDL) SQL statements like CREATE TABLE or CREATE INDEX

Some NoSQL advocates[who?] promote very simple interfaces such as associative arrays or key-value pairs. Other systems, such as native XML databases, promote support of the XQuerystandard.[citation needed] Newer systems such as CloudTPS also support join queries.[18]


Sunday, December 18, 2011

HBase - Hadoop Database

What is HBase ? (Hadoop Database) Abstraction for hadoop to store data.
HBase is an Open-Source, Distributed (clustered), Sparse (any row, no strict schema) and column-oriented Store modeled after google's BigTable.  Its an Hierarchical data structure.
  • If you have small amount of data you can go for RDBMS (Relation database management system like MySQL or Oracle). (Max data storage is TB's - tera byte)
  • If you have large amount of data like 100giga byte of data or peta byte data you need HBase for faster performance and faster throughputs. (Max data storage is PB's peta bytes)
HBase is developed on top of Hadoop and HDFS.

Data store based on the ROWID->Family->Qualifier
In the same family which is of table you can have different qualifier of columns names. RowID should be identifier for fetching the data from the row.
Timestamp will be added and data will be stored for every version.
By default get, scan everything will fetch the top only 1 record.


Java Hbase Code for hbase operations example : http://autofei.wordpress.com/2012/04/02/java-example-code-using-hbase-data-model-operations/

Installation based on cloudera : https://ccp.cloudera.com/display/CDHDOC/HBase+Installation#HBaseInstallation-InstallingtheHBaseMasterforStandaloneOperation

Start : sudo /etc/init.d/hadoop-hbase-master start

Port : http://localhost:60010/master-status

Hbase : http://hbase.apache.org/book.html (Refer link to install standard Node on your box and proceed playing Hbase). Good link for commands : http://wiki.apache.org/hadoop/Hbase/Shell

COMMAND GROUPS:
  Group name: general
  Commands: status, version

  Group name: ddl
  Commands: alter, create, describe, disable, drop, enable, exists, is_disabled, is_enabled, list

  Group name: dml
  Commands: count, delete, deleteall, get, get_counter, incr, put, scan, truncate

  Group name: tools
  Commands: assign, balance_switch, balancer, close_region, compact, flush, major_compact, move, split, unassign, zk_dump

  Group name: replication
  Commands: add_peer, disable_peer, enable_peer, remove_peer, start_replication, stop_replication

If you are using binary keys or values and need to enter them in the shell, use
double-quote'd hexadecimal representation. For example:

  hbase> get 't1', "key\x03\x3f\xcd"
  hbase> get 't1', "key\003\023\011"
  hbase> put 't1', "test\xef\xff", 'f1:', "\x01\x33\x40"

For more on the HBase Shell, see http://hbase.apache.org/docs/current/book.html

Played in Hbase Shell  :

There was good tool H-Rider available in github.
What is H-Rider - The h-rider is a UI application created to provide an easier way to view or manipulate the data saved in the distributed database - HBase™ - that supports structured data storage for large tables.

root@glakshmanan-laptop:/usr/lib/hbase/bin# hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.90.4-cdh3u2, r, Thu Oct 13 20:32:26 PDT 2011

hbase(main):001:0> create 'test', 'cf'
0 row(s) in 0.5080 seconds

hbase(main):002:0> list
TABLE                                                                                                                                                                                                                        
test                                                                                                                                                                                                                          
1 row(s) in 0.0200 seconds

hbase(main):003:0> scan 'test'
ROW                                                       COLUMN+CELL                                                                                                                                                        
0 row(s) in 0.1320 seconds

hbase(main):004:0> put 'test', 'row1', 'cf:a', 'sai-ram'
0 row(s) in 0.0280 seconds

hbase(main):005:0> put 'test', 'row2', 'cf:b', 'gubs'
0 row(s) in 0.0060 seconds

hbase(main):006:0> put 'test', 'row3', 'cf:c', 'kavitha'
0 row(s) in 0.0070 seconds

hbase(main):007:0> scan 'test'
ROW                                                       COLUMN+CELL                                                                                                                                                        
 row1                                                     column=cf:a, timestamp=1325613705467, value=sai-ram                                                                                                                
 row2                                                     column=cf:b, timestamp=1325613715431, value=gubs                                                                                                                    
 row3                                                     column=cf:c, timestamp=1325613724596, value=kavitha                                                                                                                
3 row(s) in 0.0310 seconds

hbase(main):008:0> get 'test', 'row1'
COLUMN                                                    CELL                                                                                                                                                                
 cf:a                                                     timestamp=1325613705467, value=sai-ram                                                                                                                              
1 row(s) in 0.0320 seconds

hbase(main):009:0> put 'test', 'row1', 'cf:a', 'sai sai'
0 row(s) in 0.0110 seconds

hbase(main):010:0> scan 'test'                        
ROW                                                       COLUMN+CELL                                                                                                                                                        
 row1                                                     column=cf:a, timestamp=1325613788305, value=sai sai                                                                                                                
 row2                                                     column=cf:b, timestamp=1325613715431, value=gubs                                                                                                                    
 row3                                                     column=cf:c, timestamp=1325613724596, value=kavitha                                                                                                                
3 row(s) in 0.0250 seconds

hbase(main):011:0> major_compact 'test'

hbase(main):011:0> drop 'test'

ERROR: Table test is enabled. Disable it first.'

Here is some help for this command:
Drop the named table. Table must first be disabled. If table has
more than one region, run a major compaction on .META.:

  hbase> major_compact ".META."


hbase(main):012:0> disable 'test'
0 row(s) in 2.0440 seconds

hbase(main):013:0> drop 'test'
0 row(s) in 1.0670 seconds


root@glakshmanan-laptop:/usr/lib/hbase/bin# ./hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.90.4-cdh3u2, r, Thu Oct 13 20:32:26 PDT 2011

hbase(main):001:0> create 'test', 'cf'
0 row(s) in 0.5080 seconds

hbase(main):002:0> list
TABLE                                                                                                                                                                                                                        
test                                                                                                                                                                                                                          
1 row(s) in 0.0200 seconds

hbase(main):003:0> scan 'test'
ROW                                                       COLUMN+CELL                                                                                                                                                        
0 row(s) in 0.1320 seconds

hbase(main):004:0> put 'test', 'row1', 'cf:a', 'sai-ram'
0 row(s) in 0.0280 seconds

hbase(main):005:0> put 'test', 'row2', 'cf:b', 'gubs'
0 row(s) in 0.0060 seconds

hbase(main):006:0> put 'test', 'row3', 'cf:c', 'kavitha'
0 row(s) in 0.0070 seconds

hbase(main):007:0> scan 'test'
ROW                                                       COLUMN+CELL                                                                                                                                                        
 row1                                                     column=cf:a, timestamp=1325613705467, value=sai-ram                                                                                                                
 row2                                                     column=cf:b, timestamp=1325613715431, value=gubs                                                                                                                    
 row3                                                     column=cf:c, timestamp=1325613724596, value=kavitha                                                                                                                
3 row(s) in 0.0310 seconds

hbase(main):008:0> get 'test', 'row1'
COLUMN                                                    CELL                                                                                                                                                                
 cf:a                                                     timestamp=1325613705467, value=sai-ram                                                                                                                              
1 row(s) in 0.0320 seconds

hbase(main):009:0> put 'test', 'row1', 'cf:a', 'sai sai'
0 row(s) in 0.0110 seconds

hbase(main):010:0> scan 'test'                        
ROW                                                       COLUMN+CELL                                                                                                                                                        
 row1                                                     column=cf:a, timestamp=1325613788305, value=sai sai                                                                                                                
 row2                                                     column=cf:b, timestamp=1325613715431, value=gubs                                                                                                                    
 row3                                                     column=cf:c, timestamp=1325613724596, value=kavitha                                                                                                                
3 row(s) in 0.0250 seconds

hbase(main):011:0> drop 'test'

ERROR: Table test is enabled. Disable it first.'

Here is some help for this command:
Drop the named table. Table must first be disabled. If table has
more than one region, run a major compaction on .META.:

hbase> major_compact ".META."

hbase(main):012:0> disable 'test'
0 row(s) in 2.0440 seconds

hbase(main):013:0> drop 'test'
0 row(s) in 1.0670 seconds

hbase(main):014:0>  exit

Note : For Every DDL operations you need to run disable, to Modify / Delete / Drop. Hbase is case-sensitive.

Import data from hadoop hdfs into hbase
syntax : hadoop jar /usr/lib/hbase/hbase-0.90.4-cdh3u3.jar import

example : hadoop jar /usr/lib/hbase/hbase-0.90.4-cdh3u3.jar import 0005_AlertTemplate hdfs://localhost/export/gubs_export/0005_AlertTemplate

Execute shell script against hbase 
hbase shell < shell_script.sh


Export HBase data to Hadoop (Run this command in hdfs user)
Ex : hadoop jar /usr/lib/hbase/hbase-0.90.4-cdh3u3.jar export 0005_SCMaster /export/gubs_export/0005_SCMaster 2000000

Export Hadoop To LocalSystem
Ex :  hadoop fs -copyToLocal /export/gubs_export/0005_SCMaster /tmp/gubs_export/0005_SCMaster

SCP : scp the file from source box to destination box

Export LocalSystem To Hadoop
Ex : hadoop fs -copyFromLocal /tmp/gubs_export/0005_SCMaster /export/0005_SCMaster

Import Hadoop data into Hbase (Run this command in hdfs user)
Ex : hadoop jar /usr/lib/hbase/hbase-0.90.4-cdh3u3.jar import 0005_SCMaster /export/gubs_export/0005_SCMaster


To describe hbase table :
describe '0001_SMSReportData'

Compress the bigger table in hbase 
alter '0001_SMSReportData', {NAME => 'TELEMETRY', COMPRESSION => 'SNAPPY'}

Define TTL - TimeToLive for table data
alter '0030_ConnectionSummary',{NAME=>'WSC',TTL=>'604800'}

Run Major compact for a table
major_compact '0001_ReportEvent'

Store the hbase table data in memory, when you retrieve frequently
alter '0031_DeviceMaster' , {NAME=>'D',IN_MEMORY => 'true'}

Java Hbase code to scan and rename the given table qualifier name

private static void updateTelemetryTempQualifier() throws IOException {

 Configuration configuration = HBaseConfiguration.create();
    HTablePool hTablePool = new HTablePool(configuration, 10);

    HTableInterface connection = (HTable) hTablePool.getTable(Tables.DEVICE_TELEMETRYAVERAGES
        .getTenantTableName("0001"));
    Scan scan = new Scan();
    scan.addColumn(DeviceTelemetryAveragesFamilyAndColumns.AVERAGE_DAILY_TEMP.getBytesForFamilyName(),
        DeviceTelemetryAveragesFamilyAndColumns.AVERAGE_DAILY_TEMP.getBytesForFamilyName());
    scan.setMaxVersions();
    scan.setCacheBlocks(false);
    ResultScanner rsc = connection.getScanner(scan);
    for (Result result : rsc) {
      List kvList1 = result.getColumn(
          DeviceTelemetryAveragesFamilyAndColumns.AVERAGE_DAILY_TEMP.getBytesForFamilyName(),
          DeviceTelemetryAveragesFamilyAndColumns.AVERAGE_DAILY_TEMP.getBytesForFamilyName());
      if (kvList1 != null && !kvList1.isEmpty()) {
        for (int i = 0; i < kvList1.size(); i++) {
          Put put = new Put(kvList1.get(i).getRow());
          put.add(DeviceTelemetryAveragesFamilyAndColumns.AVERAGE_DAILY_TEMP.getBytesForFamilyName(),
              DeviceTelemetryAveragesFamilyAndColumns.AVERAGE_DAILY_TEMP.getBytesForQualifier(), kvList1.get(i)
                  .getTimestamp(), kvList1.get(i).getValue());

          connection.put(put);
          Delete delete = new Delete(kvList1.get(i).getRow());
          delete.deleteColumn(DeviceTelemetryAveragesFamilyAndColumns.AVERAGE_DAILY_TEMP.getBytesForFamilyName(),
              DeviceTelemetryAveragesFamilyAndColumns.AVERAGE_DAILY_TEMP.getBytesForFamilyName());
          connection.delete(delete);
        }
      }
    }

  }

Hadoop

Hadoop Overview : (http://www.cloudera.com/what-is-hadoop/hadoop-overview/)
Apache Hadoop is a scalable, fault-tolerant system for data storage and processing. Hadoop is economical and reliable, which makes it perfect to run data-intensive applications on commodity hardware.

Important Note : 
Data Localization - Normal application will fetch the data from RDBMS system and process the data in local with in application. In HDFS the program will get copy into data server and it will process. 

Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.

Note : Data will be available in all the data nodes virtually. Basically, part of data will be available in data node and it creates block id and the block id will be stored in the name node.

HDFS (Fault-tolerant Hadoop Distributed File System (HDFS))
The Hadoop software package includes the robust, reliable Hadoop Distributed File System (HDFS), which splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss.

HDFS breaks incoming files into blocks and stores them redundantly across the cluster.

MapReduce Software Framework

  • Processes large jobs in parallel across many nodes and combines results.
  • Eliminates the bottlenecks imposed by monolithic storage systems
  • Results are collated and digested into a single output after each piece has been analyzed.

If a machine fails, Hadoop continues to operate the cluster by shifting work to the remaining machines. It automatically creates an additional copy of the data from one of the replicas it manages. As a result, clusters are self-healing for both storage and computation without requiring intervention by systems administrators.

Installation : 
Single node installation or single system installation follow : https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode

sudo yum install hadoop-0.20-native - If you get snappy library error.

Start Hadoop Daemons
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done

Logs
tail -f /var/log/hadoop-0.20/hadoop-hadoop-*.log

Hadoop Default Port:
http://localhost:50030/jobtracker.jsp
Default port quick reference for all the hadoop jobs : http://www.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/

NameNode:
  • Tracks all metadata and co-ordinates access.
  • Holds filenames, block locations on slave nodes, as well as ownership info.
  • The whole filesystem is stored in RAM for fast look-up.
  • Thus, filesystem size and metadata capacity is limited to the amount of available RAM on the NN.
  • Back this guy up and don’t use commodity hardware here. If you lose the metadata, absolutely nothing can be done, you’ve just got a lot of useless 1′s & 0′s on your datanodes.
  • Works like a journaling file system, fsimage and edits files.
  • Backup is as easy as curl’ing a url to download the fsimage and edits files.
  • Some people use DRDB and heartbeat for a manual fail-over situation.
  • We have the NN write the metadata to two different drives locally as well as to an NFS mount off of the machine.
  • There are a few projects on making a federated namenode, but nothing ready for primetime yet.
  • HDFS doesn’t like lots of small files because it wastes A LOT of RAM on the NN.
  • Default blocksize is 64M, but if you have lots of files smaller than 64M the files may be physically smaller than 64M on the disk (they won’t be padded), the block, as stored in the NN’s RAM will be the same size regardless.
  • Example: A single 64M file will use one block and one ‘slot’ in the NameNodes fsimage, where 64 * 1M files will use 64 slots. Further extrapolated, it can hose a name node.
DataNodes (DN):
  • Holds data in ‘blocks’ stored on the local file system.
  • DNs send heartbeats to the NN and after a period without any the DN is assumed lost. In that case:
  • The NN determines what blocks were lost on that node.
  • The NN finds other DN’s with copies of those blocks and tells the DN’s to copies the blocks to other DNs.
  • None of the transferring is done through the NN, but direct DN to DN.
  • This is all automatic. With a default replication level of 3, you can lose 2 nodes simultaneously without even worrying about losing any data.
The Secondary NameNode (SNN):
  • Is a very unfortunately named process!
  • It is NOT a failover node for the NN.
  • HDFS works like any other journaling filesystem.
  • The NN keeps the filesystem image in RAM (based off of the fsimage file it started with) and appends changes to a journal (the edits file).
  • The SNN periodically grabs the fsimage and edits files from the NN, replays the edits in RAM and combines it with the fsimage to make a new fsimage, then ships it back to the NN.
  • The SNN should have it’s own machine because it requires as much RAM as the NN to replay and merge the file system image.
Job Workflow:
  • A user runs a client on a client computer.
  • The client submits a Job (a mapper, a reducer and a list of inputs).
  • The Job is sent to the JobTracker (JT), which is the master process that coordinates all the action. Typically there is only one JT and usually (except in big clusters) is run on the same machine as the NN process.
  • The process that does the actual work is the TaskTracker (TT) and usually resides on the same machines as the DN process, soo…
  • Usually a DN/TT pair is a slave and a JT/NN pair is a master.
  • The JT instructs the TT’s to run a task… a part of a Job.
  • The TT will spawn child processes to run a task attempt, an instance of a task running on a slave. If it fails it can be restarted.
  • Task child-processes send heartbeats to the TT which in turn sends heartbeats to the JT. If a task doesn’t send a heartbeat in 10 minutes it is assumed dead and the TT kills its’ JVM.
  • Additionally, any task that throws an exception is considered to have failed.
  • The TT then reports failures to the JT which in turn tries to reschedule the task (preferable on the same TT)
  • Similarly, any TT that fails to send a heartbeat in 10 minutes is assumed to be dead and the JT schedules it’s tasks elsewhere.
  • Any TT that fails enough times is blacklisted from that specific Job and furthermore any TT that fails across multiple Jobs is added to a global blacklist.
Above points from blog : http://blog.milford.io/2011/01/slides-and-notes-from-my-recent-hadoop-talk-in-israel/

Copy Data from hdfs server (qa / prod / dev) to local hdfs server

Syntax : hadoop distcp
example : hadoop distcp hdfs://hadoopdbnode1/export/qamove/SEMMaster hdfs://localhost/export/gubs_export

Above is distributed copy. If its not working use filesystem copy
hadoop fs -cp /export/backup_energy/0001_ConsolidatedPerformanceSummary hdfs://destination_ip/export/backup_energy/

http://hadoop.apache.org/common/docs/current/distcp.html

How to turn off safe mode in namenode ?
In hdfs  run command : hadoop dfsadmin -safemode leave
To delete the corrupted files : hadoop fsck / -delete
To check the health do  : hadoop fsck /

Kill the job :
hadoop job -kill job_201207241936_0656

List the running job : 
hadoop job -list

Remove the directory in hadoop file system
hadoop fs -rmr /export/

List the files in hadoop file system
hadoop fs -ls /

Use rowcounter function on hbase jar to find row count
hadoop jar hbase-0.90.6-cdh3u4.jar rowcounter

If hadoop command shows local filesystem than hadoop configuration is not in classPath. Download from cloudera manager and copy in .bashrc

Copy between two hdfs file systems
hadoop fs -cp  hdfsSourcePath/* hdfs://destinationIP/export/backup_20120824/

Thursday, December 15, 2011

Good People and their talks


  1. Ku Gnanasambandam

MySQL - Convert MySQL date, datetime, timestamp to unix time and Vice versa

To convert Unix time to regular MySQL DATETIME
select from_unixtime('1213907115');
It prints '2008-06-19 16:25:15'

To Convert DATETIME to unixtime
select unix_timestamp('2008-06-19 16:25:15');
It prints 1213907115

To Convert current date & time to unix time.
select unix_timestamp();

To Convert a column of DATETIME type to unixtime.
select unix_timestamp(c8013_date_created) from RealTimeStatus.T8013_PROGRAM;

Wednesday, December 14, 2011

JavaMail using SSL and Gmail


Make sure you have mail.jar. Refer example with TLS (Transport Layer Service, SSL - Secure Sockets Layer - http://www.mkyong.com/java/javamail-api-sending-email-via-gmail-smtp-example/)

/**
 *
 */
package sendEmail;

import java.util.Properties;

import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.PasswordAuthentication;
import javax.mail.Session;
import javax.mail.Transport;
import javax.mail.internet.InternetAddress;
import javax.mail.internet.MimeMessage;

import org.apache.log4j.Logger;

/**
 * @author glakshmanan
 * Download mail.jar for this program
 */
public class SendMailUsingGmailSSL {

private static final Logger logger = Logger
.getLogger(SendMailUsingGmailSSL.class);

/**
*
*/
public SendMailUsingGmailSSL() {
// TODO Auto-generated constructor stub
}

/**
* @param args
*/
public static void main(String[] args) {
Properties props = new Properties();
props.put("mail.smtp.host", "smtp.gmail.com");
props.put("mail.smtp.socketFactory.port", "465");
props.put("mail.smtp.socketFactory.class",
"javax.net.ssl.SSLSocketFactory");
props.put("mail.smtp.auth", "true");
props.put("mail.smtp.port", "465");

Session session = Session.getDefaultInstance(props,
new javax.mail.Authenticator() {
protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication("username",
"pwd");
}
});

try {

Message message = new MimeMessage(session);
message.setFrom(new InternetAddress("gubs4u@gmail.com"));
message.setRecipients(Message.RecipientType.TO,
InternetAddress.parse("mailaddress@hostname.com"));
message.setSubject("Testing Again");
message.setText("SSL Worked Atleast");

Transport.send(message);

logger.info("Mail Successfully Sent");

} catch (MessagingException e) {
throw new RuntimeException(e);
}
}
}

Tuesday, December 13, 2011

File Operations Example in JAVA


File Operations
package fileOperations;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;

import org.apache.log4j.Logger;

/**
 * @author glakshmanan Nov 21st 2011
 */
public class fileOperations {

static Logger log = Logger.getLogger(fileOperations.class);
static final String fileName = "/home/glakshmanan/fileOperations.csv";

/**
*
*/
public fileOperations() {
// TODO Auto-generated constructor stub
}

/**
* @param args
*/
public static void main(String[] args) {
createFileIfNoExists();
writeDataInFile();
readDataFromFile();
deleteFileIfExist();

}

private static void deleteFileIfExist() {
new File(fileName).delete();
log.info("File Deleted Successfully");
}

private static void readDataFromFile() {
if (new File(fileName).exists()) {
FileReader fileReader = null;
try {
fileReader = new FileReader(fileName);
} catch (FileNotFoundException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
BufferedReader br = new BufferedReader(fileReader);
try {
String text;
while ((text = br.readLine()) != null) {
log.info("Data from File : " + text);
}
} catch (IOException e) {
log.info("File Reader error" + e.getMessage());
}
}
}

private static void writeDataInFile() {
try {
if (new File(fileName).exists()) {
FileWriter fw = new FileWriter(fileName);
BufferedWriter bw = new BufferedWriter(fw);
String[] testArr = { "Gubs", "Kavitha", "Sai" };
for (String test : testArr) {
bw.write("Testing FileWritter : " + test + "\n");
}
bw.flush();
bw.close();
log.info("Written data into the file");
} else {
createFileIfNoExists();
}

} catch (IOException e) {
log.error("Failed to write content " + e.getMessage());

}
}

private static void createFileIfNoExists() {
File fileCreate = new File(fileName);
try {
if (fileCreate.exists()) {
boolean isFileExecutable = fileCreate.canRead()
&& fileCreate.canWrite();
if (isFileExecutable) {
log.info("File Exist and you can write in file "
+ fileCreate.getAbsolutePath());
}
} else {
fileCreate.createNewFile();
// fileCreate.setWritable(false);
log.info("File Creation Completed, Absolute path name "
+ fileCreate.getAbsolutePath());
}

} catch (IOException e) {
log.error("File Creation Operation Failed : " + e.getMessage());
}
}

}

Sunday, December 11, 2011

Apache Lucene POC with code and explanation


Lucene is an open source, highly scalable text search-engine library available from the Apache Software Foundation. You can use Lucene in commercial and open source applications. Lucene's powerful APIs focus mainly on text indexing and searching. It can be used to build search capabilities for applications such as e-mail clients, mailing lists, Web searches, database search, etc. Web sites like Wikipedia, TheServerSide, jGuru, and LinkedIn have been powered by Lucene.
Lucene has many features. It:
  • Has powerful, accurate, and efficient search algorithms.
  • Calculates a score for each document that matches a given query and returns the most relevant documents ranked by the scores.
  • Supports many powerful query types, such as PhraseQuery, WildcardQuery, RangeQuery, FuzzyQuery, BooleanQuery, and more.
  • Supports parsing of human-entered rich query expressions.
  • Allows users to extend the searching behavior using custom sorting, filtering, and query expression parsing.
  • Uses a file-based locking mechanism to prevent concurrent index modifications.
  • Allows searching and indexing simultaneously.

LuceneIndexWriter Class
package pocUsingLucene;

import java.io.File;
import java.io.IOException;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.LimitTokenCountAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.LogMergePolicy;
import org.apache.lucene.index.SegmentInfo;
import org.apache.lucene.index.SerialMergeScheduler;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.NIOFSDirectory;
import org.apache.lucene.util.Version;

/**
 * @author glakshmanan
 *
 */
public class LuceneIndexWriter {

// Logger
public static final Logger log = Logger.getLogger(LuceneIndexWriter.class);

// IndexWriting Directory
private static final String INDEX_DIR = "/home/glakshmanan/index";

// Constructor
public LuceneIndexWriter() {
}

/**
* @param args
* @throws IOException
* @throws LockObtainFailedException
* @throws CorruptIndexException
* @throws ParseException
*/
public static void main(String[] args) throws CorruptIndexException,
LockObtainFailedException, IOException, ParseException {

LuceneIndexWriter luceneIndexWriter = new LuceneIndexWriter();

// Create IndexWriter object using config
IndexWriter iw = luceneIndexWriter.createIndexWriter();

// DeleteAll Documents
luceneIndexWriter.deleteAllDocuments(iw);

// Create Documents and index it
luceneIndexWriter.createDocument(iw);

// close the indexWriter
luceneIndexWriter.closeIndexWriter(iw);

// Delete the Document
// luceneIndexWriter.deleteDocument();

log.info("Index Writter written the data Sucessfully");
}

private void deleteAllDocuments(IndexWriter iw) throws IOException {
// Delete existing docs, segments and terms before we add new data
iw.deleteAll();
iw.commit();
}

/**
* This method will delete the Document from the indexDirectory using the
* Term
*
* @throws IOException
*/
@SuppressWarnings("unused")
private void deleteDocument() throws IOException {
Directory dir = NIOFSDirectory.open(new File(INDEX_DIR));
IndexReader ir = IndexReader.open(dir);
ir.deleteDocuments(new Term("name"));
// This will flush the commit and close
ir.close();
}

/**
* This method will commit changes and close the IndexWriter object
*
* @param iw
* @throws IOException
*/
private void closeIndexWriter(IndexWriter iw) throws IOException {
// Optimize, commit and close the IndexWriter
iw.optimize();
iw.commit();
iw.close();
}

/**
* This method will create document with fields and write into FD using
* indexWriter
*
* @param iw
* @throws CorruptIndexException
* @throws IOException
*/
private void createDocument(IndexWriter iw) throws CorruptIndexException,
IOException {
// Creating document to add using indexWriter
Document doc = new Document();

// Value is just stored not indexed
Field idField = new Field("id", "1", Field.Store.YES, Field.Index.NO);
// Value is stored with Indexed, Analyzed and Tokenized
Field nameField = new Field("name", "Gubendran", Field.Store.YES,
Field.Index.ANALYZED);
// Field is not stored and not analyzed
Field addressField = new Field("address", "Jersey city",
Field.Store.NO, Field.Index.NOT_ANALYZED);

// Combined the searchFullText and store in 1 index to search easy when
// we need to show full content
String searchFullText = "Gubendran" + "Jersey City";
Field searchFullTextField = new Field("content", searchFullText,
Field.Store.YES, Field.Index.ANALYZED);

// Add fields into document
doc.add(idField);
doc.add(nameField);
doc.add(addressField);
doc.add(searchFullTextField);

// Adding NumericField Example
/*
* NumericField numericField = new NumericField("title",
* Field.Store.YES, true);
* numericField.setIntValue(Integer.parseInt(value));
* doc.add(numericField);
*/

iw.addDocument(doc);
}

/**
* This method will create indexWriter object and return using
* indexConfiguration
*
* @return
* @throws CorruptIndexException
* @throws LockObtainFailedException
* @throws IOException
*/
private IndexWriter createIndexWriter() throws CorruptIndexException,
LockObtainFailedException, IOException {
// 1. Create the index
File file = new File(INDEX_DIR);
// If its linux system you can use NIOFSDirectory to make faster
Directory index = NIOFSDirectory.open(file);

// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching. For
// English use standardAnalyzer
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);

// If the file size is huge. This is alternate for MaxFieldLength
LimitTokenCountAnalyzer limitTokenCountAnalyzer = new LimitTokenCountAnalyzer(
analyzer, Integer.MAX_VALUE);

// Documents will be stored in the segments. If there were more
// documents we need to keep in different segment to make faster
final int segmentSize = 10;
int mergeFactorSize = 100;
int RAMBufferSizeMB = 32;
int maxBufferedDocs = 500;
LogMergePolicy logMergePolicy = new LogMergePolicy() {

@Override
protected long size(SegmentInfo info) throws IOException {
return segmentSize;
}
};
// This parameter determines how many documents you can store in the
// original segment index and how often you can merge together the
// segment indexes in the disk
logMergePolicy.setMergeFactor(mergeFactorSize);

logMergePolicy.setUseCompoundFile(true);
// This parameter determines the maximum number of documents per segment
// index. The default value is Integer.MAX_VALUE. Large values are
// better for batched indexing and speedier searches.
logMergePolicy.setMaxMergeDocs(Integer.MAX_VALUE);

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_31,
limitTokenCountAnalyzer);
config.setRAMBufferSizeMB(RAMBufferSizeMB);
config.setMergeScheduler(new SerialMergeScheduler());
config.setMaxBufferedDocs(maxBufferedDocs);

IndexWriter iw = new IndexWriter(index, config);
return iw;
}
}

Lucene IndexSearcher

package pocUsingLucene;

import java.io.File;
import java.io.IOException;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.NIOFSDirectory;
import org.apache.lucene.util.Version;

/**
 * @author glakshmanan
 *
 */
public class LuceneIndexSearcher {

// Logger
public static final Logger log = Logger
.getLogger(LuceneIndexSearcher.class);

// IndexSearcher Directory
private static final String INDEX_DIR = "/home/glakshmanan/index";

// Constructor
public LuceneIndexSearcher() {
super();
}

/**
* @param args
* @throws IOException
* @throws ParseException
*/
public static void main(String[] args) throws IOException, ParseException {

// Get the argument for Search
String searchText = args.length > 0 ? args[0] : "dummies";

LuceneIndexSearcher luceneIndexSearcher = new LuceneIndexSearcher();
// Create indexSearcher Object for search
IndexSearcher searcher = luceneIndexSearcher.createIndexSearcher();

// Create QueryParser Object and Collector object for search
Query queryParser = luceneIndexSearcher.createQueryParser(searchText);

// Using TopScoreDocCollector collect the documents for the
// corresponding search value. hitsPerPage is important for usage of
// paging and Filter. Paging and Filter needed to make search faster
int hitsPerPage = 10;
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(queryParser, collector);

ScoreDoc[] scoreDocs = collector.topDocs().scoreDocs;

// Display the search values using top scored docs
luceneIndexSearcher.displaySearchResult(searcher, scoreDocs);

// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}

/**
* Display the search value result from ScoreDocs
*
* @param searcher
* @param scoreDocs
* @throws CorruptIndexException
* @throws IOException
*/
private void displaySearchResult(IndexSearcher searcher,
ScoreDoc[] scoreDocs) throws CorruptIndexException, IOException {
log.info("List of docs found : " + scoreDocs.length);
for (ScoreDoc scoreDoc : scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
log.info("Name : " + doc.get("name"));
}
}

/**
* Create QueryParser object for indexSearcher using search text
*
* @param searchText
* @return
* @throws ParseException
*/
private Query createQueryParser(String searchText) throws ParseException {

// Create a queryParserObject for Search using search field and
// search value. Use same Analyzer you used for indexWriter to
// indexSearch also
Query queryParser = new QueryParser(Version.LUCENE_31, "name",
new StandardAnalyzer(Version.LUCENE_31)).parse(searchText);

// NumericRangeQuery search for numeric search
// Query queryParser = NumericRangeQuery.newIntRange("title", 4, 40,
// 500, true, true);
return queryParser;
}

/**
* Create IndexSearcher object for searcher
*
* @return
* @throws IOException
*/
private IndexSearcher createIndexSearcher() throws IOException {
// 1. Create the index
File file = new File(INDEX_DIR);
Directory index = NIOFSDirectory.open(file);

// Create indexSearcher Object using index Directory
IndexSearcher indexSearcher = new IndexSearcher(index, true);

return indexSearcher;
}

}



// Below script tag for SyntaxHighLighter