A STUDY ON BIG DATA HADOOP Nandha Kumar
A STUDY ON BIG DATA HADOOP Nandha Kumar
A STUDY ON BIG DATA HADOOP Nandha Kumar
E-Mail-nkumarram@gmail.com
used in the big data analytics tools leads to be more The name Hadoop has become synonymous with big
efficient, faster and better decisions and performance data. It’s an open-source software framework for
which are massively preferred byanalysts, business users distributed storage of very large datasets on computer
and researchers [1]. clustering networks. All that means that we can scale our
data up and down without having to worry about hardware
and network failures. Hadoop provides massive amounts
of storage for any kind of data, massive processing power
and the ability to handle practically limitless simultaneous
tasks or jobs.Hadoop is not for the beginner. To truly strip
up its power, we really need to knowbasics of Java. It
might be anassurance, but Hadoop is certainly worth the
effort – since many other companies and technologies run
off of it or integrate with it. Hadoop involves a cluster of
storage/computing nodes (or machines) out of which one
node is assigned as master and other as slave nodes. The
HDFS [18] maintains each file in the chunk of same size
blocks or nodes (except the last block). Also, various
replications of these blocks are maintained on various
nodes in the cluster for the sake of reliability and fault
Fig 2. Big Data Architecture tolerance. The Map-Reduce function computing technique
divides the whole task of processing into smaller blocks
Here’s a closer look at what’s in the image and the and assign it to various slave machines which are the
relationship between the components: required data is available and executes computing right at
that node. In this way it saves significant time and cost
• Interfaces and feeds: On either side of the diagram involved in transferring data from data server to the
are specification of interfaces and feeds into and out of computing machine. Following are the advantages,
both internally managed data and data feeds from disadvantages and latest version of Hadoop.
external sources. To understand how big data works in
the real world, start by understanding this necessity of i. Advantages of Hadoop
the data.
• Redundant physical infrastructure: The supporting • Open source: Being an open source, Hadoop is
physical infrastructure is fundamental necessity for the freely available online [3].
operation and scalability of big data architecture. • Cost Effective: saves cost as it utilizes cheaper,
Without the availability of robust physical lower end cluster of commodity of machines
infrastructures, big data would not have emerged as an instead of costlier high end server. Also,
important trend. distributed storage of data and transfer of
• Security infrastructure: The more important big computing code rather than data saves high
data analysis becomes to companies, the more transfer costs for large data sets [3].
important it should be need to secure the data. For • Scalable: To handle larger data, and to maintain
example, in a healthcare company,we will probably performance and is capable to scale linearly by
want to use big data applications to determine changes putting additional nodes in clusters [3].
in demographics or shifts in patient needs and • Fault Tolerant and Robust: It replicates data
treatments. block on multiple nodes that facilitates the
• Operational data sources: When we think about big recuperation from a single node or machine
data, understand that we have to incorporate all the failure. Also, Hadoop's architecture deals with
data sources that will give us a complete picture of frequent malfunctions in hardware. If a node
business and see how the data impacts the way we fails the task of that node is reassigned to some
operate the business. other node in the cluster [4].
• High Throughput: Due to batch processing high
throughput is achieved[4].
II.BIG DATA DATABASE TOOLS • Portability: Hadoop architecture can be
effectively ported [5] while working with several
A. Hadoop commodities of operating systems and hardwares
that may be assorted [6].
C. HBase
Fig 3. Hadoop Architecture
HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts
B. Cassandra of structured data. This tutorial provides an introduction to
Cassandra is an Apache free and open-sourceand HBase, the procedures to set up HBase on Hadoop File
distributeddatabase management systemdesigned to handle Systems, and ways to interact with HBase shell. It also
large amount of data across many commodity hardware describes how to connect to HBase using java, and how to
and serversby providing high availability of performance perform basic operations on HBase using java.
with no single point of failure. It offers robust support for
clusters spanning multiple datacenters,[1] with
asynchronous master less replication allowing low latency
operations for all clients.
Cassandra also places a high value on performance. In
2012, University of Toronto researchers studying NoSQL
• Built for Offline MongoDB supports field, join and range queries,
CouchDB can replicate to devices (like regular expression searches [12]. Queries can return
smartphones) that can go offline and handle data specific fields of documents and also include user-
sync for us when the device is back to online. defined JavaScript functions. Queries can also be
• Distributed Architecture with Replication configured to return a random sample of results of a
CouchDB was designed with bi-directional given size.
replication (or ynchronization) and off-line operation • Indexing
in mind. This means multiple replicas can have their Fields in a MongoDB document can be indexed
own copies of the same data, modify it, and then with primary and secondary indices.
sync those changes later. • Replication
• Document Storage MongoDB provides high availability with replica
CouchDB stores data as documents, key/value sets [12].A replica set consists of two or more copies
pairs expressed as JSON. Field/value paircan be of the data and each replica set member may act in
simple things like characters, numbers, or dates; but the role of primary or secondary replica at any time.
should be ordered lists and associative arrays can All writes and reads are done on the primary
also be used. Every document in a CouchDB replica by default. Secondary replicas maintain a
database has a unique id and there is no required copy of the data of the primary using built-in
document schema definition. replication. When a primary replica fails, the replica
• Eventual Consistency set conducts an election process to determine which
CouchDB guarantees eventual consistency to be secondary should become the primary. Secondary’s
able to provide both availability and partitional can optionally serve read operations, but that data is
tolerance. only eventually consistent by default.
• Map/Reduce Views and Indexes • Load balancing
The stored data is structured using views. In MongoDB scales horizontally using slicing [12].
CouchDB, each view is constructed by a JavaScript The user chooses a slice key, which determines how
function that acts as the Map half of a map/reduce the data in a collection will be distributed. The data
functional operation. is split into ranges and distributed across multiple
• HTTP API networks.Alternatively, the shard key can be hashed
All items should have a unique URI that gets to map to a shard – enabling an even data
exposed via HTTP. It uses the distribution and it can also run over multiple
HTTP methods POST, GET, PUT and DELETE for commodity servers, balancing the load or duplicating
the four basic Create, Read, Update, Delete data to keep the system up and running in case of
operations on all resources. hardware or network failure.
Advantages of MongoDB over RDBMS
E. MongoDB
MongoDB is a free and open-sourcecross- • Structure of a single object is clear
platformdocument-oriented database program. • Schema less − MongoDB is a document database in
Classified as a NoSQL database program, MongoDB which one collection holds different documents.
uses JSON-like documents with schemas. This data Number of fields, content and size of the document
base is developed by MongoDB Inc. and is free and can differ from one document to another.
open-source, published in combination with the GNU • MongoDB is a document database in which one
Affero General Public License and the Apache collection holds different documents. Number of
License.Any relational database has a typical schema fields, content and size of the document can differ
design that shows number of tables and the from one document to another
relationship between these tables. While in • No complex joins
MongoDB, there is no concept of relationship. • Deep query-ability. MongoDB supports dynamic
queries on documents using a document-based
Main features query language that's nearly as powerful as SQL
• Ad hoc queries • Tuning
• Ease of scale-out − MongoDB is easy to scale
HBase
Structured, Semi-
Commercial and JavaScript, PHP, Windows, Ubuntu
CouchDB Structured and
Open Source Erlang
Unstructured data