A STUDY ON BIG DATA HADOOP Nandha Kumar

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Journal of Analysis and Computation (JAC)

(An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861


National Conference on: Research Challenges in Information Systems of Computer Science
(NCRCISCS)"

A STUDY ON BIG DATA HADOOP AND ITS DATABASE TOOLS


R. NANDHAKUMAR1 AND ANTONY SELVADOSS THANAMANI2
1
Acadamitian -Department of Computer Science ,NGM College,Pollachi-642001,India
2
Associate Prof & Head, Department of Computer Science,NGM College,Pollachi-642001,India

E-Mail-nkumarram@gmail.com

ABSTRACT- for interaction of data. In addition to this, an important


In the information era, wide variety of data have become aspect of Big Data is the fact that it can’t be handled with
available on hand to organizational decision makers. Big data standard data base management techniques due to the
refers to datasets that are not only big, but also high in variety, versatility,inconsistency and unpredictability of the
volume and velocity, which makes them difficult to handle using possible combinations.
traditional data base tools and techniques. Due to the rapid
growth of such data, solutions need to be analyzed and presented Big Data has four aspects as shown in fig 1:
in order to handle and extract information and knowledge from
these datasets. Furthermore, decision makers of organization Volume: refers to the quantity of data captured by the
need to be able to gain valuable imminent from such varied and company. This data must be used further to obtain
rapidly changing data, ranging from daily transactional process important knowledge about the organization.
to customer interactions and social media network data. Such Velocity: refers to the time in which Big Data can be
data can be provided using big data analytics, which is the processed and produce the results. Variety: refers to the
application of advanced analytics techniques on big data. This
paper aims to analyze and compare some of the different
type of data that Big Data can integrate. This data can be
database tools which can be applied to big data, as well as the structured as well as unstructured;
occasions provided by the application of big data analytics in Veracity: refers to the degree in which a leader trusts the
various organizational domains. used information in order to take decision. So getting the
right correspondence in Big Data is very important for the
Keywords—Big data analytics, Big data Database, Big data business in future performance.
database tools.
I. INTRODUCTION

The term “Big Data” was first introduced to the


computing world by Roger Magoulas from O’Reilly
media in 2005 in order to define ahuge amount of data that
traditional database management techniques cannot
manage, access and process due to the scalability,
flexibility and size of the data. A study on the Evolution of
Big Data as a Research and Scientific Topic shows that the
term “Big Data” was present in research starting with
1970s but has been comprised in publications in 2008.
Nowadays the Big Data concept is treated from different Fig 1. V’s of Big Data
points of view covering its implications in many fields.
According to MiKE 2.0, the open source standard for The amount of data stored in various sectors can vary in
Information Management system, Big Data is defined the data stored and how they created, i.e., images, audio,
based on the size which consists of large, complex and text information etc., from one organization to another.
independent collection of data sets, each with the potential From the practical point of view, the graphical interface

R. NANDHAKUMAR AND ANTONY SELVADOSS THANAMANI 1


A STUDY ON BIG DATA HADOOP AND ITS DATABASE TOOLS

used in the big data analytics tools leads to be more The name Hadoop has become synonymous with big
efficient, faster and better decisions and performance data. It’s an open-source software framework for
which are massively preferred byanalysts, business users distributed storage of very large datasets on computer
and researchers [1]. clustering networks. All that means that we can scale our
data up and down without having to worry about hardware
and network failures. Hadoop provides massive amounts
of storage for any kind of data, massive processing power
and the ability to handle practically limitless simultaneous
tasks or jobs.Hadoop is not for the beginner. To truly strip
up its power, we really need to knowbasics of Java. It
might be anassurance, but Hadoop is certainly worth the
effort – since many other companies and technologies run
off of it or integrate with it. Hadoop involves a cluster of
storage/computing nodes (or machines) out of which one
node is assigned as master and other as slave nodes. The
HDFS [18] maintains each file in the chunk of same size
blocks or nodes (except the last block). Also, various
replications of these blocks are maintained on various
nodes in the cluster for the sake of reliability and fault
Fig 2. Big Data Architecture tolerance. The Map-Reduce function computing technique
divides the whole task of processing into smaller blocks
Here’s a closer look at what’s in the image and the and assign it to various slave machines which are the
relationship between the components: required data is available and executes computing right at
that node. In this way it saves significant time and cost
• Interfaces and feeds: On either side of the diagram involved in transferring data from data server to the
are specification of interfaces and feeds into and out of computing machine. Following are the advantages,
both internally managed data and data feeds from disadvantages and latest version of Hadoop.
external sources. To understand how big data works in
the real world, start by understanding this necessity of i. Advantages of Hadoop
the data.
• Redundant physical infrastructure: The supporting • Open source: Being an open source, Hadoop is
physical infrastructure is fundamental necessity for the freely available online [3].
operation and scalability of big data architecture. • Cost Effective: saves cost as it utilizes cheaper,
Without the availability of robust physical lower end cluster of commodity of machines
infrastructures, big data would not have emerged as an instead of costlier high end server. Also,
important trend. distributed storage of data and transfer of
• Security infrastructure: The more important big computing code rather than data saves high
data analysis becomes to companies, the more transfer costs for large data sets [3].
important it should be need to secure the data. For • Scalable: To handle larger data, and to maintain
example, in a healthcare company,we will probably performance and is capable to scale linearly by
want to use big data applications to determine changes putting additional nodes in clusters [3].
in demographics or shifts in patient needs and • Fault Tolerant and Robust: It replicates data
treatments. block on multiple nodes that facilitates the
• Operational data sources: When we think about big recuperation from a single node or machine
data, understand that we have to incorporate all the failure. Also, Hadoop's architecture deals with
data sources that will give us a complete picture of frequent malfunctions in hardware. If a node
business and see how the data impacts the way we fails the task of that node is reassigned to some
operate the business. other node in the cluster [4].
• High Throughput: Due to batch processing high
throughput is achieved[4].
II.BIG DATA DATABASE TOOLS • Portability: Hadoop architecture can be
effectively ported [5] while working with several
A. Hadoop commodities of operating systems and hardwares
that may be assorted [6].

R. NANDHAKUMAR AND ANTONY SELVADOSS THANAMANI 2


Journal of Analysis and Computation (JAC)
(An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861
National Conference on: Research Challenges in Information Systems of Computer Science
(NCRCISCS)"

systems concluded that "In terms of scalability, there is a


ii. Disadvantages of Hadoop clear winner throughout our experiments. Cassandra
• Single Point Failure: Hadoop's (version up to achieves the highest throughput for the maximum number
2.x) HDFS as well as MapReducefunction suffer of nodes in all experiments" although "this comes at the
from single points of failure [7]. price of high write and read latencies.
• Low Efficiency/ Poor Performance than DBMS
[7]: Hadoop shows lower efficiency due its Main features
inability to switch to the next stage before • Decentralized
completing the previous stage tasks which causes Every node in the cluster has the same role. There is
Hadoop unsuitable for pipeline parallel no single point of failure. Data is distributed across
processing, runtime scheduling the cluster (so that each node contains different
that causes degraded efficiency per node. Unlike data), but there is no master as every node can
RDBMS, it has no specific optimization of service any request as master node. Supports
execution plans that could minimize the transfer replication and multi data center replication
of data among various nodes. strategies are configurable[17]. Cassandra is
• Inefficient Dealing with Small Files: As HDFS is designed as a distributed system, for deployment of
meant for high throughput optimization [8], it large numbers of nodes across various
does not suit to random reads on small files [9]. • Scalability
• Not Suitable for Real Time Access: MapReduce Read and write throughput both increase linearly
and HDFS employ batch processing architecture as new machines are added, with no downtime or
and it does not fit for real-time accesses [8]. interruption to applications.
• Fault-tolerant
Data is automatically replicated to various nodes
for fault-tolerance. Replication across multiple
data centers is supported. Failed nodes can be
replaced with no latency time.
• Tunable consistency
Writes and reads offer a tunable level of
consistency, all the way from "writes never fail"
to "block for all replicas to be readable", with the
quorum level in the middle.[10]
• MapReduce support
Cassandra has Hadoop integration, with
MapReduce support. There is support also for
Apache Pig and Apache Hive.

C. HBase
Fig 3. Hadoop Architecture
HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts
B. Cassandra of structured data. This tutorial provides an introduction to
Cassandra is an Apache free and open-sourceand HBase, the procedures to set up HBase on Hadoop File
distributeddatabase management systemdesigned to handle Systems, and ways to interact with HBase shell. It also
large amount of data across many commodity hardware describes how to connect to HBase using java, and how to
and serversby providing high availability of performance perform basic operations on HBase using java.
with no single point of failure. It offers robust support for
clusters spanning multiple datacenters,[1] with
asynchronous master less replication allowing low latency
operations for all clients.
Cassandra also places a high value on performance. In
2012, University of Toronto researchers studying NoSQL

R. NANDHAKUMAR AND ANTONY SELVADOSS THANAMANI 3


A STUDY ON BIG DATA HADOOP AND ITS DATABASE TOOLS

• HBase is linearly scalable.


• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a
destination.
• It has easy java API for client.
• It provides data replication across clusters.
Applications of HBase
• It is used whenever there is a need to write heavy
applications.
Fig 4. HBase Architecture
• HBase is used whenever we need to provide fast
HBase is a distributed column-oriented database built on random access to available data.
top of the Hadoop file system. It is an open-source • Companies such as Facebook, Twitter, Yahoo,
project and is horizontally scalable. Base is a data model and Adobe use HBase internally.
that is similar to Google’s big table designed to provide • It hosts very large tables on top of clusters of
quick random access to huge amounts of structured data. commodity hardware.
It leverages the fault tolerance provided by the Hadoop
TABLE I: COMPARISON OF CASSANDRA AND HBASE
File System (HDFS).It is a part of the Hadoop ecosystem
that provides random real-time read/write access to data D. CouchDB
in the Hadoop File System.One can store the data in Apache CouchDB is open source database software that
HDFS either directly or through HBase. Data consumer spotlight on easiness of use and having an architecture that
reads/accesses the data in HDFS randomly using HBase. completely hold close the Web.[11] It has a document-
HBase sits on top of the Hadoop File System and orientedNoSQL database architecture and is implemented
provides read and write access. in the concurrency-oriented Erlang language; it uses JSON
HBase is a column-oriented database and the tables in to store data, JavaScript is a query language used
it are sorted by row. The table schema defines only withMapReduce, and HTTP for an API.[11]It wasfirst
column families, which are the key value pairs. A table released in 2005 and later became an Apache Software
have multiple column families and each column family Foundation project in 2008. Unlike a relational database,
can have any number of columns. Subsequent column this database does not store data and relationships in a
values are stored contiguously on the disk. Each cell table format. Instead of this, each database is a collection
value of the table has a timestamp. In short, in an HBase: of independent document data. Each document preserves
• Table is a collection of rows. its own data and self-contained structure definition. An
• Row is a collection of column families. application may access and manipulate multiple databases,
• Column family is a collection of columns. such as one stored on a user's mobile phone and another
on a server. Document metadata contains updated
Cassandra HBASE information, making it possible to merge any
Data Model Columnar Database Columnar differences that may have occurred while the
Database databases were disconnected. It implements a form
Interface HTTP/REST HTTP/REST of multi-version concurrency control (MVCC) so it
Object Storage Database contains data Database contains does not lock the database file during writes.
in columns(key-value data in Disagreements are left to the application to resolve.
pair) columns(key- Resolving a conflict usually involves first merging
value pair data into one of the documents, then deleting the old
Query Method Map/Reduce+CQL Map/Reduce + one [11].
Drill
Replication peer – to –peer with Cluster Main features
Multiple data centers Replication • ACID Semantics
Concurrency Atomicity, Isolation MVCC CouchDB provides ACID semantics.[11] It does this
Written in Java Java by implementing a form of Multi-Version
• Column is a collection of key value pairs. Concurrency Control, meaning that it can handle a
large volume of concurrent readers and writers
Features of HBase without conflict.

R. NANDHAKUMAR AND ANTONY SELVADOSS THANAMANI 4


Journal of Analysis and Computation (JAC)
(An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861
National Conference on: Research Challenges in Information Systems of Computer Science
(NCRCISCS)"

• Built for Offline MongoDB supports field, join and range queries,
CouchDB can replicate to devices (like regular expression searches [12]. Queries can return
smartphones) that can go offline and handle data specific fields of documents and also include user-
sync for us when the device is back to online. defined JavaScript functions. Queries can also be
• Distributed Architecture with Replication configured to return a random sample of results of a
CouchDB was designed with bi-directional given size.
replication (or ynchronization) and off-line operation • Indexing
in mind. This means multiple replicas can have their Fields in a MongoDB document can be indexed
own copies of the same data, modify it, and then with primary and secondary indices.
sync those changes later. • Replication
• Document Storage MongoDB provides high availability with replica
CouchDB stores data as documents, key/value sets [12].A replica set consists of two or more copies
pairs expressed as JSON. Field/value paircan be of the data and each replica set member may act in
simple things like characters, numbers, or dates; but the role of primary or secondary replica at any time.
should be ordered lists and associative arrays can All writes and reads are done on the primary
also be used. Every document in a CouchDB replica by default. Secondary replicas maintain a
database has a unique id and there is no required copy of the data of the primary using built-in
document schema definition. replication. When a primary replica fails, the replica
• Eventual Consistency set conducts an election process to determine which
CouchDB guarantees eventual consistency to be secondary should become the primary. Secondary’s
able to provide both availability and partitional can optionally serve read operations, but that data is
tolerance. only eventually consistent by default.
• Map/Reduce Views and Indexes • Load balancing
The stored data is structured using views. In MongoDB scales horizontally using slicing [12].
CouchDB, each view is constructed by a JavaScript The user chooses a slice key, which determines how
function that acts as the Map half of a map/reduce the data in a collection will be distributed. The data
functional operation. is split into ranges and distributed across multiple
• HTTP API networks.Alternatively, the shard key can be hashed
All items should have a unique URI that gets to map to a shard – enabling an even data
exposed via HTTP. It uses the distribution and it can also run over multiple
HTTP methods POST, GET, PUT and DELETE for commodity servers, balancing the load or duplicating
the four basic Create, Read, Update, Delete data to keep the system up and running in case of
operations on all resources. hardware or network failure.
Advantages of MongoDB over RDBMS
E. MongoDB
MongoDB is a free and open-sourcecross- • Structure of a single object is clear
platformdocument-oriented database program. • Schema less − MongoDB is a document database in
Classified as a NoSQL database program, MongoDB which one collection holds different documents.
uses JSON-like documents with schemas. This data Number of fields, content and size of the document
base is developed by MongoDB Inc. and is free and can differ from one document to another.
open-source, published in combination with the GNU • MongoDB is a document database in which one
Affero General Public License and the Apache collection holds different documents. Number of
License.Any relational database has a typical schema fields, content and size of the document can differ
design that shows number of tables and the from one document to another
relationship between these tables. While in • No complex joins
MongoDB, there is no concept of relationship. • Deep query-ability. MongoDB supports dynamic
queries on documents using a document-based
Main features query language that's nearly as powerful as SQL
• Ad hoc queries • Tuning
• Ease of scale-out − MongoDB is easy to scale

R. NANDHAKUMAR AND ANTONY SELVADOSS THANAMANI 5


A STUDY ON BIG DATA HADOOP AND ITS DATABASE TOOLS

Name of Big data


Mode of Software Types of Data Language Used Operating System
tools

• Conversion/mapping of application objects to


database objects not needed [12] III. SUMMARIZATION OF BIG DATA TOOLS

The following Table III demonstrates the comparative


TABLE II: COMPARISON OF COUCHDB AND MONGODB aspects of the diverse tools and its uses in big data based
on data sources and its operating system. It tells about
mode of software, types of data, language used and its
operating system. The main objective of this comparison
of database tools is not to look which is the best tool in big
data, but to demonstrate its usage, flexibility,
scalabilityand performance to create alertness of big data
in various fields.

TABLE III: COMPARISON OF BIG DATA TOOLS

R. NANDHAKUMAR AND ANTONY SELVADOSS THANAMANI 6


Journal of Analysis and Computation (JAC)
(An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861
National Conference on: Research Challenges in Information Systems of Computer Science
(NCRCISCS)"

HBase

Commercial and Structured and


Cassandra SQL Windows XP, Vista, 7
Open Source Unstructured data
and 8

Structured, Semi-
Commercial and JavaScript, PHP, Windows, Ubuntu
CouchDB Structured and
Open Source Erlang
Unstructured data

Structured, Semi- Amazon Linux, Windows Server 2012 &


MongoDB Open Source Structured and C++ 2012 R2, Debian 7.1
Unstructured data

[6] M. K. McKusick and S. Quinlan, ―GFS: Evolution


IV. CONCLUSION on Fast-forward‖, ACM Queue, New York, vol. 7,
no. 7, (2013).
In this paper, more than a few big data tools were [7] K. Shvachko, H. Kuang, S. Radia and R Chansler,
elucidated along with their features of several tasks. Big ―The Hadoop Distributed File System‖,
data provide vastly effective supporting processes for Proceedings of IEEE Conference,978-1-4244-7153-
collection of data sets which is too complex and large in
9/10, (2015).
size. This mandatory requirement gives the way for
developing many tools in big data research. Whereas [8] J. Dean and S. Ghemawat, ―Mapreduce: Simplified
these data base tools are generated both in real time and data processing on large clusters, communCM, vol.
in non-real time and also in very large scale which 51, no. 1, (2007). 107–113.
comes from sensors, web, networks, audio/video, etc. [9] J. Dean and S. Ghemawat, ―Mapreduce: Aflexible
Thus the aim of this survey is to enhance the knowledge data processing tool‖, commun. ACM, vol. 53, no. 1,
in big data tools and their applications applied in various
(2016), pp. 72–77.
companies. It also provides obliging services for readers,
researches, business users and analysts to make [10] Hewitt, Eben (December 15, 2015). Cassandra:
enhanced and quicker decisions using data which will The Definitive Guide(1st ed.).
promote for development and innovation in the future. [11] Brown, MC (October 31, 2016), Getting Started
with CouchDB(1st ed.),O'Reilly Media.
[12]Pirtle, Mitch (March 3, 2017), MongoDB for Web
REFERENCES Development (1st ed.), Addison-Wesley
Professional.
[1]http://www.slideshare.net/HarshMishra3/harsh-big- [13] Holt, Bradley (April 11, 2017), Scaling
data-seminar-report CouchDB (1st ed.), O'Reilly Media.
[2]http://www.infoworld.com/d/business- intelligence/7- [14]J. Cohen, B. Dolan, M. Dunlap, J.M. Hellerstein, C.
top-tools-tamingbig- data-191131. Welton, MAD skills: new analysis practices for big
[3] J. Venner, ―Pro Hadoop‖, a press, (2016). data, Proceedings of the VLDB Endow 2 (2)
[4] T. White,‖ Hadoop: The Definitive Guide‖, third (2018).
ed., O'Reilly Media, Yahoo Press, (2017).
[5] W. Tantisiriroj, S. Patil and G Gibson, ―Data-
intensive File Systems for Internet Services‖, A
Rose
by Any OtherName (CMU-PDL-08-114). Research
Centers and Institutes at Research

R. NANDHAKUMAR AND ANTONY SELVADOSS THANAMANI 7

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy