Transition From Relational Database To Big Data and Analytics
Transition From Relational Database To Big Data and Analytics
Transition from
Relational Database to
Big Data and Analytics
Santoshi Kumari and C. Narendra Babu
M. S. Ramaiah University of Applied Sciences
Contents
6.1 Introduction..............................................................................................132
6.1.1 Background, Motivation, and Aim................................................133
6.1.2 Chapter Organization....................................................................133
6.2 Transition from Relational Database to Big Data......................................134
6.2.1 Relational Database.......................................................................134
6.2.2 Introduction to Big Data...............................................................135
6.2.3 Relational Data vs. Big Data..........................................................136
6.3 Evolution of Big Data................................................................................138
6.3.1 Facts and Predictions about the Data Generated............................139
6.3.2 Applications of Big Data................................................................139
6.3.3 Fundamental Principle and Properties of Big Data........................141
6.3.3.1 Issues with Traditional Architecture for Big Data
Processing�����������������������������������������������������������������������141
6.3.3.2 Fundamental Principle for Scalable Database System......142
6.3.3.3 Properties of Big Data System..........................................143
6.3.4 Generalized Framework for Big Data Processing...........................144
6.3.4.1 Storage and Precomputation Layer..................................144
131
AU: Please
check and
6.1 Introduction
confirm if
heading levels
are correctly
The term big data was invented to represent large data generated continuously
identified.
with the advancement in digital technology, smart devices, and cheap hardware
6.1.2 Chapter Organization
This chapter is organized as follows: Section 6.2 provides a review on the t ransition
from relational database to big data and the difference between relational data and
big data. Section 6.3 elaborates on the evolution of big data, the basic principles and
properties of big data, and the generalized framework for big data systems. Section
6.4 describes big data analytics, the necessity of big data analytics, and the challenges
in big data analytics. Lastly, tools and technologies for big data processing are dis-
cussed in Section 6.5. Conclusion and future work are presented in Section 6.6.
6.2.1 Relational Database
In 1970, E. F. Codd [14] introduced a new approach for calculating and manip-
ulating data in relational database format. Since then, RDBMS has been imple-
mented and used for more than four decades, satisfying the business needs. It is
a traditional method for managing structured data in which rows and columns
are used to store the data in the form of tables and each table has a unique pri-
mary key.
AU: Please
confirm the
The key characteristics of relational database is that it provides support to the
edits. ACID (atomicity, consistency, isolation, and durability) property [15] that guaran-
ties consistency in handling all the transactions. RDBMS includes a relational data-
base and a schema to manage, store, query, and retrieve the dataset. The RDBMS
maintains data integrity through the following characteristics:
Data warehouses and data marts are the key methods for managing the structured
datasets. A data warehouse integrates data from multiple data sources that are used
for storing, analyzing, and reporting, whereas a data mart is used to access the
data stored in a data warehouse so that data is filtered and processed. Preprocessed
data is given as an input for data mining and online analytical processing to find
new values to solve various business problems. The two ways to store data in a data
warehouse are as follows:
There are several challenges faced by today’s enterprises, organizations, and govern-
ment sectors due to some limitations of the traditional database system which are
as follows:
Hence, big data analytics, tools, and technologies were introduced to overcome the
limitations of traditional database systems for managing complex data generated
rapidly in various formats characterized by [16] volume, velocity, variability, and
variety.
◾◾ Variety: Data is generated in different formats such as audio, videos, log files,
and transaction history from various sources, which no longer fits traditional
structures.
◾◾ Veracity: Large amount of unorganized data is generated, such as tweets,
comments with hash tags, abbreviations, and conversational text and speeches.
In order to address the above characteristics of big data, several new technologies
AU: Please
provide the
such as NoSQL, Hadoop, and Spark were developed.
full form of
“NoSQL”.
◾◾ NoSQL database: NoSQL [18,19] database is read as Not Only SQL. It is a
schema-less database used to store and manage unstructured data, where the
management layer is separated from the storage layer. The management layer
provides assurance of data integrity.
NoSQL provides high performance scalable data storage with low-level
access to a data management layer, so that data management tasks are han-
dled at the application layer. Advantage of NoSQL is that the structure of the
data is modified at the application layer without making any changes to the
original data in the tables.
◾◾ Parallel processing: Many processors (300 or more in number) work in
loosely coupled or shared nothing architecture. Independent processors with
their own operating systems and memories work parallelly on different parts
of the program to improve the processing speed of the tasks and memory
utilization. Communication between tasks takes place through messaging
interface.
AU: Please
confirm if
◾◾ Distributed file structure (DFS): DFS allows multiple users working on
“Distributed file
structure” can
different machines to share files, memories, and other resources. Based on
be changed to
“Distributed file access lists on both servers and clients, the client nodes get restricted access
system”.
to the file systems, but not to the whole block of storage. However, it is again
dependent on the protocol.
◾◾ Hadoop: It is a fundamental framework for managing big data on which
many analytical tasks and stream computing are carried out. Apache Hadoop
[20] allows distributed processing of huge datasets over multiple clusters of
commodity hardware. It provides a high degree of fault tolerance with hori-
zontal scaling from a single to thousands of machines.
◾◾ Data-intensive computing: Data parallel approach is used in parallel com-
puting application to process big data [21]. This is based on the principle of
association of data and programs to perform computation.
(Continued)
Table 6.1 (Continued) Difference between Relational Data and Big Data
SI. No RDBMS Big Data
1. Data generated in every 2 days is equal to data generated from the beginning
of time until 2003.
2. Over 90% of all the data in the world was produced in the past 2 years.
3. By 2020, the amount of digital information generated is expected to grow
from 3.2 ZB today to 40 ZB. Data generated by industries gets doubled in
every 1.2 years.
4. Every minute, 200,000 photos are uploaded on Facebook, 204 million emails
are sent, 1.8 million likes are shared on Facebook, and 278,000 tweets are
posted, and around 40,000 search queries are served by Google every second. AU: Please
confirm the
5. In YouTube, every minute, around 100 hours of video are uploaded and it edits.
would take 15 years to watch all the videos uploaded in a single day.
6. Thirty billion pieces of information are exchanged between Facebook users
every day.
7. Every minute, around 570 new websites are hosted into existence.
8. Around 12 TB of tweets is analyzed every day to measure “sentiment.”
9. An 81% increase in data over the mobile network had been observed per
month from 2012 to 2014.
1.
Understanding and aiming customers: Understanding the necessities and
the requirements of customers is one of the important factors for many busi-
ness entities to improve their business. Big data applications play an important
role in understanding the customers, their behaviors, and their inclinations
by analyzing the behaviors and sentiments of the customers from previously
collected large data. In order to get a more comprehensive picture of their
customers, many companies are keen to increase their datasets with social
media data, browser logs, as well as text analytics and sensor data. Nowadays,
using big data, e-business giants such as Flipkart and Amazon are able to
predict what products are on demand and can be sold, and they can suggest
similar products to the customers. Similarly, telecom companies are now able
to better understand the customer expectations and can make better deci-
sions. Exit polls of elections are more predictive using big data analytics [22].
2.
Understanding and improving business practices: Big data helps in boost-
ing the business processes [23]. Analytics on social media data, search histo-
AU: Please
confirm the
ries on web sites, and weather forecasts help retailers in adjusting their stocks.
edits. GPS systems and sensor-equipped vehicles are used to track the delivery of
goods, and analysis of live traffic data helps in finding the shortest path. Most
importantly, customer feedback on social media helps to improve the busi-
ness processes.
3.
Health care: A large amount of patient medical history help in under-
standing the symptoms and in predicting possible diseases and solutions
[24,25]. Better prediction on a disease pattern enables to provide better
treatments. For example, a success factor of in vitro fertilization treatment
can be predicted by the analysis of multiple attributes of N number of
patient records. The risk factors in a patient’s treatment can be identified
by analyzing conditions such as blood pressure levels, asthma, diabetes,
genetics, and previous records. Future judgments in medicine would not
be limited to small samples, instead it would include a huge set of records
AU: Please
check the
or possibly everyone.
phrase “or pos-
sibly everyone”
4.
Sports: IBM SlamTracker tool used for video analytics in tennis tournaments
for clarity.
works on big data analytics. The performance of players in football and base-
ball games can be tracked and predicted based on the analysis of historical
data and sensor technology in sports equipment. Big data analysis and visu-
alization can help players a lot to improve their performance, for example, a
cricket player can improve his/her bowling skills by understanding what kind
of shots are played by the opponent.
5.
Science and research: Current potentials of big data analytics are transform-
ing science and research technology. For example, CERN, the Swiss nuclear
physics lab data center, has 65,000 processors. It generates huge amounts
of data by experimenting on the discovery of universe coverts and analyzes
30 PB of data on many distributed computers across 150 data centers. This
data is used in many other research areas to compute new insights.
6.
Optimizing machine and device performance: A large data used to train
machines and devices with the help of artificial intelligence and machine
learning makes better and smarter devices without human involvement. The
more the training data, the more accurate and smarter the device. For exam-
ple, Google’s self-driving car uses data captured from sensors, cameras, and
GPS systems to analyze the traffic movement and for safe driving without
human intervention.
7.
Security and law enforcement: Big data analytics is used by many devel-
oped countries such as the United States, the United Kingdom, Japan,
and Singapore to advance security and law enforcement. For example, the
AU: Please
expand “NCS”.
National Security Agency in the United States and NCS in Singapore use
big data analytics to track and avoid terrorist activities, web application to
AU: Please
check the
thwart.
phrase “web
application
to thwart” for
clarity.
8.
Smart cities: Big data analytics [26] help optimize traffic flow by predicting
weather using real-time traffic data collected from sensors, GPS, and social
media. Analysis of data from tweets, comments, blogs and end-user feedback
helps in building better transportation systems and hence in making better
decisions for building smarter cities and a smarter planet.
9.
Financial operation: Big data analytics plays an important role in mak-
ing financial decisions and improving financial operations. Understanding
costumer requirements and inclinations helps to provide best services, such
as insurance, credit, and loan facilities. Further financial operation process
improvements can be made based on the analysis of feedbacks. Nowadays,
most of the financial trading such as investments in stock market and pur-
chase and sale of shares is dependent on big data analytics.
1.
Architecture was not completely fault tolerant: As the number of machines
increases, it is more likely that a machine would go down as it is not hori-
zontally scalable. Manual interventions are required such as managing queue
failures and setting replicas to keep the applications running.
2.
Distributed nature of data: Data is scattered in pieces on many clusters, and
the complexity is increased at the application layer to select the appropriate
data and process it. Applications must be aware of the data to be modified
or must inspect the scattered pieces over the clusters and process it and then
merge the result to present the final result.
3.
Insufficient backup and unavoidable mistakes in software: Complexities
are pushed to the application layer with the introduction of big data technol-
ogy. As the complexity of system increases, the possibility of making mistakes
will also increase. Systems must be built robust enough to avoid or handle
human mistakes and limit damages. In addition, a database should be aware
of its distributed nature. It is more time consuming to manage distributed
processes.
The big data system scalability and complexity issues of traditional systems are
addressed and resolved in a systematic approach.
To implement the above arbitrary function on random dataset with small latency,
Lambda Architecture provides some general steps for developing scalable big data
systems. Hence, it becomes essential to understand the elementary properties of big
data systems to develop scalable systems.
6.
Random queries on a large dataset: Executing random queries on a large
dataset is very important to discover and learn interesting insights from the
data. To find new business insights, applications require random mining and
querying on datasets.
7.
Scalable system with minimal maintenance: The number of machines
added to scale should not increase the maintenance. Choosing a mod-
ule with small implementation complexity is key to reduced maintenance.
The more complex system, more likely something will go wrong and hence AU: Please
rephrase the
requires more debugging and tuning of the system. Minimum maintenance sentence for
clarity.
is obtained by keeping a system simple. Keeping processes up, fixing errors,
and running efficiently when machines are scaled are the important factors to
be considered for developing systems.
8.
Easy to restore: Systems must be able to provide the basic necessary informa-
tion to restore the data when something goes wrong. It should have enough
information replicas saved on distributed nodes to easily compute and restore
the original data by utilizing saved replicas.
Query Query
Data
Reduction and
Processing Map-Reduce
View
MAP-REDUCE,
Knowledge
SPARK
Discovery Layer
Storage and
Precomputation
Data Storage
Layer
and CLOUD
HDFS
Management
and serving layers, except low latency. Hence, the next real-time data processing
layer resolves the problem of low-latency updates.
AU: Please
confirm the
addressed at present is real-time data processing and computation [29]. Instead of
edits. precomputing the data from the storage layer, real-time views are updated with new
data that gets generated, eventually achieving the highest possible latency.
This layer provides solutions for datasets generated in real time to improve
latency, whereas the storage layer produces precomputed views on the entire data-
set. Datasets that are no longer essential in real-time processing are removed, the
results are temporarily saved, and the complexity is pushed to the application layer.
Hence, the real-time data processing layer is more complex than the storage and
serving layers.
Finally, valuable results are obtained by joining the results from the precom-
puted views and real-time views. Hence, future research work would be focused
on bringing together the batch and real-time views to produce new and valuable
insights and to make better decisions. It requires advanced machine learning and
analytical techniques to improve the computational speed with maximum accuracy
for continuously changing random datasets.
A generalized flexible architecture with distinct components focused on spe-
AU: Please
confirm the
cific purposes leads to more acceptable performance. Applications must be robust
edits. enough to precompute values, corrupted values, and results by eventually executing
the computation on whole datasets to relocate or fix problems.
6.4.1.1 Volume
A large amount of unstructured data is generated and archived compared to tra-
ditional data. This data is generated continuously from various sources, such as
system logs, sensor data, click streams, transaction-based data, email communica-
tions, housekeeping data, and social media. The amount of data is increasing [10]
to a level that the traditional database management and computation systems are
incapable of handling it. The solution based on data warehouse may not be capa-
ble of analyzing and processing huge data due to the lack of a parallel processing
design. So, increasing volume is one of the biggest challenges as it requires a new
level of scalability for storage and analysis.
6.4.1.2 Velocity
The amount of data increased is exponential to the increase of IoTs, sensor-equipped
devices, e-business, and social media. The data generated continuously at high speed
makes it challenging to process and analyze it. It is essential to device an algorithm
to get quick results from streaming data. For example, online interactions and real-
time applications require high rate of analysis.
6.4.1.3 Variety
A variety of data is generated from various sources in different structures, such as
text, video, audio, images, and log files. A combination of structured, unstructured,
and semistructured data is not supported by the relational database system. Hence,
this requires modified storage and processing methods for manipulation of hetero-
geneous data.
Many large-scale organizations and enterprises require big data tools and tech-
nologies to analyze their past history and customer information to understand cus-
tomer needs, in order to improve their business and finally to make better decisions
and predictions to survive in the competitive era. They also need [16] robust big
data analytics tools and technologies to make new innovations, process improve-
AU: Both sec- ments, monitoring, security, and many other functions. The analytical methods
tion numbers
are indicating used to handle big data in different areas are listed in Sections 6.4.2.1–6.4.2.1.
the same.
Please check
and confirm.
6.4.2.1 Text Analytics
The process of mining valuable and meaningful information from text data is called
text analytics. Some of the examples of text data are mail threads, chat conversa-
tions, reviews, comments, tweets, feedbacks, financial statements, and log records.
Text analytics comprise natural language processing, statistical analysis [29,30],
machine learning [32], and computational linguistics. Meaningful abstracts are
extracted from large-scale text data by applying many text mining and analyti-
cal methods. Text analytics is basically used for question answering, information
extraction, sentiment analysis, and text summarization. Text analytics is very
essential in analyzing social media data such as tweets and Facebook comments to
understand people sentiment and events happening in real time.
6.4.2.2 Audio Analytics
Audio analytics refers to the extraction of valuable and meaningful information
from audio data. It is most commonly used in call centers and health care services
for improving skills and services provided by call centers and also for improving
AU: Please ex-
pand “CRM”.
patient treatments and health-care services. It is also essential in CRM to under-
stand the customers and improve the quality of products and facilities to satisfy
their needs and to maintain their relationships.
6.4.2.3 Video Analytics
Video analytics refers to extracting valuable data by tracking and analyzing the
video streams. It is used in key application areas, such as marketing and operations
management. Analyzing video streams of sports such as tennis and cricket helps to
improve the performance of the sports person.
people interest for particular products, and so on. Many organizations are trying
to improve their markets by analyzing social media data, such as people behav-
ior, sentiments, and opinions. According to [33], social media are categorized as
follows:
6.4.2.5 Predictive Analytics
Predictive analytics is used to make predictions on future events based on the anal-
ysis of current and historical data. According to [33,30], statistical methods form
the base for predictive analysis.
Making use of in-memory data analytics, big data tools have improved the per-
formance of data query notably. Big data analytics is not just about making better
decisions but also about real-time processing that motivates businesses to derive
new values and improve performance and profit rates from insights gained.
Big data beats RDBMS in several ways including robust backups, recovery,
faster search algorithms, overall scalability, and low-latency reads and writes.
6.4.3.2 Data Management
The traditional methods of managing structured data includes two important
parts. One is a schema to store the dataset and another is a relational database for
data retrieval. Data warehouse and data marts are the two standard approaches for
managing large-scale structured datasets. SQL is used to perform operations on
relational structured data. Data warehouse is used to store, analyze, and report the
outcomes to users. Access and analysis of the data obtained from a warehouse is
enabled by a data mart.
6.4.3.3 Data Analysis
According to Moore’s law, to cope with increasing data size, researchers gave more
attention to speeding up the analysis algorithms. As the data size increases sig-
nificantly faster than the CPU speed, there is a remarkable change in processor
technology, even though processors are doubling the clock cycle frequency. It is
essential to develop on-line, sampling, and multi-resolution analysis means. On the
other hand, development of parallel computing is required with increasing num-
bers of cores in processors. Large clusters of processors, distributed computing, and
cloud computing are developed fast to aggregate several different workloads.
AU: Please
confirm the
In real-time applications, such as navigation, social networks, finance, bio-
edits. medicine, astronomy, intelligent transport systems, and IoT, speed is the top prior-
ity. It is still a big challenge to be addressed for stream processing by giving quick
and appropriate replies when large amounts of data need to be processed in a short
span of time.
1.
Clustering algorithms: In data clustering, many challenges are emerging
in addressing the characteristics of big data. One of the important issues
that need to be addressed in big data clustering is how to reduce the data
complexity. Big data clustering is divided into two groups [36]: (i) single-
machine clustering using sampling and dimension reduction solutions and
(ii) multiple-machine clustering using parallel and Map Reduce solutions
[37]. Using sampling and dimension reduction methods, complexity and
memory space required for data analytical processes will be reduced.
Inappropriate data and dimensions are discarded before data analysis pro-
cess starts. To reduce the data size for data analysis processes, data sampling is
used, and for reducing the whole dataset, dimension reduction is used.
To perform the clustering process in parallel, CloudVista [38] uses cloud
computing. It is a common solution for clustering big data. To handle large-
scale data, CloudVista uses BIRCH and sampling methods. AU: Please ex-
pand “BIRCH”.
2.
Classification algorithms: Many researchers are working toward develop-
ing new classification algorithms for big data mining and transforming tra-
ditional classification algorithms for parallel computing. Classification [39]
algorithms are designed in such a way that they take input data from dis-
tributed data sources and use various sets of learners to process them. Tekin
et al. presented “classify or send for classification” as a novel classification
algorithm.
In the distributed data classification method, the input data should be
processed in two different ways by each learner. One performs classifica-
tion functions, whereas the other forwards the input data to another labeled
learner. Big data classification problem improves the accuracy using these
kinds of solutions.
For example, to perform big data classification, Rebentrost et al. [40]
defined a quantum-based support vector machine and showed that with
O(log NM) time complexity the proposed classification algorithm can be
6.4.3.5 Visualization of Data
Information hidden in large and complex datasets can easily be conveyed in both
functionality and visual forms. The challenges in data visualization [46] are to
represent facts more instinctively and effectively by using distinct patterns, graphs,
and visualization techniques. For valuable data analysis, information should be
abstracted in some schematic form from complex datasets, and it should include
variables or attributes for the units of information.
To extract and understand the hidden insights from the data, e-commerce com-
panies, such as eBay and Amazon, use big data visualization tools, such as Tableau
[47]. This tool helps to convert large complex datasets into interactive results and
intuitive pictures. For example, data about thousands of customers, goods sold,
feedback, and their inclinations. However, there are many challenges in the current
visualization tools, such as scalability, functionalities, and response time that can
be addressed in future work.
Map Reduce system and Google File System [53]. The initial version of Hadoop
lacks the capability to access and process the huge volume of data with the com-
modity hardware in a distributed environment. To make the computation layer
more robust, it is separated from storage layers. The storage layer, named Hadoop
distributed file system (HDFS), is capable of storing huge amounts of unstructured
data in large clusters of commodity hardware, and the Map Reduce computation
structure is built on top of the HDFS for data parallel applications. A complete
Message Protocol
Distributed Protocol
5. Visualization Tableau
3. Data Processing
Batch Stream High level language Graph
Map Spark S4 Strom
Reduce Hive/Pig Pregel/Giraph
Thri
Zookeeper
1. File System
stack of big data tools was built on Hadoop by Apache to support different applica-
tions as shown in Figure 6.3. The later version of Hadoop is called Apache YARN
in which a new layer, called resource management layer, is added for efficient utili-
zation of resources in clusters of big data.
There are five layers in big data systems (Figure 6.3). Distributed file stor-
age is the bottom layer for storing large distributed data, above which there is
a cluster resource management layer. The purpose of this layer is to manage
large clusters of hardware resources and to allow the upper layers to utilize the
AU: Please
confirm the
resources efficiently. The data stored in distributed file systems are processed by
edits. the data processing layer as batch, stream, or graph processing. Preprocessed
data is fed to the data analytic layer to analyze and extract more valuable infor-
mation. To represent valuable results, high-level abstractions are built in the
visualization layer.
6.5.1 Tools
6.5.1.1 Thrift
Thrift is a scalable cross-language services library and code generation tool set to
support scalable back-end services. Its major goal is to provide efficient and reli-
able communication across different programming languages by selecting portions
of each language that require the most modification into a common library and
AU: Please
confirm the
finally implementing them in each language. Thrift [54] supports many languages
edits. such as Haskell, Java, C++, Perl, C#, Ruby, Cocoa, Python, D, Delphi, Erlang,
OCaml, PHP, and Smalltalk.
6.5.1.2 ZooKeeper
Yahoo developed a distributed coordination system called ZooKeeper [8], and later,
it was taken over by the Apache Software Foundation. To coordinate the distrib-
uted applications, it offers an integrated service. ZooKeeper provides the following
support for distributed coordination:
6.5.1.3 Hadoop DFS
HDFS is a hierarchical file system consisting [55] of directories and files similar to
a UNIX file system. Users can perform all the administrative and manipulation
operations such as create, delete, copy, save, and move files to the HDFS as in a
normal UNIX system.
HDFS architecture: HDFS architecture consists of two units: a single name
node and multiple data nodes. Name node is a local file system responsible for
managing the file system namespace and tracks. Edit log keeps track of the logs and
updates whenever changes are made to the file system. Additionally, it keeps track
of all the blocks in the file system assigned to data nodes and mapping of blocks to
data nodes.
Name nodes are replicated for fault tolerance. To manage a large number of
files, file systems are divided into a number of blocks and are saved on data nodes.
The operations on files and directories such as opening, closing, and renaming are
performed by the name node. Mapping of blocks to data nodes is tracked by the
name node. The client directly communicates to the data nodes by obtaining a list
of files from the name node to read a file.
Block report is used to manage the copies of files. A separate file is created for
each block stored in the local file system by the data node. It also creates directories
for dividing the files belonging to different applications.
Fault tolerance in HDFS: HDFS replicates the files into blocks and keeps
them in different data nodes for fault tolerance. The name node uses the replicated AU: Please
confirm the
copy of blocks to process the requested data chunk if any data node goes down. edits.
6.5.2 Resource Management
Clusters of commodity servers are cost-efficient solutions for intensive scientific
computations and are used for running large Internet services. The issues of tradi-
tional resource management for Hadoop and Storm are described in two aspects:
AU: Please
confirm the
First, a system that runs Hadoop or Storm is commodity hardware, and not cus-
edits. tomized hardware. Second, Hadoop requires a lot of configurations and scheduling
to support fine-grained tasks.
To address these issues of traditional resource management system, new tools
such as YARN [56] and Mesos [57] are introduced.
6.5.3.1 Apache HBase
HBase is stimulated by the Google Bigtable application [4]. It is built on HDFS
and is a multidimensional, distributed, column-oriented data storage method. It
provides faster access to records and updates for the data tables. It uses HDFS to
store the data and ZooKeeper framework for distributed coordination. Row key,
column key, and a timestamp are used for distributed index mapping in the HBase
multidimensional data model. The row keys are used for organizing the mapping.
Each row has its unique key, and a set of columns and number of columns are
added dynamically to the column families.
6.5.3.2 Apache Cassandra
Apache Cassandra [7] was developed by Facebook and is based on a peer-to‐peer
distributed key-value store construct. Cassandra is a row-oriented data model where
all nodes are treated as equal. Cassandra is suitable for real-time applications where
a large data needs to be handled with a faster random access.
Three orthogonal properties called consistency, availability, and partition toler-
ance (CAP) [58] are considered to develop distributed applications. It is difficult
to satisfy the three properties together to have tolerable latency for the operations, AU: Please
confirm the
according to the CAP theorem. By loosening the value of strong consistency to edits
6.5.4 Data Processing
Basic data processing frameworks are divided into batch mode, stream mode, or
graph processing and interactive analysis mode based on their processing meth-
ods and speed. The resource managers, Mesos and YARN, manage these runtime AU: Please
systems in clusters at the lower layers. Unstructured data from HDFS as well as confirm the
edits.
structured data from NoSQL are given as input to these systems. Output of these
structures are redirected to these storage layers or cached for performing analytics
and visualization on it.
6.5.4.1 Batch Processing
Batch processing is suitable for processing large amounts of data stored in batches.
Hadoop Map Reduce is a basic model introduced to process huge amounts of data
in batch. However, it is not suitable for all kinds of batch processing tasks such
as iterative processing. To overcome some of these disadvantages, new processing
models, Spark and Apache Hama, are presented.
1. Hadoop
Hadoop is a distributed processing framework to processes big datasets over
groups of computers [16] using simple programming models. It is planned
and designed for scale-up from single servers to thousands of machines. The
role of each computer is to provide storage and local computation, instead
of depending on hardware, to provide high-availability. Machine learning
libraries are designed and modified to handle failures at the application layer,
hence providing an extensive service on top of the clusters.
Hadoop framework was introduced and published by Google, illustrat-
ing its method to manage and process a large data. Afterward, Hadoop is the
typical structure for storing, processing, and analyzing terabytes to exabyte
of data. As Doug Cutting started developing Hadoop, the framework got its
name “Hadoop” from his son’s toy elephant.
Yahoo is a main contributor to Hadoop advancement. Using 10,000-core
Hadoop clusters, by 2008, Yahoo’s web search engine index was generated.
It was developed to run on local hardware, and without any system interven-
tion, it can scale up and down. Three important functions of Hadoop frame-
work are storage, resource management, and processing.
– Hadoop Map Reduce (prior to version 2.0): Hadoop Map Reduce
structure consists of three main components: HDFS, job tracker (mas-
ter), and task tracker (slave). HDFS is used to store and share the data
among Map Reduce jobs computational tasks. First, the job tracker reads
the input data from the HDFS and splits it into partitions to run map
AU: Please
clarify as to
tasks on each, and the intermediate results are stored in the local file sys-
what “each”
refers to.
tem. Second, reduce tasks read the intermediate results from map tasks
and run the reduced code on it. The results of the reduce phase are saved
in HDFS.
– Hadoop Map Reduce Version 2.0: A new version of Map Reduce intro-
duces the resource allocation and scheduling tool, Apache YARN [59].
Task tracker is replaced by YARN node managers. To keep track of the
AU: Please
confirm the
finished jobs, job history server is a new added feature to the architecture.
edits.
Initially, clients request YARN for the resources required for their
jobs. The resource manager assigns a place to execute the master task on
jobs.
– Hadoop characteristics:
1. Fault tolerant: The Hadoop cluster is highly prone to failures as thou-
sands of nodes are running on the commodity hardware. However,
data redundancy and data replication are implied to achieve fault
tolerance.
2. Redundancy of data: Hadoop divides data into many blocks and
stores them across two or more data nodes to improve the redun-
dancy. The master node preserves information of these nodes and
data mapping.
3. To scale up and scale down: The distributed nature of Hadoop file
system allows Hadoop to scale up and scale down by adding or delet-
ing the number of nodes required in the cluster.
4. Computations moved to data: Queries are computed on data nodes
locally and the results are obtained by combining them in paral- AU: Please
confirm the
lel to avoid the overhead of bringing the data to the computational edits.
environment.
2. Spark
Spark was built on top of HDFS as an open-source project [60] by the
University of Berkley to address the issues of Hadoop Map Reduce. The
objective of the system is to support iterative computation and increase
the speed of distributed parallel processing to overcome the limitations of
Hadoop Map Reduce.
In-memory fault-tolerant data structure, resilient distributed datasets
(RDDs), was introduced by [61,62] the Berkeley University for efficient data
sharing across parallel computations. It supports batch, iterative, interactive,
and streaming in the same runtime with significantly high performance. It
also allows applications to scaled up and scale down with efficient sharing of
data. Spark differs from Hadoop by supporting simple join and group-by basic
operations.
Spark with RDDs runs applications with 100 times faster in memory and
10 times faster on disk compared to Hadoop Map Reduce.
Spark overcomes some of the limitations of Hadoop as follows:
1. Iterative algorithms: Spark allows applications and users to explicitly
cache data by calling the cache () operation, so that subsequent queries
can use intermediate results stored at the cache and provide dramatic
improvements in time and memory utilization.
2. Streaming data: Spark offers an application programming interface to
process the streaming data. It also gives an opportunity to design meth-
ods to process real-time streaming data with minimum latency.
1. Storm
Storm is an open-source distributed real-time computation framework, dedi-
cated to stream processing. It offers a fault-tolerant mechanism to execute
computation on an event as it runs into a system.
Using Apache Storm, it is easy to process real-time streaming [63] data.
It has many useful applications, such as real-time analytics, online machine
learning, continuous computation, distributed RPC (remote procedure call),
and ETL (extract, transform, and load). Storm is easy to set up and operate.
It is also scalable and fault tolerant. Storm typically does not run on top of
Hadoop clusters, and it uses Apache ZooKeeper and its own master worker
processes to manage topologies.
2. S4
S4 offers a modest programing model [51] for programmers and offers an ease
and efficient automated distributed execution, for example, automatic load
balancing. On the other hand, in Storm, the programmer should take care
of load balancing, adjusting size of the buffer, and the level of parallelism for
getting optimum performance.
6.5.4.3 Graph Processing
The earlier method of graph processing on top of Map Reduce was inefficient as
it took entire graphs as input and processed them and then wrote the complete
updated graph into the disk. Pregel [64], Giraph [65], and many other systems are
developed for efficient graph processing and to overcome the limitations of Map
Reduce.
1.
Pregel
Pregel was constructed on the bulk synchronous parallel model (BSP) [64]
as a graph processing parallel system. In the BSP model, a set of processors
that are interconnected by a communication network follow a different set
of computation threads in which individual processors are armed with a fast
local memory. The platform based on the BSP model consists of the following
three important mechanisms:
– Trained system for processing local memory transactions (i.e., processors)
– Efficient networks for communication between these systems
– A hardware support for synchronization between the systems
2.
Apache Giraph
Giraph is developed on top of Apache Hadoop for graph processing. Giraph
is built on Pregel for distributed processing of a graph [65]. Pregel is built
on top of Apache Hadoop, and it runs into map tasks of Map Reduce. To
coordinate between its tasks, Pregel uses Apache ZooKeeper, and for inter-
node communications, it uses Netty. A set of vertices and edges are used to AU: Please
confirm the
represent a graph in Giraph. Vertices perform computational tasks and con- edits.
1.
Mahout
Mahout is a set of machine learning algorithm library [4] built on Hadoop
Map Reduce to support various analytical tasks. It also aims to include vari-
ous algorithms for machine learning on different distributed systems. Mahout
library includes several algorithms for various tasks as shown in Table 6.2.
2.
MLlib
It is a machine learning library developed on Spark [9,68] that consists of a set
of machine learning algorithms, as shown in Table 6.3, for classification and
clustering, regression analysis, and collaborative filtering.
A summary of the chapter on relational data and big data analytical
methods and technologies is presented in Table 6.4 and Figure 6.4. An over-
view of characteristics and tools and technologies for managing structured
and unstructured data is represented in the form of a flowchart to easily
understand the overview of bog data tools and technologies. It also shows the
k-means Clustering
area of applications for which big data technology and relational database are
suitable.
9781138500815_C006.indd 164
Relational Atomicity, DBMS, data SQL OLTP, Data mining, Graph, chart Employee
AU: Please data consistency, warehouse, OLAP clustering, details
expand “OLTP”
and “OLAP”. (structured isolation, and data mart classification management,
data) durability [14] hospital
management,
164 ◾ Data Analytics
insurance
company
Big data Volume, HDFS (DFS) NoSQL [19] Batch: Mathematic: Tableau Social media
(unstructured velocity, HBase [4] Hadoop Statistics, (Jason computing,
data) veracity, verity Cassandra Map Fundamental Brooks, health care,
[17] [7] reduce [2] Mathematics 2016) government,
Spark [60] R TOOL finance,
PYTHON business, and
enterprise
(Continued)
27/07/18 4:46 PM
9781138500815_C006.indd 165
Table 6.4 (Continued) Relational and Big Data Tools and Technologies
Storage and Data
Data Characteristics Management Tools Processing Analytics Visualization Application
Resource Graph:
management: Giraph
(job Pregel
scheduling)
YARN
Mesos
High-level
languages:
Pig
Hive
Relational Database and Big Data ◾ 165
27/07/18 4:46 PM
166 ◾ Data Analytics
NoSQL
VS
BIG DATA VS SQL
RELATIONAL
DATA
HDFS
NOSQL
COORDINATION:
BIG DATA ZooKeeper
HADOOP MAP-
REDUCE
BATCH SPARK
PREGEL
TOOLS AND S4
TECGNOLOGIES
STREAM
PROCESSING
STROM
GRAPH GIRAPH
HIVE
HIGH LEVEL
LANGUAGES
PIG
DATA MINING
ANALYTICS STATISTICS
MLlib
MACHINE
LEARNING
VISUALIZATION
MAHOUT
Figure 6.4 Flow chart of big data analytical tools and technologies.
6.6.2 Conclusion
Enormous amounts of data are generated with increasing speed in different formats
due to digitization around the world. Big data technology will definitely enter every
domain, enterprise, and organization. Traditional database management system
fails to scale for growing data needs such as multiple partitioning and paralleliz-
ing abilities. It is also incapable of storing, managing, and analyzing unstructured
data generated from different sources such as sensors, smart applications, wearable
technologies, smartphones, and social networking websites. Evolution of big data
analytics and tools and technologies made it possible to efficiently handle huge
unstructured growing data. One of the most popular open source frameworks,
Hadoop, is a generally recognized system for large-scale data analytics. It is mainly
known and accepted for support in large-scale distributed parallel computing of
clusters, and is cost-effective, fault tolerant, reliable, and provides highly scalable
support for processing and managing terabyte to petabyte of data. However, it is
not suitable for real-time data analytics. To overcome the incapability of this earlier
version of Hadoop system for real-time analytics, a new framework was introduced,
known as Spark. To support real-time analytics, Spark with RDDs gives results in a
fraction of seconds. Several areas such as business, social media, government, health
care, and security are implementing big data technologies to gain knowledge from
previously unused data to make better decisions and predictions. In the future, it
will be motivating to overcome the drawbacks of the Spark and Hadoop systems
and work toward real-time analytics. The challenges in batch processing and stream
processing analytical systems also need to be addressed in the future work.
References
1. McKinsey & Company. “Big Data: The Next Frontier for Innovation, Competition, and
Productivity.” McKinsey Global Institute, p. 156, June 2011.
2. J. Dean and S. Ghemawat. “MapReduce.” Communications of the ACM, vol. 51, no. 1,
p. 107, January 2008.
3. “Apache SparkTM—Lightning-Fast Cluster Computing.” http://spark.apache.org/ (Accessed
January 25, 2017).
4. A. Duque Barrachina and A. O’Driscoll. “A Big Data Methodology for Categorising
Technical Support Requests using Hadoop and Mahout.” Journal of Big Data, vol. 1, p. 1,
2014.
5. F. Aronsson. “Large Scale Cluster Analysis with Hadoop and Mahout.” February 2015.
6. N. Marz and J. Warren. “Big Data—Principles and Best Practices of Scalable Realtime
Data Systems.” Harvard Business Review, vol. 37, pp. 1–303, 2013.
7. L. Avinash and P. Malik. “Cassandra: A Decentralized Structured Storage System.” ACM
SIGOPS Operating Systems Review, pp. 1–6, 2010.
8. F. Junqueira and B. Reed. ZooKeeper: Distributed Process Coordination. 2013. AU: Please pro-
vide publisher
9. X. Meng, J. Bradley, S. Street, S. Francisco, E. Sparks, U. C. Berkeley, S. Hall, S. Street, S. details.
13. K. Mayer-Schönberger, V. and Cukier. Big Data: A Revolution that Will Transform How
We Live, Work, and Think. Boston, MA: Houghton Mifflin Harcourt, 2013.
14. E. F. Codd. “A relational Model of Data for Large Shared Data Banks.” Communications
of the ACM, vol. 26, no. 6, pp. 64–69, 1983.
15. T. Haerder and A. Reuter. “Principles of Transaction-Oriented Database Recovery.” ACM
Computing Surveys, vol. 15, no. 4, pp. 287–317, 1983.
16. P. Zikopoulos and C. Eaton. Understanding Big Data: Analytics for Enterprise Class
AU: Please pro-
vide publisher
Hadoop and Streaming Data. 2011.
details. 17. D. Laney. “META Delta.” Application Delivery Strategies, vol. 949, p. 4, 2001.
18. R. Cattell. “Scalable SQL and NoSQL Data Stores.” ACM SIGMOD Record, vol. 39, no.
4, p. 12, May 2011.
19. J. Han, E. Haihong, G. Le, and J. Du. “Survey on NoSQL Database.” In 2011 6th
International Conference on Pervasive Computing and Applications (ICPCA), Port
Elizabeth, South Africa, 2011.
20. D. Borthakur. “The Hadoop Distributed File System: Architecture and Design.” Hadoop
Project Website, 2007.
21. G. Bell, T. Hey, and A. Szalay. “Beyond the Data Deluge.” Science, 2009.
22. E. Al Nuaimi, H. Al Neyadi, N. Mohamed, and J. Al-jaroodi. “Applications of Big Data
AU: Please
provide volume
to Smart Cities.” Journal of Internet Services and Applications, 2015.
number and 23. H. Chen, R. Chiang, and V. Storey. “Business Intelligence and Analytics: From Big Data
page range.
to Big Impact.” MIS Quarterly, 2012.
24. T. Huang, L. Lan, X. Fang, P. An, J. Min, and F. Wang. “Promises and Challenges of Big
Data Computing in Health Sciences.” Big Data Research, vol. 2, no. 1, pp. 2–11, 2015.
25. W. Raghupathi and V. Raghupathi. “Big Data Analytics in Healthcare: Promise and
Potential.” Health Information Science and Systems, vol. 2, no. 1, p. 3, 2014.
26. S. Kumar and A. Prakash. “Role of Big Data and Analytics in Smart Cities.” International
Journal of Science and Research, vol. 5, no. 2, pp. 12–23, 2016.
27. L. Garber. “Using In-Memory Analytics to Quickly Crunch Big Data.” Computer, vol. 45,
AU: Please pro- no. 10, pp. 16–18, October 2012.
vide publisher
details. 28. M. Özsu and P. Valduriez. Principles of Distributed Database Systems. 2011.
AU: Please pro- 29. S. Shahrivari. “Beyond Batch Processing: Towards Real-Time and Streaming Big Data.”
vide publisher
details.
Computers, vol. 3, no. 4, pp. 117–129, 2014. AU: Ref. [31] is
listed but not
30. J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning. 2001. cited in text.
AU: Please pro- 31. T. Knowledge and E. Review. “A Survey on Text Mining in Social Networks.” pp. 1–15, 2004. Please check
vide publisher and provide
details. 32. R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and citation for the
same.
AU: Please Distributed Approaches. 2011.
provide journal
title. 33. A. Gandomi and M. Haider. “Beyond the Hype: Big Data Concepts, Methods, and
Analytics.” International Journal of Information Management, vol. 35, no. 2, pp. 137–144,
April 2015.
AU: Please pro-
vide complete 34. J. Ahrens, B. Hendrickson, G. Long, and S. Miller. “Data-intensive Science in the US
details.
DOE: Case Studies and Future Challenges.” Computing in Science, 2011.
35. W. Fan and A. Bifet. “Mining Big Data : Current Status, and Forecast to the Future.”
ACM SIGKDD Explorations Newsletter, vol. 14, no. 2, pp. 1–5, 2013.
36. A. Shirkhorshidi, S. Aghabozorgi, T. Wah, and T. Herawan. “Big Data Clustering: A
Review.” In International Conference on Computational Science and Its Applications, LNCS,
vol. 8583. Cham: Springer, 2014.
37. W. Kim. “Parallel Clustering Algorithms: Survey.” Parallel Algorithms, Spring 2009.
38. H. Xu, Z. Li, S. Guo, and K. Chen. “Cloudvista: Interactive and Economical Visual
Cluster Analysis for Big Data in the Cloud.” Journal Proceedings of the VLDB Endowment,
vol. 5, no. 12, pp. 1886–1889, 2012.
39. C. Tekin and M. van der Schaar. “Distributed Online Big Data Classification using
Context Information.” In 2013 51st Annual Allerton Conference on Communication,
Control, and Computing, Allerton, IL, 2013.
40. P. Rebentrost, M. Mohseni, and S. Lloyd. “Quantum Support Vector Machine for Big
Data Classification.” Physical Review Letters, 2014.
41. J. Ayres, J. Flannick, J. Gehrke, and T. Yiu. “Sequential Pattern Mining Using a Bitmap
Representation.” In Proceedings of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Edmonton, Canada, 2002.
42. M. Lin, P. Lee, and S. Hsueh. “Apriori-based Frequent Itemset Mining Algorithms on
MapReduce.” In Proceedings of the 6th International Conference on Ubiquitous Information
Management and Communication, Kuala Lumpur, Malaysia, 2012.
43. C. Leung and R. MacKinnon. “Reducing the Search Space for Big Data Mining for
Interesting Patterns from Uncertain Data.” big data (BigData), 2014.
44. T. Kraska, U. C. Berkeley, U. C. Berkeley, R. Griffith, M. J. Franklin, U. C. Berkeley, and
U. C. Berkeley. “MLbase : A Distributed Machine-learning System.” 2013. AU: Please pro-
vide complete
45. M. Mehta, R. Agrawal, and J. Rissanen. “SLIQ: A Fast Scalable Classifier for Data details.
Mining.” In International Conference on Extending, 1996. AU: Please pro-
vide conference
46. D. Keim, C. Panse, and M. Sips. “Visual Data Mining in Large Geospatial Point Sets.” location.
IEEE Computer Graphics, 2004. AU: Please
provide volume
47. “Data Visualization | Tableau Software.” www.tableau.com/stories/topic/data- number and
page range.
visualization (Accessed January 25, 2017).
48. A. Di Ciaccio, M. Coli, and J. Ibanez. Advanced Statistical Methods for the Analysis of
Large Data-sets. 2012. AU: Please pro-
vide publisher
49. X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. details.
Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. “Top 10
Algorithms in Data Mining.” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37,
January 2008.
50. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. “Dryad.” In Proceedings of the
2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007—EuroSys’07,
p. 59, 2007.
51. L. Neumeyer, B. Robbins, and A. Nair. “S4: Distributed Stream Computing Platform.”
Data Mining Workshops, 2010.
52. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis.
“Dremel: Interactive Analysis of Web-Scale Datasets.” In 36th International Conference on
Very Large Data Bases, pp. 330–339, 2010.
53. S. Ghemawat, H. Gobioff, and S. Leung. “Google_File_System.” 2003. AU: Please pro-
vide complete
54. M. Slee, A. Agarwal, and M. Kwiatkowski. “Thrift : Scalable Cross-Language Services details for ref.
[53, 54].
Implementation.”
55. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. “The Hadoop Distributed File
System.” In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST
2010)(MSST), Incline Village, NV, pp. 1–10, 2010.
56. V. Vavilapalli, A. Murthy, and C. Douglas. “Apache Hadoop Yarn: Yet Another Resource
Negotiator.” In Proceedings of the 4th annual Symposium on Cloud Computing, SoCC’13,
Santa Clara, CA, 2013.
57. B. Hindman, A. Konwinski, M. Zaharia, and A. Ghodsi. “Mesos: A Platform for Fine-
Grained Resource Sharing in the Data Center.” In NSDI’11 Proceedings of the 8th USENIX
Conference on Networked Systems Design and Implementation, Boston, MA, 2011.
58. S. Gilbert and N. Lynch. “Perspectives on the CAP Theorem.” Computer, 2012.
59. V. K. Vavilapalli, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, E.
Baldeschwieler, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J.
Lowe, and H. Shah. “Apache Hadoop YARN.” In Proceedings of the 4th Annual Symposium
AU: Please pro-
vide conference
on Cloud Computing—SOCC’13, pp. 1–16, 2013.
location. 60. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. “Spark : Cluster
Computing with Working Sets.” In HotCloud’10 Proceedings of the 2nd USENIX
AU: Please pro-
vide conference
Conference on Hot Topics in Cloud Computing, p. 10, 2010.
location. 61. M. Zaharia, M. Chowdhury, T. Das, and A. Dave. “Resilient Distributed Datasets: A
Fault-tolerant Abstraction for In-memory Cluster Computing.” In NSDI’12 Proceedings of
AU: Please pro-
vide conference
the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2, 2012.
location. 62. M. Zaharia. “An Architecture for Fast and General Data Processing on Large Clusters.”
Berkeley Technical Report, p. 128, 2014.
63. T. da Silva Morais. “Survey on Frameworks for Distributed Computing: Hadoop, Spark
and Storm.” In Proceedings of the 10th Doctoral Symposium in Informatics Engineering—
AU: Please pro-
vide conference
DSIE’15, pp. 95–105, 2015.
location. 64. G. Malewicz, M. Austern, A. Bik, and J. Dehnert. “Pregel: A System for Large-scale Graph
Processing.” In Conference: SPAA 2009: Proceedings of the 21st Annual ACM Symposium
on Parallelism in Algorithms and Architectures, Calgary, Alberta, Canada, August 11–13,
2009.
65. O. Batarfi, R. El Shawi, A. G. Fayoumi, R. Nouri, S.-M.-R. Beheshti, A. Barnawi, and S.
Sakr. “Large Scale Graph Processing Systems: Survey and an Experimental Evaluation.”
Cluster Computing, vol. 18, no. 3, pp. 1189–1213, September 2015.
66. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. “Pig latin.” In Proceedings
of the 2008 ACM SIGMOD International Conference on Management of Data—
AU: Please pro-
vide conference
SIGMOD’08, p. 1099, 2008.
location. 67. A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff,
and R. Murthy. “Hive.” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–
1629, August 2009.
68. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai,
M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A.
Talwalkar. “MLlib: Machine Learning in Apache Spark.” Journal of Machine Learning
Research, vol. 17, pp. 1–7, 2015.