0% found this document useful (0 votes)
100 views40 pages

Transition From Relational Database To Big Data and Analytics

Big data analytics, migration from traditional data to big data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views40 pages

Transition From Relational Database To Big Data and Analytics

Big data analytics, migration from traditional data to big data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Chapter 6

Transition from
Relational Database to
Big Data and Analytics
Santoshi Kumari and C. Narendra Babu
M. S. Ramaiah University of Applied Sciences

Contents
6.1 Introduction..............................................................................................132
6.1.1 Background, Motivation, and Aim................................................133
6.1.2 Chapter Organization....................................................................133
6.2 Transition from Relational Database to Big Data......................................134
6.2.1 Relational Database.......................................................................134
6.2.2 Introduction to Big Data...............................................................135
6.2.3 Relational Data vs. Big Data..........................................................136
6.3 Evolution of Big Data................................................................................138
6.3.1 Facts and Predictions about the Data Generated............................139
6.3.2 Applications of Big Data................................................................139
6.3.3 Fundamental Principle and Properties of Big Data........................141
6.3.3.1 Issues with Traditional Architecture for Big Data
Processing�����������������������������������������������������������������������141
6.3.3.2 Fundamental Principle for Scalable Database System......142
6.3.3.3 Properties of Big Data System..........................................143
6.3.4 Generalized Framework for Big Data Processing...........................144
6.3.4.1 Storage and Precomputation Layer..................................144

131

9781138500815_C006.indd 131 27/07/18 4:46 PM


132  ◾  Data Analytics

6.3.4.2 K nowledge Discovery Layer (Serving Layer)....................144


6.3.4.3 Real-Time Data Processing Layer (Speed Layer).............. 145
6.4 Big Data Analytics....................................................................................146
6.4.1 Big Data Characteristics and Related Challenges...........................146
6.4.1.1 Volume............................................................................146
6.4.1.2 Velocity............................................................................147
6.4.1.3 Variety.............................................................................147
6.4.2 W hy Big Data Analytics?...............................................................147
6.4.2.1 Text Analytics..................................................................148
6.4.2.2 Audio Analytics...............................................................148
6.4.2.3 Video Analytics...............................................................148
6.4.2.4 Social Media Analytics....................................................148
6.4.2.5 Predictive Analytics.........................................................149
6.4.3 Challenges in Big Data Analytics..................................................149
6.4.3.1 Collect and Store Data.....................................................149
6.4.3.2 Data Management...........................................................150
6.4.3.3 Data Analysis...................................................................150
6.4.3.4 Security for Big Data....................................................... 152
6.4.3.5 Visualization of Data....................................................... 152
6.5 Tools and Technologies for Big Data Processing....................................... 153
6.5.1 Tools..............................................................................................154
6.5.1.1 Thrift...............................................................................154
6.5.1.2 ZooKeeper....................................................................... 155
6.5.1.3 Hadoop DFS................................................................... 155
6.5.2 Resource Management...................................................................156
6.5.3 NoSQL Database: Unstructured Data Management.....................156
6.5.3.1 Apache HBase................................................................. 157
6.5.3.2 Apache Cassandra............................................................ 157
6.5.4 Data Processing............................................................................. 157
6.5.4.1 Batch Processing.............................................................. 158
6.5.4.2 Distributed Stream Processing.........................................160
6.5.4.3 Graph Processing.............................................................160
6.5.4.4 High-Level Languages for Data Processing..................... 161
6.5.5 Data Analytics at the Speed Layer.................................................162
6.6 Future Work and Conclusion....................................................................163
6.6.1 Future Work on Real-Time Data Analytics....................................163
6.6.2 Conclusion.....................................................................................166
References..........................................................................................................167

AU: Please
check and
6.1  Introduction
confirm if
heading levels
are correctly
The term big data was invented to represent large data generated continuously
identified.
with the advancement in digital technology, smart devices, and cheap hardware

9781138500815_C006.indd 132 27/07/18 4:46 PM


Relational Database and Big Data  ◾  133

resources. Enterprises, businesses, and government sectors producing a large amount


of valuable information need to be processed to make better decisions and improve
their performance [1]. Processing a huge amount of unstructured data is difficult
using relational data management technologies. Efficient tools and t­ echnologies
are required to process such large-scale and complex data. Several tools such as
Google’s Map Reduce [2], Spark [3], and Mahout [4,5] were developed to overcome
the limitations of traditional methods.
Understanding the fundamental principles and properties [6] of big data
­system is essential for building robust and scalable tools and technologies. Various
­frameworks and tools such as Hadoop, Spark, Cassandra [7], ZooKeeper [8],
HBase, and MLlib [9] are introduced for batch processing, stream processing, and
graph processing of large-scale data.

6.1.1 Background, Motivation, and Aim


The data generated from social media, sensor-equipped smart device, Internet
of things (IoTs), E-business, and research centers is spiking every day with an
increase in complexity [10]. This huge amount of data generated essentially
requires new technology to store, process, and analyze the data to discover hid-
den insights for better decisions. Parallel processing is one such technique of
big data system that achieves significant efficiency in processing terabytes and
petabytes of complex data.
The key features of the emerging database systems for advanced analytics have
been a boundless motivation to take a look into big data processing frameworks,
like Hadoop [11,12] and Spark. A thoughtful review on the development of new
systems to overcome the drawbacks of traditional analytical methods for large data
is taken into account. To develop a generalized system for processing large dataset,
the Lambda Architecture [6] provides three layers of architecture structure, which
in turn helps to understand the basic requirement for processing and analyzing
large data in batch, stream, and real time.
This chapter aims to present interpretations on the drawbacks of traditional
methods for processing big data, the characteristics and challenges of big data, the
properties of big data analytics, and the tools and technologies for processing big
data. There are several benefits of big data analytics [13] in various areas, such as
finance and marketing, government, and health care.

6.1.2 Chapter Organization
This chapter is organized as follows: Section 6.2 provides a review on the t­ ransition
from relational database to big data and the difference between relational data and
big data. Section 6.3 elaborates on the evolution of big data, the basic principles and
properties of big data, and the generalized framework for big data systems. Section
6.4 describes big data analytics, the necessity of big data analytics, and the challenges

9781138500815_C006.indd 133 27/07/18 4:46 PM


134  ◾  Data Analytics

in big data analytics. Lastly, tools and technologies for big data processing are dis-
cussed in Section 6.5. Conclusion and future work are presented in Section 6.6.

6.2 Transition from Relational Database to Big Data


This topic provides an overview of the transition from the relational database sys-
tem to the big data system. Starting with the evolution of relational database man-
agement system (RDBMS) and its introduction, it outlines the characteristics of
relational database systems, traditional tools, and the drawbacks of the tools in
handling unstructured large data.
A later part provides a brief introduction to big data and an understanding on
different types of data formats. Finally, it concludes with comparisons between
RDBMS and big data in terms of performance evaluation parameters, such as scal-
ability, efficiency, storage and management, and size and schema.

6.2.1 Relational Database
In 1970, E. F. Codd [14] introduced a new approach for calculating and manip-
ulating data in relational database format. Since then, RDBMS has been imple-
mented and used for more than four decades, satisfying the business needs. It is
a traditional method for managing structured data in which rows and columns
are used to store the data in the form of tables and each table has a unique pri-
mary key.
AU: Please
confirm the
The key characteristics of relational database is that it provides support to the
edits. ACID (atomicity, consistency, isolation, and durability) property [15] that guaran-
ties consistency in handling all the transactions. RDBMS includes a relational data-
base and a schema to manage, store, query, and retrieve the dataset. The RDBMS
maintains data integrity through the following characteristics:

1. Tuple and attribute orders are not important.


2. Each entity is unique.
3. Each cell in the table contains a single value.
AU: Please
confirm the
4. An attribute should contain all the values of the same field.
edits. 5. In a database, table names and attributed names should be unique. But two
different tables can have similar attribute names.
6. SQL (Structured Query Language) is used for administrative operations,
AU: Please
expand “DDL,
which include DDL, DML, and DAL.
DML, and DAL”.

Data warehouses and data marts are the key methods for managing the structured
datasets. A data warehouse integrates data from multiple data sources that are used
for storing, analyzing, and reporting, whereas a data mart is used to access the
data stored in a data warehouse so that data is filtered and processed. Preprocessed

9781138500815_C006.indd 134 27/07/18 4:46 PM


Relational Database and Big Data  ◾  135

data is given as an input for data mining and online analytical processing to find
new values to solve various business problems. The two ways to store data in a data
warehouse are as follows:

◾◾ Dimension table—To uniquely identify each dimension record (row), it


maintains a primary key, and it is associated with a fact table. Using primary
key data from the fact table, the associated data is filtered and grouped into
slices and dices to get various combinations of attributes and results.
◾◾ Normalization—Normalization splits the data into entities to create several
tables in a relational database. It has a property of duplicating data within the
database and often results in the creation of additional tables. The tables are
classified based on data categories, such as data on employee, finance, busi-
ness, and department.

There are several challenges faced by today’s enterprises, organizations, and govern-
ment sectors due to some limitations of the traditional database system which are
as follows:

◾◾ Real-time business decision is difficult using the traditional database system


due to the complexity of processing and analysis of real-time unstructured data.
◾◾ Difficult to extract, manage, and analyze the data islands, such as data from
PDA (mobiles, tablets) and other computing devices. AU: Please
expand “PDA”.
◾◾ Traditional data models are nonscalable for large amounts of complex data.
◾◾ Data management cost increases exponentially with the increase of data and
complexity.

Hence, big data analytics, tools, and technologies were introduced to overcome the
limitations of traditional database systems for managing complex data generated
rapidly in various formats characterized by [16] volume, velocity, variability, and
variety.

6.2.2 Introduction to Big Data


The term big data emerged to describe complex unstructured data generated from
click streams, transaction histories, sensors, and PDAs. Big data technologies find new
opportunities to make quick and better decisions from large complex data at less cost.
Big data is characterized mainly by four V’s [16,17], which are described by
identifying the drawbacks of traditional methods as follows:

◾◾ Volume: Large data is generated continuously so that traditional methods


fail to manage it.
◾◾ Velocity: Data is generated at a very high frequency which traditional meth- AU: Please
confirm the
ods are incapable of processing it. edits.

9781138500815_C006.indd 135 27/07/18 4:46 PM


136  ◾  Data Analytics

◾◾ Variety: Data is generated in different formats such as audio, videos, log files,
and transaction history from various sources, which no longer fits traditional
structures.
◾◾ Veracity: Large amount of unorganized data is generated, such as tweets,
comments with hash tags, abbreviations, and conversational text and speeches.

In order to address the above characteristics of big data, several new technologies
AU: Please
provide the
such as NoSQL, Hadoop, and Spark were developed.
full form of
“NoSQL”.
◾◾ NoSQL database: NoSQL [18,19] database is read as Not Only SQL. It is a
schema-less database used to store and manage unstructured data, where the
management layer is separated from the storage layer. The management layer
provides assurance of data integrity.
NoSQL provides high performance scalable data storage with low-level
access to a data management layer, so that data management tasks are han-
dled at the application layer. Advantage of NoSQL is that the structure of the
data is modified at the application layer without making any changes to the
original data in the tables.
◾◾ Parallel processing: Many processors (300 or more in number) work in
loosely coupled or shared nothing architecture. Independent processors with
their own operating systems and memories work parallelly on different parts
of the program to improve the processing speed of the tasks and memory
utilization. Communication between tasks takes place through messaging
interface.
AU: Please
confirm if
◾◾ Distributed file structure (DFS): DFS allows multiple users working on
“Distributed file
structure” can
different machines to share files, memories, and other resources. Based on
be changed to
“Distributed file access lists on both servers and clients, the client nodes get restricted access
system”.
to the file systems, but not to the whole block of storage. However, it is again
dependent on the protocol.
◾◾ Hadoop: It is a fundamental framework for managing big data on which
many analytical tasks and stream computing are carried out. Apache Hadoop
[20] allows distributed processing of huge datasets over multiple clusters of
commodity hardware. It provides a high degree of fault tolerance with hori-
zontal scaling from a single to thousands of machines.
◾◾ Data-intensive computing: Data parallel approach is used in parallel com-
puting application to process big data [21]. This is based on the principle of
association of data and programs to perform computation.

6.2.3 Relational Data vs. Big Data


Table 6.1 highlights some of the key differences between RDBMS and big data
systems.

9781138500815_C006.indd 136 27/07/18 4:46 PM


Relational Database and Big Data  ◾  137

Table 6.1  Difference between Relational Data and Big Data


SI. No RDBMS Big Data

Description Structured data is split into Unstructured data files are


many tables containing spared across many
rows and column. clusters where distributed
Tables are interrelated. file system handles data
A foreign key stored in one redundancy.
of the columns is used to Hadoop on DFS supports
refer the interrelated many application interfaces
tables. for processing large-scale
data.

Properties RDBMS is characterized by Big data is characterized by


ACID properties. four V’s: volume, velocity,
verity, and veracity.

Data supported Supports only structured Supports structured,


data. semistructured, and
unstructured data.

Data size Terabytes Petabytes

Database SQL database: It deals with NoSQL: It deals with


management a structured data table; unstructured data record;
hence, data is stored at a hence, data is stored
single node and it requires across multiple nodes and
fixed table schemas. no fixed schema is
required.

Scalable Vertical scaling: The Horizontal scaling: More


database is scaled by machines are added to a
increasing server and pool of resources to scale
hardware powers. Building up the database.
bigger servers to scale up
is expensive.

Maintaining Maintaining RDBMS Require less maintenance.


systems is expensive and Many features such as
requires a skilled resource. auto-repair, data
redundancy, and easy data
distribution are supported.

Schema Fixed schema: SQL Dynamic schema: NoSQL


database is schema database is schema-less.
oriented. Data is stored in Data format or data model
the form of tables can be changed

(Continued)

9781138500815_C006.indd 137 27/07/18 4:46 PM


138  ◾  Data Analytics

Table 6.1 (Continued)  Difference between Relational Data and Big Data
SI. No RDBMS Big Data

containing rows and dynamically, without


columns with a unique making any changes to
primary key so that the application.
format of data cannot
be changed at any
time.

Management Relational database NoSQL database separates


and storage combines both data data management from
storage and management data storage.
AU: Please
confirm the
layers.
edits.

6.3 Evolution of Big Data


In 1989, British computer scientist Tim Berners-Lee introduced the hypertext sys-
tem to share information between computers across the world, and eventually the
World Wide Web (WWW) was invented. As more and more systems are connected
to the Internet, huge amounts of data are generated.
In 2001, Doug Laney [17], an analyst at Gartner, presented a paper “3D Data
Management: Controlling Data Volume, Velocity and Variety.” It defines com-
monly accepted characteristics of increasing data. With the introduction of Web
2.0 in 2004, numerous devices were connected to the Internet and also many web-
based applications were hosted, which led to the explosion of data. Later, Roger
Mougalas invented the term “big data” in 2005 to define a huge volume of complex
data generated at high speed.
To manage this large-scale complex data, a new tool called Hadoop was intro-
duced by Yahoo in the same year. To perform large-scale distributed computing, a
Map Reduce software concept was introduced by Oracle in 2004. Initially, Hadoop
was built on top of Google Map Reduce to index WWW. Later, Hadoop was made
available as an open source, and it is in use by many organizations to manage their
increasing data.
Currently, the importance of big data technology is increasing in various fields,
such as health care, business, government, sports, finance, and security for mak-
ing better decisions and cost-effective and process improvements. For example,
financial trading like investments in stock market and purchase and sale of shares
is dependent on big data analytics. Big data analytics using real-time traffic data
collected from sensors, Global Positioning System (GPS), and social media help in
optimizing traffic flow and predicting weather condition.

9781138500815_C006.indd 138 27/07/18 4:46 PM


Relational Database and Big Data  ◾  139

6.3.1 Facts and Predictions about the Data Generated


According to “A Comprehensive List of Big Data Statistics” (Big data statistics,
2012), August 1, 2012, a list of big data generated from various sectors, such as
business, market, education, and social media, is given as follows:

1. Data generated in every 2 days is equal to data generated from the beginning
of time until 2003.
2. Over 90% of all the data in the world was produced in the past 2 years.
3. By 2020, the amount of digital information generated is expected to grow
from 3.2 ZB today to 40 ZB. Data generated by industries gets doubled in
every 1.2 years.
4. Every minute, 200,000 photos are uploaded on Facebook, 204 million emails
are sent, 1.8 million likes are shared on Facebook, and 278,000 tweets are
posted, and around 40,000 search queries are served by Google every second. AU: Please
confirm the
5. In YouTube, every minute, around 100 hours of video are uploaded and it edits.

would take 15 years to watch all the videos uploaded in a single day.
6. Thirty billion pieces of information are exchanged between Facebook users
every day.
7. Every minute, around 570 new websites are hosted into existence.
8. Around 12 TB of tweets is analyzed every day to measure “sentiment.”
9. An 81% increase in data over the mobile network had been observed per
month from 2012 to 2014.

6.3.2 Applications of Big Data


Applications of big data technology making significant differences in a wide range
of areas are discussed below.

1.
Understanding and aiming customers: Understanding the necessities and
the requirements of customers is one of the important factors for many busi-
ness entities to improve their business. Big data applications play an important
role in understanding the customers, their behaviors, and their inclinations
by analyzing the behaviors and sentiments of the customers from previously
collected large data. In order to get a more comprehensive picture of their
customers, many companies are keen to increase their datasets with social
media data, browser logs, as well as text analytics and sensor data. Nowadays,
using big data, e-business giants such as Flipkart and Amazon are able to
predict what products are on demand and can be sold, and they can suggest
similar products to the customers. Similarly, telecom companies are now able
to better understand the customer expectations and can make better deci-
sions. Exit polls of elections are more predictive using big data analytics [22].

9781138500815_C006.indd 139 27/07/18 4:46 PM


140  ◾  Data Analytics

2.
Understanding and improving business practices: Big data helps in boost-
ing the business processes [23]. Analytics on social media data, search histo-
AU: Please
confirm the
ries on web sites, and weather forecasts help retailers in adjusting their stocks.
edits. GPS systems and sensor-equipped vehicles are used to track the delivery of
goods, and analysis of live traffic data helps in finding the shortest path. Most
importantly, customer feedback on social media helps to improve the busi-
ness processes.
3.
Health care: A large amount of patient medical history help in under-
standing the symptoms and in predicting possible diseases and solutions
[24,25]. Better prediction on a disease pattern enables to provide better
treatments. For example, a success factor of in vitro fertilization treatment
can be predicted by the analysis of multiple attributes of N number of
patient records. The risk factors in a patient’s treatment can be identified
by analyzing conditions such as blood pressure levels, asthma, diabetes,
genetics, and previous records. Future judgments in medicine would not
be limited to small samples, instead it would include a huge set of records
AU: Please
check the
or possibly everyone.
phrase “or pos-
sibly everyone”
4.
Sports: IBM SlamTracker tool used for video analytics in tennis tournaments
for clarity.
works on big data analytics. The performance of players in football and base-
ball games can be tracked and predicted based on the analysis of historical
data and sensor technology in sports equipment. Big data analysis and visu-
alization can help players a lot to improve their performance, for example, a
cricket player can improve his/her bowling skills by understanding what kind
of shots are played by the opponent.
5.
Science and research: Current potentials of big data analytics are transform-
ing science and research technology. For example, CERN, the Swiss nuclear
physics lab data center, has 65,000 processors. It generates huge amounts
of data by experimenting on the discovery of universe coverts and analyzes
30 PB of data on many distributed computers across 150 data centers. This
data is used in many other research areas to compute new insights.
6.
Optimizing machine and device performance: A large data used to train
machines and devices with the help of artificial intelligence and machine
learning makes better and smarter devices without human involvement. The
more the training data, the more accurate and smarter the device. For exam-
ple, Google’s self-driving car uses data captured from sensors, cameras, and
GPS systems to analyze the traffic movement and for safe driving without
human intervention.
7.
Security and law enforcement: Big data analytics is used by many devel-
oped countries such as the United States, the United Kingdom, Japan,
and Singapore to advance security and law enforcement. For example, the
AU: Please
expand “NCS”.
National Security Agency in the United States and NCS in Singapore use
big data analytics to track and avoid terrorist activities, web application to
AU: Please
check the
thwart.
phrase “web
application
to thwart” for
clarity.

9781138500815_C006.indd 140 27/07/18 4:46 PM


Relational Database and Big Data  ◾  141

8.
Smart cities: Big data analytics [26] help optimize traffic flow by predicting
weather using real-time traffic data collected from sensors, GPS, and social
media. Analysis of data from tweets, comments, blogs and end-user feedback
helps in building better transportation systems and hence in making better
decisions for building smarter cities and a smarter planet.
9.
Financial operation: Big data analytics plays an important role in mak-
ing financial decisions and improving financial operations. Understanding
costumer requirements and inclinations helps to provide best services, such
as insurance, credit, and loan facilities. Further financial operation process
improvements can be made based on the analysis of feedbacks. Nowadays,
most of the financial trading such as investments in stock market and pur-
chase and sale of shares is dependent on big data analytics.

6.3.3 Fundamental Principle and Properties of Big Data


Before we dive into big data tools and technologies, it is necessary to understand
the basic principle and properties of big data. It is also essential to know the
complexity and scalability issues of traditional data systems in managing large
data as discussed previously. To overcome the issues of traditional methods, sev-
eral open-source tools are developed, such as Hadoop, MongoDB, Cassandra,
HBase, and MLlib. Theses system scale to huge set of data processing and man-
agement. On the other hand, to design and understand a robust and scalable
system for big data processing, it is essential to know the basic principle and
properties of big data.

6.3.3.1 Issues with Traditional Architecture


for Big Data Processing
There are many issues with the traditional systems to deal with increasing complex
data. Some major challenges are identified and discussed as follows:

1.
Architecture was not completely fault tolerant: As the number of machines
increases, it is more likely that a machine would go down as it is not hori-
zontally scalable. Manual interventions are required such as managing queue
failures and setting replicas to keep the applications running.
2.
Distributed nature of data: Data is scattered in pieces on many clusters, and
the complexity is increased at the application layer to select the appropriate
data and process it. Applications must be aware of the data to be modified
or must inspect the scattered pieces over the clusters and process it and then
merge the result to present the final result.
3.
Insufficient backup and unavoidable mistakes in software: Complexities
are pushed to the application layer with the introduction of big data technol-
ogy. As the complexity of system increases, the possibility of making mistakes

9781138500815_C006.indd 141 27/07/18 4:46 PM


142  ◾  Data Analytics

will also increase. Systems must be built robust enough to avoid or handle
human mistakes and limit damages. In addition, a database should be aware
of its distributed nature. It is more time consuming to manage distributed
processes.

The big data system scalability and complexity issues of traditional systems are
addressed and resolved in a systematic approach.

◾◾ First, replication and fragmentation are managed by the distributed nature


of database and distributed computation methods. Furthermore, systems are
AU: Please
confirm the
scaled up by adding more machines to the existing systems to cope with
edits. increasing data.
◾◾ Second, database systems should be immutable to design systems in different
ways to manage and process large-scale data, so that making changes to the
original data does not destroy the valuable data.

To manage large amount of data, many large-scale computation systems such as


Hadoop and Spark, and database systems such as HBase and Cassandra were intro-
AU: Please
rephrase the
duced. Hadoop with batch processing of big data in parallel has high computa-
sentence for
clarity.
tion latency, whereas database Cassandra offers much more limited data mode to
achieve their scalability.
The database systems are not human fault tolerant as they are alterable.
Consequently, every system has its own pros and cons. To address these arbitrary
issues, systems must be developed in combination with one another with least pos-
sible complexity.

6.3.3.2 Fundamental Principle for Scalable Database System


To build scalable database systems [6], primarily we need to understand “what does
a data system do?” Basically, database systems are used to store and retrieve infor-
mation. Instead of limiting it to storage, new generation database systems must be
able to process large amounts of complex data and extract meaningful information
to take better decisions in less time.
In general, a data system must answer queries by executing a function that takes
the entire dataset as an input. It is defined as

Query = function ( on all data )

To implement the above arbitrary function on random dataset with small latency,
Lambda Architecture provides some general steps for developing scalable big data
systems. Hence, it becomes essential to understand the elementary properties of big
data systems to develop scalable systems.

9781138500815_C006.indd 142 27/07/18 4:46 PM


Relational Database and Big Data  ◾  143

6.3.3.3 Properties of Big Data System


1.
Fault tolerant and reliable: Systems must be robust enough to tolerate
faults and manage their work when one or two machines are down. The
main challenge of distributed systems is to “do the right thing.” Database
systems should be able to handle complications such as randomly chang-
ing data in distributed database, replication of data, and concurrency so
that we can drive around the systems and make the systems recalculate the
original systems immutably. The system must be tolerant to handle human
errors [27].
2.
Minimal latency for reads and update: Several applications read and update
the database. Some applications need updates to be transmitted immedi-
ately. Without compromising on the speed and robustness of the systems,
database systems should be able to satisfy low latency reads and updates for
applications.
3.
High scalability and performance: Increasing the data size would increase
the load; however, this should not affect the performance of the system.
Scalability and performance of the system are achieved by adding more
machines with an increased processing capacity. To handle increasing data
and load, systems are horizontally scaled over all the layers of the system
stack.
4.
Support for a wide range of applications: A system should support a AU: Please
confirm the
diverse range of applications. To achieve this, systems are built with many edits.

combinations and are generalized to support a wide range of applications.


Applications such as financial management systems, hospital management,
social media analytics, scientific applications, and social networking require
big data systems to manage their data and values [28].
5.
Compatible system with low cost: Systems should be extensible by adding
new functionalities at low cost to support a wide range of applications. In
such cases, old data is required to relocate in new formats. Systems must be
easily compatible with minimal upgrading cost while supporting old data AU: Please
confirm the
and a wide range of applications. edits.

6.
Random queries on a large dataset: Executing random queries on a large
dataset is very important to discover and learn interesting insights from the
data. To find new business insights, applications require random mining and
querying on datasets.
7.
Scalable system with minimal maintenance: The number of machines
added to scale should not increase the maintenance. Choosing a mod-
ule with small implementation complexity is key to reduced maintenance.
The more complex system, more likely something will go wrong and hence AU: Please
rephrase the
requires more debugging and tuning of the system. Minimum maintenance sentence for
clarity.
is obtained by keeping a system simple. Keeping processes up, fixing errors,

9781138500815_C006.indd 143 27/07/18 4:46 PM


144  ◾  Data Analytics

and running efficiently when machines are scaled are the important factors to
be considered for developing systems.
8.
Easy to restore: Systems must be able to provide the basic necessary informa-
tion to restore the data when something goes wrong. It should have enough
information replicas saved on distributed nodes to easily compute and restore
the original data by utilizing saved replicas.

6.3.4 Generalized Framework for Big Data Processing


Based on the abovementioned basic principle and properties of big data, Lambda
Architecture provides [6] some general steps for implementing an arbitrary func-
tion on a random dataset in real time. The challenges of big data processing are
framed into a three generalized layer architecture to produce a result with small
latency. The three layers of the framework are storage layer, knowledge discovery
layer, and speed layer, as shown in Figure 6.1.
It is redundant to run a query on the whole dataset to get the result, consider-
ing the general equation “query = function (on all data).” It takes a large amount
of resources and hence is expensive. However, the data can be processed efficiently
using the layered architecture shown in Figure 6.1.

6.3.4.1 Storage and Precomputation Layer


According to the basic principle of data processing, the storage layer precomputes
the function on the whole data and stores the results in a number of batches and
indexes them. Subsequent queries use the precomputed data, instead of preprocess-
ing whole dataset again. Hence, this layer helps to get the results quickly by giving
a precomputed view of data.
Precomputed view gets outdated whenever new data is collected in the data
store (Hadoop DFS [HDFS]) and hence the queries. To resolve this problem, the
batch layer (MAP REDUCE) [2] precomputes its view on the main dataset after a
particular time period. It is a high latency operation as it executes a function on the
entire dataset at periodic intervals.
This layer has two tasks: to store an absolute persistently increasing master data-
set and to execute a random query on that dataset. It is simple to use as parallel
­computation across the clusters and to manage varying sizes of datasets at the pre-
computation layer.

6.3.4.2 Knowledge Discovery Layer (Serving Layer)


It saves and stores precomputed views for extracting valuable information by effi-
ciently querying the precomputed view. The basic function of the serving layer is
to update the precomputed views from the distributed database and make them

9781138500815_C006.indd 144 27/07/18 4:46 PM


Relational Database and Big Data  ◾  145

Query Query

Real time Real time


Data
View view New
analytics and
Data,
Visualization MLLIB,
MAHOUT,
TABLEAU
Real Time Data
Processing Layer

Data
Reduction and
Processing Map-Reduce
View

MAP-REDUCE,
Knowledge
SPARK
Discovery Layer

Storage and
Precomputation
Data Storage
Layer
and CLOUD
HDFS
Management

Figure 6.1  Generalized framework for big data processing.

available for knowledge discovery. Finally, it periodically refreshes the precomputed


views to update new datasets.
It requires storage layer updates and random reads. It does not support arbitrary
writes as it increases the complexity in the database. Hence, it makes the database
systems simple, robust, easy to configure, and easy to operate for example, HBase
and Cassandra. AU: Please
confirm the
All the desired properties of big data systems are accomplished at the storage edits.

and serving layers, except low latency. Hence, the next real-time data processing
layer resolves the problem of low-latency updates.

6.3.4.3 Real-Time Data Processing Layer (Speed Layer)


The purpose of this layer is to increase the computation speed so that arbitrary data
in real time is computed by arbitrary functions. Finally, the issue that needs to be

9781138500815_C006.indd 145 27/07/18 4:46 PM


146  ◾  Data Analytics

AU: Please
confirm the
addressed at present is real-time data processing and computation [29]. Instead of
edits. precomputing the data from the storage layer, real-time views are updated with new
data that gets generated, eventually achieving the highest possible latency.
This layer provides solutions for datasets generated in real time to improve
latency, whereas the storage layer produces precomputed views on the entire data-
set. Datasets that are no longer essential in real-time processing are removed, the
results are temporarily saved, and the complexity is pushed to the application layer.
Hence, the real-time data processing layer is more complex than the storage and
serving layers.
Finally, valuable results are obtained by joining the results from the precom-
puted views and real-time views. Hence, future research work would be focused
on bringing together the batch and real-time views to produce new and valuable
insights and to make better decisions. It requires advanced machine learning and
analytical techniques to improve the computational speed with maximum accuracy
for continuously changing random datasets.
A generalized flexible architecture with distinct components focused on spe-
AU: Please
confirm the
cific purposes leads to more acceptable performance. Applications must be robust
edits. enough to precompute values, corrupted values, and results by eventually executing
the computation on whole datasets to relocate or fix problems.

6.4 Big Data Analytics


Advancement in electronics and communication technology leads to increased
digitization of the world, such as IoT, social media, and sensor networks. It results
in massive amounts of data eruption every day. The capability to analyze such large
volumes of data brings in a new age of invention [16]. “Big data is a collection
of large, dynamic, and complex datasets. New innovative and scalable tools are
required to deal with it. Analytics refers to processing and extracting meaningful
data, which could help in making better predictions and decisions.” There are many
advantages of big data analytics and many challenges as well.

6.4.1 Big Data Characteristics and Related Challenges


The three basic characteristics—volume, velocity, and variety—as shown in Figure
6.2 are used to define the nature of big data [17].

6.4.1.1 Volume
A large amount of unstructured data is generated and archived compared to tra-
ditional data. This data is generated continuously from various sources, such as
system logs, sensor data, click streams, transaction-based data, email communica-
tions, housekeeping data, and social media. The amount of data is increasing [10]

9781138500815_C006.indd 146 27/07/18 4:46 PM


Relational Database and Big Data  ◾  147

• Size: Terabyte, Petabyte of Data


Volume • Storage: Cloud, Hard disks
• Challenges: process large amount of data

• Speed:Stream of Data generating at high speed


Velocity fromTweeter, Social media, Facebook, etc.
• Challenges: process data with minimum time

• Structure: Unstructured, Semi structured


Variety • Variety: Audio, video, log message, transactions
• Challenges: Efficiently process different variety of data

Figure 6.2  Big data characteristics and related challenges.

to a level that the traditional database management and computation systems are
incapable of handling it. The solution based on data warehouse may not be capa-
ble of analyzing and processing huge data due to the lack of a parallel processing
design. So, increasing volume is one of the biggest challenges as it requires a new
level of scalability for storage and analysis.

6.4.1.2 Velocity
The amount of data increased is exponential to the increase of IoTs, sensor-equipped
devices, e-business, and social media. The data generated continuously at high speed
makes it challenging to process and analyze it. It is essential to device an algorithm
to get quick results from streaming data. For example, online interactions and real-
time applications require high rate of analysis.

6.4.1.3 Variety
A variety of data is generated from various sources in different structures, such as
text, video, audio, images, and log files. A combination of structured, unstructured,
and semistructured data is not supported by the relational database system. Hence,
this requires modified storage and processing methods for manipulation of hetero-
geneous data.

6.4.2 Why Big Data Analytics?


Big data analytics refers to the analysis of large unstructured data to extract the
knowledge hidden in every single bit of data for making better decisions and pre-
dictions. It is essential to discover new values from past unused data and predict a
better future.

9781138500815_C006.indd 147 27/07/18 4:46 PM


148  ◾  Data Analytics

Many large-scale organizations and enterprises require big data tools and tech-
nologies to analyze their past history and customer information to understand cus-
tomer needs, in order to improve their business and finally to make better decisions
and predictions to survive in the competitive era. They also need [16] robust big
data analytics tools and technologies to make new innovations, process improve-
AU: Both sec- ments, monitoring, security, and many other functions. The analytical methods
tion numbers
are indicating used to handle big data in different areas are listed in Sections 6.4.2.1–6.4.2.1.
the same.
Please check
and confirm.

6.4.2.1 Text Analytics
The process of mining valuable and meaningful information from text data is called
text analytics. Some of the examples of text data are mail threads, chat conversa-
tions, reviews, comments, tweets, feedbacks, financial statements, and log records.
Text analytics comprise natural language processing, statistical analysis [29,30],
machine learning [32], and computational linguistics. Meaningful abstracts are
extracted from large-scale text data by applying many text mining and analyti-
cal methods. Text analytics is basically used for question answering, information
extraction, sentiment analysis, and text summarization. Text analytics is very
essential in analyzing social media data such as tweets and Facebook comments to
understand people sentiment and events happening in real time.

6.4.2.2 Audio Analytics
Audio analytics refers to the extraction of valuable and meaningful information
from audio data. It is most commonly used in call centers and health care services
for improving skills and services provided by call centers and also for improving
AU: Please ex-
pand “CRM”.
patient treatments and health-care services. It is also essential in CRM to under-
stand the customers and improve the quality of products and facilities to satisfy
their needs and to maintain their relationships.

6.4.2.3 Video Analytics
Video analytics refers to extracting valuable data by tracking and analyzing the
video streams. It is used in key application areas, such as marketing and operations
management. Analyzing video streams of sports such as tennis and cricket helps to
improve the performance of the sports person.

6.4.2.4 Social Media Analytics


Social media channels contain huge volume of structured and unstructured data,
which help to identify recent trends and changes in the market by analytics. Real-
time analysis of social media is essential to identify events happening around
the world, to understand the sentiments of the people toward particular issues,

9781138500815_C006.indd 148 27/07/18 4:46 PM


Relational Database and Big Data  ◾  149

people interest for particular products, and so on. Many organizations are trying
to improve their markets by analyzing social media data, such as people behav-
ior, sentiments, and opinions. According to [33], social media are categorized as
follows:

1. Social news: Reddit, Digg


2. Wiki: Wikihow, Wikipedia
3. Social networks: LinkedIn, Facebook
4. Microblogs: Tumblr, Twitter
5. Sharing Media: YouTube, Instagram

6.4.2.5 Predictive Analytics
Predictive analytics is used to make predictions on future events based on the anal-
ysis of current and historical data. According to [33,30], statistical methods form
the base for predictive analysis.
Making use of in-memory data analytics, big data tools have improved the per-
formance of data query notably. Big data analytics is not just about making better
decisions but also about real-time processing that motivates businesses to derive
new values and improve performance and profit rates from insights gained.
Big data beats RDBMS in several ways including robust backups, recovery,
faster search algorithms, overall scalability, and low-latency reads and writes.

6.4.3 Challenges in Big Data Analytics


Capturing the data generated at high speed from various sources, storing huge
data, querying, distributing, analyzing, and visualization are the major challenges
of a big data system. Data incompleteness and inconsistency, scalability, timeliness,
and data security are the challenges [34] in analyzing the large data of big data
systems. The primary step in big data analysis is to clean raw data to well construct. AU: Please
check the
However, efficient access, analysis, and visualization would still remain big chal- phrase “to
clean raw
lenges for future research work. Some of the challenges in each phase of big data data to well
construct” for
analytics are discussed in Sections 6.4.3.1–6.4.3.5. clarity.

6.4.3.1 Collect and Store Data


The enterprise storage designs, such as direct-attached storage (DAS), storage area
network (SAN), and network-attached storage (NAS), were usually used for col-
lecting and storing data. In large-scale distributed systems, some drawbacks and
limitations of all these existing storage structures are observed.
On highly scalable computing clusters, concurrency and throughput for
each server are essential for the applications, but current systems lack these fea-
tures. Improving data access is a way to improve the data intensive computing

9781138500815_C006.indd 149 27/07/18 4:46 PM


150  ◾  Data Analytics

performance [21]. Data access needs to be improved by including the replication of


data, distribution of data, relocation of data, and parallel access.

6.4.3.2 Data Management
The traditional methods of managing structured data includes two important
parts. One is a schema to store the dataset and another is a relational database for
data retrieval. Data warehouse and data marts are the two standard approaches for
managing large-scale structured datasets. SQL is used to perform operations on
relational structured data. Data warehouse is used to store, analyze, and report the
outcomes to users. Access and analysis of the data obtained from a warehouse is
enabled by a data mart.

To overcome the rigidity of normalized RDBMS schemas, big data system


accepts NoSQL. NOSQL is a method to manage and store unstructured and
non-relational data, also known as “Not Only SQL” [19], for example, HBase
database.
Since SQL is simpler and is a reliable query language, many big data analytical
platforms, such as SQL stream, Impala, and Cloudera, still use SQL in their
database systems.
NoSQL employs many approaches to store and manage unstructured data.
Data storage and management are controlled independent to each other to
improve the scalability of data storage and low-level access mechanism in
data management. However, the schema-free structure of NoSQL database
allows applications to dynamically change the structures of tables and data
are not needed to rewrite. Apache Cassandra [7] is the most popular NoSQL
database used by many businesses, such as Twitter, LinkedIn, and Netflix.
Updating the developments and deployments of applications, NoSQL pro-
vides very flexible methods and is also used for data modeling.

6.4.3.3 Data Analysis
According to Moore’s law, to cope with increasing data size, researchers gave more
attention to speeding up the analysis algorithms. As the data size increases sig-
nificantly faster than the CPU speed, there is a remarkable change in processor
technology, even though processors are doubling the clock cycle frequency. It is
essential to develop on-line, sampling, and multi-resolution analysis means. On the
other hand, development of parallel computing is required with increasing num-
bers of cores in processors. Large clusters of processors, distributed computing, and
cloud computing are developed fast to aggregate several different workloads.
AU: Please
confirm the
In real-time applications, such as navigation, social networks, finance, bio-
edits. medicine, astronomy, intelligent transport systems, and IoT, speed is the top prior-
ity. It is still a big challenge to be addressed for stream processing by giving quick

9781138500815_C006.indd 150 27/07/18 4:46 PM


Relational Database and Big Data  ◾  151

and appropriate replies when large amounts of data need to be processed in a short
span of time.

6.4.3.3.1 Algorithms for Big Data Analysis


In big data analysis, data mining algorithms play a dynamic role in determining
the cost of computation, requirement of memory, and accuracy of final results.
Problems associated with large data generation have been appearing since the last
decade. Fan and Bifet defined [35] the terms big data and big data mining for rep-
resenting large datasets and knowledge extraction methods from large data, respec-
tively. Many machine learning algorithms play a major role in solving big data
analysis tasks. Data mining, machine learning algorithms, and their importance in
big data analytics are described as follows:

1.
Clustering algorithms: In data clustering, many challenges are emerging
in addressing the characteristics of big data. One of the important issues
that need to be addressed in big data clustering is how to reduce the data
complexity. Big data clustering is divided into two groups [36]: (i) single-
machine clustering using sampling and dimension reduction solutions and
(ii) ­multiple-machine clustering using parallel and Map Reduce solutions
[37]. Using sampling and dimension reduction methods, complexity and
memory space required for data analytical processes will be reduced.
Inappropriate data and dimensions are discarded before data analysis pro-
cess starts. To reduce the data size for data analysis processes, data sampling is
used, and for reducing the whole dataset, dimension reduction is used.
To perform the clustering process in parallel, CloudVista [38] uses cloud
computing. It is a common solution for clustering big data. To handle large-
scale data, CloudVista uses BIRCH and sampling methods. AU: Please ex-
pand “BIRCH”.
2.
Classification algorithms: Many researchers are working toward develop-
ing new classification algorithms for big data mining and transforming tra-
ditional classification algorithms for parallel computing. Classification [39]
algorithms are designed in such a way that they take input data from dis-
tributed data sources and use various sets of learners to process them. Tekin
et al. presented “classify or send for classification” as a novel classification
algorithm.
In the distributed data classification method, the input data should be
processed in two different ways by each learner. One performs classifica-
tion functions, whereas the other forwards the input data to another labeled
learner. Big data classification problem improves the accuracy using these
kinds of solutions.
For example, to perform big data classification, Rebentrost et al. [40]
defined a quantum-based support vector machine and showed that with
O(log NM) time complexity the proposed classification algorithm can be

9781138500815_C006.indd 151 27/07/18 4:46 PM


152  ◾  Data Analytics

implemented, where M represents the amount of training dataset and N is


the number of dimensions.
3. Association rules and sequential pattern mining algorithms: The early
methods of pattern mining were tried to analyze the transaction data of large
shopping malls. At the beginning, many researches tried to use frequent pat-
tern mining methods for processing big datasets. FP-tree (frequent-pattern
tree) [41] uses the tree structure to reduce the computation time of association
rule mining. Further, Map Reduce method was used in the frequent pat-
tern mining algorithms to improve its performance [42,43]. Big data analysis
using the Map Reduce model significantly improves the performance of these
methods compared to old-style frequent pattern mining algorithms running
on a single machine.
4. Machine learning algorithms for big data: Machine learning algorithms
[44,45] typically work as the “search” algorithms for required solutions and
are used for different mining and analysis problems compared to data min-
ing methods. To find a fairly accurate solution for the optimization problem,
machine learning algorithms are used. For example, machine learning algo-
rithms and genetic algorithms can also be used to resolve the frequent pattern
mining problems as they are used to solve the clustering problems. Improving
AU: Please
expand “KDD”.
the performance of the other parts of KDD, the potential of machine learn-
ing is used as input operators feature reduction.
The consequences indicate that machine learning algorithms have become
AU: Please the essential parts of big data analytics. Subsequently, many statistical meth-
check the later
part of the ods, old mining algorithms, processing solutions, and graphical user inter-
sentence for
clarity. faces are also used to apply several descriptive tools and big data platforms.

6.4.3.4 Security for Big Data


To improve data security, data protection laws are implemented by several
developed and developing countries. Intellectual property protection, financial
information protection, personal privacy protection, and commercial secrets are
major security issues. Data security is difficult as large amounts of data are gen-
erated due to the digitization in various sectors. Hence, the big data security
challenges in many applications to protect the increasing distributed nature of
big data need to be addressed in future research work. It is even more complex
to identify the threats that can intensify the problems from anywhere in the big
data network.

6.4.3.5 Visualization of Data
Information hidden in large and complex datasets can easily be conveyed in both
functionality and visual forms. The challenges in data visualization [46] are to

9781138500815_C006.indd 152 27/07/18 4:46 PM


Relational Database and Big Data  ◾  153

represent facts more instinctively and effectively by using distinct patterns, graphs,
and visualization techniques. For valuable data analysis, information should be
abstracted in some schematic form from complex datasets, and it should include
variables or attributes for the units of information.
To extract and understand the hidden insights from the data, e-commerce com-
panies, such as eBay and Amazon, use big data visualization tools, such as Tableau
[47]. This tool helps to convert large complex datasets into interactive results and
intuitive pictures. For example, data about thousands of customers, goods sold,
feedback, and their inclinations. However, there are many challenges in the current
visualization tools, such as scalability, functionalities, and response time that can
be addressed in future work.

6.5 Tools and Technologies for Big Data Processing


The extensive changes in big data technologies come with remarkable challenges,
with the need of innovative methods and techniques of big data analytics. Many
new methods and techniques are developed by data scientists to capture, store,
manage, process, analyze, and visualize the big data. Multidisciplinary methods
such as statistics, math, machine learning, and data mining [48,49] are applied to
unearth most valuable pieces from big data. This topic is followed by a brief dis-
cussion on some of the important tools and technologies developed for processing
big data.
There are three classifications on which the current technologies are built:
batch processing, stream processing, and interactive analysis tools and technolo-
gies. Most batch processing technologies, such as Map Reduce [2] and Dryad [50],
are based on Hadoop. Storm and S4 [51] are examples of stream processing tools
usually used for real-time analytics for streaming data. The third category is inter-
active analysis where the data analytics is done interactively with the user inputs. AU: Please
The user can interact in real time and can review, compare, and analyze the data confirm the
edits.
in graphic or tabular form or both at a time. Examples of interactive analysis tools
are Google’s Dremel [52] and Apache Drill. Tools of each category are discussed
in the next. AU: Please
specify the
Hadoop was developed by Yahoo for large-scale computation, and later it was corresponding
section/chapter
taken over by the Apache Foundation. The framework was based on the Google’s number.

Map Reduce system and Google File System [53]. The initial version of Hadoop
lacks the capability to access and process the huge volume of data with the com-
modity hardware in a distributed environment. To make the computation layer
more robust, it is separated from storage layers. The storage layer, named Hadoop
distributed file system (HDFS), is capable of storing huge amounts of unstructured
data in large clusters of commodity hardware, and the Map Reduce computation
structure is built on top of the HDFS for data parallel applications. A complete

9781138500815_C006.indd 153 27/07/18 4:46 PM


154  ◾  Data Analytics

Message Protocol
Distributed Protocol

5. Visualization Tableau

4. Data Analytics MLlib/Mahout/MLbase R/Python

3. Data Processing
Batch Stream High level language Graph
Map Spark S4 Strom
Reduce Hive/Pig Pregel/Giraph
Thri
Zookeeper

2. Resource management Yarn/Mesos

1. File System

HDFS NoSQL HBase Cassandra

Figure 6.3  Hadoop stack.

stack of big data tools was built on Hadoop by Apache to support different applica-
tions as shown in Figure 6.3. The later version of Hadoop is called Apache YARN
in which a new layer, called resource management layer, is added for efficient utili-
zation of resources in clusters of big data.
There are five layers in big data systems (Figure 6.3). Distributed file stor-
age is the bottom layer for storing large distributed data, above which there is
a ­cluster resource management layer. The purpose of this layer is to manage
large clusters of hardware resources and to allow the upper layers to utilize the
AU: Please
confirm the
resources efficiently. The data stored in distributed file systems are processed by
edits. the data processing layer as batch, stream, or graph processing. Preprocessed
data is fed to the data analytic layer to analyze and extract more valuable infor-
mation. To represent valuable results, high-level abstractions are built in the
visualization layer.

6.5.1 Tools
6.5.1.1 Thrift
Thrift is a scalable cross-language services library and code generation tool set to
support scalable back-end services. Its major goal is to provide efficient and reli-
able communication across different programming languages by selecting portions
of each language that require the most modification into a common library and
AU: Please
confirm the
finally implementing them in each language. Thrift [54] supports many languages
edits. such as Haskell, Java, C++, Perl, C#, Ruby, Cocoa, Python, D, Delphi, Erlang,
OCaml, PHP, and Smalltalk.

9781138500815_C006.indd 154 27/07/18 4:46 PM


Relational Database and Big Data  ◾  155

6.5.1.2 ZooKeeper
Yahoo developed a distributed coordination system called ZooKeeper [8], and later,
it was taken over by the Apache Software Foundation. To coordinate the distrib-
uted applications, it offers an integrated service. ZooKeeper provides the following
support for distributed coordination:

◾◾ Sequential consistency: Client updates are made in an orderly manner as


they are sent.
◾◾ Atomicity: All the updates should be complete. No partial updates are
allowed.
◾◾ Single system: Distributed ZooKeeper system has a single system image for
clients.
◾◾ Reliability: The updates are more persistent and more reliable.

The capabilities of ZooKeeper can be used by applications to build functions with


higher level of controls, such as read/write maintainable locks, queues, hurdles, and
leader.

6.5.1.3 Hadoop DFS
HDFS is a hierarchical file system consisting [55] of directories and files similar to
a UNIX file system. Users can perform all the administrative and manipulation
operations such as create, delete, copy, save, and move files to the HDFS as in a
normal UNIX system.
HDFS architecture: HDFS architecture consists of two units: a single name
node and multiple data nodes. Name node is a local file system responsible for
managing the file system namespace and tracks. Edit log keeps track of the logs and
updates whenever changes are made to the file system. Additionally, it keeps track
of all the blocks in the file system assigned to data nodes and mapping of blocks to
data nodes.
Name nodes are replicated for fault tolerance. To manage a large number of
files, file systems are divided into a number of blocks and are saved on data nodes.
The operations on files and directories such as opening, closing, and renaming are
performed by the name node. Mapping of blocks to data nodes is tracked by the
name node. The client directly communicates to the data nodes by obtaining a list
of files from the name node to read a file.
Block report is used to manage the copies of files. A separate file is created for
each block stored in the local file system by the data node. It also creates directories
for dividing the files belonging to different applications.
Fault tolerance in HDFS: HDFS replicates the files into blocks and keeps
them in different data nodes for fault tolerance. The name node uses the replicated AU: Please
confirm the
copy of blocks to process the requested data chunk if any data node goes down. edits.

9781138500815_C006.indd 155 27/07/18 4:46 PM


156  ◾  Data Analytics

6.5.2 Resource Management
Clusters of commodity servers are cost-efficient solutions for intensive scientific
computations and are used for running large Internet services. The issues of tradi-
tional resource management for Hadoop and Storm are described in two aspects:
AU: Please
confirm the
First, a system that runs Hadoop or Storm is commodity hardware, and not cus-
edits. tomized hardware. Second, Hadoop requires a lot of configurations and scheduling
to support fine-grained tasks.
To address these issues of traditional resource management system, new tools
such as YARN [56] and Mesos [57] are introduced.

◾◾ YARN: It offers a resource management structure in cluster for better


resource utilization. Before introducing YARN into the Hadoop framework,
the cluster is partitioned to share the resources and different frameworks are
run on these partitions. But this was not a promising way of efficient resource
utilization. Hence, YARN was introduced in Hadoop V2.0 to handle diverse
computational frameworks on the same cluster.
◾◾ Mesos: It is another cluster resource manager that supports processing frame-
works such as Hadoop, Spark, Storm, and Hypertable run on shared cluster
environments.

The abovementioned two resource management approaches are compared in terms


of their design and scheduling work.
Mesos first finds free and available resources and then calls the application
scheduler. This model includes a two-level scheduler with pluggable scheduling
algorithms and is called a non-monolithic model. Mesos supports an unlimited
number of scheduling algorithms, thus allowing thousands of schedulers to run as
multi-tenants on the same cluster. Each structure has the flexibility of deciding the
algorithms that are used to schedule the tasks to be run.
However, in Apache YARN, when a job request arrives, it tries to approximate
the available resource and schedules the job accordingly in a monolithic model.
YARN is improved for scheduling Hadoop jobs, typically batch jobs, with a long
running time. It does not handle DFSs or database services. Combining the new
algorithms, the scheduling program allows YARN to handle different types of
workloads.

6.5.3 NoSQL Database: Unstructured Data Management


NoSQL database is used to manage unstructured, complex, nested, and hier-
archical data structures [18,19]. As relational database systems are no longer
capable of handling the unstructured data or they need a lot of manipulations,
it is not a good solution for big data analytics. There is a dramatic change in
the requirements of database management with the evolution of interactive web

9781138500815_C006.indd 156 27/07/18 4:46 PM


Relational Database and Big Data  ◾  157

applications and smartphone applications. The need of high-performance data


computing, availability, and scalability is fulfilled by the NoSQL database. It is
a schema-less design and does not need to follow any format like the tables in a
relation database.

6.5.3.1 Apache HBase
HBase is stimulated by the Google Bigtable application [4]. It is built on HDFS
and is a multidimensional, distributed, column-oriented data storage method. It
provides faster access to records and updates for the data tables. It uses HDFS to
store the data and ZooKeeper framework for distributed coordination. Row key,
column key, and a timestamp are used for distributed index mapping in the HBase
multidimensional data model. The row keys are used for organizing the mapping.
Each row has its unique key, and a set of columns and number of columns are
added dynamically to the column families.

6.5.3.2 Apache Cassandra
Apache Cassandra [7] was developed by Facebook and is based on a peer-to‐peer
distributed key-value store construct. Cassandra is a row-oriented data model where
all nodes are treated as equal. Cassandra is suitable for real-time applications where
a large data needs to be handled with a faster random access.
Three orthogonal properties called consistency, availability, and partition toler-
ance (CAP) [58] are considered to develop distributed applications. It is difficult
to satisfy the three properties together to have tolerable latency for the operations, AU: Please
confirm the
according to the CAP theorem. By loosening the value of strong consistency to edits

subsequent consistency, Cassandra satisfies two properties, high availability and


partition acceptance. Cassandra uses partitioning and replication similar to the AU: Please
confirm the
Amazon Dynamo database model. edits

Architecture: Cassandra data model is based on the design of Google Bigtable.


The data model is partitioned into a number of rows, and each row contains a
number of columns. Similar to SQL, Cassandra offers a query language CQL
(Cassandra Query language). Tables are designed to hold duplicates of the data, but
CQL does not support join operations. For faster access to data values, Cassandra
present indexes on columns.

6.5.4 Data Processing
Basic data processing frameworks are divided into batch mode, stream mode, or
graph processing and interactive analysis mode based on their processing meth-
ods and speed. The resource managers, Mesos and YARN, manage these runtime AU: Please
systems in clusters at the lower layers. Unstructured data from HDFS as well as confirm the
edits.
structured data from NoSQL are given as input to these systems. Output of these

9781138500815_C006.indd 157 27/07/18 4:46 PM


158  ◾  Data Analytics

structures are redirected to these storage layers or cached for performing analytics
and visualization on it.

6.5.4.1 Batch Processing
Batch processing is suitable for processing large amounts of data stored in batches.
Hadoop Map Reduce is a basic model introduced to process huge amounts of data
in batch. However, it is not suitable for all kinds of batch processing tasks such
as iterative processing. To overcome some of these disadvantages, new processing
models, Spark and Apache Hama, are presented.

1. Hadoop
Hadoop is a distributed processing framework to processes big datasets over
groups of computers [16] using simple programming models. It is planned
and designed for scale-up from single servers to thousands of machines. The
role of each computer is to provide storage and local computation, instead
of depending on hardware, to provide high-availability. Machine learning
libraries are designed and modified to handle failures at the application layer,
hence providing an extensive service on top of the clusters.
Hadoop framework was introduced and published by Google, illustrat-
ing its method to manage and process a large data. Afterward, Hadoop is the
typical structure for storing, processing, and analyzing terabytes to exabyte
of data. As Doug Cutting started developing Hadoop, the framework got its
name “Hadoop” from his son’s toy elephant.
Yahoo is a main contributor to Hadoop advancement. Using 10,000-core
Hadoop clusters, by 2008, Yahoo’s web search engine index was generated.
It was developed to run on local hardware, and without any system interven-
tion, it can scale up and down. Three important functions of Hadoop frame-
work are storage, resource management, and processing.
– Hadoop Map Reduce (prior to version 2.0): Hadoop Map Reduce
structure consists of three main components: HDFS, job tracker (mas-
ter), and task tracker (slave). HDFS is used to store and share the data
among Map Reduce jobs computational tasks. First, the job tracker reads
the input data from the HDFS and splits it into partitions to run map
AU: Please
clarify as to
tasks on each, and the intermediate results are stored in the local file sys-
what “each”
refers to.
tem. Second, reduce tasks read the intermediate results from map tasks
and run the reduced code on it. The results of the reduce phase are saved
in HDFS.
– Hadoop Map Reduce Version 2.0: A new version of Map Reduce intro-
duces the resource allocation and scheduling tool, Apache YARN [59].
Task tracker is replaced by YARN node managers. To keep track of the
AU: Please
confirm the
finished jobs, job history server is a new added feature to the architecture.
edits.

9781138500815_C006.indd 158 27/07/18 4:46 PM


Relational Database and Big Data  ◾  159

Initially, clients request YARN for the resources required for their
jobs. The resource manager assigns a place to execute the master task on
jobs.
– Hadoop characteristics:
1. Fault tolerant: The Hadoop cluster is highly prone to failures as thou-
sands of nodes are running on the commodity hardware. However,
data redundancy and data replication are implied to achieve fault
tolerance.
2. Redundancy of data: Hadoop divides data into many blocks and
stores them across two or more data nodes to improve the redun-
dancy. The master node preserves information of these nodes and
data mapping.
3. To scale up and scale down: The distributed nature of Hadoop file
system allows Hadoop to scale up and scale down by adding or delet-
ing the number of nodes required in the cluster.
4. Computations moved to data: Queries are computed on data nodes
locally and the results are obtained by combining them in paral- AU: Please
confirm the
lel to avoid the overhead of bringing the data to the computational edits.

environment.
2. Spark
Spark was built on top of HDFS as an open-source project [60] by the
University of Berkley to address the issues of Hadoop Map Reduce. The
objective of the system is to support iterative computation and increase
the speed of distributed parallel processing to overcome the limitations of
Hadoop Map Reduce.
In-memory fault-tolerant data structure, resilient distributed datasets
(RDDs), was introduced by [61,62] the Berkeley University for efficient data
sharing across parallel computations. It supports batch, iterative, interactive,
and streaming in the same runtime with significantly high performance. It
also allows applications to scaled up and scale down with efficient sharing of
data. Spark differs from Hadoop by supporting simple join and group-by basic
operations.
Spark with RDDs runs applications with 100 times faster in memory and
10 times faster on disk compared to Hadoop Map Reduce.
Spark overcomes some of the limitations of Hadoop as follows:
1. Iterative algorithms: Spark allows applications and users to explicitly
cache data by calling the cache () operation, so that subsequent queries
can use intermediate results stored at the cache and provide dramatic
improvements in time and memory utilization.
2. Streaming data: Spark offers an application programming interface to
process the streaming data. It also gives an opportunity to design meth-
ods to process real-time streaming data with minimum latency.

9781138500815_C006.indd 159 27/07/18 4:46 PM


160  ◾  Data Analytics

3. Reuse intermediate results: Instead of saving output to the disk every


time, it is cached to reuse for other computations, thus reducing the time.
RDDs are fault tolerant as they record the modifications and use them
to rebuild a lost copy of dataset rather than the actual data. If any slice
of RDD crashes, it stores sufficient information for recomputation and
recovery without demanding costly replication.
4. Unlike Hadoop, Spark is not only limited to iterative Map and Red­
uce tasks that need an implicit group-by. Further, map phase should
need serialization and disk I/O call in each iteration. RDDs are basi-
cally ­in-memory cache aids to avoid frequent serialization and I/O
overhead.

6.5.4.2 Distributed Stream Processing


Apache Storm [63] and Apache S4 [51] are the two main distributed stream pro-
cessing tools. Twitter built Apache Storm and Yahoo developed S4 “Simple Scalable
Streaming System” for real-time stream processing.

1. Storm
Storm is an open-source distributed real-time computation framework, dedi-
cated to stream processing. It offers a fault-tolerant mechanism to execute
computation on an event as it runs into a system.
Using Apache Storm, it is easy to process real-time streaming [63] data.
It has many useful applications, such as real-time analytics, online machine
learning, continuous computation, distributed RPC (remote procedure call),
and ETL (extract, transform, and load). Storm is easy to set up and operate.
It is also scalable and fault tolerant. Storm typically does not run on top of
Hadoop clusters, and it uses Apache ZooKeeper and its own master worker
processes to manage topologies.
2. S4
S4 offers a modest programing model [51] for programmers and offers an ease
and efficient automated distributed execution, for example, automatic load
balancing. On the other hand, in Storm, the programmer should take care
of load balancing, adjusting size of the buffer, and the level of parallelism for
getting optimum performance.

6.5.4.3 Graph Processing
The earlier method of graph processing on top of Map Reduce was inefficient as
it took entire graphs as input and processed them and then wrote the complete
updated graph into the disk. Pregel [64], Giraph [65], and many other systems are
developed for efficient graph processing and to overcome the limitations of Map
Reduce.

9781138500815_C006.indd 160 27/07/18 4:46 PM


Relational Database and Big Data  ◾  161

1.
Pregel
Pregel was constructed on the bulk synchronous parallel model (BSP) [64]
as a graph processing parallel system. In the BSP model, a set of processors
that are interconnected by a communication network follow a different set
of computation threads in which individual processors are armed with a fast
local memory. The platform based on the BSP model consists of the following
three important mechanisms:
– Trained system for processing local memory transactions (i.e., processors)
– Efficient networks for communication between these systems
– A hardware support for synchronization between the systems
2.
Apache Giraph
Giraph is developed on top of Apache Hadoop for graph processing. Giraph
is built on Pregel for distributed processing of a graph [65]. Pregel is built
on top of Apache Hadoop, and it runs into map tasks of Map Reduce. To
coordinate between its tasks, Pregel uses Apache ZooKeeper, and for inter-
node communications, it uses Netty. A set of vertices and edges are used to AU: Please
confirm the
represent a graph in Giraph. Vertices perform computational tasks and con- edits.

nections are represented by edges. Compared to Giraph, Pregel is used for


very large-scale graph processing.

6.5.4.4 High-Level Languages for Data Processing


To retrieve the information from huge amounts of data stored in HDFS and
NoSQL database, small batch jobs were used by programmers. The high-level lan-
guages enable programmers to easily access the data stored in HDFS or NoSQL
database. PIG and HIVE are examples of high-level languages for data processing. AU: Please
confirm if “PIG”
and “HIVE” can
be lowercased
1.
PIG as “Pig” and
“Hive” in all
It is a procedure-oriented high-level programming language developed by occurrences for
consistency.
Yahoo for scripting Hadoop Map Reduce jobs. The language is also called Pig
Latin. Data pipelines are represented in a more usual way as it is a procedure-
oriented language. Due to its procedural nature PIG is suitable for iterative
data processing, data pipeline applications, and transformation jobs, and is
best suited for processing unstructured data.
A major advantage of PIG [66] compared to a declarative language is that
one can have control and check the operations performed over the data. In
declarative languages, programmers must have a good knowledge of the algo-
rithms and data operations, and to make the right choice a query program-
mer has to rely on the query optimizer. AU: Please
check the later
2.
Hive part of the
sentence for
Hive [67] is developed on Hadoop as a high-level language to perform clarity.

analytics on big data. It supports large amounts of structured data process-


ing. It offers a method to map a structure of the data stored in HDFS and

9781138500815_C006.indd 161 27/07/18 4:46 PM


162  ◾  Data Analytics

query it by using a query language such as HiveQL. Like SQL in RDBMS,


HiveQL is a query language for Hadoop systems to query large-scale struc-
tured data. The queries are sometimes compiled to Map Reduce jobs to
AU: Please
confirm the
read the data. Whereas simple queries are run directly without map reduce
edits. to read the data from HDFS. To maintain the metadata about tables, Hive
uses RDBMS tables. In JDBC, Hive JDBC drivers are used to access the
tables created by Hive, making them available for a rich set of operations
supported by Java.

6.5.5 Data Analytics at the Speed Layer


Standard data analytical methods are developed using machine learning and data
mining algorithms for specific applications [4,44,68]. Advancing these methods is
necessary for optimizing the performance of analytical tasks for large-scale data.
Machine learning libraries are built to archive different machine learning algo-
rithms to apply for different applications on the speed layer.

1.
Mahout
Mahout is a set of machine learning algorithm library [4] built on Hadoop
Map Reduce to support various analytical tasks. It also aims to include vari-
ous algorithms for machine learning on different distributed systems. Mahout
library includes several algorithms for various tasks as shown in Table 6.2.
2.
MLlib
It is a machine learning library developed on Spark [9,68] that consists of a set
of machine learning algorithms, as shown in Table 6.3, for classification and
clustering, regression analysis, and collaborative filtering.
A summary of the chapter on relational data and big data analytical
methods and technologies is presented in Table 6.4 and Figure 6.4. An over-
view of characteristics and tools and technologies for managing structured
and unstructured data is represented in the form of a flowchart to easily
understand the overview of bog data tools and technologies. It also shows the

Table 6.2  Algorithms in Mahout


Algorithms Task

Naive Bayes, neural networks, boosting, logistic regression Classification

Hierarchical clustering, canopy clustering, k-means, fuzzy Clustering


k-means, spectral clustering, minimum hash clustering,
top-down clustering, mean shift clustering

Frequent item mining Pattern mining

9781138500815_C006.indd 162 27/07/18 4:46 PM


Relational Database and Big Data  ◾  163

Table 6.3  Algorithms of MLlib


Algorithms Task

Logistic regression, linear support vector machines Binary


classifications

Linear regression, L1 (lasso) regression, L2 (ridge) Regression


regularized

k-means Clustering

Alternating least squares Collaborative


filtering

area of applications for which big data technology and relational database are
suitable.

6.6 Future Work and Conclusion


6.6.1 Future Work on Real-Time Data Analytics
In earlier days, producing a result by processing petabyte of data in an hour was a
challenging task. Nowadays, technological progresses have made it possible to see the
results in a minute. Advancement of data analytics and computational intelligence
made it possible to think about questions and get answers in a fraction of seconds.
Introduction of big data technology improves and supports data-driven
­decision-making from large-scale data. However, applications of big data analytics
are currently bound by significant latency. Many of the above big data technologies
need to advance toward real-time analytics. The challenging task is to find results
on time or in a fraction of seconds. The challenges further intensify when data is
related with other data.
The focus of big data analytics so far on conditions where analysis of data which AU: Please
rephrase the
is previously been collected and stored in a database. However, the main objective sentence for
clarity.
of real-time big data analytics [29] is to process and analyze continuously generat-
ing, randomly changing, streaming data. It will be challenging to store all the data
and events and get answers within a fraction of seconds. Therefore, real-time big
data analytics systems should process data by sampling events without losing any
valuable information from the data.
Fast processing and analysis for making quick decisions is important for real-
time big data analytics. So, future research work in this direction will give better
predictions and actionable decisions in real time. Real-time analytics of big data
finds its application in numerous areas including finance, health care, fraud detec-
tion, finance industry, and social media.

9781138500815_C006.indd 163 27/07/18 4:46 PM


Table 6.4  Relational and Big Data Tools and Technologies
Storage and Data
Data Characteristics Management Tools Processing Analytics Visualization Application

9781138500815_C006.indd 164
Relational Atomicity, DBMS, data SQL OLTP, Data mining, Graph, chart Employee
AU: Please data consistency, warehouse, OLAP clustering, details
expand “OLTP”
and “OLAP”. (structured isolation, and data mart classification management,
data) durability [14] hospital
management,
164  ◾  Data Analytics

insurance
company

Big data Volume, HDFS (DFS) NoSQL [19] Batch: Mathematic: Tableau Social media
(unstructured velocity, HBase [4] Hadoop Statistics, (Jason computing,
data) veracity, verity Cassandra Map Fundamental Brooks, health care,
[17] [7] reduce [2] Mathematics 2016) government,
Spark [60] R TOOL finance,
PYTHON business, and
enterprise

Coordinator: Stream: Data analysis: Text analytics,


ZooKeeper S4 Data mining web analytics,
Thrift Storm Machine stream
learning: analytics,
MLlib predictive
Mahout analytics
neural
networks

(Continued)

27/07/18 4:46 PM
9781138500815_C006.indd 165
Table 6.4 (Continued)  Relational and Big Data Tools and Technologies
Storage and Data
Data Characteristics Management Tools Processing Analytics Visualization Application

Resource Graph:
management: Giraph
(job Pregel
scheduling)
YARN
Mesos

High-level
languages:
Pig
Hive
Relational Database and Big Data  ◾  165

27/07/18 4:46 PM
166  ◾  Data Analytics

NoSQL
VS
BIG DATA VS SQL
RELATIONAL
DATA

HDFS

NOSQL

PRICIPLE AND STORAGE


MANAGING
PROPERTIES RESOURCES
USING YARN
AND MES0S

COORDINATION:
BIG DATA ZooKeeper

HADOOP MAP-
REDUCE

BATCH SPARK

PREGEL

TOOLS AND S4
TECGNOLOGIES
STREAM
PROCESSING
STROM

GRAPH GIRAPH

HIVE
HIGH LEVEL
LANGUAGES
PIG

DATA MINING

ANALYTICS STATISTICS

MLlib
MACHINE
LEARNING
VISUALIZATION
MAHOUT

Figure 6.4  Flow chart of big data analytical tools and technologies.

6.6.2 Conclusion
Enormous amounts of data are generated with increasing speed in different formats
due to digitization around the world. Big data technology will definitely enter every
domain, enterprise, and organization. Traditional database management system

9781138500815_C006.indd 166 27/07/18 4:46 PM


Relational Database and Big Data  ◾  167

fails to scale for growing data needs such as multiple partitioning and paralleliz-
ing abilities. It is also incapable of storing, managing, and analyzing unstructured
data generated from different sources such as sensors, smart applications, wearable
technologies, smartphones, and social networking websites. Evolution of big data
analytics and tools and technologies made it possible to efficiently handle huge
unstructured growing data. One of the most popular open source frameworks,
Hadoop, is a generally recognized system for large-scale data analytics. It is mainly
known and accepted for support in large-scale distributed parallel computing of
clusters, and is cost-effective, fault tolerant, reliable, and provides highly scalable
support for processing and managing terabyte to petabyte of data. However, it is
not suitable for real-time data analytics. To overcome the incapability of this earlier
version of Hadoop system for real-time analytics, a new framework was introduced,
known as Spark. To support real-time analytics, Spark with RDDs gives results in a
fraction of seconds. Several areas such as business, social media, government, health
care, and security are implementing big data technologies to gain knowledge from
previously unused data to make better decisions and predictions. In the future, it
will be motivating to overcome the drawbacks of the Spark and Hadoop systems
and work toward real-time analytics. The challenges in batch processing and stream
processing analytical systems also need to be addressed in the future work.

References
1. McKinsey & Company. “Big Data: The Next Frontier for Innovation, Competition, and
Productivity.” McKinsey Global Institute, p. 156, June 2011.
2. J. Dean and S. Ghemawat. “MapReduce.” Communications of the ACM, vol. 51, no. 1,
p. 107, January 2008.
3. “Apache SparkTM—Lightning-Fast Cluster Computing.” http://spark.apache.org/ (Accessed
January 25, 2017).
4. A. Duque Barrachina and A. O’Driscoll. “A Big Data Methodology for Categorising
Technical Support Requests using Hadoop and Mahout.” Journal of Big Data, vol. 1, p. 1,
2014.
5. F. Aronsson. “Large Scale Cluster Analysis with Hadoop and Mahout.” February 2015.
6. N. Marz and J. Warren. “Big Data—Principles and Best Practices of Scalable Realtime
Data Systems.” Harvard Business Review, vol. 37, pp. 1–303, 2013.
7. L. Avinash and P. Malik. “Cassandra: A Decentralized Structured Storage System.” ACM
SIGOPS Operating Systems Review, pp. 1–6, 2010.
8. F. Junqueira and B. Reed. ZooKeeper: Distributed Process Coordination. 2013. AU: Please pro-
vide publisher
9. X. Meng, J. Bradley, S. Street, S. Francisco, E. Sparks, U. C. Berkeley, S. Hall, S. Street, S. details.

Francisco, D. Xin, R. Xin, M. J. Franklin, U. C. Berkeley, and S. Hall. “MLlib: Machine


Learning in Apache Spark.” Journal of Machine Learning Research, vol. 17, pp. 1–7, 2016.
10. C. Lynch. “Big Data: How Do Your Data Grow?” Nature, 2008.
11. “Welcome to ApacheTM Hadoop®!” http://hadoop.apache.org/. (Accessed: January 25,
2017).
AU: Please pro-
12. H. Architecture. “Hadoop Fundamentals.” 2012. vide complete
details.

9781138500815_C006.indd 167 27/07/18 4:46 PM


168  ◾  Data Analytics

13. K. Mayer-Schönberger, V. and Cukier. Big Data: A Revolution that Will Transform How
We Live, Work, and Think. Boston, MA: Houghton Mifflin Harcourt, 2013.
14. E. F. Codd. “A relational Model of Data for Large Shared Data Banks.” Communications
of the ACM, vol. 26, no. 6, pp. 64–69, 1983.
15. T. Haerder and A. Reuter. “Principles of Transaction-Oriented Database Recovery.” ACM
Computing Surveys, vol. 15, no. 4, pp. 287–317, 1983.
16. P. Zikopoulos and C. Eaton. Understanding Big Data: Analytics for Enterprise Class
AU: Please pro-
vide publisher
Hadoop and Streaming Data. 2011.
details. 17. D. Laney. “META Delta.” Application Delivery Strategies, vol. 949, p. 4, 2001.
18. R. Cattell. “Scalable SQL and NoSQL Data Stores.” ACM SIGMOD Record, vol. 39, no.
4, p. 12, May 2011.
19. J. Han, E. Haihong, G. Le, and J. Du. “Survey on NoSQL Database.” In 2011 6th
International Conference on Pervasive Computing and Applications (ICPCA), Port
Elizabeth, South Africa, 2011.
20. D. Borthakur. “The Hadoop Distributed File System: Architecture and Design.” Hadoop
Project Website, 2007.
21. G. Bell, T. Hey, and A. Szalay. “Beyond the Data Deluge.” Science, 2009.
22. E. Al Nuaimi, H. Al Neyadi, N. Mohamed, and J. Al-jaroodi. “Applications of Big Data
AU: Please
provide volume
to Smart Cities.” Journal of Internet Services and Applications, 2015.
number and 23. H. Chen, R. Chiang, and V. Storey. “Business Intelligence and Analytics: From Big Data
page range.
to Big Impact.” MIS Quarterly, 2012.
24. T. Huang, L. Lan, X. Fang, P. An, J. Min, and F. Wang. “Promises and Challenges of Big
Data Computing in Health Sciences.” Big Data Research, vol. 2, no. 1, pp. 2–11, 2015.
25. W. Raghupathi and V. Raghupathi. “Big Data Analytics in Healthcare: Promise and
Potential.” Health Information Science and Systems, vol. 2, no. 1, p. 3, 2014.
26. S. Kumar and A. Prakash. “Role of Big Data and Analytics in Smart Cities.” International
Journal of Science and Research, vol. 5, no. 2, pp. 12–23, 2016.
27. L. Garber. “Using In-Memory Analytics to Quickly Crunch Big Data.” Computer, vol. 45,
AU: Please pro- no. 10, pp. 16–18, October 2012.
vide publisher
details. 28. M. Özsu and P. Valduriez. Principles of Distributed Database Systems. 2011.
AU: Please pro- 29. S. Shahrivari. “Beyond Batch Processing: Towards Real-Time and Streaming Big Data.”
vide publisher
details.
Computers, vol. 3, no. 4, pp. 117–129, 2014. AU: Ref. [31] is
listed but not
30. J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning. 2001. cited in text.
AU: Please pro- 31. T. Knowledge and E. Review. “A Survey on Text Mining in Social Networks.” pp. 1–15, 2004. Please check
vide publisher and provide
details. 32. R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and citation for the
same.
AU: Please Distributed Approaches. 2011.
provide journal
title. 33. A. Gandomi and M. Haider. “Beyond the Hype: Big Data Concepts, Methods, and
Analytics.” International Journal of Information Management, vol. 35, no. 2, pp. 137–144,
April 2015.
AU: Please pro-
vide complete 34. J. Ahrens, B. Hendrickson, G. Long, and S. Miller. “Data-intensive Science in the US
details.
DOE: Case Studies and Future Challenges.” Computing in Science, 2011.
35. W. Fan and A. Bifet. “Mining Big Data : Current Status, and Forecast to the Future.”
ACM SIGKDD Explorations Newsletter, vol. 14, no. 2, pp. 1–5, 2013.
36. A. Shirkhorshidi, S. Aghabozorgi, T. Wah, and T. Herawan. “Big Data Clustering: A
Review.” In International Conference on Computational Science and Its Applications, LNCS,
vol. 8583. Cham: Springer, 2014.
37. W. Kim. “Parallel Clustering Algorithms: Survey.” Parallel Algorithms, Spring 2009.

9781138500815_C006.indd 168 27/07/18 4:46 PM


Relational Database and Big Data  ◾  169

38. H. Xu, Z. Li, S. Guo, and K. Chen. “Cloudvista: Interactive and Economical Visual
Cluster Analysis for Big Data in the Cloud.” Journal Proceedings of the VLDB Endowment,
vol. 5, no. 12, pp. 1886–1889, 2012.
39. C. Tekin and M. van der Schaar. “Distributed Online Big Data Classification using
Context Information.” In 2013 51st Annual Allerton Conference on Communication,
Control, and Computing, Allerton, IL, 2013.
40. P. Rebentrost, M. Mohseni, and S. Lloyd. “Quantum Support Vector Machine for Big
Data Classification.” Physical Review Letters, 2014.
41. J. Ayres, J. Flannick, J. Gehrke, and T. Yiu. “Sequential Pattern Mining Using a Bitmap
Representation.” In Proceedings of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Edmonton, Canada, 2002.
42. M. Lin, P. Lee, and S. Hsueh. “Apriori-based Frequent Itemset Mining Algorithms on
MapReduce.” In Proceedings of the 6th International Conference on Ubiquitous Information
Management and Communication, Kuala Lumpur, Malaysia, 2012.
43. C. Leung and R. MacKinnon. “Reducing the Search Space for Big Data Mining for
Interesting Patterns from Uncertain Data.” big data (BigData), 2014.
44. T. Kraska, U. C. Berkeley, U. C. Berkeley, R. Griffith, M. J. Franklin, U. C. Berkeley, and
U. C. Berkeley. “MLbase : A Distributed Machine-learning System.” 2013. AU: Please pro-
vide complete
45. M. Mehta, R. Agrawal, and J. Rissanen. “SLIQ: A Fast Scalable Classifier for Data details.
Mining.” In International Conference on Extending, 1996. AU: Please pro-
vide conference
46. D. Keim, C. Panse, and M. Sips. “Visual Data Mining in Large Geospatial Point Sets.” location.
IEEE Computer Graphics, 2004. AU: Please
provide volume
47. “Data Visualization | Tableau Software.” www.tableau.com/stories/topic/data-­ number and
page range.
visualization (Accessed January 25, 2017).
48. A. Di Ciaccio, M. Coli, and J. Ibanez. Advanced Statistical Methods for the Analysis of
Large Data-sets. 2012. AU: Please pro-
vide publisher
49. X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. details.
Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. “Top 10
Algorithms in Data Mining.” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37,
January 2008.
50. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. “Dryad.” In Proceedings of the
2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007—EuroSys’07,
p. 59, 2007.
51. L. Neumeyer, B. Robbins, and A. Nair. “S4: Distributed Stream Computing Platform.”
Data Mining Workshops, 2010.
52. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis.
“Dremel: Interactive Analysis of Web-Scale Datasets.” In 36th International Conference on
Very Large Data Bases, pp. 330–339, 2010.
53. S. Ghemawat, H. Gobioff, and S. Leung. “Google_File_System.” 2003. AU: Please pro-
vide complete
54. M. Slee, A. Agarwal, and M. Kwiatkowski. “Thrift : Scalable Cross-Language Services details for ref.
[53, 54].
Implementation.”
55. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. “The Hadoop Distributed File
System.” In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST
2010)(MSST), Incline Village, NV, pp. 1–10, 2010.
56. V. Vavilapalli, A. Murthy, and C. Douglas. “Apache Hadoop Yarn: Yet Another Resource
Negotiator.” In Proceedings of the 4th annual Symposium on Cloud Computing, SoCC’13,
Santa Clara, CA, 2013.

9781138500815_C006.indd 169 27/07/18 4:46 PM


170  ◾  Data Analytics

57. B. Hindman, A. Konwinski, M. Zaharia, and A. Ghodsi. “Mesos: A Platform for Fine-
Grained Resource Sharing in the Data Center.” In NSDI’11 Proceedings of the 8th USENIX
Conference on Networked Systems Design and Implementation, Boston, MA, 2011.
58. S. Gilbert and N. Lynch. “Perspectives on the CAP Theorem.” Computer, 2012.
59. V. K. Vavilapalli, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, E.
Baldeschwieler, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J.
Lowe, and H. Shah. “Apache Hadoop YARN.” In Proceedings of the 4th Annual Symposium
AU: Please pro-
vide conference
on Cloud Computing—SOCC’13, pp. 1–16, 2013.
location. 60. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. “Spark : Cluster
Computing with Working Sets.” In HotCloud’10 Proceedings of the 2nd USENIX
AU: Please pro-
vide conference
Conference on Hot Topics in Cloud Computing, p. 10, 2010.
location. 61. M. Zaharia, M. Chowdhury, T. Das, and A. Dave. “Resilient Distributed Datasets: A
Fault-tolerant Abstraction for In-memory Cluster Computing.” In NSDI’12 Proceedings of
AU: Please pro-
vide conference
the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2, 2012.
location. 62. M. Zaharia. “An Architecture for Fast and General Data Processing on Large Clusters.”
Berkeley Technical Report, p. 128, 2014.
63. T. da Silva Morais. “Survey on Frameworks for Distributed Computing: Hadoop, Spark
and Storm.” In Proceedings of the 10th Doctoral Symposium in Informatics Engineering—
AU: Please pro-
vide conference
DSIE’15, pp. 95–105, 2015.
location. 64. G. Malewicz, M. Austern, A. Bik, and J. Dehnert. “Pregel: A System for Large-scale Graph
Processing.” In Conference: SPAA 2009: Proceedings of the 21st Annual ACM Symposium
on Parallelism in Algorithms and Architectures, Calgary, Alberta, Canada, August 11–13,
2009.
65. O. Batarfi, R. El Shawi, A. G. Fayoumi, R. Nouri, S.-M.-R. Beheshti, A. Barnawi, and S.
Sakr. “Large Scale Graph Processing Systems: Survey and an Experimental Evaluation.”
Cluster Computing, vol. 18, no. 3, pp. 1189–1213, September 2015.
66. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. “Pig latin.” In Proceedings
of the 2008 ACM SIGMOD International Conference on Management of Data—
AU: Please pro-
vide conference
SIGMOD’08, p. 1099, 2008.
location. 67. A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff,
and R. Murthy. “Hive.” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–
1629, August 2009.
68. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai,
M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A.
Talwalkar. “MLlib: Machine Learning in Apache Spark.” Journal of Machine Learning
Research, vol. 17, pp. 1–7, 2015.

9781138500815_C006.indd 170 27/07/18 4:46 PM

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy