Intelligent Data Analysis (BooksRack - Net) PDF
Intelligent Data Analysis (BooksRack - Net) PDF
Edited by
Deepak Gupta
Maharaja Agrasen Institute of Technology
Delhi, India
Siddhartha Bhattacharyya
CHRIST (Deemed to be University)
Bengaluru, India
Ashish Khanna
Maharaja Agrasen Institute of Technology
Delhi, India
Kalpna Sagar
KIET Group of Institutions
Uttar Pradesh, India
This edition first published 2020
© 2020 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/
permissions.
The right of Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar to be identified as the
authors of the editorial material in this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at
www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that
appears in standard print versions of this book may not be available in other formats.
While the publisher and authors have used their best efforts in preparing this work, they make no representations
or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim
all warranties, including without limitation any implied warranties of merchantability or fitness for a particular
purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional
statements for this work. The fact that an organization, website, or product is referred to in this work as a citation
and/or potential source of further information does not mean that the publisher and authors endorse the
information or services the organization, website, or product may provide or recommendations it may make. This
work is sold with the understanding that the publisher is not engaged in rendering professional services. The
advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist
where appropriate. Further, readers should be aware that websites listed in this work may have changed or
disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be
liable for any loss of profit or any other commercial damages, including but not limited to special, incidental,
consequential, or other damages.
10 9 8 7 6 5 4 3 2 1
Deepak Gupta would like to dedicate this book to his father, Sh. R.K. Gupta, his mother,
Smt. Geeta Gupta, his mentors for their constant encouragement, and his family members,
including his wife, brothers, sisters, kids and the students.
Siddhartha Bhattacharyya would like to dedicate this book to his parents, the late Ajit Kumar
Bhattacharyya and the late Hashi Bhattacharyya, his beloved wife, Rashni, and his research
scholars, Sourav, Sandip, Hrishikesh, Pankaj, Debanjan, Alokananda, Koyel, and Tulika.
Ashish Khanna would like to dedicate this book to his parents, the late R.C. Khanna and
Smt. Surekha Khanna, for their constant encouragement and support, and to his wife,
Sheenu, and children, Master Bhavya and Master Sanyukt.
Kalpna Sagar would like to dedicate this book to her father, Mr. Lekh Ram Sagar, and her
mother, Smt. Gomti Sagar, the strongest persons of her life.
vii
Contents
2.2.5.2 ZIP 22
2.2.5.3 Plain Text (txt) 23
2.2.5.4 JSON 23
2.2.5.5 XML 23
2.2.5.6 Image Files 24
2.2.5.7 HTML 24
2.3 Overview of Big Data 25
2.3.1 Sources of Big Data 27
2.3.1.1 Media 27
2.3.1.2 The Web 27
2.3.1.3 Cloud 27
2.3.1.4 Internet of Things 27
2.3.1.5 Databases 27
2.3.1.6 Archives 28
2.3.2 Big Data Analytics 28
2.3.2.1 Descriptive Analytics 28
2.3.2.2 Predictive Analytics 28
2.3.2.3 Prescriptive Analytics 29
2.4 Data Analytics Phases 29
2.5 Data Analytical Tools 30
2.5.1 Microsoft Excel 30
2.5.2 Apache Spark 33
2.5.3 Open Refine 34
2.5.4 R Programming 35
2.5.4.1 Advantages of R 36
2.5.4.2 Disadvantages of R 36
2.5.5 Tableau 36
2.5.5.1 How TableauWorks 36
2.5.5.2 Tableau Feature 37
2.5.5.3 Advantages 37
2.5.5.4 Disadvantages 37
2.5.6 Hadoop 37
2.5.6.1 Basic Components of Hadoop 38
2.5.6.2 Benefits 38
2.6 Database Management System for Big Data Analytics 38
2.6.1 Hadoop Distributed File System 38
2.6.2 NoSql 38
2.6.2.1 Categories of NoSql 39
2.7 Challenges in Big Data Analytics 39
2.7.1 Storage of Data 40
2.7.2 Synchronization of Data 40
2.7.3 Security of Data 40
2.7.4 Fewer Professionals 40
2.8 Conclusion 40
References 41
Contents ix
Index 387
xix
List of Contributors
Aniruddha Sadhukhan
Dakun Lai
RCC Institute of Information Technology
University of Electronic Science and
West Bengal
Technology of China
India
Chengdu
China
Anisha Roy
RCC Institute of Information Technology
Deepak Kumar Sharma
West Bengal
Netaji Subhas University of Technology
India
New Delhi
India
Arvinder Kaur
Guru Gobind Singh Indraprastha
Dhanushka Abeyratne
University
Yellowfin (HQ)
India
The University of Melbourne
Ayush Ahuja Australia
Jaypee Institute of Information Technology
Noida Faijan Akhtar
India Jamia Hamdard
New Delhi
Biswajit Modak India
Nabadwip State General Hospital
Nabadwip
India
xx List of Contributors
Series Preface
The Intelligent Signal and Data Processing (ISDP) book series is aimed at fostering
the field of signal and data processing, which encompasses the theory and practice of
algorithms and hardware that convert signals produced by artificial or natural means into
a form useful for a specific purpose. The signals might be speech, audio, images, video,
sensor data, telemetry, electrocardiograms, or seismic data, among others. The possible
application areas include transmission, display, storage, interpretation, classification,
segmentation, or diagnosis. The primary objective of the ISDP book series is to evolve
future-generation scalable intelligent systems for faithful analysis of signals and data.
ISDP is mainly intended to enrich the scholarly discourse on intelligent signal and image
processing in different incarnations. ISDP will benefit a wide range of learners, including
students, researchers, and practitioners. The student community can use the volumes in
the series as reference texts to advance their knowledge base. In addition, the monographs
will also come in handy to the aspiring researcher because of the valuable contributions
both have made in this field. Moreover, both faculty members and data practitioners are
likely to grasp depth of the relevant knowledge base from these volumes.
The series coverage will contain, not exclusively, the following:
h) Pattern recognition
i) Remote sensing imagery
j) Underwater image analysis
k) Gesture analysis
l) Human mind analysis
m) Multidimensional image analysis
3. Speech processing
a) Modeling
b) Compression
c) Speech recognition and analysis
4. Video processing
a) Video compression
b) Analysis and processing
c) 3D video compression
d) Target tracking
e) Video surveillance
f) Automated and distributed crowd analytics
g) Stereo-to-auto stereoscopic 3D video conversion
h) Virtual and augmented reality
5. Data analysis
a) Intelligent data acquisition
b) Data mining
c) Exploratory data analysis
d) Modeling and algorithms
e) Big data analytics
f) Business intelligence
g) Smart cities and smart buildings
h) Multiway data analysis
i) Predictive analytics
j) Intelligent systems
xxv
Preface
Intelligent data analysis (IDA), knowledge discovery, and decision support have recently
become more challenging research fields and have gained much attention among a large
number of researchers and practitioners. In our view, the awareness of these challenging
research fields and emerging technologies among the research community will increase
the applications in biomedical science. This book aims to present the various approaches,
techniques, and methods that are available for IDA, and to present case studies of their
application.
This volume comprises 18 chapters focusing on the latest advances in IDA tools and tech-
niques.
Machine learning models are broadly categorized into two types: white box and black box.
Due to the difficulty in interpreting their inner workings, some machine learning models
are considered black box models. Chapter 1 focuses on the different machine learning mod-
els, along with their advantages and limitations as far as the analysis of data is concerned.
With the advancement of technology, the amount of data generated is very large. The
data generated has useful information that needs to be gathered by data analytics tools in
order to make better decisions. In Chapter 2, the definition of data and its classifications
based on different factors is given. The reader will learn about how and what data is and
about the breakup of the data. After a description of what data is, the chapter will focus on
defining and explaining big data and the various challenges faced by dealing with big data.
The authors also describe various types of analytics that can be performed on large data
and six data analytics tools (Microsoft Excel, Apache Spark, OpenRefine, R, Hadoop, and
Tableau).
In recent years, the widespread use of computers and the internet has led to the genera-
tion of data on an unprecedented scale. To make an effective use of this data, it is necessary
that data must be collected and analyzed so that inferences can be made to improve vari-
ous products and services. Statistics deals with the collection, organization, and analysis of
data. The organization and description of data is studied under these statistics in Chapter
3 while analysis of data and how to make predictions based on it is dealt with in inferential
statistics.
After having an idea about various aspects of IDA in the previous chapters, Chapter 4
deals with an overview of data mining. It also discusses the process of knowledge discovery
in data along with a detailed analysis of various mining methods including classification,
xxvi Preface
clustering, and decision tree. In addition to that, the chapter concludes with a view of data
visualization and probability concepts for IDA.
In Chapter 5, the authors demonstrate one of the most crucial and challenge areas in
computer vision and the IDA field based on manipulating the convergence. This subject is
divided into a deep learning paradigm for object segmentation in computer vision and visu-
alization paradigm for efficiently incremental interpretation in manipulating the datasets
for supervised and unsupervised learning, and online or offline training in reinforcement
learning. This topic recently has had a large impact in robotics and autonomous systems,
food detection, recommendation systems, and medical applications.
Dental caries is a painful bacterial disease of teeth caused mainly by Streptococcus
mutants, acid, and carbohydrates, and it destroys the enamel, or the dentine, layer of
the tooth. As per the World Health Organization report, worldwide, 60–90% of school
children and almost 100% of adults have dental caries. Dental caries and periodontal
disease without treatment for long periods causes tooth loss. There is not a single method
to detect caries in its earliest stages. The size of carious lesions and early caries detection
are very challenging tasks for dental practitioners. The methods related to dental caries
detection are the radiograph, QLF or or quantitative light-induced fluorescence, ECM,
FOTI, DIFOTI, etc. In a radiograph-based technique, dentists analyze the image data.
In Chapter 6, the authors present a method to detect caries by analyzing the secondary
emission data.
With the growth of data in the education field in recent years, there is a need for intelligent
data analytics, in order that academic data should be used effectively to improve learning.
Educational data mining and learning analytics are the fields of IDA that play important
roles in intelligent analysis of educational data. One of the real challenges faced by students
and institutions alike is the quality of education. An equally important factor related to
the quality of education is the performance of students in the higher education system.
The decisions that the students make while selecting their area of specialization is of grave
concern here. In the absence of support systems, the students and the teachers/mentors
fall short when making the right decisions for the furthering of their chosen career paths.
Therefore, in Chapter 7, the authors attempt to address the issue by proposing a system that
can guide the student to choose and to focus on the right course(s) based on their personal
preferences. For this purpose, a system has been envisaged by blending data mining and
classification with big data. A methodology using MapReduce Framework and association
rule mining is proposed in order to derive the right blend of courses for students to pursue
to enhance their career prospects.
Atmospheric air pollution is creating significant health problems that affect millions of
people around the world. Chapter 8 analyzes the hypothesis about whether or not global
green space variation is changing the global air quality. The authors perform a big data
analysis with a data set that contains more than 1M (1 048 000) green space data and air
quality data points by considering 190 countries during the years 1990 to 2015. Air quality
is measured by considering particular matter (PM) value. The analysis is carried out using
multivariate graphs and a k-mean clustering algorithm. The relative geographical changes
of the tree areas, as well as the level of the air quality, were identified and the results indi-
cated encouraging news.
Preface xxvii
of the system. The presence of a large number of bugs of this kind can put systems into
vulnerability positions and reduces the risk aversion capability. Thus, it is essential to
resolve these issues on a high priority. The test lead can assign these issues to the most con-
tributing developers of a project for quick closure of opened critical bugs. The comments
are mined, which help us identify the developers resolving the majority of bugs, which is
beneficial for test leads of distinct projects. As per the collated data, the areas more prone
to system failures are determined such as input/output type error and logical code error.
Sentiments are the standard way by which people express their feelings. Sentiments are
broadly classified as positive and negative. The problem occurs when the user expresses
with words that are different than the actual feelings. This phenomenon is generally known
to us as sarcasm, where people say something opposite the actual sentiments. Sarcasm
detection is of great importance for the correct analysis of sentiments. Chapter 14 attempts
to give an algorithm for successful detection of hyperbolic sarcasm and general sarcasm in
a data set of sarcastic posts that are collected from pages dedicated for sarcasm on social
media sites such as Facebook, Pinterest, and Instagram. This chapter also shows the initial
results of the algorithm and its evaluation.
Predictive analytics refers to forecasting the future probabilities by extracting information
from existing data sets and determining patterns from predicted outcomes. Predictive ana-
lytics also includes what-if scenarios and risk assessment. In Chapter 15, an effort has been
made to use principles of predictive modeling to analyze the authentic social network data
set, and results have been encouraging. The post-analysis of the results have been focused on
exhibiting contact details, mobility pattern, and a number of degree of connections/minutes
leading to identification of the linkage/bonding between the nodes in the social network.
Modern medicine has been confronted by a major challenge of achieving promise and
capacity of tremendous expansion in medical data sets of all kinds. Medical databases
develop huge bulk of knowledge and data, which mandates a specialized tool to store
and perform analysis of data and as a result, effectively use saved knowledge and data.
Information is extracted from data by using a domain’s background knowledge in the
process of IDA. Various matters dealt with regard use, definition, and impact of these
processes and they are tested for their optimization in application domains of medicine.
The primary focus of Chapter 16 is on the methods and tools of IDA, with an aim to
minimize the growing differences between data comprehension and data gathering.
Snoozing, or sleeping, is a physical phenomenon of the human life. When human snooze
is disturbed, it generates many problems, such as mental disease, heart disease, etc. Total
snooze is characterized by two stages, viz., rapid eye movement and nonrapid eye move-
ment. Bruxism is a type of snooze disorder. The traditional method of the prognosis takes
time and the result is in analog form. Chapter 17 proposes a method for easy prognosis of
snooze bruxism.
Neurodegenerative diseases like Alzheimer’s and Parkinson’s impair the cognitive and
motor abilities of the patient, along with memory loss and confusion. As handwriting
involves proper functioning of the brain and motor control, it is affected. Alteration in
handwriting is one of the first signs of Alzheimer’s disease. The handwriting gets shaky,
due to loss of muscle control, confusion, and forgetfulness. The symptoms get progres-
sively worse. It gets illegible and the phonological spelling mistakes become inevitable. In
Chapter 18, the authors use a feature extraction technique to be used as a parameter for
Preface xxix
diagnosis. A variational auto encoder (VAE), a deep unsupervised learning technique, has
been applied, which is used to compress the input data and then reconstruct it keeping the
targeted output the same as the targeted input.
This edited volume on IDA gathers researchers, scientists, and practitioners interested
in computational data analysis methods, aimed at narrowing the gap between extensive
amounts of data stored in medical databases and the interpretation, understandable, and
effective use of the stored data. The expected readers of this book are researchers, scien-
tists, and practitioners interested in IDA, knowledge discovery, and decision support in
databases, particularly those who are interested in using these technologies. This publica-
tion provides useful references for educational institutions, industry, academic researchers,
professionals, developers, and practitioners to apply, evaluate, and reproduce the contribu-
tions to this book.
1.1 Introduction
In the midst of all of the societal challenges of today’s world, digital transformation is rapidly
becoming a necessity. The number of internet users is growing at an unprecedented rate.
New devices, sensors, and technologies are emerging every day. These factors have led
to an exponential increase in the volume of data being generated. According to a recent
research [1], users of the internet generate 2.5 quintillion bytes of data per day.
1. Data collection and preparation: This step involves acquiring data, and converting it into
a format suitable for further analysis. This may involve storing the data as a table, taking
care of empty or null values, etc.
2. Exploration: Before a thorough analysis can be performed on the data, certain character-
istics are examined like number of data points, included variables, statistical features, etc.
Data exploration allows analysts to get familiar with the dataset, and create prospective
hypotheses. Visualization is extensively used in this step. Various visualization tech-
niques will be discussed in depth later in this chapter.
3. Analysis: Various machine learning and deep learning algorithms are applied at this step.
Data analysts build models that try to find the best possible fit to the data points. These
models can be classified as white box or black box models.
Data collection
Exploration Analysis
and preparation
1. White box models: The models whose predictions are easily explainable are called white
box models. These models are extremely simple, and hence, not very effective. The accu-
racy of white box models is usually quite low. For example – simple decision trees, linear
regression, logistic regression, etc.
2. Black box models: The models whose predictions are difficult to interpret or explain are
called black box models. They are difficult to interpret because of their complexity. Since
they are complex models, their accuracy is usually high. For example – large decision
trees, random forests, neural networks, etc.
So, IDA and machine learning models suffer from accuracy-explainability trade-off.
However, with advances in IDA, the explainability gap in black box models is reducing.
1.2 Interpretation of White Box Models 3
White box models are extremely easy to interpret, since interpretability is inherent in their
nature. Let’s talk a few white box models and how to interpret them.
Predicted value yi
Residual ei
Target value ti
from 0 to 1. Higher the R2 value, better the model explains the data. R2 is calculated as:
where
ei = yi − ti (1.3)
where
But, there is a problem with R2 value. It increases with number of features, even if they
carry no information about the target values. Hence, adjusted R2 value (R2 ) is used, which
takes into account the number of input features:
n−1
R2 = 1 − (1 − R2 ) (1.6)
n−p−1
Where
1. Pick the best attribute/feature. Best feature is that which separates the data in the best
possible way. The optimal split would be when all data points belonging to different
classes are in separate subsets after the split.
2. For each value of the attribute, create a new child node of the current node.
3. Divide data into the new child nodes.
4. For each new child node:
a. If all the data points in that node belong to the same class, then stop.
b. Else, go to step 1 and repeat the process with current node as decision node.
Condition for
feature 1
yes no
yes no
CLASS 2 CLASS 3
Figure 1.4 Distribution of points in case of high and low information gain.
Many algorithms like ID3 [20] and CART [21] can be used to find the best feature.
In ID3, information gain is calculated for each attribute, and the attribute with highest
information gain is chosen as best attribute (see Figure 1.4).
For calculating information gain, we first calculate entropy (H) for each feature (F) over
the set of data points in that node (S):
∑
HF (S) = − pv × log(pv ) (1.7)
v∈values(F)
Minimum value of the Gini Index is 0 when all data points in S belong to the same class.
Interpreting decision trees is extremely simple. To explain any decision made by the deci-
sion tree, just start from the root node, and move downward according to the input features.
Eventually, a leaf node will be reached, which the prediction is made by the decision tree.
Hence, it is clear that if a decision tree is small, then it may be considered white box. But
for large decision trees, it’s not possible to say which factor has how much effect on the final
outcome. So, large decision trees are considered black box models.
There are a lot of other white box IDA algorithms like naive Bayes, decision rules,
k-nearest neighbors, etc. whose predictions are easy to explain.
1.3 Interpretation of Black Box Models 7
In case of white box models, being interpretable is a property of the model itself. Hence,
we studied a few white box models and learned how to interpret them. However, in case
of black box models, special algorithms are needed for their interpretability. These algo-
rithms are model-agnostic, i.e., they do not depend on a particular model. This has a huge
advantage – flexibility.
We will discuss a few of these algorithms in this section.
1∑
n
fs = f (xs , xci ) (1.11)
n i=1
where
If f s is evaluated at all xs observed in data, then we’ll have n pairs of the type (xs , f s ), which
can be plotted to see how the machine learning model f varies with the set of features xs .
Figure 1.5 [23] shows the partial dependence of house value for on various fea-
tures – median income (MedInc), average occupants per household (AvgOccup), and
median house age (HouseAge).
8 1 Intelligent Data Analysis: Black Box Versus White Box Modeling
Partial dependence
Partial dependence
Partial dependence
1 1 1
0 0 0
1.5 3.0 4.5 6.0 7.5 2.0 2.5 3.0 3.5 4.0 0 10 20 30 40 50
Medlnc AveOccup HouseAge
Figure 1.5 Partial dependence plots from a gradient boosting regressor trained on California
housing dataset [23].
1.0
Figure 1.6 Partial dependence plot from a gradient boosting regressor trained on California
housing dataset [23].
PDP can also be plotted for two features in xs instead of one, as shown in Figure 1.6 [23]
(partial dependence of house value on house age and average occupancy).
However, PDP sometimes gives incorrect results too. As an example [24], consider the
following data generation process:
where
1,000 observations were generated from this equation, and a stochastic gradient boosting
model was fit to the generated data. Figure 1.7a [24] shows the scatter plot between X 2 and
Y , and Figure 1.7b [24] shows the partial dependence of Y on X 2 . The PDP shows that there
is no meaningful relationship between the two variables, but we can clearly see from the
scatter plot that this interpretation is wrong.
1.3 Interpretation of Black Box Models 9
6
4
4
2
2
partial yhat
0
0
y
–2
–2
–4
–4
–6
–1.0 –0.5 0.0 0.5 1.0 –1.0 –0.5 0.0 0.5 1.0
x_2 x_2
(a) Scatterplot of Y versus X2 (b) PDP
These plots sometimes produce wrong results because of averaging, which is an inherent
property of PDPs. Moreover, PDPs provide a useful summary only when the dependence of
selected features on the remaining features is not too strong.
To overcome some of the issues with PDP, individual conditional expectation (ICE) plots
are used.
6
4
2
partial yhat
0
–2
–4
–6
marginal
1.0
1.0
distribution
p2(x2)
conditional
0.8
0.8
distribution
p2|1(x2|0.3)
0.6
0.6
x2 x2
0.4
0.4
0.2
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
(a) (b)
variables, conditional probability is more or less concentrated around the data as shown in
the figure. So, the formula for calculating M-plot is:
[ ]
fs,M (xs ) = EXc |Xs f (Xs , Xc )|Xs = xs = f (xs , xc )P(xc |xs )dxc (1.14)
∫xc
Integrating the conditional distribution has the advantage that excessive extrapolation is
not required. However, averaging the local predictions leads to mixing the effects of both
features, which is undesirable.
ALE addresses this problem by taking the differences, instead of averages over the con-
ditional probability distribution. Ale is calculated as follows:
xs [ ]
fs,ALE (xs ) = EXc |Xs f s (Xs , Xc )|Xs = zs dzs − constant (1.15)
∫z0,1
xs
fs,ALE (xs ) = f s (zs , xc )P(xc |zs )dxc dzs − constant (1.16)
∫z0,1 ∫xc
where
𝛿f (xs , xc )
f s (xs , xc ) = (1.17)
𝛿xs
There are quite a few changes as compared to M-plots. First, we average over the changes of
predictions, not the predictions themselves (f s instead of f ). Second, there is an additional
integration over z. z0, 1 is some value chosen just below the effective support of the prob-
ability distribution over the selected features xs . The choice of z0,1 is not important, since
it only affects the vertical translation of ALE plot, and the constant is chosen to vertically
center the plot.
ALE calculates prediction differences, and then integrates the differential (see Figure 1.10
[6]). The derivative effectively isolates the effect of the features of interest, and blocks the
1.0
0.8
conditional
distribution
p2|1(x2|0.3)
0.6
X2
0.4
local effect
𝜕x1 x = 0.3
1
over the conditional
0.0
effect of correlated features, thereby overcoming the correlation problem of PDPs and
ICE plots.
1.5 Summary
In this chapter, the types of IDA models have been discussed, namely white box models
and black box models. These models are caught in the explainability-accuracy trade-off.
White box models have low accuracy, but are able to produce high-quality explanations of
the decisions made by the model. Black box models, on the other hand, are more accurate
models, but suffer from low explainability. To highlight the differences between the two
models, various interpretability techniques have been reviewed.
White box models (like linear regression, decision trees, naive Bayes, etc.) are inherently
interpretable. A few of these models have been discussed, along with ways to interpret their
predictions.
Age of Miss America
correlates with
Murders by steam, hot vapours and hot objects
Correlation: 87.01% (r=0.870127)
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
25 yrs 8 murders
Age of Miss America
23.75 yrs
Murders by steam
6 murders
22.5 yrs
21.25 yrs
4 murders
20 yrs
However, to explain the decisions of black box models, special algorithms have been
developed. A few of these “model interpretability algorithms” have been discussed in
this chapter, namely, PDPs, ICE, ALE, global and local surrogate models, and feature
importance.
Although researchers have made tremendous progress, a lot of challenges still remain
untackled. A few shortcomings of machine learning models have also been presented.
References
16 Seber, G.A. and Lee, A.J. (2012). Linear Regression Analysis, vol. 329. Hoboken, NJ:
Wiley.
17 Montgomery, D.C., Peck, E.A., and Vining, G.G. (2012). Introduction to Linear Regres-
sion Analysis, vol. 821. Hoboken, NJ: Wiley.
18 Cameron, A.C. and Windmeijer, F.A. (1997). An R-squared measure of goodness of fit
for some common nonlinear regression models. Journal of Econometrics 77 (2): 329–342.
19 Safavian, S.R. and Landgrebe, D. (1991). A survey of decision tree classifier methodol-
ogy. IEEE Transactions on Systems, Man, and Cybernetics 21 (3): 660–674.
20 Quinlan, J.R. (1986). Induction of decision trees. Machine Learning 1 (1): 81–106.
21 Breiman, L. (2017). Classification and Regression Trees. New York: Routledge.
22 Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine.
Annals of Statistics 29 (5): 1189–1232.
23 Partial Dependence Plots. http://scikit-learn.org/stable/auto_examples/ensemble/plot_
partial_dependence.html (accessed 02 November 2018).
24 Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black
box: visualizing statistical learning with plots of individual conditional expectation.
Journal of Computational and Graphical Statistics 24 (1): 44–65.
25 Apley, D. W. (2016). Visualizing the Effects of Predictor Variables in Black Box Super-
vised Learning Models. arXiv:1612.08468.
26 Cook, R.D. and Weisberg, S. (1997). Graphics for assessing the adequacy of regression
models. Journal of the American Statistical Association 92 (438): 490–499.
27 Ribeiro, M.T., Singh, S., and Guestrin, C. (2016). Why should i trust you?: explaining
the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 1135–1144. ACM.
28 Breiman, L. (2001). Random forests. Machine Learning 45 (1): 5–32.
29 Spurious Correlations. http://tylervigen.com/spurious-correlations (accessed
02 November 2018).
17
2.1 Introduction
Digital Data
Semi Structured
Structured Data Unstructured Data
Data
and variables, data can be divided into a lot of categories, but for the sake of this chapter,
we would be focusing on the digital data and we will loosely use data and digital data inter-
changeably. Digital data, as shown in Figure 2.2, can be broadly classified into structured,
semi-structured, and unstructured data. We will explore each of these categories one by one
in the upcoming sections and understand their definitions, sources, and the ease or unease
of working with and processing them.
The rest of the chapter is organized as follows: Section 2.2 contains various data types
and different file formats, Section 2.3 contains an overview of big data, Section 2.4 contains
various phases of data analytics, Section 2.5 contains various data analytics tools, Section 2.6
contains database management tools for big data, Section 2.7 contains challenges in big data
analytics, and finally, Section 2.8 contains conclusion.
There are different data types and different file formats, which are described as follows:
store, access, analyze, and modify data. Let us discuss this in the context of a relational
database management system (RDBMS) due to their vast use, availability, robustness, and
efficiency attributes. Most of the structured data is generally stored in RDBMS, which is
also very cost-effective. An RDBMS conforms to the relational data model wherein the data
is stored in rows and columns and where storage, retrieval, and management of data has
been tremendously simplified.
The first step of RDBMS is the design of a relational (a relation is generally implemented
using a table), the fields and columns to store the data, the type of data that we would
like to store in these fields (they can range from just numbers, characters to dates, and
Boolean). The number of records/tuples in a relation is called the cardinality of the relation
and the number of columns is called the degree of the relation. Often there is a need to
implement constraints to which our data conforms to. These constraints include UNIQUE,
which makes sure no two values are same in tuples; NOT NULL, which makes sure there is
no null or empty value in a column; PRIMARY KEY,a combination of UNIQUE and NOT
NULL constraints and there exists only one primary key in a relation. To better understand,
let us design our own relation to store the details of an employee who’s working in a stock
brokering enterprise. Table 2.1 portrays a good example of a well-structured relation (in the
form of a table), complete with all the required fields and constraints, which adheres to the
relational data model.
The above-shown example is a perfect example of an ideal data model and the values
inserted into such a data model would be a prime specimen for structured data. The data
would be following an organized structure and would be adhering to a data model.
Various reasons [3], due to which working with structured data are so much easier and
efficient, are as follows:
1. Data manipulation: It is very easy to manipulate data when the data is structured data.
Operations like inserting, updating, and deleting, which come under data manipulation
language (DML), provide the required ease in operations pertaining to storing, accessing,
processing, etc.
2. Scalability: Structured data possess great potential to be scalable due to the ease of storage
and processing attributes.
3. Security: Structured data can be better secured as applying encryption and decryption
techniques upon a structured schema, which is easier for the organization, and through
this, they can control access to information and data.
20 2 Data: Its Nature and Modern Data Analytical Tools
4. Transaction processing: Usually a structured data set has better transaction processing
due to the adherence to a defined structure. Taking our previous example of RDBMS,
we have support for atomicity, consistency, isolation, and durability (ACID) properties,
which makes the transitioning of data and its processing easier and uniform.
5. Retrieval and recovery: Structured data is easier to retrieve from a database and, due to
its conformity to a data model, it is also easier to create backups and provide back-up
security to structured data.
1. Data mining: Data mining is the action of sorting across very large data sets to recognize
and categorize patterns and institute relationships to solve and overcome various diffi-
culties through data analysis [4]. Data mining tools allow us to make use of even unstruc-
tured data and utilize it so as to provide information, which would have been impossible
2.2 Data Types and Various File Formats 21
to extract directly from unstructured data. The previously said patterns and relationship
between variables are identified using cutting-edge methods, which have their roots in
artificial intelligence, machine learning, deep learning, statistics, and database systems.
Some popular data mining algorithms include associative rule mining, regression anal-
ysis, collaborative filtering, etc.
2. Text mining: Text data is largely unstructured, vague, and difficult to deal with when
being under the target of an algorithm. Text mining is the art of extraction of high qual-
ity and meaningful information from a given text by methods, which analyze the text
thoroughly by means of statistical pattern learning [5]. It includes sentimental analysis,
text clustering, etc.
3. Natural language processing: Commonly referred to simply as natural language process-
ing (NLP), is related to bridging the gap between human–computer communications. It
is the implementation of various computational techniques on the scrutiny and fusion
of natural language and speech [6]. In simpler terms, NLP is largely concerned with how
to analyze large amounts of natural language data and make it better understandable by
a computer and is a component of artificial intelligence.
4. Part of speech tagging (POST): The process of reading and analyzing some text and then
tagging each word in a sentence as belonging to a particular part of the speech such as
“adjective,” “Pronoun,” etc. [7]
There are many other tools and methods available too, however, the main gist remains
the same for them all and that is to use various mathematical and statistical tools to ana-
lyze and find patterns present in unstructured data in lieu of making it usable for data
analytics.
to the raw data end but is also prominent for the end product. To better understand, let us
take the scenario of data related to weather forecasts. Consider the situation where we had
a set of natural phenomenon happening in a particular region and after resource-extensive
methods, we had successfully processed and analyzed this data. Our successfully processed
data yielded the insights that these phenomena are precast to a cyclone. Afterward, appro-
priate steps were taken and that particular region was evacuated and many lives were saved.
However, after deriving our insights, all the calculations, data, and insights were not stored
and simply lost over time or got overwritten in the cache. After a few days, a similar set of
phenomena appeared in another region. As we can see, if we had our previous calculations
and the derived insights, we could have benefited greatly on the resource part and also had
gained a time advantage as the insights were already ready. What if this, in turn, had facil-
itated us to better implement precautions and broadcast warnings to the particular region?
Thus, let us look into some commonly used file formats. Also, from a real-time perspective,
the industry today is utilizing various file formats for different purposes and one hardly
ever gets a neat tabular data. As a data scientist or data analyst, it is essential to be aware
and up to date with each of these file formats and be armed with information on how to
use them.
2.2.5.2 ZIP
ZIP format is an archive file format. In simpler terms, a file is said to be an archive file
if it contains multiple files along with some metadata. Archive file formats are used for
compressing files so that they occupy less space. ZIP file format is a lossless compression
format, which means that the original file could be fully recovered after decompressing an
already compressed ZIP file. Along with ZIP, there are many more commonly used archive
file formats that include RAR, Tar, etc.
2.2 Data Types and Various File Formats 23
2.2.5.4 JSON
JSON file format [9] is one of the front-runners in the data file format used for
semi-structured data. JSON, which is an acronym for JavaScript object notation is a
text-based open standard that has been designed with the idea of exchanging data over
the web in mind. This format is commonly used for transmitting semi-structured and
structured data over the web and is a language-independent data. Figure 2.5 shows a JSON
file having the details of two meteorologists.
2.2.5.5 XML
Extensible markup language [10] also serves as a major file format in the context of
semi-structured data and is a part of the markup languages. This file format has certain
{
"Meteorologists" : [
{
"id" : "1",
"Name": "Shikhar",
"Salary": "230000",
},
{
"id" : "2",
"Name": "Manu",
"Salary": "230000",
}
]
}
rules that it must follow while encoding the data. This type of file format is readable by both
humans and machines. As discussed earlier, it plays a vital role in sending information
over the internet and is a self-descriptive language, which is a bit similar to hypertext
markup language (HTML). Figure 2.6 shows a sample XML document.
2.2.5.7 HTML
HTML is also a subset of the markup language of which XML is a part. It is the standard
and the most widely used markup language for webpage creation. HTML uses predefined
tags and hence can easily be identified. Some example tags that are used in HTML are the
<p> tag, which represents the tag used for a paragraph, and the <br> tag, which is used
for a line break. Many tags in HTML need to be closed while some do not have that need
associated with them. Figure 2.7 shows a sample code for an HTML file.
We have discussed many of the most commonly used file formats (and data types), how-
ever, there exist even more such file formats, which one might see, including but not limited
to PDF, DOCX, MP3, MP4, etc. These are fairly easier to understand and have a certain
affinity for them in terms of understanding as we have used them in our day-to-day life. As
we can see, each file format has a different style of encoding the data and thus in a way,
each file type represents a different type of data. Based on the method of storing data, each
file format has its own advantages and disadvantages. Every organization understands and
analyzes which file format will give them the best results and then takes that specific file
format into use. From a data analytics point of view, each file format and data type must be
viable for processing and analysis.
Magnitude Velocity
Eligibility
Validity
Lifespan
Units Value
Byte 8 Bits
1 KB 1024 bytes
1 MB 1024 KB
1 GB 1024 MB
1 TB 1024 GB
1 PB 1024 TB
1 EB 1024 PB
2. Velocity: Velocity here refers to the ability to store the vast amounts of data and pre-
pare it for processing and then finally analyzing and deriving information and enhanced
insights, which would facilitate better decision making. We can measure the velocity by
the amount of time it takes to do the abovementioned tasks. Earlier, batch processing
was the way to go but with advancements in technology, we’ve upgraded to periodic,
and then came near real-time, and currently, we are in the days of real-time processing.
3. Variety: As discussed earlier, the variety aspect of big data deals with how wide the
range of the data types included in the big data sets are and what are their sources. The
sources can range anywhere from traditional transaction processing systems and ratio-
nal database management system (RDMS) (which produce structured data) to HTML
pages (which produce semi-structured data) and unstructured emails or text documents
(which produce unstructured data).
4. Validity: Although not a decisive characteristic in determining big data, validity serves as
a must-have characteristic for the big data to be of any use. Validity associates to the accu-
racy and correctness of the data in question. There are many ways in which the validity
of the data can be affected. For example, due to the introduction of biases, abnormality,
and noise into the data, the data can be rendered corrupted or invalid.
5. Eligibility lifespan: While analyzing big data, it is very important to know whether the
data is actually eligible or not. Otherwise, the insights obtained due to such vast mag-
nitudes of ineligible data can be disastrous in terms of decision making based on the
insights. A data’s eligible lifespan pertains to the time aspect of how long any data will
be valid. Sometimes, data can be required for long-term decision making but if the eligi-
bility lifespan of the data is a lot less, we can end up making the wrong decision.
6. Variance: When dealing with such a high volume of data, the collection of data must
be kept under check for strict variance standards as its effect can snowball into wrong
insights and decision making.
7. Visualization: After data has been processed it should be represented in readable format.
There may be parameters in big data, which cannot be represented by usually available
formats. Thus visualization is also a big problem for big data. Nan cubes developed by
AT&T laboratories are used for visualization nowadays.
2.3 Overview of Big Data 27
2.3.1.1 Media
Media is the most popular source of big data, which is available with much ease and pro-
vides the fastest way for organizations to get an in-depth overview of a targeted group of
people according to their preference. Media data not only provides valuable data on con-
sumer preferences but also encapsulates in itself the essence of the changing trends. Media
includes social platforms like Instagram, Facebook, YouTube, as well as generic media like
images, PDF, DOCX, etc., which provide comprehensive insights into every aspect of the
target population.
2.3.1.3 Cloud
With the boom in cloud computing, companies have started integrating and shifting their
data onto the cloud. Cloud storage serves as a storehouse of structured and unstructured
data and can provide access to them in real time. The main aspect of taking the cloud as a
source is the flexibility, availability, and scalability it offers. Not only that, but the cloud also
provides an efficient and economical big data source.
2.3.1.5 Databases
One of the oldest sources of big data. However, with time, organizations have started uti-
lizing a hybrid approach. Databases are used along with another big data source so as to
28 2 Data: Its Nature and Modern Data Analytical Tools
utilize the advantages of both the sources. This strategy lays down the groundwork for a
hybrid data model and requires low investment and IT infrastructural costs. Most popular
databases include SQL, Oracle, DB2, Amazon Simple, etc.
2.3.1.6 Archives
Archives of scanned documents, paper archives, customer correspondence records,
patients’ health records, students’ admission records, and so on, can also serve as a viable
but crude source of big data.
Big Data
Analytics
Predictive Prescriptive
Descriptive
Numeric
Forcasting Predictive Scoring Optimization Simulation Regression Visualization
Modelling
Organization is able to predict their sales trends [17]. Predictive analytics is also used in
academics [18], Purdue University applied this technique to predict that a student success
in a course and give labels to risk of success (green means success probability is high, yellow
means some problems, and red specifies high risk of failure).
There are number of data analytics tools available in the market. In this chapter, the authors
will discuss Microsoft Excel, Apache Spark, Open Refine, R programming, Tableau, and
Hadoop. The comparison among these tools are given in Table 2.3 and their description is
as follows:
1. Sort: MS Excel has the ability to sort our collected data on one column or multiple
columns and offers a varying flexibility on the type of sort applied. The most common
sorts include ascending sort (in which the data is sorted from small to large manner)
and the descending sort (in which the data is sorted from large to small manner). Sort-
ing data is often one of the basic steps in data analysis as it often increases the efficiency
of other analytical algorithms.
2. Filter: In many cases, often we have a very large data source in which we have to select
and analyze only that data, which is relevant or which satisfies certain conditions and
constraints. MS Excel can help filter and will only display data records, which meet the
input constraints
3. Conditional formatting: Depending on the value of certain data, one might often need to
format it differently. This not only helps understand vast amounts of data easily but also
helps in better readability and also sometimes makes applications of computer-based
data analysis algorithms easier. A very common example can be coloring all the
profit-based final sales as green in color and the loss based on final sales as red.
4. Charts and graphs: Diagrammatic representation of any data helps increase its under-
standing, handling, and presentation properties among various other benefits. Even a
simple MS Excel chart can say more than a page full of numbers. MS Excel offers various
Table 2.3 Comparison of different data analytic tools.
Features MS Excel Apache Spark Open Refine Hadoop NoSQL Tableau R Programming
Flexibility It offers great Apache Spark is a It operates on the Hadoop manages There are all sorts Tableau does not This tool can
flexibility to great tool for data resembling data whether of information out provide the incorporate all of
structured and managing any to the relational structured or there, and at some feature of the standard
semi-structured sort of data database tables. unstructured, point you will automatic statistical tests,
data. irrespective of its encoded or have to go updates. Always a models and
type. formatted, or any through manual work is analyses as well as
other type of data tokenizing, required provides for an
parsing, and whenever user effective language
natural-language changes the so as to manage
processing backend. and manipulate
data.
Scalable MS Excel has Apace Spark boats Scalability for This is a huge Horizontally Tableau is very when it comes to
decent scalability a scalable model Open Refine has feature of Scalable easy to learn as scalability, R has
both horizontally is a preferred tool not been one of its Hadoop. It is an Increasing load compared to any some limitations
and vertically as it offers all the strong points, open source can be manage by other in terms of
features of however, it platform and runs increasing visualization available
Hadoop plus the accommodates on industry- servers etc. software. Users functions to
added speed this drawback standard can incorporate handle big data
advantage. with the ability to hardware Python and R to efficiently and a
transform data implement lack of knowledge
from one format tableau about the
to another. appropriate
computing
environments to
scale R scripts
from single-node
to elastic and
distributed cloud
services.
(Continued)
Table 2.3 (Continued)
Features MS Excel Apache Spark Open Refine Hadoop NoSQL Tableau R Programming
Cost MS Excel basic Apache Spark is It is an open ApacheHadoop NoSQL being Much costlier R is open-source
Effective version is free for an open source source software being open source open source when compared language. It can
all. However, software and and has easy software is easily software is easily to other Business be used anywhere
professional packs hence is free and availability and available and runs available and runs intelligent tools in any
are costly. easily available. offers a nice on low-cost on low-cost providing organization
solution to messy hardware hardware approximately without investing
data. platform. platform. same money for
functionality. purchasing
license.
Data visu- Pivot tables, Spark supports Since it allows Stores data in Visualizes data in Tableau is purely R visualizes the
alization graphs, and pie many APIs linking to form data nodes. form of Document dedicated for data data in the form
charts are among through which websites and Client Server database, visualization of graphical
the many tools data visualization other online APIs, Architecture. Key-value pair representation,
available for can be done it supports database, Wide this attribute of R
visualizing data in which include external column database, is extremely
MS Excel open source visualization Node based exemplary and
visualization tools tools. However, it database this is the reason
such as, D3, also has some why it is able to
Matplotlib, and built in surpass most of
ggplot, to very reconciliation the other
large data. tools as well. statistical and
graphical
packages with
great ease.
2.5 Data Analytical Tools 33
charts and graphs, which include pie chart, bar graph, histogram, etc. MS Excel offers a
very user-friendly way to make these charts and graphs.
5. Pivot tables: One of the most powerful and widely used features of MS Excel, pivot tables
allow us to extract the significance from a large, detailed data set. Excel pivots are able
to summarize data in flexible ways, enabling quick exploration of data and producing
valuable insights from the accumulated data.
6. Scenario analysis: In data analysis, we are often faced with a lot of scenarios for analy-
sis and are constantly dealing with the question of “what if.” This what-if analysis can
be done in MS Excel and it allows us to try out different values (scenarios) for various
formulas.
7. Solver: Data analysts have to deal with a lot of decision problems while analyzing large
sets of data. MS Excel includes a tool called solver that uses techniques from the opera-
tions research to find optimal solutions for all kinds of decision problems.
8. Analysis ToolPak: This add-in program provides data analysis tools for various applica-
tions like statistical, engineering, and financial data analysis.
As we can see, MS Excel as a tool for data analysis has a lot of functionalities. One can see
many advantages in using this tool. Some of these advantages include easy and effective
comparisons, powerful analysis of large amounts of data, ability to segregate and arrange
data, and many more. However, like everything in life, there are many disadvantages asso-
ciated with MS Excel as well. These disadvantages vary from a relatively higher learning
curve for full utilization of MS Excel, costly services to time-consuming data entries, prob-
lems with data that is large in volume, and many times just simple calculation errors. Thus,
MS Excel offers a holistic tool, which serves a variety of data analysis needs but at the same
time, there is still a lot of room for improvement and it fails in very specific analysis with
scaling data.
Dynamic In Memory
Swift Processing
Computation
Features of Apache
Spark
data set with ease and also a reconcile and match feature, which extends the data with
the web.
Open Refine is available in multiple languages including English, Japanese, Russian,
Spanish, German, French, Portuguese, Italian, Hungarian, Hebrew, Filipino, and Cebuano,
etc., and it is also supported by the Google News Initiative.
are either marking or deleting the rows. By combining filters and facets the rows can
be selected and the operations are performed.
● Column-based operation: Unlike Rows, we can add new columns. There are multiple
(Google Refine Expression Language) statements for modifying the data on the
selected cell.
3. Exporting Results
Once the data has been transformed, it can be exported to different formats like CSV,
JSON, TSV, HTML, and Excel spreadsheets, which are supported by OpenRefine. By
default, the format selected is JSON, which can be easily changed to other formats.
Uses of OpenRefine
1. Cleaning of the data set that is huge and messy.
2. Transforming the data set.
3. Parsing data from the web.
Limitations of OpenRefine
1. To perform simple operations, complicated steps are needed to be followed
2. Sometimes the tool degrades and returns false results, which only has the solution of
restarting the software or sometimes even the project itself.
2.5.4 R Programming
R is a programming language for statistical computation, representation of graphics, and
reporting [29]. R was invented by Robert Gentleman and Ross Ihaka and is based on the
language S designed at Bell Lab for statistical analysis in 1990s. It is a free, open-source
36 2 Data: Its Nature and Modern Data Analytical Tools
language available at [30] and maintained by the R project. R can be used in command
line mode as well as many graphic user interfaces (GUIs), like RStudio, R Commander, etc.
Some of the features of R are:
2.5.4.1 Advantages of R
2.5.4.2 Disadvantages of R
1. R is relatively slower than its competitive languages such as Python and Julia.
2. R is not scalable as compared to its competitive languages.
3. R can handle data sets that can fit into the memory of the machine. The most expensive
machines also may not have a big enough memory size that can accommodate a big
enterprise data set.
2.5.5 Tableau
Tableau [31] is business visualization software used to visualize large volumes of data in
form of graphs, charts, figures, etc. Tableau is mostly used to visualize spreadsheets (Excel
files). Since it is difficult to analysis numbers, text, and their interdependencies, Tableau
helps to analyze and understand how the business is going. Hence, Tableau is also known
as business intelligent software.
1. Easy and user-friendly: Tableau provides easy installation and does not use any high-level
programming language. Its drag-and-drop functionality provides a user-friendly
interface.
2. Variety: Tableau can visualize data in numerous forms such as graphs, pie charts, bar
graph, line graph, etc., and in numerous colors and trends.
3. Platform independent: Tableau is platform independent, i.e., it can work on any hardware
(Mac, PC, etc.) or software (MacOS, Windows, etc.).
2.5.5.3 Advantages
1. Flexibility: Tableau does not provide the feature of automatic updates. Manual work is
always required whenever the user changes the backend.
2. Cost: Much costlier when compared to other business intelligent tools providing
approximately the same functionality.
3. Screen resolution: If the developer screen resolution is different from the user screen
resolution, then the resolution of the dashboard might get disturbed.
2.5.6 Hadoop
The Hadoop is an open-source Apache Software utility framework library used for big data
that allows it to store and process large data sets in parallel and distributed fashion. The core
Apache Hadoop utility is basically divided into two parts: the storage part – well known as
HDFS (Hadoop distributed file system) and the processing part – Map reduce, as shown in
Figure 2.12.
Hadoop
A. Common: this module of Apache Hadoop includes libraries and utilities needed by other
Hadoop modules.
B. Distributed file system (HDFS): a distributed file system that provides very high band-
width across the cluster.
C. YARN: this module was introduced in 2012 as a platform responsible for the manage-
ment and computing of resources as clusters that are utilized for scheduling users’ appli-
cations.
D. Map reduce: this is the processing part of Hadoop used for large-scale data processing.
2.5.6.2 Benefits
A. Scalability and performance: Operates a very large data set over many inexpensive par-
allel servers making it highly scalable. Hadoop can store thousands of terabytes of data.
B. Reliable: Hadoop provides fault tolerance. It creates a copy of original data in the cluster,
which it uses in the event of failure.
C. Flexibility: Structured format is not required before storing data. Data can be stored in
any format (semi-structured or unstructured).
D. Cost effective: Apache Hadoop is open-source software that is easily available and runs
on a low-cost hardware platform.
2.6.2 NoSql
NoSql, unlike the traditional database, is a new set of the file-based database management
system independent of RDMS. Big companies like Amazon and Google use NoSql to
manage their databases. This was designed to overcome two main drawbacks of traditional
2.7 Challenges in Big Data Analytics 39
SQL NoSQL
databases system: high operational speed and flexibility in storing data. The basic differ-
ence between SQL and NoSQL is that the SQL works on predefined schema and all the
data are entered accordingly. The field defined as an integer can store only the integer
value, whereas in NoSql there is no predefined structure and data is stored in a different
format such as JSON object, etc.
A. Document database: Data is stored in the form of document in an encoded format such
as a JSON object, or XML. Documents are identified via a unique key. Each key is paired
with a complex data structure called a document, e.g., Mongo DB, Apache Couch DB,
and Arango DB.
B. Key-value pair database: This data is stored in the form of key and value pair that are
called columns. For every data value, there must exist a key for reference. The column
name is string and can contain any data type or size. The schema of the database is
dynamic and can be easily changed by just adding columns. It is the same as an asso-
ciative array, e.g., Aerospike, Apache Ignite, and Amazon DynamoDb.
C. Wide column database: In this, the data is stored in form of rows and columns. But unlike
a traditional database, the data type of each row can vary, e.g., HBase and Cassandra.
D. Node-based database: This type of database is also known as a graph-based database.
Here the data is stored in nodes that are connected to each other. This type of data set is
used where there is high connectivity among the nodes, e.g., Allegro Graph and Arango
DB (Table 2.4).
It seems to be particularly helpful to classify the problems that analysts have when we talk
about sorting big data into three different categories, since these hierarchical classifications
are found to be really useful and versatile. We should not be confused by this problem and
40 2 Data: Its Nature and Modern Data Analytical Tools
the problems faced by IT, since IT has problems like storing data, accessing data, and putting
it somewhere.
The four categories are:
2.8 Conclusion
The data generated on the internet is being doubled every year and the type of data is
mainly unstructured in nature. The size of data is too large, which cannot be processed
with traditional programming languages, visualization tools, and analytical tools. In this
book chapter, the authors explained various types of data, i.e., structured, unstructured,
and semi-structured, as well as their various resources and various file formats in which
data is stored. Further, analytical process of big data is categorized into descriptive, pre-
scriptive, and predictive along with real-world applications. Then the authors described
various analytical tools like R, Microsoft Excel, Apache Spark, Open Refine, Tableau, and
two storage management tools, HDFS and NoSql. In the last section, various challenges and
applications of big data were presented.
References 41
References
1 https://en.wikipedia.org/wiki/Raw_data
2 www.ngdata.com/what-is-data-analysis
3 Castellanos, A., Castillo, A., Lukyanenko, R., and Tremblay, M.C. (2017). Understanding
benefits and limitations of unstructured data collection for repurposing organizational
data. In: Euro Symposium on Systems Analysis and Design. New York, Cham: Springer.
4 Liao, S.-H., Chu, P.-H., and Hsiao, P.-Y. (2012). Data mining techniques and
applications–a decade review from 2000 to 2011. Expert Systems with Applications 39
(12): 11303–11311.
5 Tan, A.-H. (1999). Text mining: the state of the art and the challenges. In: Proceedings of
the PAKDD 1999 Workshop on Knowledge Discovery from Advanced Databases, vol. 8. sn,
65–70.
6 Gelbukh, A. (2005). Natural language processing. In: Fifth International Conference on
Hybrid Intelligent Systems (HIS’05), 1. Rio de Janeiro, Brazil: IEEE https://doi.org/10
.1109/ICHIS.2005.79.
7 Jahangiri, N., Kahani, M., Ahamdi, R., and Sazvar, M. (2011). A study on part of speech
tagging. Review Literature and Arts of the Americas.
8 www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/
9 www.json.org
10 www.w3.org/standards/xml/core
11 Assunção, M.D., Calheiros, R.N., Bianchi, S. et al. (2015). Big data computing and
clouds: trends and future directions. Journal of Parallel and Distributed Computing 79:
3–15.
12 Gudivada, V.N., Baeza-Yates, R., and Raghavan, V.V. (2015). Big data: promises and
problems. Computer 48 (3): 20–23.
13 Zaharia, M., Chowdhury, M., Das, T. et al. (2012). Resilient distributed datasets: a
fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th
USENIX Conference on Networked Systems Design and Implementation, 2–2. USENIX
Association.
14 Russom, P. (2011). Big data analytics. TDWI Best Practices Report, Fourth Quarter 19 (4):
1–34.
15 Wolpin, S. (2006). An exploratory study of an intranet dashboard in a multi-state health-
care system. Studies in Health Technology and Informatics 122: 75.
16 Delen, D. and Demirkan, H. (2013). Data, information and analytics as services. Deci-
sion Support Systems 55: 359–363.
17 IBM netfinity predictive failure analysis. http://ps-2.kev009.com/pccbbs/pc_servers/pfaf
.pdf (accessed 20 November 2018).
18 Lohr, S. (2012). The age of big data. New York Times 11 (2012).
19 Purdue university achieves remarkable results with big data. https://datafloq.com/read/
purdueuniversity-achieves-remarkable-results-with/489 (accessed 28 February 2015).
20 Gartner taps predictive analytics as next big business intelligence trend. www. http://
enterpriseappstoday.com/business-intelligence/gartner-taps-predictive-analytics-as-
nextbig-business-intelligence-trend.html (accessed 28 February 2015).
42 2 Data: Its Nature and Modern Data Analytical Tools
21 The future of big data? Three use cases of prescriptive analytics. https://datafloq.com/
read/future-big-data-use-cases-prescriptive-analytics/668(accessed 02 March 2015).
22 Farris, A. (2012). How big data is changing the oil & gas industry. Analytics Magazine.
23 The oil & gas industry looks to prescriptive analytics to improve exploration and pro-
duction. www.exelisvis.com/Home/NewsUpdates/TabId/170/ArtMID/735/ArticleID/
14254/The-Oil--Gas-Industry-Looks-to-Prescriptive-Analytics-To-Improve-Exploration-
and-Production.aspx (accessed 28 February 2015).
24 Kantere, V. and Filatov, M. (2015). A framework for big data analytics. In: Proceedings
of the Eighth International C* Conference on Computer Science & Software Engineering
(eds. M. Toyama, C. Bipin and B.C. Desai), 125–132. New York: ACM.
25 Levine, D.M., Berenson, M.L., Stephan, D., and Lysell, D. (1999). Statistics for Managers
Using Microsoft Excel, vol. 660. Upper Saddle River, NJ: Prentice Hall.
26 Ham, K. (2013). OpenRefine (version 2.5). http://openrefine.org. Free, open-source tool
for cleaning and transforming data. Journal of the Medical Library Association 101 (3):
233.
27 Zaharia, M., Chowdhury, M., Franklin, M.J. et al. (2010). Spark: cluster computing with
working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud
Computing, HotCloud’10, 10–15. New York: ACM.
28 Ihaka, R. and Gentleman, R. (1996). R: a language for data analysis and graphics.
Journal of Computational and Graphical Statistics 5 (3): 299–314.
29 www.tableau.com/trial/data-visualization
30 www.r-project.org/about.html
31 http://hadoop.apache.org
43
3.1 Introduction
In recent years, the widespread use of computers and the internet has led to generations of
data on an unprecedented scale [1–4]. To make an effective use of this data it is necessary
that this data must be collected and analyzed so that inferences can be made to improve
various products and services. Statistics deals with collection, organization, and analysis of
data. Organization and description of data is studied under descriptive statistics, whereas
analysis of data, and making predictions based on it is dealt with in inferential statistics.
3.2 Probability
3.2.1 Definitions
Before we delve deeper into the understanding of statistical methods and its applications,
it is imperative that we review some definitions and go through concepts that will be used
all throughout the chapter [5].
3.2.1.2 Probability
In any experiment, probability is defined with reference to a particular event. It is measured
as the chances of occurrence of that event among all the possible events. In other words, it
can be defined as the number of desired outcomes divided by the total number of possible
outcomes of an experiment.
1. Probability(Outcome) > = 0
2. If Outcome A and Outcome B are mutually exclusive i.e., A ∩ B = Ø then
Probability(A ∪ B) = Probability(A) + Probability(B)
3. Probability(Outcome space) = 1
3.2.1.5 Independence
It follows that two events may not always be interrelated and cause any influence over each
other. Outcome A is said to be independent of outcome B when occurrence of B does not
affect the probability of outcome A.
While the reverse is true, the following expression is also true and is derivable
X(H, T) = X(T, H) = 1
X(H, H) = 2
X(T, T) = 0 (3.4)
3.2 Probability 45
3.2.1.8 Expectation
Expectation is similar to probability distribution in the sense that it has different definitions
for different nature of variables, discrete and continuous. In the case of discrete variables,
the expected value or mean 𝜇 is calculated as:
∑ ∑
E(X) = xp(x), and E(h(X)) = h(x)p(x) (3.7)
x x
Whereas for continuous variables, the summation is simply replaced with an integration:
∞ ∞
E(X) = x f (x)dx, and E(h(X)) = h(x)f (x)dx (3.8)
∫−∞ ∫−∞
Here we have Bi an event whose probability has to be found out given that we have an
evidence of the outcome A. Bi must be influenced by A to have an effect on its value. An
extremely famous example to aid the understanding of the Bayes’ rule is that of a false- or
true-positive test for a clinical trial. Let us have a trail to detect cancer in patients. Now the
trial could give us a true-positive result in the case where the patient has cancer and the trial
indicates so. However, there are cases where a false-positive is obtained when the patient
46 3 Statistical Methods for Intelligent Data Analysis: Introduction and Various Concepts
does not have cancer but the trial reports otherwise. Similarly, there are cases possible for
negative test results and for people who do not have cancer. Bayes’ rule allows us to answer
questions of the kind: given a positive test result, what is the probability of a patient actually
having cancer?
Descriptive statistics, as its name implies, refers to the branch of statistics that deals with
collection, presentation, and interpretation of data. Raw data, by itself, is of little use.
Descriptive statistical methods can be used to transform raw data in a form such that its gist
can be easily found out and it can be interpreted using established methods. Descriptive
statistical methods are of the following types:
1. Picture representation
2. Measure of central tendency
3. Measure of variability
4. Distributions
which also preserves intragroup information. Here, class intervals are shown on the left of
a vertical line. Each class interval is represented by a stem, which show the lowest value in
the interval. The column to the right of stem is the leaves column, which contains a series of
numbers, corresponding to the last digit of values for each value in the interval. The value
of the dependent variable for any entry is given by:
Final Value = Stem value × Stem width + leaf (3.11)
where “stem width” is the number of values a stem can represent.
3.3.2.1 Mean
Mean of a distribution is the sum of all measurements divided by the number of
observations.
1∑
n
x= x (3.12)
n i=1 i
where x1 ,x2 ,...,xn are a set of values of n observations. In case the distribution is continuous,
the mean is calculated by:
b
1
f = f (x)dx (3.13)
b − a ∫a
where f (x) is the distribution function, f is the mean of the function, [a,b] is the domain of
the function and x ∈ R.
3.3.2.2 Median
Median of a data set is the value that separates the higher half from the lower half.
3.3.2.3 Mode
Mode is defined as the most frequent value in a data set. There can be one or multiple modes
for a given set of values.
48 3 Statistical Methods for Intelligent Data Analysis: Introduction and Various Concepts
3.3.3.1 Range
Range is defined as the difference between the highest and the lowest value in a data set.
While it’s a pretty simple measure, it informs us about the distribution of data in the data
set. Simple range, though, is affected by outliers that can expand it abnormally, hence it has
been supplanted by other measures such as interquartile range and semi-interquartile range.
Interquartile range is the difference between the highest and the lowest values when only
values in Q2 and Q3 are considered. Semi-interquartile range is the interquartile range
divided by two. These measures mitigate the effects of outliers and provide a better view
of the data than the simple range.
1∑
n
𝜎2 = (x − x)2 (3.14)
n i=1 i
where 𝜎 2 is the variance, n is the number of data instances, xi are the independent variables,
and x is the mean. Standard deviation is the measure of dispersion in a data set. A low
variance implies that the values in the data set lie close to the mean, while a high variance
means that the data is spread out. It is calculated as the square root of variance.
median in the left skew, but this assertion has been shown to be unreliable [6]. Formally,
skewness is defined as:
[( )3 ]
X −x
𝛾=E (3.15)
𝜎
where 𝛾 is Pearson’s moment coefficient of skewness, E is the expectation function, X is the
random variable, x is the mean, and 𝜎 is the standard deviation.
Kurtosis is the measure of thickness of tails in a distribution. With regard to normal distri-
bution, a distribution whose tail width is similar to that of the standard normal distribution
is called mesokurtic, one whose tail is wider is known as platykurtic and the distributions
with narrower tails are called leptokurtic. Kurtosis is defined as:
[( )4 ]
X −x
Kurt[X] = E (3.16)
𝜎
where the variables mean the same as in Eq. (3.15).
approaches that will be discussed are testing and estimation. Testing is based on decid-
ing whether a particular hypothesis regarding the population stands accepted or not with
respect to the patterns observed in the sample data. The other approach of estimation can
further be divided into two types depending upon whether the estimation is of a particular
value or a range of values, i.e., point estimation or interval estimation.
B𝜃 = E𝜃 (G) − 𝜃 (3.17)
In fact, mean squared error, M 𝜃 (G) is an umbrella term for the two values calculated
above. It is defined as the sum of the squared bias and variance.
E𝜃 (GU − GL ), (3.20)
and the probability of actually having the value of population in that interval
are important quality measures. An obvious trade-off between a higher probability of having
the value in interval and width of the interval can be seen. Thus, according to convention,
a value of (1 − 𝛼) is used as a confidence level for the estimator
for all possible 𝜃. An individual instance (gU , gL ) of this interval is called a 100(1 − 𝛼)%
confidence interval.
Comprehensive study of both point estimation and interval estimation is beyond the
scope of this book.
3.4 Inferential Statistics 51
Type I errors are considered to be a grave as compared to type II errors. It is more serious
to distort the predictions made on populations through a false negative since the popula-
tion data would also be rejected based on the sample data. G is usually accepted as a point
estimator for 𝜃. For example, if a hypothesis is based on the mean 𝜇, then X would be the
appropriate choice for G.
H0 ∶ 𝜃 ≥ 𝜃0 , Ha ∶ 𝜃 < 𝜃0 . (3.23)
Critical region mentioned earlier is governed by the maximum value that still gets
rejected by H 0 . All values from −∞ to cu , which is the critical value bounding the region
are rejected. Hence, this is also known as left rejection.
The measure of quality of a test is the test’s power 𝛽
Since H 0 is the domain in which we want most of our g to fall as it falls under type II errors,
we wish to have a higher value of 𝛽(𝜃) for 𝜃 ∈ H a . Type I errors are avoided by having a small
value of 𝛽(𝜃) for 𝜃 ∈ H 0 . Additionally, we try to restrict the type I errors to a maximum value
called significance level 𝛼 of the test
When statistical significance [9] is discussed in the context of correlation, we come across
two values, r and 𝜌. r refers to the degree of correlation between pair of instances in the sam-
ple. 𝜌 on the other hand refers to the degree of correlation that exists in the actual entirety
of the population. We seek to eliminate the possibilities, through the use of statistical sig-
nificance, of having a positive or negative value of r (for positive or negative correlation)
while the value of 𝜌 equals zero. This is a bigger concern when the size of the sample is
small since such inferences can be easily formed. It has been conventionally accepted that
5% of statistical significance is the benchmark for accepting that an inference drawn on a
sample is not arrived at by chance or luck and may hold true for the entire population.
3.5.1 Regression
Regression defines the relationship between the response and the regressors. Here, the inde-
pendent variables are called “regressors,” whereas the dependent variable, also known as
the output, is called the “response.” Regression can be linear as well as nonlinear.
Here, y is the response, E(y) is the expected value of y, n is the number of data instances,
𝛽 i are the weights and xi are the regressors.
While the linear model is not accurate for several classes of data, it can provide a close
approximation of the underlying function. Moreover, a large class of nonlinear functions,
such as polynomials, can be easily converted into the linear model [10]. Linear models can
be easily solved using simple numerical methods, and hence it is the most widely used
model for data analysis. Moreover, linear models can be used to perform statistical infer-
ence, especially hypothesis testing and interval estimations for weights and predictions.
It is used to describe the relationship between the weight of a part of a plant and the
weight of the whole plant.
The Mitscherlich model used to predict crop yield [13] with respect to amount of fertilizer
applied is expressed as:
While nonlinear methods can model real-world phenomena more accurately, there exists
no algebraic method to estimate the least square estimators for 𝛽s. Moreover, the statistical
properties of the responses and regressors are unknown, hence no statistical inference can
be performed. Due to these shortcomings, nonlinear models are employed only in cases
where the relationships between the regressors and the responses are known, and the goal
is to find the weights. If output is to be predicted, an approximate linear model should
be used.
1. The distribution of the data can be any of the distributions in the exponential family.
Hence, such models can be used to analyze data following distributions such as binomial,
Poisson, or gamma, as well as the normal.
2. The expected value of the response is not a linear function of the regressors, but is
given by:
∑
n
g(E(yj )) = 𝛽0 + 𝛽i xij . (3.29)
i=1
Here, the variables are the same as in Eq. (3.26). g(.) represents a differentiable monotone
function, known as link function.
A general algorithm has been proposed in [14], which though iterative, has natural start-
ing values and uses repeated least square estimations. It also gives us a common method
for statistical inference.
Estimated values for weights and responses are found using least square estimators. The
estimated weights are written as 𝛽̂i , and hence the fitted values are calculated as:
∑
n
yj = 𝛽̂0 +
̂ 𝛽̂i xij . (3.31)
i=1
The fitted values are the predication made by the regression according to the estimated
weights. A quantity residual is defined as the difference between the observed value and
the fitted value for a given data point:
rj = yj − ̂
yj . (3.32)
The residuals are related to the variance 𝜎 2 , which can be estimated from the residuals
by:
∑m
j=1 (yj − ̂
yj )2
2
S = , (3.33)
m − (n + 1)
where the numerator is called the residual sum of squares(R.S⋅S) and the denominator is
called the residual degrees of freedom(𝜈). If the fitted model is adequate or redundant, i.e.,
it contains at least the required nonzero weights, S2 is a good estimate of 𝜎 2 ; whereas if the
fitted model is deficient, S2 is larger than 𝜎 2 .
We can use the concepts defined above to test whether some of the inputs in a model are
extraneous. We can fit a base model Ω1 , which is known to be either adequate or redundant.
For simplicity, Ω1 can contain all the possible inputs. Another model, Ω0 , is fitted for the
data in which the weights and inputs under investigation are absent.
The residual sum of squares (RSS1 ) and degrees of freedom(𝜈 1 ) are calculated for Ω1 . The
estimate of variance according to the base model is given by
RSS1
S12 = . (3.34)
𝜈1
The residual sum of squares (RSS0 ) and degrees of freedom(𝜈 0 ) are calculated for Ω0 . The
estimate of variance according to the test model is given by
RSS0
s20 = (3.35)
𝜈0
If Ω1 is also an adequate model, the values of S12 and S02 should be fairly close. For better
accuracy, extra sum of squares is calculated as
𝜈E = 𝜈0 − 𝜈1 (3.37)
The residual degrees of freedom (𝜈) is same as that in ANOVA for the same set of inputs. Σ
is then given by R/𝜈. The extra sum of squares and products matrix can be calculated in the
same manner by which we calculated the extra sum of products in the univariate model,
that is, by taking a difference for the matrices for the two models under comparison.
For the test statistic, there are four commonly used methods – Wilks’ lambda,
Lawley-Hotelling trace, Pillai trace, and Roy’s greatest root. If the dimensionality of the
output is 1, all these test statistics are the same as F-statistic. Once the superfluous 𝛽s have
been eliminated, the relationship between the elements of the output and the 𝛽s is found
by canonical variate analysis.
being that ANOVA only works with numerical variables, whereas these models deal with
categorical inputs.
Log-linear model is a type of generalized linear model, where the output Y i follows a
Poisson distribution, with an expected value 𝜇 i . The log of 𝜇 i is taken to be linear, hence the
name. Such models are represented by:
Since the variables are categorical, they can either be present, represented by 0; or be absent,
represented by 1. Hence, the input variables can only be binary.
Log-linear model is used to find the associations between different variables. These asso-
ciations, also called interactions, are represented by 𝛽 in Eq. (3.44). Hence, the problem
boils down to finding which of the 𝛽 s are zero, same as that in ANOVA.
For log-linear analysis, we first need to calculate deviance of the fitted model. Deviance in
the log-linear model is analogous to Residual Sum of Squares in ANOVA. The deviance can
be calculated for a data set using any of the popular statistical software packages available, to
get a deviance table for the data set. The change in deviance after including the interaction
between a pair of variables is similar to Extra Sum of Squares in ANOVA. If the interaction
is redundant, the change in deviance has a distribution that is chi-squared with one degree
of freedom. If the change is much higher than expected, the interaction has a significant
effect on the output, and must be included in the model. This analysis is performed for
each effect and their interactions, and the superfluous ones are weeded out. One thing is
to be remembered is that change in deviance depends on the order in which the terms are
included in the deviance table, hence a careful analysis of data is required.
After the deviance analysis, the model is represented by the significant effects and inter-
actions. A conditional independence graph is then prepared. In this graph, each effect is
represented as nodes, and the interactions in the model are represented as edges between
them. If, for a given node N, we know the values of its neighbor nodes, then any information
about any other nodes does not tell us anything new about N.
3.5.1.9 Overdispersion
The data is said to be overdispersed when the observed variability in output is much greater
than what is predicted from the fitted model. It can happen if there’s a failure in identify-
ing a hidden variable which should have been included. Another reason may be that the
output is not entirely decided by the input, but also on certain intrinsic variations between
individuals. Hence, to model such effects, an extra variable U j is added to the linear part
of the model, which is distributed over N(0,𝜎 U 2 ). A logistic regression with random effects
can be expressed as:
logit(pj ) = 𝛽0 + 𝛽1 x1j + … + 𝛽n xnj + Uj . (3.46)
Overdispersion has been studied in [16].
where F(t) is the cumulative distribution function of the output. The hazard function is the
probability density of the output at time t with respect to survival till time t:
f (t)
h(t) = , (3.48)
S(t)
where f (t) is the probability density function of the output. In most cases, the hazard
increases with time. This is because though the probability distribution of the output
decreases, the number of survivors declines at a greater rate.
For most models, a proportional hazard function is used, which is defined as:
h(t) = 𝜆(t) exp .{G(x, 𝛽)} (3.49)
where 𝜆(t) is baseline hazard function, a hazard function on its own, and G(x,𝛽) is any arbi-
trary known function. Conventionally, G(x,𝛽) is taken as a linear function, similar to that
in the linear model. A linear G implies that if the baseline hazard is known, the logarithm
of likelihood is a generalized linear model with Poisson distribution output and a log link
function, which is the log-linear model.
In industrial studies, the baseline hazard can be known by performing controlled exper-
iments, but the same cannot be done for human patients due to ethical, moral, and legal
concerns. Hence, Cox proportional hazard model [17] is used in medical studies that can
work with arbitrary baseline hazard. It loses information related to the baseline, and the
resulting likelihood is independent of it. It uses the information about the order in which
the failures take place, instead of the survival time. This model, too, can be fitted using a
log-linear model for further analysis.
To calculate the coefficients for the principal components, the eigenvalues for the
variance-covariance matrix are found are arranged in a decreasing order. The correspond-
ing eigenvectors for these eigenvalues are also found. The coefficients of the principal
components are same as the values of the elements of the eigenvectors, while the eigen-
values give the variances of the components. We should find k principal components such
that
∑k
i=1 𝜆i
∑p ∼ 1, (3.59)
𝜆
j=1 j
3.6 Errors
We had read in Section 3.4 that the aim of inferential statistics was to correctly or at least
reasonably make predictions about the population from the given sample without making
an error of generalization. That is, we tend to make predictions about values based on the
training sample. If these predictions result in a numeric value, then they are categorized as
regression, otherwise if the values fall into discrete unordered sets, they are categorized as
classification.
The basis for prediction of values is the assumption that a random variable Y , is depen-
dent in some way on the values of X i , which are the outcomes of the given sample. This
60 3 Statistical Methods for Intelligent Data Analysis: Introduction and Various Concepts
dependency can be any function such as a simple linear regression model. However, since
the prediction is based on assumption, if the true relationship between the random vari-
able and the sampled data is not linear, it can lead to prediction error [19]. If on the other
extreme end, no assumptions are made then the random variable might overfit the data.
Overfitting refers to a phenomenon in which the parameters drawn from a sample of data
to define a relationship between the random variable and the sampled data become too sen-
sitive to a particular sample. In such a case, the predictions made about the relationship can
be found out to be wrong even on a new sample of data. Thus, overfitting more often than
not, contributes to prediction error.
bias that is almost negligible for the two models and the gradual decrease in variance as size
of sample increases.
3.7 Conclusion
An introductory view of the basics of statistics and the statistical methods used for intel-
ligent data analysis was presented in this chapter. The focus was on building an intuitive
understanding of the concepts. The methods presented in this chapter are not a comprehen-
sive list, and the reader is encouraged to audit the literature to discover how these methods
are used in real world scenarios. The concepts presented in the chapter form the basics of
data analysis, and it is expected that the reader now has a firm footing in the statistical
concepts that can then be used to develop novel state-of-the-art methods for specific use
cases.
References
1 Berthold, M.R., Borgelt, C., Höppner, F., and Klawonn, F. (2010). Guide to Intelligent
Data Analysis: How to Intelligently Make Sense of Real Data. Berlin: Springer Science &
Business Media.
62 3 Statistical Methods for Intelligent Data Analysis: Introduction and Various Concepts
2 Lavrac, N., Keravnou, E., and Zupan, B. (2000). Intelligent data analysis in medicine.
Encyclopedia of Computer Science and Technology 42 (9): 113–157.
3 Liu, X. (2005). Intelligent data analysis. In: Encyclopedia of Data Warehousing and Min-
ing, 634–638. IGI Global.
4 John E. Seem. Method of intelligent data analysis to detect abnormal use of utilities in
buildings, New York, NY: US Patent 6,816,811, 9 November 2004.
5 DeGroot, M.H. and Schervish, M.J. (2012). Probability and Statistics. Boston, MA: Pear-
son Education.
6 Von Hippel, P.T. (2005). Mean, median, and skew: correcting a textbook rule. Journal of
Statistics Education 13 (2), 1–13.
7 Richard Lowry. (2014). Concepts and applications of inferential statistics.
8 Berthold, M.R. and Hand, D.J. (2007). Intelligent Data Analysis: An Introduction.
Springer.
9 Huck, S.W., Cormier, W.H., and Bounds, W.G. (1974). Reading Statistics and Research.
New York, NY: Harper & Row.
10 Draper, N.R. and Smith, H. (2014). Applied Regression Analysis, vol. 326. Hoboken, NJ:
Wiley.
11 Niklas, K.J. (1994). Plant Allometry: The Scaling of Form and Process. Chicago, IL:
University of Chicago Press.
12 Mitscherlich, E.A. (1909). Des gesetz des minimums und das gesetz des abnehmended
bodenertrages. Landwirsch Jahrb 3: 537–552.
13 Ware, G.O., Ohki, K., and Moon, L.C. (1982). The mitscherlich plant growth model for
determining critical nutrient deficiency levels 1. Agronomy Journal 74 (1): 88–91.
14 McCullagh, P. and Nelder, J.A. (1989). Generalized linear models, vol. 37 of monographs
on statistics and applied probability. London, UK: Chapman and Hall, Second edition.
15 Tabachnick, B.G. and Fidell, L.S. (2007). Using Multivariate Statistics. New York: Allyn
& Bacon/Pearson Education.
16 Collett, D. (1991). Modelling binary data, vol. 380. London, UK: Chapman & Hall.
17 Cox D.R. (1992). Regression models and life-tables. In: Kotz, S. and Johnson, N.L. (eds).
Breakthroughs in Statistics. New York, NY: Springer.
18 Jolliffe, I. (2011). Principal component analysis. In: International Encyclopedia of Statisti-
cal Science, 1094–1096. Boston, MA: Springer.
19 Salkever, D.S. (1976). The use of dummy variables to compute predictions, prediction
errors, and confidence intervals. Journal of Econometrics 4 (4): 393–397.
20 Friedman, J.H. (1997). On bias, variance, 0/1—loss, and the curse-ofdimensionality.
Data Mining and Knowledge Discovery 1 (1): 55–77.
63
Objective
Having explained various aspects of intelligent data analysis (IDA) in previous chapters, we
are now going to see it along with the essence of data mining. In this chapter we are going
to base our discussion on the following major points:
● What data mining is? What are the similarities and differences between data and
knowledge?
● Process of knowledge discovery in data. Along with it, various mining methods, including
classification, clustering, and decision tree will be discussed.
● Issues related to data mining and its evaluation.
● In addition to that, the chapter will be concluded with a view of data visualization and
probability concepts for intelligent data analysis (IDA).
Initially, the term “data fishing” was used, which meant analyzing data without a prior
hypothesis. The term was used in a negative way and continued to be used until Michael
Lovell published an article in the Review of Economic Studies in 1983. He used the
term “data mining” but the semantics of the term was still negative. Later on, it was
in the 1990s when “data mining” was seen as a positive and beneficial practice. First,
the International Conference on Data Mining and Knowledge Discovery (KDD-95) was
started in 1995 for research purposes. It was in Montreal under AAAI sponsorship.
Usama Fayyad and Ramasamy Uthurusamy chaired it. The journal Data Mining and
Intelligent Data Analysis: From Data Gathering to Data Comprehension,
First Edition. Edited by Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar.
© 2020 John Wiley & Sons Ltd. Published 2020 by John Wiley & Sons Ltd.
64 4 Intelligent Data Analysis with Data Mining: Theory and Applications
All of these things aim to reduce risks and increase profitability in the business. Companies
like Amazon analyze a trend of customers buying specific items together so that in future
when customers buy an item, they then recommend the items that previous customers have
brought together. Amazon has reported a 29% sales increase as a result of this recommen-
dation system. The stock market is another field in which data mining is extensively used
in order to predict stock prices and get better returns on investment.
4.2 Data and Knowledge 65
DATA
DATA
Data Knowledge
It is collection of raw and unorganized facts. It is basically possession of information and using
it for the profit of business.
It is not specific or context based. It is very specific and context based.
It is based on observations and records. It is obtained from information.
It may be organized or unorganized. It is always organized.
It is sometimes not useful. It is always useful.
Data doesn’t depend on knowledge. Without data, knowledge cannot be generated.
of data mining, but it had some crucial differences. The analysts using OLAP needed some
prior knowledge about the things they were looking for, whereas in data mining, the ana-
lysts have no prior knowledge of the expected results. Moreover, OLAP gives the answers
to questions on the basis of past performance, but it cannot uncover patterns and relation-
ships that predict the future. Table 4.1 represents some of the peculiar differences between
data and knowledge.
Not only is the data huge in volume, it also comes from a variety of sources. Some of
which are shown in Figure 4.2 below.
TIME
RELATIONAL
RELATED
DATA
DATA
TEANSACTIONAL
DATA TEXT DATA
KNOWLEDGE
INFORMATION
DATA
Figure 4.3 Knowledge tree for intelligent data mining.
requirements and business expectations from a particular data mining project and ends
with applying the gained knowledge in business in order to benefit it. Data mining is used
to extract this information from any given data as shown by knowledge tree in Figure 4.3.
Data
Understanding
Preparing mining and Presenting
business
data evaluating findings
expectations
results
DATA
MINING
70 4 Intelligent Data Analysis with Data Mining: Theory and Applications
Data mining
Issues
Methodology
Data based
based
Mining new
Interactive Query Improved Handling Handling
kinds of
mining language efficiency noisy data complex data
knowledge
2. Handling complex data: Complexity of data can be seen either as complexity in the vari-
ety of data that is being generated or as a complexity that large amounts of data that is
stored in different places has to be analyzed by the mining algorithm simultaneously
and in real time. Variety of data include web data, multimedia data, time stamp data,
hypertext data, sensor data, etc. A single mining system cannot produce optimal results
for all kind of data. Thus specific systems should be made to handle them efficiently. As
for distributed data, mining algorithms that can run on a distributed network efficiently
can be the solution.
● Database technology
● Statistics
● Machine learning
● Information science
● Visualization
● Other disciplines
Other
Statistics
Disciplines
Datamining
Systems
Machine
Visualization
Learning
Information
Science
72 4 Intelligent Data Analysis with Data Mining: Theory and Applications
Based on mining of databases: Database systems can be categorized based on data types,
models, etc.
Based on mining of knowledge: Knowledge mining can be classified based on functionalities
such as characterization, discrimination, association, classification, prediction, cluster-
ing, etc.
Based on utilization of techniques: The techniques can be classified based on the degree of
interactivity of users involved in analysis of methods employed.
Based on adaptation of application: They can be classified based on applications in areas of
finance, telecommunications, stock analysis, etc.
The following considerations play an important role in designing the data mining
language:
● Specification of data set in data mining request applicable to a data mining task.
● Specification of knowledge kind in data mining request to be revealed during the process.
● Availability of background knowledge available for data mining process.
● Expressing data mining results as generalized or multi-level concepts.
● Ability to filter out less interesting knowledge based on threshold values.
⟨DMQL⟩ ::=
use database ⟨database_name⟩
{use hierarchy ⟨hierarchy_name⟩ for ⟨attribute⟩}
⟨rule_spec⟩
related to ⟨attr_or_agg_list⟩
from ⟨relation(s)⟩
[where ⟨condition⟩]
[order by ⟨order_list⟩]
{with [⟨kinds_of⟩] threshold = ⟨threshold_value⟩
[for ⟨attribute(s)⟩]}
[Source: http://hanj.cs.illinois.edu/pdf/dmql96.pdf]
4.7 Data Mining Methods 73
● Model evaluation involves the steps to fit the model to estimate a particular pattern
and meet the conditions of the knowledge discovery of data (KDD) estimates how well a
particular pattern (a model and its parameters) meet the criteria of the KDD process. It
should meet both logical and statistical benchmarks.
● Search methods contain parameter search and model search to optimize the model cal-
culation and further loops over the parameters in the search method.
1. Classification
2. Cluster analysis
3. Association
4. Decision tree induction
4.7.1 Classification
Classification is a data mining technique used to categorize each item in a set of data into
one of an already defined set of classes or collections. It is used to reveal important informa-
tion about data and metadata. It involves building the classification model and then using
the classifier for classification as shown in Figure 4.8.
The other classification methods include:
● Genetic algorithm: Centered on the idea of the survival of the fittest, the formation of the
new population is dependent on the rules set by the current population and offspring
values also. Crossover and mutation are the genetic parameters used to form offspring.
The fitness rules are decided by the training data set and evaluated by its accuracy.
● Rough-set approach: This method is functional on discrete valued attributes, and some-
times requires conversion of continuous valued attributes. It is established by creation
of equivalence classes using training set. The tuples forming the equivalence class are
undetectable. This indicates that the samples are similar to the attributes that describe
the data, thus, they can’t be distinguished easily.
● Fuzzy-set approach: Also called possibility theory, it works at an extraordinary level of
generalization. This method is an advantage when dealing with inexact data. The fuzzy
theory is a solution to vague facts where finding relations in the data set is complex. There
is no direct measure to classify in the data set.
The most common algorithms include logistic regression, naïve Bayes, and support vector
machine, etc.
Classification of e-mail messages as spam or authentic based upon the content present in
them using large data set of e-mails.
4.7.3 Association
It is used to discover relation between two or more items in the same transaction as shown
in Figure 4.10. It is often useful to find hidden patterns in data, and also called relation
technique. It is often used to find rules associated with frequently occurring elements. It is
the base for root cause analysis and market basket analysis.
The Apriori algorithm is widely used in market industry for association rule mining.
It is used in product bundling to research about customer buying habits based on histor-
ical customer transactions and defect analysis also.
Weight == heavy ?
Yes No
Yes No
Data exploration is defined as the process to understand the characteristics present in the
data set better. It includes listing down the characteristics using an analytics tool or a more
innovative statistical software. It mainly focuses on summarizing the contents present in
the data set knowing the variables present, any missing values, and a general hypothesis.
The various steps involved in data exploration are:
1. Variable identification: This involves deciding the role of the variables in the data set
mainly as predictor and target variables. Also, identification of their data type and their
category.
2. Univariate analysis: This involves analyzing variables one by one and used to find miss-
ing values and outliers. It consists of two categories:
a. Continuous variables: It gives understanding about the central tendency and spread of
the variables. Common visualizations include building histograms and boxplots for
individual variable. They focus on the statistical parameters.
b. Categorical variables: Count and count% are used to understand the distribution via
frequency table. Bar chart is used for visualization purpose for categorical values.
3. Bivariate analysis: It determines the relation between two variables. It can be performed
on the combination of continuous and categorical variables to search for association or
disassociation. The combination can be:
a. Categorical and categorical: The relation between two categorical variables can be dis-
covered via following methods:
– Two-way table: It is a table showing count and count% for each combination of
observations in rows and columns.
– Stacked column chart: It is a visual representation of two-way table showing
columns as stacks shown in Figure 4.12.
180
160
140
120
100 Type A
Type B
80
Type C
60
40
20
0
Week 1 Week 2 Week 3 Week 4
Figure 4.13 Different relationships shown by scatter plots for bivariate analysis.
– Chi-square test: It derives the statistical relation between two or more categories via
difference in expected and observed frequencies. The formula is shown as below:
∑ (observed − expected)2
X2 =
expected
b. Categorical and continuous: Box plots can be drawn to display relation between cate-
gorical values and we can discover statistical significance using:
– Z-Test/T-Test: This test is used to access if mean of two categories are statistically
different or not. If the probability of Z is small then the difference of two means is
more substantial. The T-test used when number of observation for both groups is
less than 30.
– ANOVA: Also called analysis of variance, it is used to analyze difference of means
of two categories in a sample.
c. Continuous and continuous: In bivariate analysis, scatter plot is a nifty way when look-
ing for relation between two continuous values as shown in Figure 4.13. The relation
can be linear or nonlinear. To determine the strength of relation, correlation is used.
Its value lies between −1 and 1, with 0 showing no correlation.
The formula for correlation is as follows:
Cov(x, y)
Correlation =
𝜎x ∗ 𝜎y
where,
Cov (x, y) is the covariance between the two variables
𝜎x = Standard deviation of variable x
𝜎y = Standard deviation of variable y
4.8 Data Exploration 79
4. Missing values treatment: Missing values in a data set can reduce its fit in the model and
can make the model biased, and can wrongly predict the relation among the variables.
Missing values can occur at:
● Data extraction: There may be errors with data extraction, which may lead to miss-
ing values. Sometimes the data source may also have inconsistencies or typographical
errors. They can be corrected easily.
● Data collection: They are harder to find and occur at data collection. These include
● It takes into account the correlation factor and model is easy to build.
● Selection of k-value is a critical task. It shows deviation at both higher and lower values
● Measurement error
● Experimental error
● Sampling error
80 4 Intelligent Data Analysis with Data Mining: Theory and Applications
Boxplots, histograms, and scatter plots are an easy way to detect outliers. We can remove
outliers by following methods:
● Deleting observations
● Assigning values via mean, mode, and median based on the data set
6. Feature engineering: It is the science of extracting more new information from the data
set without adding any additional data. It involves two steps of variable creation (via
derived variables and dummy variables) and transformation to make it more systematic.
Data visualization is defined as producing graphs, charts, plots, or any graphical repre-
sentation of data for better understanding and to extract meaningful information. It is a
comprehensive way to analyze data and get statistics from the complete data set without
reading the data. Patterns or trends are easily detectable using a visualization software that
might get missed in text-based data. The more sophisticated software allow more complex
visualizations or even with options such as demographics, geographic distribution, heat
maps, dials, and gauges, etc.
Some of its advantages are:
● It can deal with very large databases containing nonhomogeneous and noisy data too.
● It requires no knowledge of statistics or algorithms for understanding.
● It provides a qualitative view for quantitative analysis.
● It can identify factors and critical areas for improvement.
● It is a comprehensive way to show results, recent trends, and to communicate to others
too.
Some of the data visualization techniques are shown in Figure 4.14 and different visual-
ization samples in Figure 4.15.
Data visualization is also used in business intelligence to get insights of the business
model and other analytics.
A
X
Z E
B F
B
B A B C D
Proportional Area Chart (Circle) Waterfall Chart Phase Diagram Cycle Diagram Population Pyramid
1000 1000 0 – 10
E A 10 – 20
800 800 B
800 20 – 30
A B C D 600 30 – 40
600
A D 40 – 50
400 400 B 50 – 60
400
200 C 60 – 70
200
C 70 – 80
0 0
E F G H A B C D 2 4 6 8 10 20% 10% 0% 10% 20%
Boxplot Three-dimensional Stream Semi Circle Donut Chart Topographic Map Radar Diagram
Graph
1000
C
800 600 B
600 400 D
240
400 200 A
200 2 220
4
6
0 8
10
A B C D 200
Discrete Continuous
Can you estimate
outcomes and Is the data symmetric or
probabilities? asymmetric?
Symmetric Asymmetric
Yes No
Is the data
Where do the outliers
clustered around a
Estimate lie?
Is the data symmetric central value?
probability
distribution of aymmetric?
Yes
Symmetric Aysmmetric
Are the values clustered Are the outliers How likely Only Mostly
No are the Mostly positive
around a central value? positive or positive negative
negative? outliers?
Yes No
Only +v More +ve More -ve No outliers. Limits on data Very low Low
Uniform or Lognormal
Uniform Negative Hypergeo Logistic Exponential Minimum
Binomial Geometric Multi- Triangular Normal Gamma
Discrete Binomial metric Cauchy Extreme
model Weibull
Binomial Discrete Uniform Geometric Neg Binomial Hypergeometric Uniform Triangular Normal Logistic Exponential Lognormal Min Extreme
Figure 4.16 Different probability distribution functions classification and their approaches.
Reference 83
A data visualization can be in the form of a dashboard and can track live data from a
source, showing variations using a metric like progress rate, performance monitoring, alert
systems, etc. Tableau and Qlik are the major market vendors today in data visualization
domain.
It is a useful environment for data engineers and scientists in the data exploratory process
for detailed analysis.
Reference
1 Larose, D.T. (2005). Discovering Knowledge In Data, xi. Hoboken, NJ: Wiley.
85
5.1 Introduction
Deep learning [1] and deep reinforcement learning [2] are currently the resolution
in applied artificial intelligence for many applications based on upgrading excellent
performances. Basically, deep learning [1] is the best model for data representation
by understanding and reinforcement learning [2] is a modern approach to solving the
decision making. These are essential to represent the basic things forming the intelligent
autonomous systems [3] that can be enabled to solve the basic level based on simultaneous
localization and mapping (SLAM), which is based on interacting with the unknown
environment. Deep learning is sometimes called hierarchical learning or deep structured
learning. The history of deep learning and neural networks is not new. Indeed, the first
mathematical model of neural networks was introduced by Walter Pitts and Warren
McCulloch in 1943 [4]. However, it grew up just a few years ago by upgrading to graphic
processing units (GPUs), which are increasing more opportunities for many applications
in artificial intelligence. Figure 5.1 shows us the wide range of deep learning.
Let’s look at the progression of fields of deep learning. In machine learning, there
commonly exist three types of learning: supervised learning, unsupervised learning,
and reinforcement learning. Most of the recent successes in applications are computer
vision and image processing, language and speech processing, which includes natural
language processing and speech recognition, and robotics. In the past decades, simple
machine learning algorithms pointed directly to supervised learning. Supervised learning
algorithms [5] are a mapping function from example data sets, where they will be labeled
as targets of features, and data sets will be given large examples to classify, e.g., learning
tasks learn to predict the output. There are no different previous decades which were tried
for extracting the feature mapping, it is necessary to visualization in preprocessing data,
called big data. And another challenge is how to visualize the network model, called tools
to understand vectorization or deep learning frameworks.
In this article, the first section will cover deep learning and visualization in context. Next,
the information of data processing and visualization is addressed to understand specific
Intelligent Data Analysis: From Data Gathering to Data Comprehension,
First Edition. Edited by Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar.
© 2020 John Wiley & Sons Ltd. Published 2020 by John Wiley & Sons Ltd.
86 5 Intelligent Data Analysis: Deep Learning and Visualization
Computer Physics
Science
medical
neuroscience
Deep
Learning
psychology business AI ML Deep Learning
Math biology
Engineering
Figure 5.1 Left: overview of neural network and deep learning; Right: branch of deep learning in
AI, artificial intelligence; ML, machine learning [1].
methods. Finally, several experiments are fully conducted with relevant applications in the
real world, including data visualizations.
(a) (b)
Figure 5.2 (a) Overview of visualization: score function, data loss, and regularization and (b)
data-driven approach in linear model.
5.2 Deep Learning and Visualization 87
15
14
13
12
y
11
10
x
Linear 9
ŷ 0.0 0.2 0.4 0.6 0.8 1.0
x
Figure 5.3 Linear model and sample data visualization: left: a simple linear module; right:
visualization based on linear module.
And based on the scores vector, s = f (xi , W), we can write the support vector machine (SVM)
loss as formulation:
{
∑ 0 if syi ≥ sj + 1
Li =
j≠yi s j otherwise
∑
= max(0, sj − syi + 1), (5.3)
j≠yi
From (5.3)
1 ∑∑
N
L= max(0, f (xi ; W)j − f (xi ; W)yi + 1). (5.6)
N i=1 j≠y
i
Regularization forms:
1 ∑
N
L(W) = (f (xi , W), yi ) + 𝜆R(W) (5.7)
N i=1
where 𝜆 is the regularization strength (hyperparameter). Data loss is the first part of Eq. (5.7)
and it models predictions by matching the training data. The second part represents the
regularization based on the training data, and is very useful on practices.
According to f = f (g) and g = g(x), chain rule:
df df dg
= . (5.8)
dx dg dx
In visualization context, the function (5.8) is used to define the feed forward and back prop-
agation in neural networks. Especially, optimization is one of the most key research in deep
learning. For instance, Figure 5.4 illustrates the loss function:
loss 1.2 1.2
0.4 0.4
Figure 5.4 Gradient descent is the excellent to visualization in deep learning. left: describe the gradient descent concepts. center, right: Examples:
Selected initial and final weights regarding the gradient descent.
5.2 Deep Learning and Visualization 89
conv conv full con full con full con softmax POOL POOL POOL
RELU RELU RELU RELU RELU RELU
CONV CONV CONV CONV CONV CONV FC
Figure 5.5 left: Design the model with simplify blocks regarding dog detection; right: Example
shows interests in visualization feature layers [Fei-Fei Li].
logb x
0 0.2 0.4 0.6 0.8 1 1.2 1.4
–0.2
2
–0.4
–0.6 1
–0.8
x
–1
2 4 6 8 10
–1.2
–1
–1.4
–1.6 –2
1 ∑
1
loss = − y log ̂
yn + (1 − yn ) log(1 − ̂
yn ). (5.11)
N n=1 n
Figure 5.7 (a) Matrix multiplication for deep learning using linear model. (b) Visualized the linear
function using sigmoid function extending the deep integration.
Piecewise ⎧ ⎧ 1.0
⎪ 0, for x < xmin ⎪ 0, for x < xmin
Linear ⎪ ⎪ 0.5
f (x) = ⎨wx + b, for xmin ≥ x ≤ xmax f ′ (x) = ⎨w, for xmin ≤ x ≤ xmax 0.0
⎪ ⎪
⎪ 1, for x > xmax ⎪ 0, for x > xmax
–5.0 –2.5 0.0 2.5 5.0
⎩ ⎩
Bipolar ⎧ ⎧ 1
1 ′
Sigmoid f (x) = f (x) = f (x)(1 − f (x))
1 + e−x 1.0
0.5
0.0
–1
2 ′
Hyperbolic f (x) = −1 f (x) = 1 − f (x)2
1 + e−2x 1
Tangent, TanH
0
–1
1
Arctangent, f (x) = tan−1 (x) f ′ (x) =
1 + x2 1
ArcTan
0
–1
Rectified ⎧ ⎧ 1.0
⎪0, for x ≤ 0 ⎪0, for x ≤ 0
Linear Units, f (x) = ⎨ ′
f (x) = ⎨ 0.5
Leaky Rectified ⎧ ⎧ 1
Exponential ⎧ ⎧ 1
x ′ 1
Softplus f (x) = ln (1 + e ) f (x) =
1 + e−x 1.0
0.5
0.0
Optimizer: In Figure 5.8, it shows that Adam (adaptive learning rate optimization) [7] is
one of the best performances.
0.200
SGD
0.175 Momentum
RMSprop
0.150 Adam
0.125
Loss
0.100
0.075
0.050
0.025
0.000
0 50 100 150 200 250 300 350 400
Steps
C1 S1 C2 S2
feature feature feature feature n2
n1
input maps maps maps maps output
32 × 32 28 × 28 14 × 14 10 × 10 5×5
0
1
Convolutional
layer 1 Convolutional
layer 2
Pooling Fully-
Pooling
layer 2 connected
layer 1
Input layer layer
Figure 5.9 Left: example of block box most uses to visualize the complex networks.
where Rt + 1 is a value that an agent receives after taking action A in state St . Next, both
parameters are also defined, noted discount rate 𝛾 and learning rate 𝛼(0<𝛼<1).
Solving for the optimal policy using Q-learning, if given set both 𝛾and 𝛼 to 1:
agent visualization
state St
reward Rt
action at
real-time environment
St+1
Rt+1
Figure 5.10 Overview of reinforcement learning model [9]: an agent is visualized with the
real-time environment based on avoiding the obstacles experience.
agent visualization
policy πe(s,a)
state St
state
reward Rt
deep neural network
action at
real-time environment
St+1
Rt+1
In the formulation, the Markov decision process is used to formalize the reinforcement
learning problems. Our approach here is to visualize the meaning of Figures 5.11 and 5.12
and how it works in visualization contexts.
Sum of V for all Actions at each Episode Sum of V for all Actions at each Episode
3.0
1.75
1.50 2.5
1.25 2.0
Sum of V
Sum of V
1.00
1.5
0.75
0.50 1.0
0.25 0.5
0.00
0.0
0 20 40 60 80 100 0 200 400 600 800 1000
Episode Episode
0.75
0.50
0.25
0.00
0 200 400 600 800 1000
Episode
For transfer learning, it is basically to retrain that last layer on feature representation as
objects.
5.2.5 Softmax
The formalization equation of Softmax:
ezj
𝜎(z)j = ∑K , (5.16)
k=1 ezk
where j = 1, …, K.
The course function can be defined by using cross-entropy (Figure 5.15). Firstly, it is needs
to define L is loss, and (xi , yi ) is the training set. Probably, the loss can define:
1∑
L= D(s(Wxi + b), yi ), (5.17)
N i
and
̂ , Y ) = −YlogŶ.
D(Y (5.18)
Weight histogram is very useful for representing the debugging and visualization. For
instance, the interpretation of distribution is to understand the training parameters such
as weights, biases, loss of matrices, or statics parameters; we need to visualize the weights
based on layers representation. Figure 5.16 is an example of this.
Convolution
AvgPool
MaxPool
Concat
Dropout
Fully connected
Softmax
Figure 5.13 Inception v3 module: it was the powerful for visualizing the deep network model.
Figure 5.14 GoogLeNet architecture [12].
5.3 Data Processing and Visualization 97
z y˄ = s(z) y
L0 L1 L2 L3 L4 L5 L6 L7 L8
PreAct
BN PreAct
These are some of the difficulties of understanding the deep learning work. Activation visu-
alization is currently using exciting methods to understand how the deep networks are
performing well, shown in Figure 5.17.
Two variables
relationship
Three variables
fewCategories
overTime
fewCategories column chart
fewPeriods
manyCategories line chart
Figure 5.19 Comparison method: overview of charts is represented the most commons on data visualization.
Only relative stacked 100% column
differences matter chart
Relative and absolute
Few periods stacked column chart
difference matter
Changing
overtime Only relative stacked 100% area
differences matter chart
Many periods
Relative and absolute
stacked area chart
difference matter
composition
Sample share of total Pie chart
Accumulation or
static waterfall chart
subtraction to Total
Figure 5.20 Composition methodology: overview of charts is represented most commons on data visualization.
5.3 Data Processing and Visualization 101
Figure 5.21 Example of visualization applied MNIST data set by using deep learning that is
unsupervised learning algorithm: auto encoder.
5 6 7 8 9
10 10
0 0
–10
–10
–20
–20
–20 –10 0 10 20 30 –30 –20 –10 0 10 20 30 40
and
1 ∑ 𝜆 ∑ 2
m m
J(𝜃) = (h𝜃 (x(i) ) − y(i) )2 + 𝜃 (5.20)
2m i=1 2m i=1 j
Usually, for training,
1
Jtrain (𝜃) = (h (x(i) ) − y(i) )2 . (5.21)
2m 𝜃
For cross-validation,
1
Jcross−validation (𝜃) = (h (x(i) ) − y(i) )2 (5.22)
2mcross−validation 𝜃 cross−validation cross−validation
102 5 Intelligent Data Analysis: Deep Learning and Visualization
IMDB Rating and Gross Sales IMDB Rating and Gross Sales
(Linear) (Polynomial and L2)
Degree 1 Degree 5
L2 Reg (Degree 5)
1000 1000
Gross sales revenue ($ millions)
600 600
400 400
200 200
0
0
2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9
IMDB Rating (0 – 10) IMDB Rating (0 – 10)
Testing
( ( ) )2
1 (i)
Jtest (𝜃) = h𝜃 xtest − y(i)
test . (5.23)
2mtest
It is easy to understand the simplified model based on using visualization techniques pic-
tured in Figures 5.24 and 5.25 as the dropout. There is a complexity method of batch nor-
malization in deep network.
overfitting
overfitting 2 dropout(50%)
2
dropout(50%) train
train 1 test
1
test
0 0
–1.0 –0.5 0.0 0.5 1.0 –1.0 –0.5 0.0 0.5 1.0
overfitting
2
dropout(50%)
train
1 test
–1 overfitting loss=0.1627
dropout loss=0.0834
–2
Figure 5.25 Dropout processing and visualization: sampling dropout loss base on overfitting
problems.
Class box
where the object lies. These areas will be further analyzed by the Fast R-CNN detector to
determine object type (classification) and to adjust rectangular bounding boxes (regression)
to better fit the object shape. The system loss function L is a combined loss of classification
class and regression bbox :
= class + bbox . (5.24)
Thanks to the share of convolutional feature map at classification, regression, and RPN
stage, the Faster R-CNN [4] is faster than Fast R-CNN and therefore, it requires less com-
putational effort.
In this section we consider the Mask-RCNN [10] framework for object detection and seg-
mentation (Figure 5.26).
104 5 Intelligent Data Analysis: Deep Learning and Visualization
Mask R-CNN [10] is extended from Faster R-CNN. Besides the class label and the
bounding box offset, the Mask RCNN is able to detect the shape of objects, called object
masks.
This information is useful for designing high-precision robotic systems, especially
autonomous robotics grasping and manipulation applications. The general loss function L
considers the mask loss L mask mask :
Additionally, the Mask R-CNN can achieve a high pixel level accuracy by replacing RoIPool
[14–16] with RoIAlign. The RoIAlign is an operation for extracting a small feature map
while aligning the extracted features with the input by using bilinear interpolation. Readers
may refer to the paper [6] for further details. To train the detector, we reuse a Mask R-CNN
implementation available at [14–16].
Let’s look at the chart in Figure 5.27 below carefully. It can show what a chart
says explicitly. Basically, there are three concepts of the data visualizations based
on the large motivations like art science, including relationship, comparison, and
composition.
Relationship: it shows the connections or correlation between two or more variables
through data presented.
One of the useful relationships to understand and visualize our efficient networks during
training is using graph modeling. Figure 5.27 shows the training process it will be useful for
understanding which one be the best, and when we need to stop training.
1.25 loss:
loss:
loss:
1 loss:
loss:
loss:
0.75
loss:
loss:
loss:
0.5
loss:
loss:
0.25 loss:
loss:
12 more
0
Figure 5.27 Mask-RCCN result progress: training with Mask-RCNN according to variety loss values.
5.4 Experiments and Results 105
Figure 5.28 Deep learning and object visualization based on sampling during training set using
Mask-RCNN.
According to Figure 5.27, it represents the many parameters that is easy to understand
what is happening our model. In this case, each color is shown as different loss values
based on decreasing or increasing in progress and it is very useful in particles such
as a list of order here: total loss, rpn_class_loss, rpn_bbox_loss, mrcnn_class_loss,
value_loss, val_rpn_class_loss, val_mrcnn_class_loss, val_mrcnn_bbox_loss, and
val_mrcnn_mask_loss.
One of the experiences we are using to manage the convergence based on using a graph
shown in Figures 5.28–5.30:
Sampling for human detection based on using two classes, involved faces, and human
bodies.
We now discuss the visualization of activation function. In this case, we relay our expe-
riences with our own data sets. Specifically, food detection and segmentations using more
than 30 000 objects and 7000 images based on creating the 20 classes. From Figure 5.31–5.36,
it is used to visualize the Mask-RCNN segmentation and detection in image processing.
This is the correlation heat map that show relationships of many variables in an all-in-one
chart.
106 5 Intelligent Data Analysis: Deep Learning and Visualization
Figure 5.30 Human detection using Mask RCNN: noised data during the human detection and
segmentation.
5.4 Experiments and Results 107
Figure 5.31 Showing the activation function of layers based on food recognition and
segmentation using the Mask-RCNN.
y1, x1 y2, x2
1.0
0.7
0.8
0.6
0.5
0.6
0.4
0.3 0.4
0.2
0.2
0.1
0.0 0.0
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0
y1, x1 y2, x2
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
dy dx dw dh
40000 35000
20000
25000 30000
30000 25000
20000 15000
20000
15000 20000
10000
15000
10000
10000
5000 10000
5000 5000
0 0 0 0
–5 0 5 –5 0 5 0 50 100 0 50 100
dy dx dw dh
Training error
Overfitting zone
Generalization error
Error
Local maximum
Local minimum
Generalization gap
0 0.0
–0.2 –0.1 0.0 0.1 0.2 –0.4 –0.2 0.0 0.2 0.4
res2a_branch2a_4/kernel:0 res2a_branch2a_4/bias:0
1500 1.0
1000
0.5
500
0 0.0
–0.6 –0.4 –0.2 0.0 0.2 –0.4 –0.2 0.0 0.2 0.4
res2a_branch2b_4/kernel:0 res2a_branch2b_4/bias:0
6000 1.0
4000
0.5
2000
0 0.0
–0.2 –0.1 0.0 0.1 0.2 0.3 0.4 –0.4 –0.2 0.0 0.2 0.4
browl (1.00) 0
Hanlkura (0.99) 0
Hanlkura (0.98) 9
Hanlkura (0.98) 7
Shouga (0.98) 4
Tekka maki (0.98) 4
Tekka maki (0.98) 5
Tekka maki (0.98) 0
Hanlkura (0.97) 1
Hanlkura (0.97) 1
Ika (0.97) 1
Maguro (0.97) 4
Tekka maki (0.96) 5
Tamago (0.96) 9
Negitoro (0.96) 2
Ebi (0.95) 4
Tekka maki (0.95) 1
Maguro (0.95) 5
Tekka maki (0.94) 6
Tamago (0.94) 1
Ika (0.94) 5
Tekka maki (0.94) 1
Ika (0.94) 2
Samon (0.94) 2
Maguro (0.94) 9
Negitoro (0.94) 2
Anago (0.93) 5
Tekka maki (0.93)
Predictions
0
Ika (0.93) 8
Negitoro (0.93) 4
Tekka maki (0.92) 5
Tekka maki (0.92) 5
Tekka maki (0.92) 1
Anago (0.92) 2
Anago (0.91) 3
Anago (0.91) 5
Tekka maki (0.90) 2
Ebi (0.89) 2
Maguro (0.89) 3
Ebi (0.89) 6
Tamago (0.89) 6
Tamago (0.89) 3
Tubukai (0.88) 2
Ebi (0.88) 3
Samon (0.87) 1
Tubukai (0.87) 1
Anago (0.87) 2
Maguro (0.86) 0
Negitoro (0.86) 5
Tamago (0.85) 0
Negitoro (0.84) 0
Ika (0.83) 3
Samon (0.83) 2
Tubukai (0.83) 3
Samon (0.82) 2
Samon (0.77) 9
Maguro (0.76) 3
Tubukai (0.71) 2
Ebi (0.64)
Negitoro
Tekka maki
Shouga
Tekka maki
Tekka maki
Tekka maki
Tekka maki
Anago
Samon
Ebi
Tubukai
Ika
Maguro
Hanlkura
Tamago
Tamago
Tamago
Tamago
Tekka maki
Tekka maki
Tekka maki
Tekka maki
Tekka maki
Tekka maki
Tekka maki
Negitoro
Negitoro
Negitoro
Negitoro
Anago
Anago
Anago
Anago
Ebi
Ebi
Ebi
Ebi
Ika
Ika
Ika
Ika
Hanlkura
Hanlkura
Hanlkura
Hanlkura
Maguro
Maguro
Maguro
Maguro
Tubukai
Tubukai
Tubukai
Tubukai
Samon
Samon
Samon
Samon
Tamago
browl Ground Truth
For comparison between the loss function and validation function, we can also easily
understand our model based on visualizing the plot function.
For deeper visualization, there are figures listed here (Figures 5.38 and 5.39) that illustrate
the loss model and rating density of the data set [17].
5.4 Experiments and Results 111
0.78 3.0
Loss
Loss
2.5
0.76 2.0
0.74 1.5
1.0
0 2 4 6 8 0 10 20 30 40 50
Epoch
Epoch
Figure 5.38 Visualization and loss function in deep learning for recommendation system.
30000
25000
20000
Count
15000
10000
5000
0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Rating
Figure 5.39 Data visualization in MovieLens 1 M of recommendation system based on deep matrix
factorization module.
5.5 Conclusion
Number of Episode
150
135
107
number of episode
100
79
57.8
54
43 44 46
50 42
32
27 28
20 20
15 14 16.6
13
9 9
7 7 8 7 7.3
6 5 5 5 5 6 5
0
Episode
10000 Episode
Number of Episode
7500
5000
Mean
2500
Figure 5.40 Line in charts, and modeling and visualization for reinforcement learning [19].
References
1 Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
2 Mnih, V., Kavukcuoglu, K., Silver, D. et al. (2015). Human-level control through deep
reinforcement learning. Nature https://doi.org/10.1038/nature14236.
3 Ross Girshick, ‘Fast R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks’, International Conference on Computer Vision (ICCV), IEEE, 2015.
114 5 Intelligent Data Analysis: Deep Learning and Visualization
4 Waleed Abdulla, ‘Mask R-CNN for object detection and instance segmentation on Keras
and TensorFlow’, Github, GitHub repository, 2017.
5 Le, A.T., Bui, M.Q., Le, T.D., and Peter, N. (2017). D*life with reset: improved version of
* Lite for complex environment. In: First IEEE International Conference Robotic Comput-
ing(IRC); Taichung, Taiwan, 160–163. IEEE https://doi.org/10.1109/IRC.2017.52.
6 Srivastava, N., Hinton, G., Krizhevsky, A. et al. (2014). Dropout: a simple way to prevent
neural networks from overfitting. The Journal of Machine Learning Research: 1929–1958.
7 Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep
learning. 2016. arXiv:1603.07285.
8 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg
and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
9 Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht,
The Marginal Value of Adaptive Gradient Methods in Machine Learning, arXiv, 2018.
10 Le, T.D., Huynh, D.T., and Pham, H.V. (2018). Efficient human-robot interaction using
deep learning with mask r-cnn: detection, recognition, tracking and segmentation.
In: 2018 15th International Conference on Control, Automation, Robotics and Vision
(ICARCV), 162–167.
11 Chollet, F. (2017). Xception: deep learning with depthwise separable convolutions. In:
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1800–1807.
12 Doan, K.N., Le, A.T., Le, T.D., and Peter, N. (2016). Swarm robots communication and
cooperation in motion planning. In: Mechatronics and Robotics Engineering for Advanced
and Intelligent Manufacturing (eds. D. Zhang and B. Wei), 191–205. Cham: Springer
https://doi.org/10.1007/978-3-319-33581-0_15.
13 Than D. Le, Duy T. Bui Pham, VanHuy Pham, Encoded Communication Based on
Sonar and Ultrasonic Sensor in Motion Planning, IEEE Sensors, 2018. DOI: https://doi
.org/10.1109/ICSENS.2018.8589706
14 Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, Imagenet classification with deep
convolutional neural networks 2012.
15 Alex Krizhevsky, ‘One weird trick for parallelizing convolutional neural networks’, arxiv,
2014.
16 Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization.
2014. arXiv: 1412.6980v9
17 Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, ‘Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks,’ Advances in Neural Information Pro-
cessing Systems, 2015.
18 Duc Minh Nguyen, Evaggelia Tsiligianni, Nikos Deligiannis, Matrix Factorization via
Deep Learning, 2018. arXiv:1812.01478.
19 Than D. Le ; An T. Le ; Duy T. Nguyen, Model-based Q-learning for humanoid robots,
18th International Conference on Advanced Robotics (ICAR), IEEE Xplore, 2017. DOI:
https://doi.org/10.1109/ICAR.2017.8023674
20 Pytorch, www.pytorch.org
21 He, K., Gkioxari, G., Dollr, P., and Girshick, R. (2017). Mask-RCNN. In: 2017 IEEE
International Conference on Computer Vision (ICCV), 2980–2988.
115
6.1 Introduction
Dental caries is the most familiar oral disease. It is a painful bacterial disease caused
mainly by Streptococcus mutants, acid, and carbohydrates. If caries remains untreated
then it affects the root of the teeth. The World Health Organization (WHO) report [1]
reveals that 98% adults and 60–90% [2] of school children are suffering from dental caries.
It is an infectious and chronic disease that can affect us at any age. In the near future it will
be an epidemic.
our teeth are damaged. Proper hygiene of teeth reduces the risk of caries. Early caries
creates a small hole at the enamel layer.
Most of the people in the world suffer from painful dental caries due to the
above-mentioned reasons. Dental caries are of three types: enamel caries, dentine
caries, and root caries. Enamel is the first layer of the teeth and it is made with calcium.
In its early stage of caries, the enamel layer starts to decay. This is called enamel caries.
Fillings and good oral hygiene prevent this type of caries spreading. This loss can progress
through the dentin layer to the pulp region and ultimately, people will lose their tooth
[4, 9]. This condition can be avoided if it is detected in its early stages. In that case, surgical
procedures could be skipped. If dental caries is detected at its early stage, the dentist can
start the initial treatments to control the progress of caries lesions. In the next phase,
infection goes to the dentine layer where the layer starts to decay. This is called dentine
caries. If it remains untreated then the root of the teeth becomes affected. The cemento
enamel junction is affected mainly by root caries [10]. As a result, the patient feels lots of
pain. The solution for this is a root canal or an uprooted tooth. Figure 6.1 shows dental
caries at different phases. Figure 6.1a shows the enamel caries, Figure 6.1b shows dentine
caries, and Figure 6.1c shows the root caries.
Figure 6.2 shows the worldwide caries affect rate along with its severity in the middle age
group of people [11]. This graph is prepared according to the WHO report [2]. According to
this report, 11% of the total populations are highly affected (which means all the risk factors)
in dental caries, 12% people of the total population are suffering moderately (which means
they have pain as they were going through the filling procedure), and 9% of the people in
the total population have caries at its early stages, whereas another 7% of people have very
low risk dental caries and for the remaining 61%, data are not found.
Figure 6.3 shows the percentage of affected rate for smokers. Tobacco smoking increases
the risk factor of dental caries. Tobacco increases the growth of the Streptococcus mutant
Pulp Abscess
(a) (b) (c)
Figure 6.2 Worldwide dental caries severity regions. Dental caries Affected Rate in the World
as per WHO Report
High
11%
Moderate
12%
Low
9%
No Data
61%
Very Low
7%
Percentage of affection
20
15
10
Percentage of affection
5
0
e
al er
s
m er 50
Fe ok ok 50
Sm Sm
< es
e > al
on Ag Age M
N
to the dental surface and increases the severity of dental caries. Nicotine is one of the toxins
found in tobacco, which forms acid when it dissolves in the water base of saliva. This acidic
object affects the calcium layer of teeth. The report in Figure 6.3 shows that the percent-
age of males affected is quite higher than the female population as the number of female
smokers are less. The rate of those newly affected with dental caries is less than those of the
age above fifty. Caries detection at its early stage prevents a lot of decay of our teeth. But if
it remains untreated, it would become terrible. In the year of 2010, 2.4 billion people and
621 million children were attacked with caries in their permanent teeth due to lack of or
ignorance of proper treatment. Figure 6.4 shows the recent data in CAPP (computer aided
process planning) in 2014. This represents the global caries map for the children whose
age is younger than 12 by average numbers of teeth affected using the decayed, missing
teeth and filled teeth (DMFT) index of severity, soon to be called the DMFT metric. This
world map clearly shows the severity rate of all the countries. For example, in North Amer-
ica, Australia, China, and some countries in Africa, children have very low affected rate,
118 6 A Systematic Review on the Evolution of Dental Caries Detection Methods
Very Low
Low
Medium
High
No Data Found
NA
Figure 6.4 Worldwide dental caries affected Level that according to DMFT among 12-year-old
children.
1980 2.43 — — —
1985 2.78 — — —
2001 1.74 183 128 70%
2004 1.61 188 139 74%
2011 1.67 189 148 78%
2015 1.86 209 153 73%
whereas Brazil and Russia have more children with an affected rate, and Saudi Arabia and
Peru have children with a very high affected rate. Table 6.1 shows the global DMFT trends
for 12-year-old children. The data are taken from the year 1980 to 2015. Column 2 refers to
the weighted mean DMFT, which equals 2.015. The third column refers to the number of
countries worldwide. The fourth column refers to the number of countries having DMFT
less than or equal to three. The fifth column represents the percentage of countries having
DMFT less than or equal to three.
The major risk behind dental caries is that it increases the probability of chronic disease
[12]. For example, dental infections increase the risk of pneumonia, premature birth, com-
plications in diabetes, etc. Early caries detection and diagnosis reduces the risk factors of
dental caries. This also reduces the time of patients and doctors, along with the treatment
cost. Most of the time, dental caries starts at the occlusal surface of the teeth. The current
6.2 Different Caries Lesion Detection Methods and Data Characterization 119
caries detection techniques like radiographs, as well as tactile and visual inspection, cannot
help in detecting caries at its early stage. Sometimes discoloration of the pit region could be
misinterpreted as carious lesion [13]. The purpose of new detection methods is to remove
the deficiencies of detection. The methods should satisfy the following criteria. These are
as follows.
1. It should precisely and accurately detect the caries lesion.
2. It should icude ideal methods thatmonitor the whole progress rate of the caries lesion.
3. It should be easy to access as well as cheap.
4. It should be able to detect the caries lesion at every layer, including caries restoration.
Caries detection methods have been broadly classified [4, 9, 14–18] as point methods, based
on visible property of light methods, light emitting methods, OCT, software tools, and radio-
graph methods. Figure 6.5 shows the subcategorization of these methods. The next subsec-
tions explain the details of these methods. This classification involves light and transducer.
It is classified into different groups on the basis of implementation technique and data pat-
tern. In the point method, light absorption and secondary emission technique is used to
determine the mineral concentration in the tooth. This mineral concentration is different
for both caries and the healthy teeth region. This method is suitable for early caries lesion
detection. Based on visible property of light is a kind of imaging technique that determines
the caries lesion according to the visible light scattered or absorption quantity. It is capable
of distinguishing different phases of dental caries evolution.
Radiographs are also used as an imaging technique to detect caries lesion. In this tech-
nique, very high frequency light is used for imaging. This high frequency light penetrates
throughout the entire teeth. Caries lesion allows for more amounts of penetration of the
high frequency light. OCT is a high-resolution imaging technique that uses white light for
imaging. White light contains most of the different frequencies of light, so it is capable of
capturing the true color of this region. This white light is noninvasive in nature, not like the
x-ray image. These images are helpful to reconstruction 3D images. Light-emitting devices
are used to measure the depth and area of the caries lesion. This emitting technique can
be used in periodic intervals to determine the nature of caries lesions as active or inac-
tive. Software tools are kind of a hybrid approach, which include sound data, visual change
data, localized enamel breakdown change data, etc., in order to decide the level of the caries
lesion. It is a kind of automation of the techniques that are used by experienced dentists to
detect the caries lesion.
Teeth Region
as a number between 1 and 13; higher numbers refer to deeper lesion. The ECM [27, 36]
method monitors the suspected carious lesion due to its electrical resistance behavior. It also
computes the bulk resistance of tooth tissue. Sound enamel is a poor conductor of electricity
and its pore size is not large. Demineralization happens because a carious lesion implies
a large size of pore. Measurements can be taken from both enamel and exposed dentin
surfaces. Based on the pore size, dentists make the decisions. It is working either in site
mode or in surface specific mode. It finds the depth of root caries. ECM shows quantitative
data that helps to monitor the lesion growth rate [24, 37, 38].
6.2.3 Radiographs
Dentists prefer radiographs for caries detection. Accurate radiographs are very important
diagnostic tools to detect dental disease and other disease. It is useful to detect caries
because it causes tooth demineralization that can be captured by radiographs very
122 6 A Systematic Review on the Evolution of Dental Caries Detection Methods
easily [44–46]. In radiographs, the affected carious lesion appears as darker than the unaf-
fected portion of the tooth. However, early detection of carious lesion using radiographs
is very difficult, especially when they are small in size. Three main types of dental x-rays
[47, 48] are used to detect the dental caries. These are panoramic dental x-ray that shows
a broad view of the jaws, teeth, sinuses, nasal area, infectious lesion, fractures, and dental
caries. The bitewing x-rays show the teeth of upper and lower back where they touch each
other, bone loss if any, and dental infection. Periapical x-rays show the entire tooth. These
x-rays [49, 50] are very helpful to find impacted teeth, tumors, cysts, and bone changes.
Digital subtraction radiographs [51] are a more advanced imaging tool. This method
has been used in the assessment of the progression, arrest, or regression of caries lesion
[52, 53]. This method provides vital information on any changes happening over time.
Therefore, it is suitable for monitoring lesion behavior. However, dental radiographs fail to
measure the carious lesion depth in a tooth
(a) (b)
Figure 6.9 (a) (35–40) mm teeth image, (b) QLF teeth image.
well-suited for kids [56]. QLF can easily detect early demineralization and can quantify
the total loss. However, it takes a long time to capture images and analyze it. QLF cannot
differentiate between tooth decay and hypoplasia [57, 58]. It is a sensitive, reproducible
method for quantification of 400 m depth lesion. As this method uses secondary emission
that is fluoresced to detect the health of the teeth, thus the CCD module is a part of this
method. The data pattern of this method is similar to the visible light property method.
Figure 6.9 shows the teeth enamel layer under QLF light. Vista Proof and LED technology
[59–61] are also categorized as light-emitting devices. The Vista Proof device is mainly used
to detect the occlusal lesions, and us sometimes used for enamel caries. The sensitivity of
Vista Proof devices is 86%. It has the better sensitivity than the diagnodent pen device. The
design of the Vista Proof and QLF machine is almost the same and exits at a wavelength
peak of 405 nm [62, 63]. The changes in the enamel layer can be detected when the
tooth is illuminated by violet blue light. Vista proof is a device based on 6 blue GaN LED
emitting a 405 nm light [64]. With this camera, it is possible to digitize the video signal
from the dental surface during fluorescence emission using CCD sensor. These images
show different areas of dental surface that fluorescent in green and in red for carious
lesion. The software highlights the carious lesions and classifies them in a scale range
from 0 to 5 giving a treatment orientation in the first evaluation, monitoring, and invasive
treatment [25, 26, 65]. LED technology is the newest technology to detect the carious
lesion. It consists of one camera with LED technology. This system illuminates the tooth
enamel firstly, and then it records the fluorescence of the dental tissue and enhances the
image. Clinical studies are ongoing to verify its values in application domain [66]. Vista
Proof and LED technology are not much used in dental clinics. Different types of caries
detection technologies are shown in Figure 6.10.
Figure 6.10 (a) FOTI device, (b) diagnodent device, (c) QLF machine, (d) caries detection using
Radiograph.
6.2 Different Caries Lesion Detection Methods and Data Characterization 125
Code Description
0 Sound
1 First visual change in enamel
2 Distinct visual change in enamel
3 Localized enamel breakdown
4 Underlying dark shadow from dentin
5 Distinct cavity with visible dentin
6 Extensive distinct cavity with visible dentin
126 6 A Systematic Review on the Evolution of Dental Caries Detection Methods
Figure 6.11 Caries affected lesion, 3D view of the same lesion and it spreads in dentine layer.
nature, and sometimes injurious to our health. There is a lack of a single method that can
detect early caries as well as acute caries at any stage [80]. There is a huge demand for a
low-cost device that can provide a 360∘ tooth-wise view that can help dental practitioners
measure the caries depth in a tooth. Figure 6.11 shows the overview of the required system.
This also minimizes unwanted oral intervention.
Kodak provides a customized dental radiography setup. In this setup technicians use RGV
format to store the dental radiographic images. As per Versteeg Ch et al., the retaking rate of
CCD-based dental radiography like RGV costs much more than the traditional film-based
dental radiograph. Hence we can conclude that the considerable amount of data error is
present in RGV system. Another article by S. Tavropoul et al., also concludes a similar type of
opinion. Accuracy of cone beam dental CT, intraoral digital, and conventional film radiogra-
phy is useful for for the detection of periapical lesion. Radiography uses x-ray that penetrates
the dental tissues. So, it is very difficult to detect interproximal caries specifically between
overlapped teeth. Dental radiography imaging is a kind of shadow imaging. Hence proper
shape of the region may differ from the actual one when the film or CCD transducer is not
aligned properly. Hence, a retake of the image is very common to optimize the data error
in dental radiography. QLF is also an imaging-based diagnostic method. So image-related
data error challenges are also present in it. In QLF the lesion size is monitored at regular
intervals and the decision is made on the basis of a lesion contour boundary change. Hence
image data registration is one of the challenges to proper diagnosis.
current may affect the result. A band-pass filter is used to clean the data obtained from
the trans-user. Next-generation ECM and diagnodent will use a camera along with the
method to reduce the overload of monitoring the caries lesion in periodic intervals. Image
scaling is one of the most important parts at the time of monitoring a region. Hence
image registration will be introduced when a camera module will be used with the point
detection methods. This method uses CCD for imaging so that the data volume is huge. An
image compression mechanism may be used, but it may degrade the quality of the region
of interest in the image. It will be advantageous if the ROI is extracted from background
and saved in high resolution. The segmentation of ROI by means of a manual method is
laborious. Hence an automatic method is preferred. Automatic techniques will analyze
the data of the image in terms of features like texture, color, intensity, etc., and then
segment accordingly. Earlier ROI segmentation method segments the image on the basis
of some thresholds that are fixed and given manually. But modern systems will set these
thresholds (parameters) automatically on the basis of training data sets. Figure 6.7 shows
the data features along with its distribution of a teeth region obtained from visible light
property method. As already discussed, this technique is purely based on image. The data
pattern of radiograph is of two types; the older one is an analog type and the newer one
is a digital type. The analog type radiograph contains more detailed information than the
digital type. However, the analog type radiograph is very prone to error and recovery and
error detection is difficult. Due to this reason, digital radiography is preferred. Digital
radiography supports operations used in digital image processing and the operations are
similar to the information the previous sections explained. For monitoring the caries lesion
using radiography, radiographic images are needed to be compared in regular intervals.
Hence image scaling and registration is essential. This method uses different type of
transducers. Some of them provide single dimensional data like sound, two-dimensional
data like images, and some of them use multidimensional data. It is difficult to synchronize
them by hand, thus, an automatic system is preferred for synchronization. Advanced
soft computing techniques are also used to synchronize the data and assist to make a
decision. Diagnodent senses the intensity level of secondary emissions from the teeth. This
secondary emission varies from person to person so calibration is essential for each patient.
It is a time-consuming process. This method senses the intensity of light, so that ambient
light from the environment may change the value of the data. This is another challenge
in preserving the property of data. The ECM method measures the cumulative impedance
change in the teeth region and from the variation of cumulative impedance it determines
the health of the affected teeth. This method is just like the diagnodent, where the external
ambient light will not affect the system, but the moisture within the mouth may affect the
reading. The saliva layer on the teeth may change the cumulative impedance, which causes
an error in data. At the time of this test, maintaining standard dryness within the mouth
is essential to reduce the data error. FOTI uses ambient light to determine the amount of
light scattered or absorption rate from the teeth surface. After that the data is analyzed
to determine the teeth health. The saliva layer on the teeth as well as the structure of the
dental tissue may vary from person to person. Hence calibration is also essential for each
subject in order to set the reference value. At the time of calibration, a healthy tooth is
required. The selection of a healthy tooth depends on the experience of the operators.
Wrong selection leads to wrong data acquisition and an erroneous data analysis report.
6.5 Conclusion 129
Effectiveness of such existing methods depends on some metrics like specificity and
sensitivity. Specificity is the true negatives rate whereas sensitivity is the true positive
rate. Figure 6.12 is showing the lesion prevalence, sensitivity, and specificity for occlusal
surfaces. These data are based on an excellent systematic review by Bader et al. [3] where
the diagnostic performance of Bader studies are restricted in terms of diagnostic methods,
especially for primary teeth, anterior teeth, root surfaces, visual/tactile, and FOTI methods.
They limited the measurement study on histological validation. Figure 6.13 shows the
lesion prevalence, sensitivity, and specificity [53] for proximal surfaces. These graphs show
that while specificity is sufficient, the sensitivity of the traditional methods is much less.
6.5 Conclusion
Dental caries is a very common infectious disease among adults, as well as children. Hence
caries detection at its early stage is very crucial. Several methods are available to detect
caries. Among them, as discussed in this review, some are highly expensive, injurious to
health, fail to detect caries in its early stage, and are partially invasive in nature. Detec-
tion and treatment of caries affected lesions are not confined onto only the boundary of
normal disease detection and treatment methods; it also includes the essence of beauty or
overall satisfaction with one’s appearance and the capability to spend the same amount
of money with a less painful treatment. Nowadays, carious lesion detection and treatments
have become cosmetic services with different pricing packages. Hence selection of diagnosis
and treatment packages is different, thus, lots of factors and constraints need to be consid-
ered here. To address these issues, a mathematical model has been proposed below. The
su
til
Vi ap
ac
r
og
-T
i
al
ad
su
R
Vi
su
til
Vi ap
ac
r
og
-T
i
al
ad
su
R
Vi
130 6 A Systematic Review on the Evolution of Dental Caries Detection Methods
budget is the maximum amount of money that can be spent for the carious lesion treatment.
A mathematical model on the selection procedure of caries treatment has been constructed
based on the experiences of dentists and by use of a survey in urban and semi-urban areas
near Kolkata. The variable names are self-explanatory.
Budge
DiagnosticCost + TreatmentCost ≥ (6.3)
QualityCompromizeFactor
This mathematical model helps dentists and patients to choose better treatment with lower
cost. To prevent caries in its early stage, not only is primary awareness sufficient, but also
it is more important to detect actual caries-affected lesions. It is also important to establish
a system that can identify and measure various advanced and early stages of caries lesion
detection. Though there are many new and good caries detection methods available, there
is a huge gap. Hence there is need for a complete system that can identify the caries lesion
in its early and advanced stages. In the next phase of our research work we are planning to
develop such a system that can detect the affected carious lesion and extract the features of
it to assist dentists. The system will be primarily based on image processing techniques. It
will be safe for humans and will be a low-cost device. The system can also be used in remote
areas by the use of telemedicine technology.
Acknowledgment
Funding was provided by the Visvesvaraya fellowship for Ph.D. scheme under Ministry of
Electronics and Information Technology of the Government of India.
References
11 Smejkalova, J., Jacob, V., Hodacova, L. et al. (2012). The inuence of smoking on den-
tal and periodontal status. In: Oral Health Care-Pediatric, Research, Epidemiology and
Clinical Practices (ed. M. Virdi). InTech.
12 Petersen, P.E., Bourgeois, D., Ogawa, H. et al. (2005). The global burden of oral diseases
and risks to oral health. Bulletin of the World Health Organization 83: 661–669.
13 Prabhakar, N., Kiran, K., and Kala, M. (2011). A review of modern noninvasive methods
for caries diagnosis. Archives of Oral Sciences & Research 1: 168–177.
14 Cortes, D., Ellwood, R., and Ekstrand, K. (2003). An in vitro comparison of a combined
foti/visual examination of occlusal caries with other caries diagnostic methods and the
effect of stain on their diagnostic performance. Caries Research 37: 8–16.
15 de Gonzalez, A.B. and Darby, S. (2004). Risk of cancer from diagnostic x-rays: estimates
for the UK and 14 other countries. The Lancet 363: 345–351.
16 E. D. J. de Jong, & L. Stöber. (2003). Quantitative light-induced uorescence (QLF) – a
potential method for the dental practitioner. Quintessence International, 34(3) (2003).
17 Ismail, A., Sohn, W., Tellez, M. et al. (2007). The international caries detection and
assessment system (ICDAS): an integrated system for measuring dental caries. Commu-
nity Dentistry and Oral Epidemiology 35: 170–178.
18 Selwitz, R.H., Ismail, A.I., and Pitts, N.B. (2007). Dental caries. The Lancet 369: 51–59.
19 Drexler, W. and Fujimoto, J.G. (2008). Optical Coherence Tomography: Technology and
Applications. New York: Springer Science & Business Media.
20 Ely, E.W., Stump, T.E., Hudspeth, A.S., and Haponik, E.F. (1993). Thoracic complica-
tions of dental surgical procedures: hazards of the dental drill. The American Journal of
Medicine 95: 456–465.
21 Haruyama, O., Kimura, H., Nishiyama, N. et al. (2001). The anomalous behavior of elec-
trical resistance for some metallic glasses examined in several gas atmospheres or in a
vacuum. In: Amorphous and Nanocrystalline Materials (ed. M.E. McHenry), 69–86. New
York: Springer.
22 H. Nathel, J. H. Kinney, & L. L. Otis, (1996) Method for detection of dental caries and
periodontal disease using optical imaging, Oct. 29 1996. US Patent 5,570,182.
23 Carounanidy, U. and Sathyanarayanan, R. (2009). Dental caries: a complete changeover
(part ii)-changeover in the diagnosis and prognosis. Journal of Conservative Dentistry:
JCD 12: 87.
24 Ghosh, P., Bhattacharjee, D., and Nasipuri, M. (2017). Automatic system for plasmod-
ium species identification from microscopic images of blood-smear samples. Journal of
Healthcare Informatics Research 1: 231–259.
25 Gimenez, T., Braga, M.M., Raggio, D.P. et al. (2013). Fluorescence-based methods for
detecting caries lesions: systematic review, meta-analysis and sources of heterogeneity.
PLoS One 8: e60421.
26 Guerrieri, A., Gaucher, C., Bonte, E., and Lasfargues, J. (2012). Minimal intervention
dentistry: part 4. Detection and diagnosis of initial caries lesions. British Dental Journal
213: 551.
27 Zandonfia, A.F. and Zero, D.T. (2006). Diagnostic tools for early caries detection. The
Journal of the American Dental Association 137: 1675–1684.
28 Lussi, A., Hibst, R., and Paulus, R. (2004). Diagnodent: an optical method for caries
detection. Journal of Dental Research 83: 80–83.
References 133
29 Zeitouny, M., Feghali, M., Nasr, A. et al. (2014). Soprolife system: an accurate diagnostic
enhancer. The Scientific World Journal 2014: 651–667.
30 Harris, R., Nicoll, A.D., Adair, P.M., and Pine, C.M. (2004). Risk factors for dental caries
in young children: a systematic review of the literature. Community Dental Health 21:
71–85.
31 Whitters, C., Strang, R., Brown, D. et al. (1999). Dental materials: 1997 literature review.
Journal of Dentistry 27: 401–435.
32 Anttonen, V., Seppä, L., and Hausen, H. (2003). Clinical study of the use of the laser
fluorescence device diagnodent for detection of occlusal caries in children. Caries
Research 37: 17–23.
33 Bamzahim, M., Shi, X.-Q., and Angmar-Mansson, B. (2002). Occlusal caries detection
and quantification by diagnodent and electronic caries monitor: in vitro comparison.
Acta Odontologica Scandinavica 60: 360–364.
34 Sheehy, E., Brailsford, S., Kidd, E. et al. (2001). Comparison between visual examination
and a laser fluorescence system for in vivo diagnosis of occlusal caries. Caries Research
35: 421–426.
35 Shi, X., Tranaeus, S., and Angmar-Mansson, B. (2001). Comparison of qlf and diagn-
odent for quantification of smooth surface caries. Caries Research 35: 21.
36 Pretty, I. and Ellwood, R. (2013). The caries continuum: opportunities to detect, treat
and monitor the re-mineralization of early caries lesions. Journal of Dentistry 41:
S12–S21.
37 Ashley, P., Blinkhorn, A., and Davies, R. (1998). Occlusal caries diagnosis: an in vitro
histological validation of the electronic caries monitor (ECM) and other methods. Jour-
nal of Dentistry 26: 83–88.
38 Stookey, G.K. and Gonzfialez-Cabezas, C. (2001). Emerging methods of caries diagnosis.
Journal of Dental Education 65: 1001–1006.
39 Pitts, N.B. (2001). Clinical diagnosis of dental caries: a european perspective. Journal of
Dental Education 65: 972–978.
40 Tranfius, S., Shi, X.-Q., and Angmar-Mfiansson, B. (2005). Caries risk assessment:
methods available to clinicians for caries detection. Community Dentistry and Oral
Epidemiology 33: 265–273.
41 Yu, J., Tang, R., Feng, L., and Dong, Y. (2017). Digital imaging fiber optic transillumi-
nation (DIFOTI) method for determining the depth of cavity. Beijing da xue xue bao. Yi
xue ban Journal of Peking University. Health sciences 49: 81–85.
42 C. Gutierrez, DIFOTI (Digital Fiberoptic Transillumination): Validity at In Vitro, PhD
thesis, Dissertation Zum Erwerb des Doktorgrades der Zahnheilkunde an der Medizinis-
chen Fakuly at der Ludwig- Maximilians-University zu Munchen, 2008.
43 Attrill, D. and Ashley, P. (2001). Diagnostics: Occlusal caries detection in primary
teeth: a comparison of diagnodent with conventional methods. British Dental Journal
190: 440.
44 Pitts, N. (1996). The use of bitewing radiographs in the management of dental caries:
scientific and practical considerations. Dentomaxillo-facial Radiology 25: 5–16.
45 Wenzel, A. (1998). Digital radiography and caries diagnosis. Dentomaxillofacial
Radiology 27: 3–11.
134 6 A Systematic Review on the Evolution of Dental Caries Detection Methods
46 Wenzel, A., Larsen, M., and Feierskov, O. (1991). Detection of occlusal caries with-
out cavitation by visual inspection, film radiographs, xeroradiographs, and digitized
radiographs. Caries Research 25: 365–371.
47 E. Hausmann, D. Wobschall, L. Ortman, E. Kutlubay, K. Allen, & D. Odrobina, (1997)
Intraoral radiograph alignment device, May 13 1997. US Patent 5,629,972.28
48 Wenzel, A., Hintze, H., Mikkelsen, L., and Mouyen, F. (1991). Radiographic detection
of occlusal caries in noncavitated teeth: a comparison of conventional film radiographs,
digitized film radiographs, and radiovisiography. Oral Surgery, Oral Medicine, Oral
Pathology 72: 621–626.
49 Patel, S., Dawood, A., Wilson, R. et al. (2009). The detection and management of root
resorption lesions using intraoral radiography and cone beam computed tomography–an
in vivo investigation. International Endodontic Journal 42: 831–838.
50 Preston-Martin, S., Thomas, D.C., White, S.C., and Cohen, D. (1988). Prior exposure to
medical and dental x-rays related to tumors of the parotid gland1. JNCI: Journal of the
National Cancer Institute 80: 943–949.
51 Gröndahl, H.-G. and Huumonen, S. (2004). Radiographic manifestations of periapi-
cal inammatory lesions: how new radiological techniques may improve endodontic
diagnosis and treatment planning. Endodontic Topics 8: 55–67.
52 Datta, S., Chaki, N., and Modak, B. (2019). A novel technique to detect caries lesion
using Isophote concepts. IRBM 40 (3): 174–182.
53 Wenzel, A. (2004). Bitewing and digital bitewing radiography for detection of caries
lesions. Journal of Dental Research 83: 72–75.
54 Coulthwaite, L., Pretty, I.A., Smith, P.W. et al. (2006). The microbiological origin of
uorescence observed in plaque on dentures during qlf analysis. Caries Research 40:
112–116.
55 Pretty, I.A., Edgar, W.M., and Higham, S.M. (2002). Detection of in vitro demineraliza-
tion of primary teeth using quantitative light-induced uorescence (QLF). International
Journal of Paediatric Dentistry 12: 158–167.
56 Pretty, I.A., Edgar, W.M., and Higham, S.M. (2002). The effect of ambient light on qlf
analyses. Journal of Oral Rehabilitation 29: 369–373.
57 Pretty, I., Edgar, W., and Higham, S. (2001). Aesthetic dentistry: the use of qlf to quan-
tify in vitro whitening in a product testing model. British Dental Journal 191: 566.
58 Yin, W., Hu, D., Li, X. et al. (2013). The anti-caries efficacy of a dentifrice containing
1.5% arginine and 1450 ppm fluoride as sodium monouorophosphate assessed using
quantitative light-induced fluorescence (QLF). Journal of Dentistry 41: S22–S28.
59 Curzon, P. (1994). The Verified Compilation of Vista Programs, in 1st ProCoS Working
Group Meeting. Denmark, Citeseer: Gentofte.
60 Raggio, D.P., Braga, M.M., Rodrigues, J.A. et al. (2010). Reliability and discriminatory
power of methods for dental plaque quantification. Journal of Applied Oral Science 18:
186–193.
61 Ruckhofer, E. and Städtler, P. (2010). Ist vista proof eine hilfe bei der kariesdiagnostik?
Stomatologie 107: 13–16.
62 Presoto, C.D., Trevisan, T.C., Andrade, M.C.D. et al. (2017). Clinical effectiveness of
uorescence, digital images and icdas for detecting occlusal caries. Revista de Odontologia
da UNESP 46: 109–115.
References 135
63 Rodrigues, J., Hug, I., Diniz, M., and Lussi, A. (2008). Performance of fluorescence
methods, radiographic examination and ICDAS II on occlusal surfaces in vitro. Caries
Research 42: 297–304.
64 Jablonski-Momeni, A., Heinzel-Gutenbrunner, M., and Klein, S.M.C. (2014). In vivo per-
formance of the vistaproof fluorescence based camera for detection of occlusal lesions.
Clinical Oral Investigations 18: 1757–1762.
65 Jablonski-Momeni, A., Heinzel-Gutenbrunner, M., and Vill, G. (2016). Use of a
fluorescence-based camera for monitoring occlusal surfaces of primary and permanent
teeth. International Journal of Paediatric Dentistry 26: 448–456.
66 Guerra, F., Corridore, D., Mazur, M. et al. (2016). Early caries detection: comparison of
two procedures. A pilot study. Senses and Sciences 3: 317–322.
67 Hsieh, Y.-S., Ho, Y.-C., Lee, S.-Y. et al. (2013). Dental optical coherence tomography.
Sensors 13: 8928–8949.
68 Otis, L.L., Everett, M.J., Sathyam, U.S., and Colston, B.W. Jr., (2000). Optical coher-
ence tomography: a new imaging: technology for dentistry. The Journal of the American
Dental Association 131: 511–514.
69 Sun, C.-W., Ho, Y.-C., and Lee, S.-Y. (2015). Sensing of tooth microleakage based on
dental optical coherence tomography. Journal of Sensors 2015: 120–132.
70 Wang, X.-J., Milner, T.E., De Boer, J.F. et al. (1999). Characterization of
dentin and enamel by use of optical coherence tomography. Applied Optics 38:
2092–2096.
71 Alammari, M., Smith, P., De Jong, E.D.J., and Higham, S. (2013). Quantitative
light-induced fluorescence (QLF): a tool for early occlusal dental caries detection and
supporting decision making in vivo. Journal of Dentistry 41: 127–132.
72 Pitts, N., Ekstrand, K., and I. Foundation (2013). International caries detection and
assessment system (ICDAS) and its international caries classification and management
system (ICCMS)–methods for staging of the caries process and enabling dentists to
manage caries. Community Dentistry and Oral Epidemiology 41: e41–e52.
73 Nardini, M., Ridder, I.S., Rozeboom, H.J. et al. (1999). The x-ray structure of epoxide
hydrolase from agrobacterium radiobacter ad1 an enzyme to detoxify harmful epoxides.
Journal of Biological Chemistry 274: 14579–14586.
74 Mah, J.K., Huang, J.C., and Choo, H. (2010). Practical applications of cone-beam com-
puted tomography in orthodontics. The Journal of the American Dental Association 141:
7S–13S.
75 Sherrard, J.F., Rossouw, P.E., Benson, B.W. et al. (2010). Accuracy and reliability of
tooth and root lengths measured on cone-beam computed tomographs. American Jour-
nal of Orthodontics and Dentofacial Orthopedics 137: S100–S108.
76 Lascala, C., Panella, J., and Marques, M. (2004). Analysis of the accuracy of linear
measurements obtained by cone beam computed tomography (CBCT-newtom). Den-
tomaxillofacial Radiology 33: 291–294.
77 Sukovic, P. (2003). Cone beam computed tomography in craniofacial imaging. Orthodon-
tics & Craniofacial Research 6: 31–36.
78 Misch, K.A., Yi, E.S., and Sarment, D.P. (2006). Accuracy of cone beam computed
tomography for periodontal defect measurements. Journal of Periodontology 77:
1261–1266.
136 6 A Systematic Review on the Evolution of Dental Caries Detection Methods
79 Cogswell, W.W. (1942). Surgical problems involving the mandibular nerve. The Journal
of the American Dental Association 29: 964–969.
80 Hollander, F. and Dunning, J.M. (1939). A study by age and sex of the incidence of den-
tal caries in over 12,000 persons. Journal of Dental Research 18: 43–60.
81 Al-Ansari, A.A. et al. (2014). Prevalence, severity, and secular trends of dental caries
among various saudi populations: a literature review. Saudi Journal of Medicine and
Medical Sciences 2: 142.
82 da Silveira Moreira, R. (2012). Epidemiology of dental caries in the world. In: Oral
Health Care-Pediatric, Research, Epidemiology and Clinical Practices. InTech.
83 Petersen, E.P. (2009). Oral health in the developing world, World Health Organization
global oral health program chronic disease and health promotion. Geneva:. Community
Dentistry and Oral Epidemiology 58 (3): 115–121.
137
7.1 Introduction
Intelligent data analysis (IDA) is the field of artificial intelligence (AI). With the help of IDA,
meaningful information is discovered from a large amount of data. This field includes pre-
processing of data, data analysis, mining techniques, machine learning, neural networks,
etc. Statistical techniques [1] are implemented for data analysis. The types of statistics are
as follows: (i) descriptive statistics and (ii) inferential statistics. In descriptive statistics,
assumptions are not used for summarizing data, whereas, in inferential statistics, conclu-
sions are drawn based on certain assumptions.
IDA has emerged as a combination of many fields: (i) statistics, (ii) AI, (iii) pattern recog-
nition, and (iv) machine learning [2]. IDA in the educational field can be achieved with the
help of educational data mining and learning analytics [3], which will become a milestone
in the field of education and a teacher-taught paradigm.
With the help of these fields, students data can be analyzed in an intelligent manner
to achieve the following: (i) academic analytics, (ii) course redesigning, (iii) implement-
ing learning management systems, (iv) interactive sessions between instructors and stu-
dents, (v) implementing new patterns of students assessments, (vi) improving pedagogies,
(vii) predicting students dropouts, and (viii) statistical reports of the learners.
The author in [4] has quoted the definition of Learning Analytics by George Siemens.
According to George Siemens, “Learning analytics is the use of intelligent data,
learner-produced data, and analysis models to discover information and social connections
for predicting and advising people’s learning.”
Data mining is the most striking area coming to the fore in modern technologies for fetch-
ing meaningful information from a large volume of data that is unstructured with lots of
uncertainty and inconsistencies. There is stupendous leverage to the educational sector in
using data mining techniques to inspect data from students, assessments, latest scholastic
patterns, etc. This proves to provide a quality education as well as decision-making advice
for students in order to escalate their career prospects and select the right course for train-
ing to fulfill the skill gap that exists between primary education and the industry that will
hire them. Data mining has a great impact on scholastic systems where education is mea-
sured as the primary input for informative evolution. In such a scenario where data is being
generated at an alarming rate from many sources like media files, data from social media
websites, e-mails generated, Google search, instant messaging, mobile users, internet of
things (IOT), etc., data grows enormously and it may not fit on a single node, which is
why the the term “big data” was coined. It is a term for data sets that are so large in vol-
ume or complex that it becomes arduous to manage them with customary data processing
application software.
Using big data techniques, unstructured data is gathered and analyzed to reveal infor-
mative data for operations, which includes the gathering of data for storage and analysis
purposes that gains control over operations, such as experimental, fact-finding, allocation,
data visualization, updating, and maintaining the confidentiality of information.
The problem pertaining to the enormously large size of a data set can be solved by Hadoop
using a MapReduce model, which works on a Hadoop layer and allows parallel processing
of the data stored in a Hadoop distributed file system that allows dumping any kind of data
across the cluster. MapReduce tasks run over Hadoop clusters by splitting the big data, i.e.,
input file, into small pieces and processing the data on parallel distributed clusters. It is
an open-source programming prototype that performs parallel processing of applications
on clusters, and with the distribution of data, computation becomes faster. A MapReduce
programming framework executes its operations into three stages, i.e., map phase, shuffle
phase, and reduce phase.
In data mining, association rule learning is a method for discovering interesting
relations among variables in large databases [5]. A big data approach using associa-
tion rule mining can help colleges, institutions, and universities get a comprehensive
perspective of their students. It provides compelling reasons to the institutions to
put in concentrated efforts so as to enhance the quality of education by enhancing
the skill sets of the potential students, keep them intensely focused, matching their
skill sets with the demands of the market and/or society, and making them ready for
future challenges. It helps in answering questions related to the (i) learning behaviors,
(ii) understanding patterns of students, (iii) curriculum trends, and (iv) selection of
courses to help create captivating learning experiences for students. In this chapter,
the authors have, for experimental exploration, synthesized a data set based upon the
preferences of students so that MapReduce and Apriori algorithms can be applied
in order to derive the appropriate rules for helping students make well-informed
decisions.
Artificial
Intelligence (AI) Include techniques which enable a
computer to mimic human behavior.
Development of Intelligent Machines using
Machine human intelligence.
Learning
(ML) A subset of AI techniques
Uses statistical methods to enable machines
to improve with experience.
Improvement from self-learning.
Figure 7.1 Artificial intelligence and its subsets using intelligent data analytics.
There are two types of analytics: learning analytics and academic analytics. Learning ana-
lytics focus on consistent improvements in learning efficiencies from primary to higher
studies. Learning analytics facilitate academicians, learners, and management authorities
for improvement in the classroom by using course-level activities from the intelligent anal-
ysis of data generated, whereas, in academic analytics, there is a comparative analysis of
quality and standard norms followed while implementing educational policies in universi-
ties and colleges on national and international levels.
Apart from it, there is an analysis of learner abilities with respect to traditional and mod-
ern educational systems that result in effective decision-making by policymakers when
funding organizations [6]. The sample techniques for the analytics engine are shown in
Figure 7.3.
7.2 Learning Analytics in Education 141
Mobile Parents/Guardians
Communication,
Web-enabled
communication
Assignments,
Home Work,
Educator/Institute Communication
Schedule of Holidays,
Exchange
Any other communication
Children/Ward
Publishers
Semantic content
Intelligent Curriculum
Recommender
Learning
Automated Intervention Learner Profile
Adaptation and
Human Intervention
Personalization
Engine
Intervention Engine
7.3 Motivation
In the present scenario, our educational system is facing threats related to quality-driven
education, lack of skills in students as per the industry demand, and problems in effective
decision-making due to complexity in educational patterns.
The Hadoop MapReduce framework and association rule mining contribution for edu-
cational data mining has not been explored and is underutilized to mine the meaningful
information from huge unstructured data accumulated in the form of complex data sets,
educational entities, and patterns.
With big data solutions, comprehensive aspects about the student are obtained, which
help colleges, institutes, and organizations to improve education quality and graduate
skilled workers. There are many questions that are unanswered and are pertinent to the
(i) learning observance, (ii) perceptive learning, (iii) curriculum trends, (iv) syllabi, and
(v) future courses for students, which is where we will focus this chapter.
7.4 Literature Review 143
There [19] are web-enabled tools that help in evaluating the students’ activities online.
The activities are as follows: (i) time devoted by students in reading online, (ii) utilization
of electronic resources, and (iii) how fast students understand online study material.
In [20], there are some pedagogical approaches that are effective with students. The
approaches are as follows:
● Prediction of students dropout rate
● Students who require extra help
● Students who require assignments, mock tests, etc.
● Immediate feedback to teachers about the academic performance of students
● Tailor-made assignments for individual students
● Course design
● Grading of students
● Visualizations through dashboard for tracking student performance
Data mining and learning analytics fields have the potential for improved research, evalu-
ation, and accountability through data mining, data analytics, and dashboards [21].
The authors have stressed the role of big data and analytics in the shaping of higher edu-
cation, which involves ubiquitous computing, smart classrooms, and innovation in smart
devices for providing education [22]. Big data and learning analytics pave the path for guid-
ing reform activities in higher education and guide educators for continuous improvements
in teaching. In [23], the authors have proposed an algorithm for efficient implementations
of the Apriori algorithm in the MapReduce framework. MapReduce frameworks performs
mining using map and reduce functions on large data sets of terabytes in size.
Association rule mining is a mining technique that finds frequent itemsets in a database
with minimum support and confidence constraints [24].
Using the Apriori algorithm and MapReduce technique of Hadoop, frequently occurring
itemsets in a data set can be identified [25].
Katrina Sin and Loganathan Muthu have stressed the need of educational data mining in
the present setting of education where a large amount of data accumulated from massive
open online courses (MOOC) [26].
There are some open-source tools of data mining like MongoDB and Apache
Hadoop [27].
Patil and Praveen Kumar have performed the classification of data using MapReduce pro-
gramming and have proposed a mining model for effective data analysis related to students
pursuing higher education [28].
In a MapReduce framework, there is parallel processing of clusters for managing the huge
volume of data [29].
MapReduce scales to a large array of machines to solve large computational
problems [30].
Provost Foster and Fawcett Tom [31] have discussed the relationship of data science to big
data and their significance in decision-making. A single node computer is inadequate with
the storage of a large volume of data and processing demand increases beyond the capacity
of the node. In such a situation, distributed data and parallel computations on the network
of machines for the completion of the task is the preferable solution for faster computational
processing and saving storage space [32].
7.5 Intelligent Data Analytical Tools 145
Enrique Garcia et al. has discussed an educational data mining tool based on association
rule mining. This tool helps improve e-learning courses and allows teachers to analyze
and discover hidden information based on an interaction between the students and the
e-learning courses [33]. Authors have presented association rule mining technique for
assessing student data. The student’s performance can be analyzed using association rule
mining [34].
SESSIONAL
FIRST (4.0) FIRST (4.0) THIRD (4.0) FIRST (1.0/2.0) FIRST (4.0/2.0) SECOND (4.0/2.1) THIRD (4.0) FAIL (4.0) FAIL (4.0)
GPA
NO
Attendance Technical Skills
YES NO YES
Quantitative
NO YES
Reasoning
= AVERAGE = GOOD
NO YES
Figure 7.6 Decision tree generated for the data set [37].
7.5 Intelligent Data Analytical Tools 147
Technical Skills
Quantitative
Attendance
Placement
Reasoning
Skills
GPA
performance in these attributes. The synthesized educational data set of students is shown
in Table 7.2. Using the K-means clustering technique on this data set in MATLAB software
has generated the following two clusters: (i) students who are short of attendance and (ii)
students who have performed poorly in tests.
148 7 Intelligent Data Analysis Using Hadoop Cluster – Inspired MapReduce Framework
(a) x = 89
Y = 65
Z = 89
90
80
Attendance
x = 75.57
70 Y = 74.18x = 72.43
Z = 66.58Y = 68.08
Z = 64.33
60
50
x = 59
Y = 53
Z = 42
40
90
80 40
70 50
60 70 60
50 80
Class Performance 90 Sessional
(b)
x = 88
100 x = 98 Y = 78
Y = 91 Z = 83
Z = 92
90
x = 83
Y = 79
Z = 79
x = 78.37
80 Y = 81.9
Z = 75.44
Attendance
x = 72.4
Y = 71.26
70 Z = 88.85
60
x = 98 x = 73.54
Y = 55 Y = 55.92
Z = 43 Z = 48.88
50
x = 54
Y = 79
Z = 42
40 40
100 90 50
80 70 80 70 60
60
Class Performance
50 40 100 90 Sessional
Figure 7.8 (a). Visualization of student attributes (K = 2) [38]. (b). Visualization of student
attributes (K = 3) [38].
After applying the preprocessing and data mining models on the data set, Figure 7.8a
shows the clustering of students, i.e., K = 2 and Figure 7.8b shows clustering of students,
i.e., K = 3.
KEEL (http://www.keel.es/) means knowledge extraction based on evolutionary learn-
ing. It is Java-based software that supports the following:
a) Knowledge extraction algorithms
b) Preprocessing of data
c) Intelligence-based learning
d) Statistical methodologies
7.6 Intelligent Data Analytics Using MapReduce Framework in an Educational Domain 149
Sessional
Class id
ASSGN
ATTD
CLR
CLP
LW
1 78 79 89 9 23 79
2 79 80 89 10 22 81
3 71 70 76 10 23 71
4 52 79 55 4 10 80
5 52 50 45 3 12 51
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
100 78 80 78 9 20 79
SPSS means statistical package for social sciences and is developed by IBM. SPSS is
mainly used for statistical analysis. In SPSS, we can import data from Excel spreadsheets
for generating regression models, predictive analytics, and data mining on educational
data sets.
sciences only. In real life, the proposed system shall include all the areas of study at
secondary and tertiary levels of education in formal as well as informal education sector.
The output obtained using the proposed methodology can be tested on such practical
applications, which may run into terabytes of data.
7.6.2 Objective
The objective here is to predict the interest of a student in training course(s) from various
available combinations for training from the data set.
OUTPUT-FILE
Mobile Computing: 2
Web Technology: 3
Machine Learning: 1
Artificial Intelligence: 1
Application Programing: 3
Task Completion Parallel
Reducing
Key: Mobile Computing Key: WT Value: 3 Key: AP Value: 3
Value: 2 Key: ML Value: 1 Key: AI Value: 1
JOB TRACKER
Shuffling
DATA NODE-1 DATA NODE-2 DATA NODE-3
Key: WT Value: 2 Key: MC Value: 1 Key: ML Value: 1
NAME NODE Key: AP Value: 1 Key: WT Value: 1 Key: AI Value: 1 Mapping
Key: AP Value: 1 Key: MC Value: 1
INPUT FILE
SPLITS
The equations for support and confidence of the rules are shown in Equations 7.1 and 7.2.
Here s, c denotes the support and confidence, whereas p denotes the probability of occur-
rence of items in the database.
7.7 Results
The results shown in Figure 7.10 depict that a maximum of students have shown key interest
toward machine learning and AI, in comparison to other program, i.e., application program-
ming, web technology, and mobile computing, which helps management as well as faculty
members to induct such specialized courses in course curriculum and conduct in-house
training programs and workshops for the benefit of students.
Table 7.4 shows that maximum students have opted for specialization in AI, there-
fore students may opt for {Python, C++}-like technologies, whereas in application
152 7 Intelligent Data Analysis Using Hadoop Cluster – Inspired MapReduce Framework
Artificial Intelligence<AI,39> 39
Application Programming<AP,28> 28
Mobile Computing<MC,26> 26
Machine Learning<ML,33> 33
Web Technology<WT,28> 28
⋮ ⋮
<N,N>specialized courses <N>count
programming, students may opt for Java, DotNet, and PHP-like languages. In mobile
computing, 26 students have shown interest, therefore, Android is the right program for
training.
From Figure. 7.9, it can also be derived that after mapping, the shuffling task consolidated
the relevant records obtained from the mapping phase. The summarized output is obtained
through the reducing phase. After applying the Apriori algorithm to the output obtained
in Figure 7.10, the best rules are displayed in Table 7.5. This table shows the preferable
technologies for students to opt into and thus enhance their skills.
References 153
In this chapter, the authors have discussed in detail IDA and its role in the educational
environment. IDA tools have also been discussed in the chapter.
The authors have synthesized an experimental data set that consists of courses related to
the field of ICT and their attributes. The data set is processed through the proposed method-
ology of the MapReduce algorithm in the first stage and the Apriori algorithm in the second
stage. The results and their analysis show that MapReduce and association rule mining can
provide students the career counseling support that strengthens their decision-making to
opt for the right course(s) for training activities as per industry requirements. Here, the
experimentation has been limited only to the internship/training requirements of computer
science engineering and information technology fields. In future, the authors intend to
involve all the branches of engineering and technology in the first phase, other professional
courses in the second phase, and lastly, a generic career counseling system with necessary
appropriate enhancements in the third phase. Using not only Hadoop and Apriori algo-
rithms, but with the inclusion of some machine learning/AI techniques, more meaningful
information and academic patterns can be retrieved from relevant data sets on a larger scale
in the future. The results so obtained are surely going to help educational institutions to find
answers to some of the yet unanswered questions.
All these propositions have at the center of their focus, the improvement of the quality of
education and employment prospects of the students. The proposed methodology is going to
be the pivotal point in designing and implementing such support systems that will facilitate
intelligent decision-making by parents, teachers, and mentors related to the careers of their
children/wards/students and the strengthening of in-house training programs.
References
1 Berthold, M.R., Borgelt, C., Höppner, F., and Klawonn, F. (2010). Guide to intelligent
data analysis: how to intelligently make sense of real data. Texts in Computer Science 42
https://doi.org/10.1007/978-1-84882-260-3_1, Berlin, Heidelberg: Springer-Verlag.
2 Hand, D.J. (1997). Intelligent data analysis: issues and opportunities. In: Advances in
Intelligent Data Analysis Reasoning about Data, IDA 1997. Lecture Notes in Computer
Science, vol. 1280. Berlin, Heidelberg: Springer.
154 7 Intelligent Data Analysis Using Hadoop Cluster – Inspired MapReduce Framework
3 Baepler, P. and Murdoch, C.J. (2010). Academic analytics and data mining in higher
education. International Journal for the Scholarship of Teaching and Learning 4 (2)
Article 17.
4 Scapin, R. (2015). "Learning Analytics in Education: Using Student’s Big Data to Improve
Teaching", IT Rep Meeting – April 23rd.
5 Liu, B. and Wong, C.K. (2000). Improving an association rule based classifier. Journal In
Principles of Data Mining and Knowledge Discovery: 504–509.
6 Siemens, G., Gasevic, D., Haythornthwaite, C., et al. Open Learning Analytics: an inte-
grated & modularized platform. Doctoral dissertation, Open University Press. https://
solaresearch.org/wp-content/uploads/2011/12/OpenLearningAnalytics.pdf
7 Gašević, D., Kovanović, V., Joksimović, S., and Siemens, G. (2014). Where is research on
massive open online courses headed? A data analysis of the MOOC research initiative.
The International Review of Research in Open and Distributed Learning 15 (5).
8 Gasevic, D., Rose, C., Siemens, G. et al. (2014). Learning analytics and machine learn-
ing. In: Proceedings of the Fourth International Conference on Learning Analytics and
Knowledge, 287–288. New York: ACM https://doi.org/10.1145/2567574.2567633.
9 Clow, D. (2013). An overview of learning analytics. Teaching in Higher Education 18 (6).
10 Chatti, M.A., Lukarov, V., Thus, H. et al. (2014). Learning analytics: challenges and
future research directions. eleed (10).
11 Gasevic, D., Dawson, S., and Siemens, G. (2015). Let’s not forget: learning analytics are
about learning. Tech Trends 59 (1): 64–71. https://doi.org/10.1007/s11528-014-0822-x.
12 Cooper, A. (2012). A Brief History of Analytics: A Briefing Paper, CETIS Analytics Series.
JISC CETIS http://publications.cetis.org.uk/wp-content/uploads/2012/12/Analytics-Brief-
History-Vol-1-No9.pdf.
13 G. Siemens. (2011). What are learning analytics. Retrieved March 10.
14 Baker, R.S.J.D. and Yacef, K. (2009). The state of educational data mining in 2009: a
review and future visions. Journal of EDM 1 (1): 3–17.
15 Baker, R.S.J.D. (2010). Data mining for education. International Encyclopedia of Educa-
tion 7: 112–118.
16 Castro, F., Vellido, A., Nebot, A., and Mugica, F. (2007). Applying data mining tech-
niques to e-learning problems. In: Evolution of Teaching and Learning Paradigms in an
Intelligent Environment, 183–221. Berlin, Heidelberg: Springer.
17 Algarni, A. (2016). Data mining in education. International Journal of Advanced Com-
puter Science and Applications 7 (6).
18 Manyika, J., Chui, M., Brown, B. et al. (2011). Big Data: The Next Frontier for Innova-
tion, Competition, and Productivity. New York: Mckinsey Global Institute.
19 Castro, F., Vellido, A., Nebot, A., and Mugica, F. (2007). Applying data min-
ing techniques to e-learning problems. Studies in Computational Intelligence 62:
183–221.
20 U.S. Department of Education Office of Educational Technology (2012). Enhancing
Teaching and Learning through Educational Data Mining and Learning Analytics: An
issue brief. In Proceedings of conference on advanced technology for education. https://
tech.ed.gov/wp-content/uploads/2014/03/edm-la-brief.pdf
References 155
21 West, D.M. (2012). Big Data for Education: Data Mining, Data Analytics, and Web
Dashboards. Washington, DC: Governance Studies at Brookings Institute https://pdfs
.semanticscholar.org/5a63/35fa6a09f3651280effc93459f1278639cc4.pdf?_ga=2.36321056
.1417896260.1577346636-557630246.1577346636.
22 Siemens, G. and Long, P. (2011). Penetrating the fog: Analytics in learning and educa-
tion. EDUCAUSE Review 46 (5): 30.
23 Lin, M.-Y., Lee, P.-Y., and Hsueh, S.-C. (2012). Apriori-based frequent itemset min-
ing algorithms on MapReduce. In: Proceedings of the 6th International Conference on
Ubiquitous Information Management and Communication, 76. New York: ACM.
24 Ma, B.L.W.H.Y. and Liu, B. (1998). Integrating classification and association rule min-
ing. In: Proceedings of the 4th International conference on knowledge discovery and data
mining.
25 Woo, J. (2012). Apriori-map/reduce algorithm. In: Proceedings of the International Con-
ference on Parallel and Distributed Processing Techniques and Applications (PDPTA). The
Steering Committee of The World Congress in Computer Science. Computer Engineer-
ing and Applied Computing (WorldComp).
26 Sin, K. and Muthu, L. (2015). Application of big data in education data mining and
learning analytics – aliterature review. ICTACT Journal on Soft Computing, ISSN:
2229-6956 (online) 5 (4).
27 Manjulatha, B., Venna, A., and Soumya, K. (2016). Implementation of Hadoop oper-
ations for big data processing in educational institutions. International Journal of
Innovative Research in Computer and Communication Engineering, ISSN (Online) 4
(4): 2320–9801.
28 Patil, S.M. and Kumar, P. (2017). Data mining model for effective data analysis of higher
education students using MapReduce. International Journal of Engineering Research &
Management Technology, ISSN: 2278-9359 6 (4).
29 Vaidya, M. (2012). Parallel processing of cluster by MapReduce. International Journal of
Distributed and Parallel Systems (IJDPS) 3 (1).
30 Dean, J. and Ghemawat, S. (2010). MapReduce: Simplified Data Processing on Large
Clusters, Google, Inc,. In: Proceedings of the 6th conference on Symposium on Opearting
Systems Design & Implementation.
31 Foster, P. and Tom, F. (2013). Data science and its relationship to big data and
data-driven decision making. Big Data, Mary Ann Liebert, Inc. 1: 1.
32 Steele, B., Chandler, J., and Reddy, S. (2016). Hadoop and MapReduce. In: Algorithms
for Data Science. Cham: Springer.
33 García, E., Romero, C., Ventura, S., and de Castro, C. (2011). A collaborative educa-
tional association rule mining tool. Internet and Higher Education 14 (2011): 77–88.
https://doi.org/10.1016/j.iheduc.2010.07.006.
34 Kumar, V. and Chadha, A. (2012). Mining association rules in Student’s assessment
data. IJCSI International Journal of Computer Science Issues, ISSN (Online): 1694-0814 9
(5) No. 3.
35 Slater, S., Joksimovic, S., Kovanovic, V. et al. (2017). Tools for educational data mining:
areview. Journal of Educational and Behavioral Statistics 42 (1): 85–106. https://doi.org/
10.3102/1076998616666808.
156 7 Intelligent Data Analysis Using Hadoop Cluster – Inspired MapReduce Framework
36 Guleria, P., Thakur, N., and Sood, M. (2014, 2014). Predicting student performance
using decision tree classifiers and information gain. In: International Conference on Par-
allel, Distributed and Grid Computing, Solan, Himachal Pradesh, India, 126–129. IEEE
https://doi.org/10.1109/PDGC.2014.7030728.
37 Guleria, P. and Sood, M. (2015). Predicting student placements using Bayesian classifi-
cation. In: 2015 Third International Conference on Image Information Processing (ICIIP),
109–112. Solan, Himachal Pradesh, India: IEEE https://doi.org/10.1109/ICIIP.2015
.7414749.
38 Guleria, P. and Sood, M. (2014). Mining educational data using K-means cluster-
ing. International Journal of Innovations & Advancement in Computer Science 3 (8):
2347–8616.
157
8.1 Introduction
In this century, industrialization has contributed to climate variation, and has adversely
impacted the environmental, which has raised severe health problems. Even though the
world is trying to find ways to heal the planet and preserve, protect, and enhance the global
nature, deforestation, air pollution, and greenhouse gas emissions from human activities
are now a greater threat than ever before.
Air pollution can be defined as the pollutants carbon monoxide (CO), sulfur dioxide
(SO2 ), particular matter (PM), nitrogen oxide (NO2 ), and ozone (O3 ) in such levels that they
have a negative effect on environment and on human health [1]. Currently, air pollution
kills around 7 million people worldwide each year, which is a horrendous figure and needs
immediate resolution [2]. Health risk effects on pulmonary, neurological structure, cardiac,
and vascular systems are some of the diseases that have emerged as a result of polluted
air. In 2005, 4 700 O3 -related deaths were attributed to air pollution [3]. Approximately
130 000 PM2.5 deaths were reported in the United States due to the increase of ecological
footprints. Unlike any other pollutant, PM creates health issues such as cardiovascular and
respiratory diseases and specific types of serious lung cancers [4, 5]. Basically, PM consists
of organic and inorganic elements such as sulfate (SO4 2− ), nitrates (NO3 ), ammonia
(NH3 ), sodium chloride (NaCl), black carbon, mineral dust, liquid particles, and physical
solids. Smaller particles less than 10 μm (≤ PM10 ) can be identified as some of the most
health-damaging particles; the human breathing system is unable to filter them [4, 5].
Uncontrolled population growth, urbanization, and the extermination of green spaces,
fossil fuels that burn and emit exhaust gases – all of which are major causes of pollution
and significantly impact global air quality.
Due to urbanization growth and city population, global green surfaces are substituted
rapidly by continuous construction of massive concrete surfaces. Forests and human-made
gardens (without cultivations) cover 31% of the land area, just over 4 billion hectares.
Nonetheless, in preindustrial areas, it is going down to 5.9 billion hectares. According to
the research data of the United Nations Food and Agriculture Organization, the results
Intelligent Data Analysis: From Data Gathering to Data Comprehension,
First Edition. Edited by Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar.
© 2020 John Wiley & Sons Ltd. Published 2020 by John Wiley & Sons Ltd.
158 8 Influence of Green Space on Global Air Quality Monitoring
of deforestation are at their highest rate since the 1990s. Annually, the biosphere has lost
an average of 16 million hectares of green space [6]. Green spaces affect the air quality
by direct elimination and by controlling air quality, reducing the spread of air pollutants,
removing industrial pollutant use from local microclimates, and limiting the emission
of volatile organic compounds (VOCs), which can add to O3 and PM2.5 formations [7].
Studies have exposed that trees, especially low VOC–releasing species, can be used as a
viable approach to reduce urban O3 levels. According to previous studies, forests absorb
impurities in the air and this is a process that is mutual to all vegetation. In an exchange of
gases, plants draw in carbon dioxide, alter it to make food, and discharge oxygen.
This shows that forests have a duty to enhance the environment air quality. Green spaces
fight against pollution by sticking particles and aerosols to leaf surfaces, and by reducing the
air movements that make them fall into the groundwater [1, 3]. Furthermore, the forests
play a major role in decreasing greenhouse gas effects by eliminating CO2 gas. Previous
explorations have focused on urbanization and the capability of minimizing airborne PM
and NO2 by urban forestry. Several tree arrangements are able to adjust wind profiles. Also,
the ability to generate wind inversions through forest structures assists in pulling pollu-
tants from the air or can be used as physical resistance factors to prevent the diffusion of
pollutants into the atmosphere [1, 3].
As a result of deforestation, air pollution is becoming a major environmental problem
that affects human health and the global climate in significant ways. Additionally, some
studies do not really focus on experimental samples of analyzed data related to the air pol-
lutant elimination capacity of green space, but typically instead evaluate the Urban Forest
Effects Model (UFORE) by Nowak [8]. Additionally, few studies show experiential evi-
dence of the combination among overall ambient PM densities and urban forestry den-
sities [9]. Studies put a reasonable amount of effort to prove or confirm the typical eval-
uations that have either been established to work considerably to reduce developments
related to assessments from models [9] or those evaulations that have shown no positive
effects [10].
According to the studies of Sanderson et al., if we ignore the current deforestation, it will
result in overrages of 6% in the probable growth of overall isoprene releases and of 5–30 ppb
surface ozone (O3 ) levels because of climate modification over the years 1990–2090 [11].
Ganzeveld and Lelieveld initiated a study that had an important outcome on atmospheric
chemistry from Amazonian deforestation; they did this by including solid reductions
in ozone (O3 ) dry sedimentation and isoprene emissions [12]. Lathiere et al. showed
that tropical deforestation might gain a 29% decrease in universal isoprene emissions
[11]. Ganzeveld et al. showed decreases in universal isoprene emissions and growths in
boundary layer ozone (O3 ) mixing ratios by up to 9 ppb in response to 2000–2050 variations
in land usage and land cover [11].
Most of the previous studies discussed the consequences of anthropogenic land usage
alteration on the global climate and ignored the potential atmospheric chemistry and air
quality index. There are some studies that address why there needs to be a focus on the
green space effects on global air pollution by only considering specific countries and cities.
Nonetheless, there needs to be more research done in the area showing the connection
between green space variation and a country’s air pollution, as this is considered as critical
and essential for this constantly growing problem. This study is expected to evaluate
8.2 Material and Methods 159
whether or not there is a visible relationship that exists or not between global air quality
by considering PM and global forest density.
Table 8.1 Air quality categories (annual mean ambient defined by WHO).
Level of Difference of
Levels of Air Quality
the Green Space
(PM value)
1990 to 2015
No Yes Information
Is correctly Clustering
Processing by
Percentage > 75 ?
Graphically and
Statistically
Results of the
Analysis Process
other optical technologies, including satellite retrievals of aerosol optical depth and chem-
ical transport models, have been used in this study. This technique covers the estimates
of annual exposures of PM2.5 levels at high spatial resolution. Air quality data are testified
in terms of annual mean concentrations of PM2.5 fine particles per cubic meter of air vol-
ume (m3 ). Routine air quality measurements typically describe such PM concentrations in
terms of micrograms per cubic meter (μg m–3 ) [4, 5]. Approximately, each data point covers
100 km2 of land area (0.1∘ × 0.1∘ , which equates to approximately 11 × 11 km at the equator)
[4, 5] on a global scale. According to the definition of the air quality published by the WHO,
it can be identified in several categories as follows.
According to WHO guidelines, a PM, an annual average concentration of 10 μg m–3
should be selected as the long-term guide value for PM2.5 , and this study has followed this
guide [4, 5]. This study also characterizes the minor edge range over which important
effects on survival were detected in the American Cancer Society’s (ACS) study [12].
Finally, air quality data was combined with green space data that have been collected by
the United Nations FAO [13]. When collecting these records, it identifies and excludes
green spaces that are relevant to the urban parks, forest cultivations below 5 m, and
agricultural plantations. After normalizing the data records, we calculated the green space
value of differences between each year and categorized them as follows: To recognize
the relationship between green space and air quality we performed a static calculation
process. For this purpose, we used MS Excel and IBM SPSS software that is selected
purposefully for accuracy. Selecting accurate and powerful analyzing tools are prioritized.
A Waikato Environment for Knowledge Analysis (WEKA) graph tool was used to identify
the approximate geographical changes of the tree areas as well as the level of air quality.
8.3 Results
An annual air quality for each country with longitude and latitude coordinated were used.
Figure 8.2a shows the air quality ranges in the ground, and Figure 8.2b is a filtered image
of Figure 8.2a that considers the intensity of the brightness. The dark areas show the risk
162 8 Influence of Green Space on Global Air Quality Monitoring
(a)
(b)
Figure 8.2 (a) Air quality with land areas in 2014 (using 1 048 576 instances). (b) Air quality value
exceeds PM2.5 (using 1 048 576 instances).
areas that exceed the PM2.5 level and the ranges vary from risk level (above 25 μg m–3 ) to an
extremely high-risk level (70 μg m–3 or more) areas. By analyzing this, we can identify Latin
America, Europe, and Asia as the low air quality areas. China, Korea, African Countries,
Arabia, Germany, Italy, and Mexico belong to the high-risk areas.
In South Asia, China, India, Korea, Pakistan, and Malaysia have more forest areas com-
pared with Africa. Thus, if the air quality is not at a satisfactory level it may affect and
8.4 Quantitative Analysis 163
Air Quality 248 950 205.386 3 105.143 1 046 112 2 367 721.879 .000
@2014 87 762 175.753 3 87.536 1 046 112 1 002 588.270 .000
Difference 33 494.097 3 11.403 1 046 112 2 937.270 .000
promote bad health conditions in conjunction with the absolute adverse effects of urban-
ization and the rapid development of the manufacturing field. In addition to that, the state
of Alaska, northwest of Canada, is the largest and most sparsely populated US state and
Latin America (Florida in the United States and Mexico), also poses a risk to the air quality
level.
Figure 8.3a took consideration of tree area percentages of the ground in 1990 and
Figure 8.3b is relevant to the 2014 with the longitude and relevant latitude. In this figure,
intensity is regularly proportional to the tree area percentage of the countries. Some signif-
icant changes happened during these 24 years and we can identify some slide differences
by comparing these two figures. Figure 8.3c represents the differences between Figure 8.3a
and b. Europe, South Asia, and North America have the highest difference of the tree space
according to Figure 8.3c.
(a)
(b)
(c)
Figure 8.3 (a) Tree area in 1990. (b) Tree area in 2014. (c) Difference of tree area during
1990–2014.
8.4 Quantitative Analysis 165
150
Values
100
50
–50
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
Cluster
Difference of green space area
percentage during 1990 to 2015 16 < 32 0 < 16 (–16) < 0 (–32) < (–16) < (–32)
(df2015 – df1990)
the green area is increasing, it has a positive effect and reduces the value of PM in the air
and causes good air quality (clustering number 2 to 3 and 4 to 5). Furthermore, decreasing
the green space area increases the PM value and results in the increase of bad and risky air
quality (clustering number 3 to 4).
By analyzing Figure 8.5, we can identify an inversely proportional combination between
air quality and tree area. Figure 8.5 shows a quantitative representation of Figure 8.4.
Through Figure 8.5, it can be identified that PM and green space don’t have a 1 : 1 propor-
tional relation. But it can be identified as proportional and a green area has a hidden ability
to filter air and reduce a considerable portion of particulars. In cluster 2 and cluster 4 there
are no considerable tree percentages when compared to the PM values, as it increases the
PM value in an uncontrolled manner. Nevertheless, even PM value is high in cluster 3 and
cluster 5, as the tree area becomes resistant to increased PM values. Green spaces act as a
cofactor regarding air quality.
Figure 8.6 identifies clusters 1(PM2.5 ) that represent a smaller number of data that rep-
resents good air quality. Cluster 2–6 have represented air quality that exceeds PM2.5 . The
missing pieces can be clearly identified through Figure 8.6, and the value amount of the
data set is not a considerable value compared to the amount of the valid instances. Thus, it
does not affect the research outcomes in a considerable manner.
166 8 Influence of Green Space on Global Air Quality Monitoring
250 Variables
Air_Quality
200
@2014
150 Difference
Values
100
50
–50
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
Cluster
1.200,000
No. of instances in each range of air quality
1.000,000
800,000
600,000
400,000
200,000
0
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Valid Missing
(PM) 36 to 69 µg/m3
(PM) 26 to 35 µg/m3
(PM) 16 to 25 µg/m3
(PM2.5) < 10 µg/m3
Risk Level
8.5 Discussion
This study has observed a possible impact on its relation between atmospheric air qualities
from 1990 to 2015 with the changes recorded from 190 countries. The green space density
40.0000000
20.0000000
Difference of the green area %
1,042,463
1,042,465
1,042,466
1,042,464
936,123936,122
916,209
916,207 916,210 936,121
936,1 916,208
.0000000 28,934
28,935 28,933
19,191 28,936
19,190
19,189
Decreasing the green space
19,188
–20.0000000
–40.0000000
1 2 3 4 5 6
Level of difference of Green Area
Figure 8.7 Tree area percentage/relation of raw data (difference) and ranges (level of difference).
168 8 Influence of Green Space on Global Air Quality Monitoring
247
Mean Air Quality in year 2014 (µg/m3)
124
PM 2.5
10
0
42 84
Green Space Percentage in year 2014
247
Extremely
Mean Air Quality in year 2014 (µg/m3)
High Risk
High Risk
Level
124
70.00
35.00
PM 2.5
10
0
42 84
24.2 64.9
Green Space Percentage in year 2014
has been analyzed and the data set contains more than 1000 K (1 048 000). The effects
of deforestation, changing atmospheric chemistry, and air quality can highly impact the
nature of the global atmosphere, including the changing life quality of living creatures.
Previous studies have exposed and expressed the important effects of natural emissions,
deposition, and atmospheric chemistry directly driven by changes in meteorology (such as
temperature, humidity, solar radiation) linked with climate modification.
8.6 Conclusion 169
In order to differentiate the effects due to changes in land use/land cover, this chapter
has identified green spaces of urban parks, forest cultivations below 5 m, and agricultural
plantations, which have seen to be considerably affected by air quality change [14].
Furthermore, variations in troposphere ozone (O3 ), nitrogen dioxide (NO2 ), and sulfur
dioxide (SO2 ) in response to green space can also contribute to and affect air quality;
however, this criticism is not considered in this chapter. Earlier studies also pointed out
that variation in the chemical composition of the atmosphere, such as increasing ozone
(O3 ) concentrations, can affect vegetation [14] and these effects were also not considered
in this chapter. According to some previous studies, in Southeast Asia, smoke from fires
related with deforestation, mainly of peat forests, significantly increases existing urban air
pollution, especially in El Niño years [15]. Complete tree removal causes an increase in
wind speed and air temperature, and a decrease in water vapors, which leads to weakening
of the monsoon flow over east China. G. Lee et al. report that observational air surface
temperatures are lower in open land than boreal forested areas in Canada and the United
States [16]. However, all above studies were limited only either to specific areas of the world
or to a specific year. Additionally, some studies have carried out the relation between PM,
global air quality, and the climate changes. Therefore, in order to maximize the previous
research areas, scientists considered statistic algorithms with a big data set that consists
of country-wise air quality values considering PM and country-wise green space values
between 1960 and 2015. Further, we used a data set with big data to get more accurate results
for this chapter, and in order to reduce anomalies, we used machine learning algorithms.
An IBM SPSS modeler is used because it is foremost a data mining software, which can
apply several algorithms for data preparation, data rectification, data statistics purposes,
data visualization, and for analyzing predictives. A SPSS K-means clustering algorithm is
used to cluster and classify those data into manageable groups. The K-means clustering
analysis is an excellent approach for knowledge discovery and effective decision-making
[17] in huge data sets, which also has seen good results in our previous research [18–20].
8.6 Conclusion
By analyzing air quality data sets of countries and country green space, this chapter
identified that there is a relation between the green space density and the atmospheric
air quality and for a sustainable prediction, a machine learning algorithm needs to be
utilized with the collected results. Since some pollutants such as carbon monoxide (CO),
sulfur dioxide (SO2 ), nitrogen oxide (NO2 ), and ozone (O3 ) are not able to be absorbed by
tree surfaces; those are not considered in this chapter. PM is considered as the significant
pollutant that affects human health. Furthermore, different multivariate tables and graphs
have been analyzed in this chapter to gain more accurate results. The outcomes point out
that K-means clustering can be used to classify the air quality in each country. Additionally,
the same research methodology can be used to carry out air quality, forest area, health
effect, and climate-related studies. The gained result can be used to identify the relevant
risk level of air pollution for individual countries, and it will help us take urgent actions to
prevent future risks. Further, the results with the difference of green space value between
each year will lead the relevant governments, World Wildlife Fund (WWF), WHO, etc.,
170 8 Influence of Green Space on Global Air Quality Monitoring
to take better actions to protect current green spaces and carry out forestation to prevent
further environment changes and to neutralize the effects of air pollutants.
Author Contribution
G.P. and M.N.H. conceived the study idea and developed the analysis plan. G.P. analyzed
the data and wrote the initial paper. M.N.H. helped to prepare the figures and tables and
finalized the manuscript. All authors read the manuscript.
References
1 Chen, T., Kuschner, W., Gokhale, J. & Shofer, S. (2007). Outdoor Air Pollution: Nitro-
gen Dioxide, Sulfur Dioxide, and Carbon Monoxide Health Effects. [online] Available at:
https://www.sciencedirect.com/science/article/abs/pii/S0002962915325933 (accessed 15
May 2017).
2 Kuehn, B. (2014). WHO: More Than 7 Million Air Pollution Deaths Each Year. [online]
jamanetwork. Available at: https://jamanetwork.com/journals/jama/article-abstract/
1860459 (accessed 22 Apr. 2017).
3 Irga, P., Burchett, M., and Torpy, F. (2015). Does Urban Forestry have a Quantitative
Effect on Ambient Air Quality in an Urban Environment. [ebook], 170–175. NSW: Uni-
versity of Technology Sydney. Available at: https://www.researchgate.net/publication/
281411725_Does_urban_forestry_have_a_quantitative_effect_on_ambient_air_quality_in_
an_urban_environment (accessed 16 May 2015).
4 World Health Organization. (2017). Ambient and household air pollution and health.
[online] Available at: https://www.who.int/phe/health_topics/outdoorair/databases/en/
(accessed 22 April 2017).
5 World Health Organization. (2017). Ambient (outdoor) air quality and health. [online]
Available at: http://www.who.int/mediacentre/factsheets/fs313/en/. (accessed 23 May
2017).
6 Adams, E. (2012). Eco-Economy Indicators - Forest Cover| EPI. [online] Earth-policy.org.
Available at: http://www.earth-policy.org/indicators/C56/ (accessed 15 April 2017).
7 Nowak, D., Hirabayashi, N., Bodine, A., and Greenfield, E. (2014). Tree and Forest
Effects on Air Quality and Human Health in the United States, 1e [ebook], 119–129.
Syracuse, NY: Elsevier Ltd.. Available at: https://www.fs.fed.us/nrs/pubs/jrnl/2014/nrs_
2014_nowak_001.pdf (accessed 19 April 2017).
8 Nowak, D., Crane, D., and Stevens, J. (2006). Air Pollution Removal by Urban Trees and
Shrubs in the United States, 4e [ebook] Syracuse, NY: Elsevier, 115–123. Available at:
https://www.fs.fed.us/ne/newtown_square/publications/other_publishers/OCR/ne_2006_
nowak001.pdf (accessed 19 April 2017).
9 Pataki, D., Carreiro, M., Cherrier, J. et al. (2011). Coupling biogeochemical cycles in
urban environments: ecosystem services, green solutions, and misconceptions. Frontiers
in Ecology and the Environment, [online] 9 (1): 27–36. Available at: http://doi.wiley.com/
10.1890/090220 (Accessed 21 April 2017).
References 171
10 Setälä, H., Viippola, V., Rantalainen, L. et al. (2013). Does Urban Vegetation Mitigate Air
Pollution in Northern Conditions? 4e [ebook], 104–112. Syracuse, NY: Elsevier. Available
at: http://www.sciencedirect.com/science/article/pii/S0269749112004885 (accessed 23
April 2017).
11 Wu, S., Mickley, L., Kaplan, J., and Jacob, D. (2012). Impacts of Changes in Land Use
and Land Cover on Atmospheric Chemistry and Air Quality over the 21st Century, 12e
[ebook], 2–165. Cambridge, MA: Harvard University’s DASH repository. Available
at: https://dash.harvard.edu/bitstream/handle/1/11891555/40348235.pdf?sequence=1
(accessed 24 April 2017).
12 Ganzeveld, L. and Lelieveld, J. (2004). Impact of Amazonian deforestation on atmo-
spheric chemistry. Geophysical Research Letters, [online] 31 (6), p.n/a-n/a. Available at:
https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2003GL019205 (accessed 27
May 2017).
13 Data.worldbank.org. (2017). Forest area (% of land area) | Data. [online] Available at:
https://data.worldbank.org/indicator/AG.LND.FRST.Zs (accessed 26 April 2017).
14 Flannigan, M., Stocks, B., and Wotton, B. (2000). Climate Change and Forest Fires, 1e
[ebook], 221–229. Syracuse, NY: Elsevier. Available at: http://www.environmentportal
.in/files/cc-SciTotEnvi-2000.pdf (accessed 30 April 2017).
15 Marlier, M., DeFries, R., Kim, P. et al. (2015). Regional Air Quality Impacts of Future
Fire Emissions in Sumatra and Kalimantan, 1e [ebook], 2–12. New York: I0P Publish-
ing. Available at: http://iopscience.iop.org/article/10.1088/1748-9326/10/5/054010/meta
(accessed 28 April 2017).
16 Varotsos, K., Giannakopoulos, C., and Tombrou, M. (2013). Assessment of the impacts
of climate change on European ozone levels. Water, Air, and Soil Pollution 224 (6).
17 Wanigasooriya, C.S., Halgamuge, M.N., and Mohammad, A. (2005). The analysis of anti-
cancer drug sensitivity of lung cancer cell lines by using machine learning clustering
techniques. International Journal of Advanced Computer Science and Applications,
[online] 8 (9). Available at: http://thesai.org/Publications/ViewPaper?Volume=8&
Issue=9&Code=IJACSA&SerialNo=1 (Accessed 14 May 2007).
18 Halgamuge, M.N., Guru, S.M., and Jennings, A. (2005). Centralised strategies for cluster
formation in sensor networks. In: Classification and Clustering for Knowledge Discovery,
315–334. New York: Springer-Verlag.
19 Halgamuge, M.N., Guru, S.M., and Jennings, A. (2003). Energy efficient cluster forma-
tion in wireless sensor networks. In: Proceedings of IEEE International Conference on
Telecommunication (ICT’03), vol. 2, 1571–1576. IEEE Papeete, Tahity, French Polynesia,
23 Feb–1 March 2003.
20 Wanigasooriya, C., Halgamuge, M.N., and Mohamad, A. (2017). The analyzes of anti-
cancer drug sensitivity of lung cancer cell lines by using machine learning clustering
techniques. International Journal of Advanced Computer Science and Applications
(IJACSA) 8 (9).
173
9.1 Introduction
Data analytics (DA) is a technique where a data set is examined to extract a decision
depending upon the contained information using a specialized system, software, and tools.
Similarly, in space technology, a huge amount of terrestrial data has to be collected using
various techniques and technologies. Primarily, the data contains a huge amount of earth
data and space observation data that are basically collected using various spaceborne
sensors. These collected data are combined in reference with data from other sources.
More likely, a new broad area is coming up with sharp scope of challenges and break-
throughs: computational sensor vis-a-vis space sensor networks that can cover the whole
electromagnetic spectrum, such as radio and gamma waves and the gravitational quantum
principle. This data analysis can largely contribute to the universe and can also enhance
life on earth. Figure 9.1 shows a generalized data collection scheme from various sources
from space. Implementation of big data in space technology can provide a boost to this area
by providing a common platform to various scientific communities and users. In addition
to this, a broad effort is also awakening the public by offering new opportunities to uplift
the individuality of a person in a society [1].
In space exploration, space communication and networking has emerged as a new
research arena. In earlier days, the communication in space was done using radio signals
that were blasted toward the antenna of a spacecraft within range. Moreover, the software
platform used in different missions was not versatile enough and was absolutely different
for every mission, which forced each mission to be mutually exclusive. A standardized,
futuristic technology and interconnected novel network that can support space communi-
cation is a solution to this problem. The interplanetary internet (IPN) is a solution to this
kind of problem.
Intelligent Data Analysis: From Data Gathering to Data Comprehension,
First Edition. Edited by Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar.
© 2020 John Wiley & Sons Ltd. Published 2020 by John Wiley & Sons Ltd.
174 9 IDA with Space Technology and Geographic Information System
The IPN is capable of deep space exploration. It has a wireless backbone, even if the links
are error prone, and only a slight delay range varying from minutes to hours whenever there
is a connection. The basic principle on which the IPN works is called a store-and-forward
network of internets. There are phenomenon, such as dust devils, magnetic interference,
and solar storms that can disrupt communication among spacecraft. In addition to this,
the farther spacecraft are from earth, the more likely they are to have obsolete technolo-
gies compared to the spacecraft launched very recently. Hence, the existing terrestrial and
internet protocol suite will not be effective to overcome constrictions compounded by such
extremities. Only a deep space network will be able to handle such extreme conditions. Pro-
tocols used in earth-based internet and planetary network can connect to the IPN backbone
through mobile satellite gateways, and can seamlessly switch among protocols used [2], for
example, a delay tolerant network (DTN), which is capable of integrating a bundle layer on
heterogeneous lower layers. NASA has its own open-source, customizable overlay network
called interplanetary overlay network (ION) that can implement DTN, which can be used
for heterogeneous routing protocols.
The term big data from space has got its own meaning. It primarily refers to the amalga-
mation of data from spaceborne sensors and ground-based space observer sensors. Big data
has three attributes, called three Vs, namely velocity, variety, and veracity. The volume of
data generated by these is enormous, for example, the archived data are of an exabyte scale.
The velocity (the rate at which new data are collected) is very high; the variety (data deliv-
ered from heterogeneous sources with various parameters such as frequency and spectrum)
is extraordinary. The velocity and variety have veracity (the accuracy validated from the col-
lected data). Hence, big data in space has got two dimensions, the capability to collect data
and the ability to excerpt information out of the collected data [3].
Landsat was the very first revolution in earth observation (EO) in which the largest
historical collection of geodata was made public in the United States in 2008, which is the
largest collection of earth imagery. Similarly, the European Space Agency (ESA) of the
European Union (EU) has created its own mark in EO through it sentinel twin-satellite
constellation, the Copernicus program. ESA is indeed making EO truly an era of big data
by producing up to 10 TB of data while gearing up to the full capacity of its program [3].
Additionally, inclusion of the data relay satellite of Europe (EDRS-A) to the orbit is a sig-
nificant step to creating a space data highway. EDRS is on its way to creating Europe’s own
space communication network that can transmit data at a significantly higher rate (more
than 1 Gbps), and in turn, will generate a staggering amount of data on the ground [3].
9.1 Introduction 175
In this context, the United States has its own way of providing open access to satellites
and data in association with the National Oceanic and Atmospheric Administration
(NOAA) and various industries. Data, specifically earth imagery, from April 2016 gen-
erated by advanced spaceborne thermal emission and reflection radiometer (ASTER), a
Japanese remote sensing instrument operating aboard NASA’s Terra spacecraft, are free
for use. Similarly, the Japanese Space Agency (JSA) provides a similar model from data
generated by the advanced land-observing satellite “DAICHI” (ALOS-1) at no charge,
which is used by various countries to use unlimited open-access data. Public and private
sectors are the major investors in achieving this goal of building and launching collections
of small satellites to orbit. Apart from satellites, other techniques used in geospatial studies
also add volumes of data, which adds value to all the three Vs of big data, and furthermore,
can be used by anyone.
Other countries around the globe are also taking the path to revolutionizing the EO in
line with space observation. This, however, has boosted various communities to come up
with a common umbrella of big data. The main aim of big data and analytics is to nurture
innovation, research, science, and technology and other such related developments that are
capable of handling challenges in EO as well as promoting extracting useful data in space
science.
Similarly, the geographic information systems (GIS) is an information technology–
enabled tool that can analyze, manipulate, visualize geographic information, and store
data, usually in a map. In the past years, advancements can be seen in the development
and implementation of tools that can attain, analyze, share, and store geotopological data.
Technically, GIS illustrates two primary tuples <x,y> where x is a location including
temporal dimension on earth’s surface and y is an attribute of that location. These tech-
nologies have widespread application and are designed for data processing, data analysis,
data modeling, and data storage. Using GIS information handling, such as systems for
acquiring imagery from aircraft or space, otherwise known as remote-sensing systems, can
also be accomplished. GIS started to show its presence in the late nineteenth century, and
is of a greater significance in today’s industrial sector.
GIS is primarily a collection of voluminous geographic data and documents such as
images, graphic charts, tables, multimedia, and text along with the related exploration data
and developments. The acute phenomenon is GIS analysis of the spatial geographic data
along with storing and analyzing the data in line with the operation being performed on
the data. GIS, however, is capable enough to interconnect geographic data with databases,
manage data effectively, and overcome the data-storing issues involved.
Ever since its inception in the mid-twentieth century, GIS has been applied in problems
related to spatiotemporal and decision-making [4]. With the increasing advancement in
information technology, the GIS as a software platform is being witnessed as a steady evo-
lution. The evolutions of GIS in the last several years are cited in Figure 9.2, which basically
signifies several representative architectures of GIS [5–7].
Desktop GIS. Here, all information and programs are accomplished by central software;
basically GIS application software installed in a stand-alone machine. All the geospatial
data are stored in a stand-alone system and the computations are also managed in the
same system.
176 9 IDA with Space Technology and Geographic Information System
1993 to
2010
1966 1980’s
1971 was the 2010 to
GIS maps 1986 1992
1970’s GIS 1979 1986 period Onward:
GIS demons 1969 moved SPOT the
satellite project Odessy MapInfo in the The
began in trated ESRI from satellite Lebanon
said used a develop develop history Open
1950’s by founded military was used
GPS layering ed ed of GIS Source
Howard and tax launched GIS
system when it Explosion
Fisher use
really
took off
detection is possible by applying change detection mechanism on the spacecraft [10]. This
mechanism, however, allows a spacecraft to pause observation tasks and focus on the mete-
oroid shower and capture the motion sequence of the dust devil. The captured motion
sequence of the dust devil will alert scientists about the complete sequence back on earth.
The required autonomy in a spacecraft is achieved by deploying more computational capa-
bility and analytic capability on the spacecraft or rover and authorizing the system to have
an autonomous response system.
In the era of exponential advancement in information and communication technology
(ICT), big data analytics has its own place as a cutting edge technology. Big data analyt-
ics is exceptionally backed by real-time earth observation system (EOS). It may be thought
that the amount of data generated by a single satellite in EOS is not voluminous enough,
but the inclusive data generated by various satellites can produce a sheer volume of data,
maybe even in exabytes. Here the major challenge in EOS is to extract useful information
from such a high volume of data. Challenges such as aggregation of collected data, stor-
ing, analysis of data, and managing remotely collected data are impending. This requires
scientists to achieve a universal and integrated information system of EOS with associated
services such as real-time data processing. In addition to this, there is a requirement of an
ecosystem consisting of data acquisition, data processing, data storage, and data analysis
and decision-making (explained in Section 9.3), which will provide a holistic approach that
can ruminate data flow from a satellite to a service using big data architecture. This kind
of system can provide a real-time mechanism to analyze and process the remotely captured
EOS data.
There is a great deal of interest in ICT and advent because information technology has
boosted the exponential growth in data volume and data analysis [11, 12]. The last two
years have witnessed the generation of 90% of the whole data, as per a report from IBM
[13]. This resulted in an appraisal of the big data concept as trendy and a cutting edge tech-
nology. This also has added a huge number of research challenges pertaining to various
applications such as modeling, processing, data mining, and distributed data repositories.
As discussed earlier, big data encompasses three Vs, namely volume, velocity, and verac-
ity of data, making it difficult to collect, process, analyze, and store data with the help of
currently used methodologies.
In a real-time scenario, huge amounts of data are to be collected for processing and
analysis; this represents the volume. Additionally, high-speed processing and analysis of
real-time data, such as data collected from online streaming (real-time remote sensing is
another aspect); this represents the velocity. Finally, the validity of data collected from
vastly different structures, such as internet of things (IoT), machine to machine, wireless
sensor networks, and many more, is a very vital aspect to be considered in a real-time
scenario; this represents the veracity.
However, existing services, such as web services and network devices, are also part of
generating extensive data and it can also be expected that in a real-time environment the
majority of data will be generated from different sensors. According to Oracle, sensors will
soon generate data in scale of petabytes [14].
In a way, the progresses in big data and IT have transformed the fashion in which remote
data are collected, processed, analyzed, and managed [15–17]. The continuous data streams
are generated by integrating satellites to various EOS and other platforms. These continuous
178 9 IDA with Space Technology and Geographic Information System
streams of data are generally termed as “big data” that are putting a day-to-day challenge
to the real-time environment [18]. This results to a critical task, as it involves a scientific
understanding of the remotely sensed data [19]. Moreover, because of the rate of increase
in the volume of real-time data, shortly there will be a demand of a potential mechanism
that would efficiently aggregate, process, and store the data and sources in real time.
Data acquisition is the first step in remote sensing, where a satellite generates a huge
amount of raw data by monitoring the earth continuously. The data captured in this phase
are raw data of no interest that are to be cleaned at different orders of magnitude. Filters
used in the data cleaning process don’t discard useful information. Also, a generation of a
précised metadata is equally important that primarily describes the data composition and
the data collection and analysis mechanism. Analysis of such metadata is of course a chal-
lenging task that requires us to understand the source of each data in remote sensing.
802.11
Environmental
802.16
information
Communication Infrastructure
Geographical
information
Decision Support
System Server
(social media,
Entity Data
Real time Pre-
resolution visualisation
analytics processing
and identifier
Distributed Parallel
storage processing
Data
Extraction exploration
and
transformation
Data warehouse
Others
Traditional
Warehouse Consumption
data Extraction Staging Transformation Data logic
sources schema (analytics)
and the results of each chink of data are generated in real time. The results from each
processing server are forwarded to an aggregation server for compilation, organization,
and storing for further processing.
9.1.7 Analysis
The primary objective of remote sensed image analysis is to cultivate robust algorithms that
would support the discussed architecture and monitor various geographical areas, such as
land and sea, and additionally, can be able to appraise the system. The satellite data are
captured from various satellite missions that monitor the earth from almost 1000 km from
the earth’s surface. For example, the ENVISAT data set, captured by the ENVISAT mission
of the ESA using advanced synthetic apertures radar (ASAR) satellite sensors, monitors
various geographical areas, such as deserts, forest, city, rivers, roads, and houses, of Europe
and Africa.
Data
Gathering data Cleaning data Model building Gaining visualization -
from various to get - selecting ML answers from gaining brief
sources homogeneity models model results idea from the
results
from prior experience. The common process for the machine learning model is shown in
Figure 9.4.
So, ML can be inferred as an essential component in this process. Examples of events
are identified in the training phase. To make it possible for the computing system, a gener-
alized pattern is determined through various techniques. A data set for similar events are
then found out as an additional training example. This overall process is referred to as a
supervised learning model, as at every stage the model is to be trained.
Depending on the noisiness of the system and how the system responds, the training sets
vary to achieve success. In general, a system may have more than 100 example training
sets. But, the solution solely depends on the context of the problem and behavior of the
data set. However, training a model as per the requirement is a very tedious job and subject
to expertise.
Science/
Methods
Satellite Remote
Sensing
BIG data Machine
• Optical remote sensing learning tools
like Shogun,
• Microwave remote Apache
sensing Mahout, Scikit-
Learn, Apache
• Data warehouse Spark MLlip,
TensorFlow
LANDSAT
ENVISAT
RADARSAT SPOT
Hence, there is a requirement of such physical interfaces that can be set up on the basis
of development, which must be futuristic and must satisfy the regulation of both space and
terrestrial network, and nonetheless, must be compatible with current and future wireless
communication systems. The implementation, if it has to be homogeneous with current and
future terrestrial wireless communication systems, requires a unified physical link inter-
face. The development should be based on international regulations for space and terrestrial
networks and should be compatible with future trends.
Various geospatial techniques are followed, but this can be classified into three broad
classifications. Position metrics calibrate a geographical position to a location, data collec-
tion mechanisms collect GIS data, and data analysis models analyze and use the collected
data [9]. A pictorial view of the various geospatial techniques is shown in Figure 9.7.
Geo referencing
pictorial and
Numerical data
Data Interpre
Products Users
Product tation
• Theme
• Object • Temporal Division
• External Memory • Indexing on No-SQL
• State • Spatial Division
• Internal Memory • Indexing on
• Data • Cluster-based Division
Distributed file system
overcome submeter resolution imagery by remote sensing from satellites [8]. These optical
wavelength, infrared sensor–based images are found to be useful in many applications, a
temperature map of the world, for example. Usage of active sensors, capturing transmis-
sion signals, radar-based remote sensing, and laser transmission such as LiDAR can also
provide a very high precision of data [45].
Distributed Spatiotemporal
Spatiotemporal Analysis and Mining
Storage Indexing
• Spatiotemporal • Data preparation • Process metamodel
• Knowledge
vcharacteristics • Correlation analysis • Executable script
presentation layer
• Computation analysis • Algorithm mapping
• Web service layer
• Parallel strategy characteristics • Planning and control
Figure 9.9, it is evident to implement data mining, process modeling, and service technolo-
gies for achieving an efficient knowledge discovery and geospatial analysis. In general, a
data mining algorithm is designed for small applications where the scope of data mining is
very limited and implemented in a stand-alone computer. On the contrary, big data mining
requires aggregation of a wide range of varied data and the processing of data using par-
allel computing. In dynamic streaming, the traditional knowledge discovery process will
not be effective enough. Similarly, process modeling technologies needs to provide services
such as script mapping, process planning and control, and its evaluation and source, and
spatiotemporal temporal meta-model over big data [49].
Figure 9.10 Conceptual diagram of the proposed fogGIS framework for power-efficient, low
latency, and high throughput analysis of the geospatial big data.
translated to a corresponding fog layer [55]. In the similar context, any geospatial data can
be compressed on the fog computer and can be transmitted to a cloud layer later. In the
cloud layer the data is either compressed or decompressed before processing and finally
analyzed and visualized.
Various concepts, challenges, requirements, and scopes are discussed in this chapter. A pri-
mary problem of meteoroid shower, also known as the dust devil, is discussed, and possible
solutions are presented. Table 9.2 depicts various aspects of integrated data representation,
storage model, computation, and visual analysis. Table 9.3 justifies various data models
with supported data types and scalability with examples of various products that use a cor-
responding data model.
9.4 Conclusion
Various concepts, challenges, requirements, and scopes are discussed in this chapter.
A primary problem of a meteoroid shower, also known as the dust devil, is discussed,
and possible solutions to overcome the dust devil are proposed. In addition to this,
amalgamation of AI and ML with respect to space study and GIS are discussed, which
will definitely help to calibrate space technology to the next level. Moreover, involve-
ment of big data to remote sensing and GIS data collection techniques, such as Big-GIS
and fog GIS, can be used so as to make the data collection easier and prediction more
accurate.
9.4 Conclusion 193
Integrated representation
References
1 Maliene, V. (2011). Geographic information system: old principles with new capabilities.
Urban Design International 16 (1): 1–6.
2 Mukherjee, J. and Ramamurthy, B. (2013). Communication technologies and architec-
tures for space network and interplanetary internet. IEEE Communication Surveys and
Tutorials 15 (2): 881–897.
3 Marchetti, P.G., Soille, P., and Bruzzone, L. (2016). A special issue on big data from
space for geoscience and remote sensing. IEEE Geoscience and Remote Sensing Magazine
https://doi.org/10.1109/MGRS.2016.2586852.
4 Coppock, J.T. and Rhind, D.W. (1991). The history of GIS. Geographic Information
System: Principles and Applications 1 (1): 21–43.
5 Abel, D.J., Taylor, K., Ackland, R., and Hungerford, S. (1998). An exploration of GIS
architectures for internet environments, computer. Environment and Urban Systems 22
(1): 7–23.
6 Yue, P., Gong, J., Di, L. et al. (2010). GeoPW: laying blocks for geospatial processing
web. Transactions in GIS 14 (6): 755–772.
7 Zhao, P., Foerster, T., and Yue, P. (2012). The geoprocessing web. Computers & Geo-
sciences 47 (10): 3–12.
8 Cheng, T. and Teizer, J. (2013). Real-time resource location data collection and visu-
alization technology for construction safety and activity monitoring applications.
Automation in Construction 34: 3–15.
9 Cheng, E.W., Ryan, N., and Kelly, S. (2012). Exploring the perceived influence of safety
management practices on project performance in the construction industry. Safety
Science 50 (2): 363–369.
10 Pradhananga, N. and Teizer, J. (2013). Automatic spatio-temporal analysis of construc-
tion site equipment operations using GPS data. Automation in Construction 29: 107–122.
11 Agrawal, D., Das, S., and Abbadi, A.E. (2011). Big data and cloud computing: current
state and future opportunities. In: EDBT, 530–533. New York: ACM.
References 195
34 Kul Bhasin, Jeffrey Hayden, Developing Architectures and Technologies for an Evolv-
able NASA Space Communication Infrastructure, National Aeronautics and Space
Administration, Washington, DC, 2004.
35 AOS Space Data Link Protocol. Recommendation for Space Data System Standards,
CCSDS 732.0-B-2, Blue Book, Issue 2, Washington D.C., July 2006.
36 Taleb, T., Kato, N., and Nemoto, Y. (2006). An efficient and fair congestion con-
trol scheme for LEO satellite networks. IEEE/ACM Transactions on Networking 14:
1031–1044.
37 Castro, M.A.V. and Granodos, G.S. (2007). Cross-layer packet scheduler design of a
multibeam broadband satellite system with adaptive coding and modulation. IEEE
Transactions on Wireless Communications 6 (1): 248–258.
38 Lohr, S. (2012). The age of big data. New York Times 11: 11–23.
39 Buyya, R., Yeo, C.S., Venugopal, S. et al. (2009). Cloud computing and emerging IT
platforms: vision, hype, and reality fordelivering computing as the 5th utility. Future
Generation Computer Systems 25 (6): 599–616.
40 Peter, M. and Grance, T. (2009). The NIST Definition of Cloud Computing (Draft), vol.
53, pp. 50, pp. 1216-1217. National Institute of Standards and Technology.
41 Lehner, W. and Sattler, K.U. (2010). Database as a service (DBaaS). In: 2010 IEEE 26th
International Conference on Data Engineering (ICDE), 1216–1217. IEEE.
42 Costa, P., Migliavacca, M., Pietzuch, P., and Wolf, A.L. (2012). NaaS: networkas-a-service
in the cloud. In: Proceedings of the 2nd USENIXconference on Hot Topics in Management
of Internet, Cloud, and Enterprise Networks and Services, Hot-ICE, vol. 12, 1–13. New
York: ACM.
43 Neumeyer, L., Robbins, B., Nair, A., and Kesari, A. (2010). S4: distributed stream
computing platform. In: IEEE International Conference on Distributed Systems and
Modelling, 170–177. IEEE.
44 P. Russom, Big data analytics, TDWI Best Practices Report, Fourth Quarter, IEEE, 2011.
45 Baertlein, H. (2000). A high-performance, high-accuracy RTK GPS machine guidance
system. GPS Solutions 3 (3): 4–11.
46 Chang, F., Dean, J., Ghemawat, S. et al. (2008). Bigtable: a distributed storage system for
structured data. ACM Transactions on Computer Systems (TOCS) 26 (2): 4–15.
47 Schadt, E.E., Linderman, M.D., Sorenson, J. et al. (2010). Computational solutions to
large-scale data management and analysis. Nature Reviews Genetics 11 (9): 647–657.
48 Dean, J. and Ghemawat, S. (2010). MapReduce: a flexible data processing tool. Commu-
nications of the ACM 53 (1): 72–77.
49 Di, L., Yue, P., Ramapriyan, H.K., and King, R. (2013). Geoscience data provenance: an
overview. IEEE Transactions on Geoscience and Remote Sensing 51 (11): 5065–5072.
50 Bonomi, F., Milito, R., Zhu, J., and Addepalli, S. (2012). Fog computing and its role
in the internet of things. In: Proceedings of the First Edition of the MCC Workshop on
Mobile Cloud Computing, 13–16. ACM.
51 Hancke, G.P. and Hancke, G.P. Jr., (2012). The role of advanced sensing in smart cities.
Sensors 13 (11): 393–425.
52 Dubey, H., Yang, J., Constant, N. et al. (2015). Fog data: enhancing telehealth big data
through fog computing. In: SE BigData & Social Informatics, vol. 14. ACM.
References 197
53 Yi, S., Li, C., and Li, Q. (2015). A survey of fog computing: concepts, applications and
issues. In: Proceedings of the Workshop on Mobile Big Data, 37–42. ACM.
54 Monteiro, A., Dubey, H., Mahler, L. et al. (2016). FIT: a fog computing device for
speech tele-treatments. In: IEEE International Conference on Smart Computing (SMART-
COMP), St. Louis, MO, 1–3. https://doi.org/10.1109/SMARTCOMP.2016.7501692.
55 F. Chen, H. Ren, Comparison of vector data compression algorithms in mobile GIS,
3rd IEEE International Conference on Computer Science and Information Technology
(ICCSIT), 2010. DOI: https://doi.org/10.1109/ICCSIT.2010.5564118
199
10
As indicated by the UN overview in 2014, the greater part of the total populace is presently
living in urban zones and expanding day by day, without a doubt cautioning city orga-
nizers. Associated urban areas develop when internet of things (IoT) advancements and
socially mindful system frameworks are used as total organizations over an entire associated
metropolitan area. When considering associated urban zones, one may consider innova-
tive urban communities that have the noticeable front-line advances for their residents. Be
that as it may, little private networks have likewise been profiting from interfacing peo-
ple, organizations, city foundations, and administrations. This chapter gives the detail of
city transportation issues and a segment of the troubles that are included with creating
across-the-board IoT systems. The alliance of world-class IoT change foresees working with
everyone of these brilliant urban networks that empower residents to make innovation use
more sensible, versatile, and manageable. Numerous urban areas and towns around the
world are swinging to socially keen gadgets to take care of urban issues, for instance, traf-
fic issues, natural pollution, medicinal services, and security observations to improve the
expectations for everyday comforts for their overall population. Brilliant sensors have been
introduced all through a city, in vehicles, in structures, in roadways, in control checking
frameworks, security reconnaissance, and applications and gadgets that are used by peo-
ple who are living or working there. Conveying data to the general population using these
cutting-edge services, provide brilliant urban community openings. The enormous infor-
mation can be used to settle on how open spaces are arranged, how to make the best usage
of their benefits, and how to pass on authoritative notices all the more capable, suitable,
and proper.
With the evolution of smart cities transforming towns into digital societies, making the
life of its residents hassle free in each facet, the intelligent transport system (ITS) turns into
the quintessential component among all. In any urban area, mobility is a key difficulty; be
it going to office, school, and college or for another motive for residents to use a transport
facility to journey within the city. Leveraging residents with an ITS can save them time
and make the city even smarter. ITS pursuits to achieve optimization of traffic performance
by means of reducing traffic troubles is a worthy goal. It alerts drivers with previous data
about the traffic, local convenience concurrent running statistics, seat availability, etc. This
reduces the travel time of commuters and improves their security and comfort.
The utilization of ITS [1] is broadly acknowledged and utilized in numerous nations
today. The utilizationisn’t simply restricted to movement of traffic control and data, yet is
used additionally for street well-being and effective foundation use. In view of its unlimited
potential outcomes, ITS has now turned into a multidisciplinary conjunctive field of work,
and along these lines, numerous associations around the globe have created answers for
giving ITS applications to address the issue [2].
The whole utilization of ITS depends on information accumulation, examination, and
utilizing the consequences of the investigation in the tasks, control, and research ideas
for movement administration where an area assumes an imperative job. Here the use of
sensors, global positioning system (GPS), communication protocols, messages, and data
analysis plays an integrated role for the implementation of:
● Data collection through smart devices or sensors. This requires exact, broad, and specific
information gathering with real-time observation for intelligent planning. So the infor-
mation here is gathered through devices that lay the base of further ITS capacities. These
smart devices are for the identification of vehicles, location of vehicles, sensors, cameras,
and so forth. The equipment chiefly records the information, such as movement check,
reconnaissance, travel speed, and travel time, area, vehicle weight, delays, and so forth.
These hardware smart devices are associated with the servers that for the most part store
a lot of information for further investigation.
● Data sharing with the vehicle, traffic control center (TCC), and authorities. Data should
be communicated to the vehicle, TCC, and authorities in real time. The efficiency of
data transmission to the stakeholders of intelligent transportation system (ITS) is the
key factor for the system. Here, communication protocol plays an important role for the
real-time data transmission to all stakeholders. By data sharing, many problems can be
easily removed, and it will also help to prevent critical situations.
● Analysis of the gathered data. The information that has been gathered and received at
Traffic Management Center (TMC) is processed further in numerous steps. These steps
are blunders rectification, data cleansing, data synthesis, and adaptive logical analysis.
Inconsistencies in information are diagnosed with specialized software and rectified.
After that, the record is further altered and pooled for evaluation. This mended collective
data is analyzed further to expect a visitor’s situation in order to deliver appropriate
information to users. There are so many tools available, especially for the analysis of
data.
● Send real-time information to the passengers. Passenger information system (PIS) is uti-
lized to illuminate transportation updates to the passengers. The framework conveys
ongoing data like travel time, travel speed, delay, mishaps on streets, change in course,
preoccupations, work zone conditions, and so on. This data is conveyed by an extensive
variety of electronic gadgets like variable message signs, highway warning radio, website,
SMS, and customer care (Figure 10.1).
snarls, and collisions. ITS on buses and trains are used to manage and adapt fleet operations
and to provide automated ticketing and real-time traffic information to passengers. ITS on
the roadside is used to coordinate traffic signals, detect and manage events, and display
information for passengers, drivers, and pedestrians.
There are various types of services that can be given to the users (Figure 10.2). The number
of services can be changed (added or removed) from time to time according to the situation
and demand by the public. Some of the important services for the user are [6]:
In the current scenario, most of the countries have a partial ITS, that is, not all the ser-
vices are implemented properly. For instance, India has services like automatic payment
toll plaza (which is not yet implemented in all parts of the country), electronic payment,
traffic information system (which is again, not yet implemented in all part of the country).
But there area lot of developments that should be required, such as accident management
system, live information of public transport, intelligent parking, etc. Parking is the common
problem in all the major metros or big cities, and people do not have any information about
the vacancy of parking available in nearby areas. Most of the traffic increases are due to the
10.1 Introduction to Intelligent Transportation System (ITS) 203
Commercial
Intelligent parking
Vehicle Operations
parking problem, so intelligent parking plays a vital role in resolving the traffic problem,
which is essentially congestion from lack of parking space.
as they are continuously updated by real-time traffic monitoring system. ITS movement
control occupies activity far from occupied or hazardous zones, counteracting congested
roads in addition to decreasing the danger of crashes.
● Benefits of public transport. Use of electric and hybrid buses or public transport vehicles
always reduces pollution. Understanding of online data to transports, train, and their
travelers makes a superior educated traveler and administrator. E-ticketing empowers
faster, easier travel by public transport, and gives administrations data to make upgrades
in the vehicle’s framework. Smart cards for the public transport vehicles save time and
money of the passengers.
● Loss of infrastructure reduced. Heavy vehicles can put too much stress on the road net-
work, especially when they are overloaded. The weight of vehicles in weight stations and
other old forms of weight control is reduced but with the cost of time and delayed traffic is
increased. Weight-speed-speed system measures the size and weight of the vehicle while
transmitting the collected data returned to the central server. The overloaded vehicles can
be recognized and suitable measures taken, which result in high compliance between the
passengers and the driver and the reduction in lengthy road routes. Not only do these sys-
tems simplify the enforcement, they can reduce the expenses of repairing the road, so that
it can be allocated somewhere else.
● Traffic and driver management benefits. Online and street-side data will be available to
drivers, and vehicles are equipped with driver help frameworks to enhance the profi-
ciency and security of road transport. Administration of vehicle armadas, both cargo and
open transport, through online data and two-way correspondence among chief and driver
helps to limit confusion [7, 8].
● Reduced parking problem. Invalid parking contributes to busy streets and makes a city
hazardous, and causes problems for other drivers; city vehicles and others need spaces in
which to park. Stopped vehicles cause time-consuming traffic to grow in crowded areas
because visitors find these places in which it is very difficult to park. Manual parking
management systems can be expensive and ineffective; they can also add more people to
the crowded areas. Cities should explore smart parking systems, which scans parked vehi-
cles and communicates with the parking meter to search and record unauthorized parked
vehicles. Rather than requiring a human parking enforcement officer, smart parking sys-
tems let drivers know that they will be automatically identified for illegal or extended
parking. These automated system drivers help to improve traffic flow by enhancing con-
formity to parking rules and keeping ample parking spaces available for use.
transport system is integrated into the infrastructure and the vehicle itself, these techniques
can get rid of the crowded areas, improve safety, and increase productivity.
Clogs in urban rush hour gridlock systems are an extraordinary danger to any economy,
and the regular habitat, and adversely affects the nature of individual life. Traffic jams
call for extraordinary systems to diminish travel period, queue, and unsafe gas emana-
tions, particularly unused time spent sitting during extended traffic lights. Traditional activ-
ity administration arrangements fail to confront numerous restrictions primarily because
of the absence of sufficient correspondence between vehicles themselves and additional
vehicle-to-infrastructure connection. In the past couple of years, there have been endeav-
ors to build up an assortment of vehicle automation and communication systems (VACS)
[9] for enhancing traffic capability and well-being. Additionally, there have been significant
advancements in intelligent automatic vehicles for this reason.
● RFID uses the vehicle for communication of infrastructure, and it has proved useful in
identifying vehicles, in order to facilitate parking with automatic toll collection. Still, if
the range of communication is limited in RFID, ITS can be supported from this technique
in different ways. An RFID application that has not been widely implemented has to fix
RFID tags for traffic signals. This can be accessed/read by vehicles passing nearby.
● Bluetooth [11] is utilized primarily to accumulate the MAC address of mobile devices
when passing through vehicles to build source-destination chart. This information can be
utilized for ITS applications for the prediction of traffic flows in the future. Dynamically,
Bluetooth can be used to portray traffic masses of specific areas of the city. In comparison
to RFID, the correspondence distance of Bluetooth is very restricted.
● Wi-Fi is broadly utilized in ITS experiments for correspondence technology, particularly
the 802.11a variation that works properly at 5.8 GHz. The explanation behind this is the
accessibility of the ad hoc mode and the correspondences range, which is roughly 50 m.
It is equipped for duplex correspondence and locally supported TCP/IP.
● 2G, 3G, 4G, and 5G mobile communications has been widely used in vehicles for decades
to achieve location-based services such as real-time maps and internet. These mobile
communications can be used to connect various devices to vehicles as well as vehicles
to infrastructure.
● GPS are used in vehicles to get the accurate position of the vehicle. Location-based ser-
vices are not possible without the help of GPS. GLONASS and Galileo are other options
for this, and these services are the main competitors of GPS.
● WiMAX was initially anticipated to be included in VANETS due to its long range of 1 km.
Be that as it may, it had a moderate start, and a few of the communication organiza-
tions quit utilizing it. This issue has prompted a shortage of hardware utilizing it and any
experimentation generally stayed on a simulation level.
206 10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT
● ISM RF is free from license. It is a duplex wireless communication that can cover the area
from 20 m to 2 km, but it depends upon the equipment fitted with it, like antennas and RF
power. Scientists can use any wireless protocol with it for the research work. It is difficult
to blend and match segments from various producers of devices, since there is no normal
convention. The frequencies used are 868 MHz, 915 MHz, and 2.4GHz.
● Management of the communication of the two devices in a long-distance pairing. The dis-
tance between any two devices that are communicating plays a vital role, whether com-
munication is reliable or unreliable, continuously without connecting and disconnecting,
and whether the amount of data that can be sent without any noise. In one-hop commu-
nication, a message can start traveling only when direct communication from source to
destination is established. In multi-hop communication protocols, the long distance is
covered by the intermediate nodes. However, the complication of these protocols means
that more research and testing is required to check that a message can travel from the
source of the desired distance to the destination. Implementations of some multi-hop
protocols use the infrastructure as intermediate devices.
● Effect of bandwidth and MAC (medium access control). The bandwidth has to do with the
amount of information that can be passed to any message or message on a communica-
tion network from source to destination if the information is changed to bits per second.
Due to concentration on wireless devices, many devices are the worst at the same fre-
quency, so a MAC protocol is used to avoid collision, and as a method to identify dropped
information. There is a requirement of a different bandwidth to work by different services.
Emergency or urgent messages are generally small and require continuous bandwidth,
whereas standard VANET information is required to have high bandwidth and large mes-
sages.
● Internet on vehicle. Current traffic operational strategies are based mostly on TCC or any
centralized control room. When the scenario is complex, it will cause a significant amount
of computing resources. With the help of the internet on vehicle technology, we can dis-
tribute part of the vehicle’s decision work. In increasing information exchange between
vehicles, many decentralized solutions can be proposed, which can provide efficient traf-
fic operations and ensures system scalability.
● Handling of emergency situations. The number of times the vehicle interacts with smart
traffic lights is important for the efficiency of optimization algorithms and bandwidth
usage. A very few number of messages and traffic lights will be obsolete or inoperative,
and many messages and bandwidth will have multiple conflicts, and traffic light con-
trollers may not have sufficient time to process information from time to time for the use
of customizable algorithms. In case of emergency, real-time information is important,
10.2 Issues and Challenges of Intelligent Transportation System (ITS) 207
such as an accident that has already happened, an inevitable collision in the future, vio-
lation of red light, involvement of pedestrian (movable device), and bicycle traffic being
light. Apart from this, there is also a priority of emergency vehicles, heavy vehicles, and
transit/busses on traffic light schedules.
● Efficient use of data received from multiple sources. Compared to two single sources,
multiple source supplements can provide data, and multiple stores of data fusion can
reduce the uncertainty related to individual sources, creating a better understanding
of the observed condition. Apart from this, it is quite expensive to install and maintain
sensor suits. If due to some failure it is shown that one or more sensors are deteriorating,
it will only reduce the entry rate and will have less effect on the performance of the
system. With the help of ITS, we can get real-time information in many sources of
traffic data. Consequently, this will allow us to manage the transport system more
efficiently.
● Verifying of model in ITS. In spite of the fact that countless models have been proposed,
there is no appropriate method to demonstrate their worth. On one hand, the trans-
portation framework is refined and reproduction cannot give a comparable forecast
about this present reality. Then again, a reality check on an extensive scale is incom-
prehensible. In the future, with the improvement of ITS, we might have the capacity
to utilize more information gathered to remove the gap between simulation and the
real world.
● Security and privacy of gathered data or information. Wireless communication is telling us
that eavesdropping is notorious for ease of ease. Due to cheap storage and big data analysis
or data science, the information or data collected from a passing car and recorded com-
munication can disclose enough information to know a lot about a person, for example,
a social site’s data, the person’s schedule, and their shopping habits. This information
can be sold for advertisement and can be used as a way to hijack their normal routes. A
more important concern is the active malicious communication node that spreads wrong
information or rejects complete communication. This problem is extremely risky because
it casts doubt on the system used for emergency, and doubt of such a system will make it
obsolete.
Any new communication mechanism such as VANET could be used by “malicious users”
to use the technology for fun or with the intention to harm the particular user using the
technology. In addition to that, modifying the hardware to send the wrong information will
cause the local government, manufacturers, and other middlemen to take the risk of enter-
ing the central computer attack. The performance of the vehicle can reduced its chances for
attack by using the VANET as a gateway (Figure 10.3).
Management the
Effect of Bandwidth and
communication of two
MAC
devices in a long distance
Challenges and
Opportunity in the Handling of Emergency
implementation of ITS situation
Internet on Vehicle
compared to traditional data. Instead of emphasizing the statistical emphasis on the model,
the machine emphasizes learning algorithms.
Data always play an important role in the journey of scientific research. Data/information
are important in demonstrating, analyzing, and tackling logical issues, and data mining
is the most famous apparatus to process information appropriately. In the beginning of
data mining, the data sets that should be taken care of are generally spotless ones of mod-
erately little scale. With the improvement of data collection techniques, the wellsprings
of data are getting to be more extravagant and more pervasive, which straightforwardly
prompts the increase in the data sum. In the meantime, the gathered data have addition-
ally demonstrated a solid character of continuous and heterogeneous data, with a specific
extent of messy data blended inside too. In this manner, the customary information mining
is standing up to extraordinary difficulties. With the end goal to conquer the difficulties,
IDA strategies with higher preparing speed, higher precision, and higher effectiveness are
created. In the development process of IDA, different researcher contributes the revised or
updated algorithms from time to time to solve a particular problem in a different field. The
past trends in the development of IDA include the following:
● The concept of IDA becomes smarter
● The data set is bigger day by day
● The organization of data set is transformed from structured data set to unstruc-
tured/heterogeneous data set.
First, the coverage area of IDA was climate, geography, medical science, physics,
image processing, system engineering, etc., but now the application of IDA is spread to
e-commerce, public sectors, government sectors, social networks, etc. Some of the IDA
algorithms have done great progress but still there exists some challenges in data analysis.
These challenges include the following:
● Selection characteristic of disproportion data set. Although the most important features of
multisource asymmetric data set by data preprocessing, this feature can create an unbal-
anced data set. Conventional IDA algorithms all hope that the generalization should be
given so that the unbalanced data set will be neglected as a minority. However, as a result
of fault diagnosis, the minorities often include valuable information. Therefore, a special
feature selection algorithm for developing unbalanced data needs to be developed.
● Data analysis distribution. In view of dispersed registering, the high performance com-
puting innovation can upgrade the capacity of each and every hub, with the goal that
the ability and handling speed of the whole disseminated framework can be addition-
ally reinforced. One of the principal methodologies of high performance computing is
to exploit graphic processing units’ (GPU) capacity to do gigantic dreary monotonous
counts, and extra more central processing unit (CPU) assets to accomplish more complex
computational work, or, in other words arrangement without any changes on equipment.
● Big data analytics. Although distributed knowledge analysis platforms could provide an
answer to cope with data of huge size, the information is based mostly on modeling in big
data where the atmosphere is still challenged. There are two effective solutions to the big
data analytics problem: (a) using machine learning design, an algorithm of deep learning
to processing of huge multidimensional data, and (b) converting the whole data set into
subsets, where submodels can be developed, and then get the whole model by logically
integrating the submodels.
210 10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT
realistic simulations and mathematical models. The aim of all these methods is to help
decision-makers with reliable and scientific data.
● Addressing traffic congestion and parking. Fifty percent of the world’s population lives in
cities, and the population of cities is growing at roughly 2% each year. Whereas incre-
mental growth is nice for the city’s economic health, this increase usually causes more
problems.
Data analysis isused to assist town planners to establish the causes of the group’s growth.
The planner will currently inspect the origin of the visitors, and experiences encountered
during their visit, as well as the final purpose of the travel, and may make sure that there’s
available parking downtown. Town planners will use knowledge analytics to search out
the popular parking slots preferred by most drivers.
● Long-term traffic. American census data shows that average American workers spend
20% more today than they used to do in the 1980s. It is a positive step for cities to attract
visitors and businesses. However, it is never easy to scale the road potential to maintain
additional traffic. Large data and IOTs can be used to determine the length of the journey,
to decide the long journey, where the journey begins and finally ends, and who is doing it.
The planner can access when those passengers have unlimited access to practical driving
options. Analytics can show the location of the longest travel time and final miles traveled
along with the interval. This data helps to identify alternate routes that can encourage
drivers to use the options.
Traffic has a significant impact on people staying in cities and on the efficiency of cities
during busy times. Despite the population growth, efficient use of data and censors will
help manage the traffic efficiently. The goal of smart traffic management is to make urban
driving more intuitive and efficient. As smart cities develop, services and infrastructure will
be further integrated. As time goes on, issues such as traffic, waste management, and energy
conservation will greatly benefit from the concept of things and the internet of big data.
That is, the entire process of analysis has been depicted in Figure 10.4. When the device
is used, they will register on the forum and their basic information will be availablefor a
meaningful model search. Then their examples will be stored in the local database. That is,
the combination will analyze these examples and implement meaningful annotation and
meaningful association building according to the knowledge model. When service require-
ments are detected, meaningful reasoning and analysis will be called to find related services.
Finally, users’ models will be analyzed and proper services will be generated.
IDA transforms data into working knowledge, allowing transport users to make informed
decisions to ensure the safe and efficient use of facilities. For example, in this type of system,
each passenger has access to the most reliable and up-to-date position of almost all trans-
port modes from any point on the transport network. Passenger devices use information
about traffic from smartphones, tablet computers, and roadside information. Then they can
choose the mode and route that will provide them with minimum travel time and dynamic
adjustments with real-time information.
Analysis of Model
Searching Local
Databases Analysis of property
Basic Information
Authentication
Sensor/Device
vehicle tracking research, planners must exploit any additional prior information. As vehi-
cles drive on roads, on-road hurdles or road map information can be considered as preexist-
ing. Information for tracking any and all on-road obstacles is considered and was directly
applied to form a soft hindrance by filtering the directional process noise. As a sequence
of linear segments by modeling the problem of vehicles with on-road obstacles, the track-
ing is considered as a state with linear equality barriers, and it has been demonstrated that
the projection of optimum solution is the penman filter. Other customizable-based filters
are also enabled to fix the on-road goal tracking issue. Recently, road map information is
included in the vehicle tracking in road coordination. The constrained state estimation
problem then becomes a standard unrelated problem. In [14], a single vehicle is tracked
on a one-dimensional road, a coordination system in which only longitudinal speed is con-
sidered in a vehicle; thus, this approach is only applicable in single-lane cases. This method
is extended to the multidimensional by adding an extra dimension in order to model sev-
eral alleys [15] in the case. Then, exact longitudinal and lateral motions of the vehicle are
individually estimated.
With the IoT, modern vehicles and transport systems make the backbone of the future ITS.
The associated autonomous vehicles are becoming reality; Google’s public commitment
to self-driving cars and some other companies are some examples of industry attempts in
this direction. Developers of automated driving platforms, which help in delivering their
technologies, are working to supports its use by various types of automobile companies.
10.4 Intelligent Data Analysis for Security in Intelligent Transportation System 213
Mess
age
Integri
ry,
Security at different Layers Authe
ncity
Securing ITS is the foundation for ITS, since ITS systems should be secure before they will
improve the effectiveness of the surface installation. ITS as an associate data system should
be protected so that ITS applications are trusty, reliable, and obtainable. In summary, we
tend to describe the safety within the realm of three-dimensional models with multiple
layers, as shown in Figure 10.5.
● ITS ensures the security at all layers of the networking system, such as physical layer secu-
rity, data link layer security, network layer security, transport layer security, and applica-
tion layer security.
● Information security and operational security are the other dimensions for the security in
ITS. Whatever the information is that is transmitting from one vehicle to another vehicle,
vehicle to control center, vehicle to devices/sensors during the operations must be secure
and not accessed by unauthorized persons. So, ensuring the security during information
sharing and operations is the crucial part of security in ITS.
● Privacy: The contents of the transit information transmitted in the ITS are true and con-
fident, and only the sender and the desired receiver understand the transmitted message.
Because it is a content-based network, the information is the truth and also the confidence
key. Message integrity requires that this data is transmitted between the sender and the
receiver, which is unchanged in the transmission for any reason. Availability means that
we have to make sure that the network communication is working unnecessarily without
interruption. Disclaimer means that the sender and the receiver are both capable of iden-
tifying the other party involved in the communication. Authentication ensures message
integrity and adds the sender’s digital signature with the message.
While the software is the main area of innovation and value in modern vehicles, addi-
tional complexity comes at the cost. Experts estimate that today the cost of software and
electronics in some vehicles is already 35–40%, in the future, some types of vehicles can
reach 80%. In addition, more than 50% of car warranty costs can be attributed to electronics
and their embedded software. In February 2014, after the discovery of faulty software in
the car’s hybrid control system, the carmaker Toyota recalled 1.9 million hybrid cars world-
wide. Software disturbances can stop the hybrid control system while driving, resulting in
a lack of electricity, and the vehicle suddenly comes to a stop.
Moreover, security-related incidents are a dangerous threat. Exploiting vulnerabilities
within the vehicle’s natural state will permit the remote control of car components, whereas
an assaulter will put off the lights or perhaps control the brakes while on the move. Addi-
tionally, quite recently, attacks on vehicles exploited their vulnerabilities, enabling hackers
214 10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT
to use the brakes, kill the engine, and steer the vehicle over the internet; the company urged
car owners to update their cars’ software packages to patch the known vulnerabilities. There
is currently no standard policy on security for transportation. In addition, methods and tools
are rare to estimate the combination of physical and cyber risks, and provide only limited
guidance for the transport sector on the way to assess these risks. Nevertheless, such a com-
bination of risks is expected to increase. It has highlighted the need for the development of
transport-specific equipment to help in the analysis of joint physical attacks and cyberat-
tacks, especially those that are extremely high risks, on interdependent and dependent land
transport systems. Steps should be taken to provide a method for multi-risk analysis in the
ITS mutually dependent/dependent system. This analysis should be based on a detailed
and reliable vulnerability assessment, pointing out how joint attacks can provide cascading
effectiveness and how ITS can spread the notice of these threats during deployment. Apart
from this, the effectiveness of these efforts should be enhanced by introducing a method
for analyzing the necessary level of detail and the problem range and a scalable framework
that can be extended to include additional systems and threat scenarios.
IDA in ITS enables intelligent decisions for better transport security and traffic effective-
ness through sensors fitted in vehicles, which helps incooperation and wireless communi-
cation. Similar to other control systems, ITSs should work properly to limit losses due to
system failure, such as invasion or communication breakdown.
In the communication system of the vehicle, the network points broadcast beacon mes-
sages to the vehicle and roadside unit (RSU) senders to share information related to the
situations, environment, traffic situations, and incidents of the period. The beacon can also
be transmitted, sporadically, sent to one or several vehicles. Generally, one beacon mes-
sage is sent by a vehicle every hundred to one thousand milliseconds. Such communication
is generally commented on as a vehicle-to-vehicle (V2V) or vehicle-to-basic infrastructure
(V2I). For information shared between participating organizations, messages can also be
sent through unicast, multicast, or geocast. Roadside-aided routing is connected to finish
the IoT infrastructure through different communication channels. All vehicles share data
about themselves by utilizing signals, for example, speed, area, street condition occasions,
mischance area, change path/blend movement cautioning, brake light cautioning, crisis
vehicle cautioning, and so forth. RSUs send cautioning messages principally about street
and weather conditions. Again, the cloud-based services givento the vehicle, for example,
vehicle upkeep update or e-toll, may require a session to be set up between the vehicle and
a RSU utilizing unicast messages.
IDA plays an important role in the security of the ITS. By smart data analysis, the system
can easily identify the congestions, traffic, accidents, and intentions of drivers easily and
take the corrective action. For example, suppose that the car is moving in a road abnormally
and the cameras are able to identify the abnormal or unwanted activity in the road by the
vehicle to the control center or server, so that the corrective action can be initiated. These
abnormal activities may be due to speeding, drinking and driving, criminal activity, or due
to a health problem. The following steps should be taken for the security of transportation
system by the help of IDA:
● Identify the real usage of ITS. First, list the realistic ITS usage and applications required
that reflect a series of resale processes. Most of these applications can be encouraged or
10.5 Tools to Support IDA in an Intelligent Transportation System 215
Companies that are not taking advantage of data analytic tools and techniques will be left
behind. Since the data analytics tool captures products that automatically embrace and
analyze the data, as well as provide information and predictions, you can improve predic-
tion accuracy and refine the models. This section will discuss four data analytics tools for
success. Organizations can analyze the data and remove active and commercially relevant
information to boost performance. There are many extraordinary analytical tools available
and you can take advantage of it to enhance your business and develop skills. The properties
of the good IDA tools are:
Some of the tools that support the IDA and that are the best for the analysis of ITS are:
● See5. See5/C 5.0 is designed to analyze millions of records and hundreds of databases
in numerical, time, date, or nominal areas. To accelerate the See5/C 5.0 analysis, one or
more CPUs (including Intel hyper-threading) also takes advantage of the computer up to
eight cores. In order to maximize interpretation, See5/C 5.0 classifiers are expressed as a
set of decision trees or immediate rules, which are generally easier to understand than
216 10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT
neural networks. This tool is available for Windows as well as for the Linux operating
system. RuleQuest provides the C source code of this tool so that See5 can be embedded
in any system of organization.
● Cubist. This tool has some feature of See5, such as, it can also analyze millions of records
and hundreds of databases at a time. To maximize interpretation, Cubist models are
expressed as a collection of rules, where each rule has a multi-special linear model. When-
ever a situation matches the conditions of the rule, then the approximate value of the
respective model is used to calculate it. This tool also available for Windows as well as for
the Linux operating system. Cubist is easy to learn and advanced knowledge of statistics
or machine learning is not necessary.
● Inductive learning by logic minimization (ILLM). This tool innovates the classification
models in the form of different rules, which shows the hidden relationship among the
data.
The following programming language can support IDA for the ITS:
Number String
List
Tuple Dictionary
References 217
can be easily done by the R programming like histogram, bar charts, piecharts, box plots,
line graphs, and scatter plots.R programming can be especially useful for the IDA because
ITS produces a large amount of homogenous and heterogeneous data daily. For instance,
there is a requirement of deep analysis to identify dangerous driving in real time, so that
the information about that vehicle can be sent to the concerned authorities immediately
for quick action (Table 10.1).
References
1 Rodrigue, J.P., Comtois, C., and Slack, B. (2013). The Geography of Transport Systems.
New York: Routledge.
2 Sussman, J. (2005). Perspectives on Intelligent Transportation Systems (ITS). Boston, MA:
Springer Science + Business Media.
3 Cobo, M.J., Chiclana, F., Collop, A. et al. (2014). A bibliometric analysis of the intelli-
gent transportation systems research based on science mapping. IEEE Transactions on
Intelligent Transportation Systems 15 (2): 901–908.
4 Diakaki, C., Papageorgiou, M., Dinopoulou, V. et al. (2015). State-of-the-art and -practice
review of public transport priority strategies. IET Intelligent Transport Systems 9:
391–406.
5 Eriksson, O. (2002). Intelligent transport systems and services (ITS). In: Information
Systems Development (eds. M. Kirikova, J. Grundspenkis, W. Wojtkowski, et al.). Boston,
MA: Springer, Springer.
6 Y. Lin,P. Wang, and M. Ma, Intelligent Transportation System (ITS): Concept, Challenge
and Opportunity. IEEE Conference on Big Data Security on Cloud (2017).
7 Chen, B. and Cheng, H. (2010). A review of the applications of agent technology in traf-
fic and transportation systems. IEEE Transactions on Intelligent Transportation Systems
11 (2): 485–497.
8 Barmpounakis, E.N., Vlahogianni, E.I., and Golias, J.C. (2015). A game theoretic
approach to powered two wheelers overtaking phenomena. In: Proceedings of the Trans-
portation Research Board 94th Annual Meeting, Washington, DC, USA, 1–4.
9 A. Maimaris & G. Papageorgiou, “A Review of Intelligent Transportation Systems from
a Communication Technology Perspective,” IEEE International Conference on ITSC,
Brazil 2016.
218 10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT
10 Cheng, W., Cheng, X., Song, M. et al. (2012). On the design and deployment of RFID
assisted navigation systems for VANETs. IEEE Transactions on Parallel and Distributed
Systems 23 (7): 1267–1274.
11 Liu, Y., Dion, F., and Biswas, S. (2005). Dedicated short-range wireless Communications
for intelligent transportation system spplications:state of the art. Transportation Research
Record: Journal of the Transportation Research Board 1910: 29–37.
12 F. Victor & Z. Michael, “Intelligent Data Analysis and Machine Learning:are they Really
Equivalent Concepts?” IEEE conference RPC, Russia, (2017).
13 Lu, H., Zhiyuan, S., and Wencong, Q. (2015). Summary of big data and its application in
urban intelligent transportation system. International Journal on Transportation Systems
Engineering and Information 15 (05): 45–52.
14 Ulmke, M. and Koch, W. (2006). Road-map assisted ground moving target tracking.
IEEE Transaction Aerospace Electronic system 42 (4): 1264–1274.
15 Chen, Y., Jilkov, V.P., and Li, X.R. (2015). Multilane-road target tracking using radar and
image sensors. IEEE Transaction Aerospace Electronic System 51 (1): 65–80.
219
11
11.1 Introduction
11.1.1 Overview of Big Data Analytics on Motor Vehicle Collision Predictions
Due to population growth, there are more traffic accidents, which have become a global
concern. In the past, researchers have conducted studies to find out the common cause for
motor vehicle collisions. Log-linear models, data mining, logical formulation, and fuzzy
ART are some of the methods widely used to perform research [1]. Even with the use of
these methods, data analysis is a complicated process. However, with the improvements
in technology, data mining is defined to be a highly accurate method for the analysis of
big data.
With the use of big data application, few researchers have focused mainly on understand-
ing the significance of vehicle collisions. A research performed by Shi et al. [2] explains
the significance of identifying the traffic flow movements on highways to minimize the
impact on vehicle collisions. This research has used a time series data approach under
clustering analysis to comprehend traffic flow movements. It was converted using the cell
transformation method.
A similar study by Yu et al. [3] considers traffic data mining to be a big data approach. The
author further carries out the study using past traffic data with the application of common
rules via data mining, which is based on a cloud computing technique. Traffic trend
prediction and accident detection within a MAP-reduce framework are used. However,
the data sets have many missing pieces and were redundant with values that cannot be
normalized [1].
Data mining is a widely used computing technique adapted to determine unfamiliar pat-
terns in a large data set [4]. For prospective use, data mining is a comprehensive field that
presents meaningful patterns. These techniques are classified into three categories; classi-
fication, prediction, and data clustering, in order to trace seasonal trends and patterns [4].
Classification is a commonly used method that forecasts unknown values acknowledged to
generated mode [5].
This chapter proposes a framework that uses a classifying technique for collision predic-
tions that has been undertaken by Python programming language. Python is an effective
Intelligent Data Analysis: From Data Gathering to Data Comprehension,
First Edition. Edited by Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar.
© 2020 John Wiley & Sons Ltd. Published 2020 by John Wiley & Sons Ltd.
220 11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City
programming language that is used in the theoretical and mathematical analysis for large
data sets [6]. Python programming language supports various data mining techniques and
algorithms that mainly clusters and classifies. As this option has many beneficial features,
it is one of the most suitable tools for making scalable applications. Thus, it can be utilized
for the framework of big data analysis in wide motor vehicle collision data sets to obtain
reliable results.
A specific data set has been obtained for this chapter about vehicle collision in a large
city in the United States (New York City). A data mining technique is exercised to perform
further analysis on the data set. Based on recent news reports, New York City (NYC) roads
are believed to have an increase in motor vehicle collisions. The National Safety Council
has conducted a preliminary survey that confirms that the year 2016 had the deadliest acci-
dents on NYC roads over several decades [7]. Thus, there is a need to predict and assess the
association of vehicles involved in a collision and their attributes found in the data set.
This chapter uses an analytical approach to data mining, which forecasts relevant attributes
corresponding to the source of other related attributes. Analysis of variance (ANOVA) table
generated from k-means clustering, k-nearest neighbor (kNN), naïve Bayes, and random
forest classification algorithms are used in this chapter to understand the association of
statistical data collected.
Attributes Description
raw data. Subsequently, data integration links the data into a reliable structure. In the last
part of preprocessing, data is converted into acceptable forms for data mining [8].
During the data preprocessing, the vehicle type-code is further categorized into four
groups, depending on vehicle passenger capacity and size of the vehicle. This developed
attribute is then added for analysis into the data set. Table 11.2 illustrates categorized
groups.
Raw Dataset:
motor vehicle collision data
26 attributes and
1,048,575
Large Medium
Vehicle Vehicle
Small Very Small
Vehicle Vehicle
Yes
No
Predictions
of motor vehicle collisions
and reduce road
risks
data mining, combined with categorizing data inputs, for measuring statistical learning
schemes. Relevant statistical data is generated rapidly and is precise. Further, it visualizes
the input data and learning outcome for a large set of data that becomes clearer [6].
Figure 11.1 illustrates the overall data analysis process of this chapter.
11.3 Classification Algorithms and K-Fold Validation Using Data Set Obtained from NYPD (2012–2017) 223
Classifiers Description
LOAD NYPD_collision_data
STRING [] VTC= SELECT Vehicle_Type_Code
READ VTC
IF VTC = " ", "Unkown", "Other"
DELETE
ELSE
224 11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City
ADD
END IF
SAVE filtered_collision_data [n=998,193]
THEN
LOAD filtered_collision_data
STRING [] Date= SELECT Date
SEPARATE Date = "Day_of_ the_ week", "Day", "Month", "Year" manually
SAVE Seperated_Filtered_vollision_Data [n=998,193]
LOAD Seperated_Filtered_collision_Data
STRING [] VTC= SELECT Vehicle Type Code
READ VTC
FOR n = 1 to 998,193
IF VTC= "Bus, Fire Truck, Large Commercial Vehicle" THEN
SAVE VTC as "Large Vehicle"
END IF
IF VTC= "Pick-up Truck, Small Commercial Vehicle, Van, Livery Vehicle" THEN
SAVE VTC as "Medium Vehicle"
END IF
IF VTC= "Passenger Vehicle, Sort-Utility/Wagon, Taxi, Pedicab" THEN
SAVE VTC as "Small Vehicle"
END IF
IF VTC= "Motorcycle, Scooter, Bicycle" THEN
SAVE VTC as "Small Vehicle"
END IF
SAVE Grouped_Filtered_ Collision_Data [n=998,193]
Step3:
SELECT NAVIEBAYES
FOR k= 1 to 10 DO
Learn NavieBayes based on predictions
END FOR
RETURN k = Accuracy% (k1), Accuracy%(k2), Accuracy% (k3) ,......, Accuracy% (k10)
11.4 Results
96
94
Accuracy %
92
RF
kNN
90
NavieBayes
88
86
2 4 6 8 10
k-fold value
8000
7000
6000
Seconds
5000
4000
3000
2000
1000
5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0
Nodes
Bayes accuracy. The highest accuracy has been recorded k6 = 97% in kNN while the lowest
is recorded in k7 = 85.66% in Naive Bayes. Further, Figure 11.1 indicates k6 instance as
the highest accuracy for all three classifiers. Nonetheless, comparing all above results, it
is evident that random forest and kNN prediction related to vehicle groups will be highly
accurate.
Figure 11.3 explains that the total processing time of total nodes is in a linear growth
where there is evidence of the accurate frequency of data set in a random forest classifier.
However, 10–15 node processing time has decreased in the classifier.
Figure 11.4 illustrates the constant accuracy of 95.033% data comparing to each total num-
ber of nodes. It is further evident constant high accuracy data prediction in random forest
is spread among each node.
11.4 Results 227
Node Accuracy
100
98
Accuracy%
96
94
92
90
5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0
Nodes
LOAD Grouped_Filtered_collision_Data
STRING [] VG= SELECT Vehicle_Group
READ VG
FOR n = 1 to 998193
IF VG= "Large Vehicle" THEN
SAVE VG as "Test 1"
END IF
IF VG= "Medium Vehicle" THEN
SAVE VG as "Test 2"
END IF
IF VG = "Small Vehicle" THEN
SAVE VG as "Test 3"
END IF
IF VG = "Very Small Vehicle" THEN
SAVE VG as "Test4"
END IF
SAVE pvaluetest_ Collision_Data [n=998193]
228 11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City
LOAD pvaluetest_Colision_Data
STRING [] Period= SELECT Year
SEPARATE Period = "2016-2017", "2014-2015", "2012-2013" manually
SAVE pvaluegrouptest_Collision_Data [n=998193]
LOAD pvaluegrouptest_Collision_Data
INT [] VG = Select Vehicle Group
STRING [] Period= SELECT Year group
Step1:
SELECT ONE-WAY ANOVA
FOR VG = "Test 1" DO
Learn pvalue based on mean
END FOR
Step2:
SELECT ONE-WAY ANOVA
FOR VG = "Test 2" DO
Learn pvalue based on mean
END FOR
Step3:
SELECT ONE-WAY ANOVA
FOR VG = "Test 3" DO
Learn pvalue based on mean
END FOR
Step4:
SELECT ONE-WAY ANOVA
FOR VG = "Test 4" DO
Learn pvalue based on mean
END FOR
RETURN INT [] p-value = (p1 , p2 , p3 , p4 )
SAVE p1 , p2 , p3 , p4
According to the above analysis of algorithms, it is observed that different vehicle groups
are compared against a period of years. In the following four tests p-value <0.05 considered
as statistically significant (Table 11.5).
The tests output yields every p-value of p < 0.001. Its justified p-value is significant at a
99.99% confidence level while indicating a high significance between different means of
each vehicle group in given periods.
11.4 Results 229
11.4.4 Measured Different Criteria for Further Analysis of NYPD Data Set
(2012–2017)
Figure 11.7 trends indicate between the years 2013 and 2016 the numbers of collisions and
number of persons injured has been increasing year-wise while killed persons were stable.
200000
NUMBER OF PERSONS
150000
NUMBER OF PERSONS INJURED
NUMBER OF PERSONS KILLED
100000 NUMBER OF ACCIDENTS
50000
Figure 11.7 Comparison of number of collisions, persons injured, and persons killed year-wise.
Further, this is evident by numbers recorded for 2017 until the month of April. Within
four months of 2017, the number of collisions, injured, and killed persons recorded were
122 245, 31 057, and 105, respectively, which is an approximately 50% growth comparing to
the previous year. Therefore, the year 2017 was more fatal than 2016 and could be recorded
as the deadliest year for NYC roads in decades.
11.4 Results 231
50000
NUMBER OF PERSONS INJURED
40000
Small_Vehicle
30000 VerySmall_Vehicle
Medium_Vehicle
Large_Vehicle
20000
10000
0
2012 2013 2014 2015 2016 2017
YEAR
Figures 11.8 and 11.9 show that most of the fatal collisions are recorded from
medium-sized vehicles. The reason for this could be that the increasing number col-
lisions occurred by passenger vehicles in NYC. In 2017, it recorded rapid growth compared
to other vehicle types. Further Figure 11.8 shows the number of persons killed in very
small vehicles is considerably higher than other groups of vehicles. This could be due to
less safety in very small vehicles.
175
150
NUMBER OF PERSONS KILLED
125
Small_Vehicle
100 Medium_Vehicle
VerySmall_Vehicle
Large_Vehicle
75
50
25
0
2012 2013 2014 2015 2016 2017
YEAR
14000
NUMBER OF PERSONS INJURED
12000
10000
QUEENS
8000 MANHATTAN
BRONX
BROOKLYN
6000 STATEN ISLAND
4000
2000
0
2012 2013 2014 2015 2016 2017
YEAR
Figures 11.10 and 11.11 shows the number of persons injured and killed in each borough.
Queens and Brooklyn can be identified as the areas where many fatal crashes occur. This
could be due to decreased road safety in both of these areas as well as high traffic conditions.
After analyzing 998 193 motor vehicle collisions (reported during 2012–2017) in NYC,
there is a probability of extreme cases. Therefore, distribution of collision severity of
60
50
NUMBER OF PERSONS KILLED
40 QUEENS
BROOKLYN
BRONX
30 MANHATTAN
STATEN ISLAND
20
10
0
2012 2013 2014 2015 2016 2017
YEAR
15
10
0
2012 2013 2014 2015 2016 2017
YEAR
each vehicle group requires testing the normality of their distribution. The full data
set (n = 998 193) were used. Mean value is calculated for “injured persons” and “killed
persons” attributes in data set and outliers were identified.
The outlier is a critical representation of the spread of data, as value change by 25% in
the upper and lower boundary, which do not affect any prediction. However, if data is from
a normal distribution, the outlier is considered inefficient compared to the standard devi-
ation [14]. With the purpose of estimating outliers, one person injured, and another killed
one have been used. Therefore, any value below or above one is defined as the outlier.
Figures 11.12–11.19 show the outliers in each year injured and killed people are recorded
based on the vehicle group.
In Figures 11.12–11.19, significant numbers of outliers for the severity of collisions in
each vehicle group can be observed. Most of the outliers are visible in the injured numbers.
This could be due to the growth of injured numbers over the reported killed. However, the
forecasts were carried out by excluding the extremely severe cases, which were found during
the outlier analysis process. In these predictions, taking into consideration the outliers of
severe collision cases is significant and very critical.
11.5 Discussion
In recent years, the growth of motor vehicle collisions has become critical. The number of
road incidents has increased leading to more injuries, disabilities, and mortality on a global
level. On a daily basis, more people will experience collision as a result of traffic congestion,
which causes delays for vehicles passing through areas with lane closures.
The outcome of this chapter will predict emerging patterns and trends in motor vehicle
collisions that may reduce the road risks. In fact, it will help to predict patterns of collision
234 11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City
1.02
1.00
0.98
0.96
25
20
15
10
0
2012 2013 2014 2015 2016 2017
YEAR
and severity engaged with each type of vehicle. This chapter has used 998 193 large gen-
uine data sets from NYPD as the source for data analysis. Therefore, the analyzed patterns
were very reliable for overcoming road risks. The results of this chapter can even be used by
NYPD to identify and prevent road risk on NYC roads. This chapter has used machine learn-
ing classification algorithms of k-NN, random forest, and Naive Bayes, as these received
good results in our previous research [15, 16]. Using these three accuracy classifiers can
predict the different vehicle groups and identify particular risk groups.
11.5 Discussion 235
2.25
2.00
1.75
1.50
1.25
1.00
2012 2013 2014 2015 2016 2017
YEAR
30
20
10
0
2012 2013 2014 2015 2016 2017
YEAR
Among three classifiers’ data sets, the random forest generated the highest data predic-
tion accuracy results. Random forest is a tree-structured algorithm that is used for pattern
recognition [11]. The main reasons for using random forest as a classification technique is
due to its nature of producing accurate and consistent predictions [9]. The random forest
algorithm previously used in several studies demonstrates predictions. For instance, ran-
dom forest has been used for a data-driven model for crossing safety by predicting collisions
in railway and roadway crossings [17].
236 11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City
4.0
3.5
3.0
2.5
2.0
1.5
1.0
2012 2013 2014 2015 2016 2017
YEAR
On the other hand, studies carried by several researchers [18, 19] suggested Naive Bayes
classification as the accurate classification technique for collision analysis. These studies
have used only numerical inputs for relevant prediction analysis. Therefore, Naive Bayes
as a statistical pattern recognizer has produced high accuracy than other classifiers. Never-
theless, according to this chapter, random forest provides the most accurate prediction over
the selected data set.
Additionally, the data set was statistically analyzed using one-way ANOVA. It shows high
significance between different means of each vehicle group in the given periods. This is
11.6 Conclusion 237
1.6
1.4
1.2
1.0
2012 2013 2014 2015 2016 2017
YEAR
evident that vehicle groups are highly significant for collision patterns. Therefore, collision
severity has been analyzed comparing to the vehicle group. It is evident that small vehi-
cles were the reason for the high collision severity. This chapter reveals that between the
years 2012–2017 motor vehicle collisions have increased in NYC with severity. However, it
is observed that some extreme values are present in the data. Hence, this chapter excludes
extreme values using outlier analysis. Further graphical representation of location using
latitude and longitude confirmed the pattern of collision for each vehicle group. Brooklyn
and Queens boroughs are identified as locations with the highest severe collisions.
The main limitation of this chapter occurred in the data analysis phase. During the
classification carried out using Python programming, Naïve Baiyes did not capture the
data in k = 4 and k = 7 k-fold instances. However, this did not impact the overall result
since Naïve Bayes generated lesser value for other k-fold instances comparing to kNN and
random forest.
The results obtained from this chapter confirmed the significance of vehicle groups in
motor vehicle collisions and road risks. Identified vehicle groups could accurately predict
the location and severity of the collisions. Therefore, further studies can consider these
identified vehicle groups for future road accident–related research. This academic chapter
acknowledged patterns through vehicle collision analysis and converted this knowledge
for relevant road safety authorities and law enforcement officers to minimize motor vehicle
collisions.
11.6 Conclusion
The data mining classification techniques are widely used for data analysis to obtain valu-
able findings. In this chapter, there are three main classifiers that have been used to generate
238 11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City
accurate prediction over motor vehicle collision data. In total 998 193 data collisions were
tested among these three classifiers. The analysis in this chapter shows that the random
forest classifier has the highest data prediction accuracy algorithm with 95.03%, following
kNN at 94.93%, and naïve Bayes at 70.13%. Additionally, the analysis of random forest node
processing time and accuracy further confirms it is the most suitable prediction classifier
of this collision data. The findings of this chapter show that there has been an increase
in motor vehicle collisions on NYC roads during 2012–2017. Considering only the recent
years 2016–2017, there has been a growth of approximately 50% in vehicle collisions. Among
these collisions, the small vehicle group recorded the highest. It is evident that the identifi-
cation of motor vehicle groups has a significant impact on the severity of collisions on NYC
roads. Brooklyn and Queens boroughs are identified as locations with the highest and most
severe collision rates. However, this chapter generates results for collisions in NYC only; it
cannot be universally applied to collisions occurring in other parts of the world. Further,
these results can be used to improve road safety and minimize any potential road risks.
Finally, it highlights valuable information for road authorities and law enforcing bodies to
manage inner-city traffic in an efficient manner.
Author Contribution
D.A. and M.N.H. conceived the study idea and developed the analysis plan. D.A. analyzed
the data and wrote the initial paper. M.N.H. helped preparing the figures and tables, and in
finalizing the manuscript. All authors read the manuscript.
References
1 Abdullah, E. and Emam, A. (2015). Traffic accidents analyzer using big data. In: 2015
International Conference on Computational Science and Computational Intelligence
(CSCI), 392. Las Vegas: IEEE https://doi.org/10.1109/CSCI.2015.187.
2 Shi, A., Tao, Z., Xinming, Z., and Jian, W. (2014). Unrecorded accidents detection on
highways based on temporal data mining. Mathematical Problems in Engineering 2014:
1–7. https://doi.org/10.1155/2014/852495.
3 Yu, J., Jiang, F., and Zhu, T. (2013). RTIC-C: a big data system for massive traffic infor-
mation mining. In: 2013 International Conference on Cloud Computing and Big Data,
395–402. IEEE https://doi.org/10.1109/CLOUDCOM-ASIA.2013.91.
4 Sharma, S. and Sabitha, A.S. (2016). Flight crash investigation using data mining tech-
niques. In: 2016 1st India International Conference on Information Processing (IICIP),
1–7. IEEE https://doi.org/10.1109/IICIP.2016.7975390.
5 Chauhan, D. and Jaiswal, V. (2016). An efficient data mining classification approach for
detecting lung cancer disease. In: 2016 International Conference on Communication and
Electronics Systems (ICCES), 1–8. IEEE https://doi.org/10.1109/CESYS.2016.7889872.
6 Ince, R.A.A., Petersen, R., Swan, D., and Panzeri, S. (2009). Python for information the-
oretic analysis of neural data. Frontiers in Neuroinformatics 3 https://doi.org/10.3389/
neuro.11.004.2009.
References 239
7 Korosec, K., 2017. 2016 Was the Deadliest Year on American Roads in Nearly a Decade
[WWW Document]. Fortune. http://fortune.com/2017/02/15/traffic-deadliest-year
(accessed 10.8.17).
8 Sharma, S. and Bhagat, A. (2016). Data preprocessing algorithm for web structure
mining. In: 2016 Fifth International Conference on Eco-Friendly Computing and Com-
munication Systems (ICECCS), 94–98. IEEE https://doi.org/10.1109/Eco-friendly.2016
.7893249.
9 Witten, I., Frank, E., and Hall, M. (2011). Data Mining Practical Machine Learning Tools
and Techniques, 3e. Boston: Morgan Kaufmann.
10 Shi, A., Tao, Z., Xinming, Z., and Jian, W. (2014). Evolution of traffic flow analysis
under accidents on highways using temporal data mining. In: 2014 Fifth International
Conference on Intelligent Systems Design and Engineering Applications, 454–457. IEEE
https://doi.org/10.1109/ISDEA.2014.109.
11 Chen, T., Cao, Y., Zhang, Y. et al. (2013). Random Forest in clinical metabolomics for
phenotypic discrimination and biomarker selection. Evidence-based Complementary and
Alternative Medicine 2013: 1–11. https://doi.org/10.1155/2013/298183.
12 Salvithal, N. and Kulkarni, R. (2013). Evaluating performance of data mining clas-
sification algorithm in Weka. International Journal of Application or Innovation in
Engineering and Management 2: 273–281.
13 Rao, J., Xu, J., Wu, L., and Liu, Y. (2017). Empirical chapter on the difference of Teach-
ers’ ICT usage in subjects, grades and ICT training. In: 2017 International Symposium
on Educational Technology (ISET), 58–61. IEEE https://doi.org/10.1109/ISET.2017.21.
14 Halgamuge, M.N. and Nirmalathas, A. (2017). Analysis of large flood events: based on
flood data during 1985–2016 in Australia and India. International Journal of Disaster
Risk Reduction 24: 1–11. https://doi.org/10.1016/j.ijdrr.2017.05.011.
15 Halgamuge, M.N., Guru, S.M., and Jennings, A. (2005). Centralised Strategies for Cluster
Formation in Sensor Networks, in Classification and Clustering for Knowledge Discovery,
315–334. Cambridge, UK: Springer-Verlag.
16 Wanigasooriya, C., Halgamuge, M.N., and Mohamad, A. (2017). The analyzes of anti-
cancer drug sensitivity of lung cancer cell lines by using machine learning clustering
techniques. International Journal of Advanced Computer Science and Applications
(IJACSA) 8 (9): 1–12.
17 Trudel, E., Yang, C., and Liu, Y. (2016). Data-driven modeling method for analyzing
grade crossing safety. In: 2016 IEEE 20th International Conference on Computer Sup-
ported Cooperative Work in Design (CSCWD), 145–151. IEEE https://doi.org/10.1109/
CSCWD.2016.7565979.
18 Al-Turaiki, I., Aloumi, M., Aloumi, N., and Alghamdi, K. (2016). Modeling traffic
accidents in Saudi Arabia using classification techniques. In: 4th Saudi International
Conference on Information Technology (Big Data Analysis) (KACSTIT). Riyadh: IEEE
https://doi.org/10.1109/KACSTIT.2016.7756072.
19 Li, L., Shrestha, S., and Hu, G. (2017). Analysis of Road Traffic Fatal Accidents Using
Data Mining Techniques, 363–370. IEEE https://doi.org/10.1109/SERA.2017.7965753.
241
12
12.1 Introduction
Neurological disorders are critical human disorders that are generally related to problems
of the spinal cord, brain, and nervous system. In general, the structural disturbance
of neurons in the human body leads to these disorders. Lifestyle, hereditary, infection,
nutrition, environment, and major physical injuries are some of the important causes
of neurological disorders. There are different signs and symptoms for each neurological
disorder. Therefore, it is difficult to mention the complete list. However, in general, the
change in behavior, emotions, and physical appearance are some signs of these critical
disorders. In general, neurological disorders lead to the impairment of the brain, spinal
cord, nerves, and several neuromuscular functions in the body [1–3]. These disorders are
the deformities that ensue in the central nervous system of the human body. The central
nervous system consists of nerves inside the brain and spinal cord, whereas the nerves
outside the brain and spinal cord are part of the peripheral nervous system [4]. The brain
controls different body functions such as breathing, hormone release, heart rate, body
temperature, movements, sensations, desires, and emotions in the body, etc. The spinal
cord carries the brain signals to peripheral nerves. And the peripheral nerves connect the
central nervous system with the other body organs [5]. The diseases related to the spinal
cord and brain are considered central nervous system disorders. Meningitis, brain tumor,
stroke, seizures, epilepsy, Alzheimer’s disease, Parkinson’s disease, and multiple sclerosis
are some of the major central nervous system disorders, whereas the diseases related to
nerves outside the spinal cord and brain are known as peripheral nervous system disorders.
Some of the examples of peripheral nervous system disorders are Guillain-Barre syndrome,
carpal tunnel syndrome, thoracic outlet syndrome, complex regional pain syndrome,
brachial plexus injuries, sciatica, neuritis, and dysautonomia. The brief list of central and
peripheral nervous system disorders is depicted in Figure 12.1 [6, 7].
Disorders Disorders
- Meningitis , - Guillain-Barre
- Brain Tumour, Syndrome
- Stroke, - Carpal Tunnel
Syndrome
- Seizures & Epilepsy,
- Alzheimer’s Disease, - Thoracic Outlet
Syndrome
- Parkinson’s Disease,
- Multiple Sclerosis. - Complex Regional Pain
Syndrome
neurons in the nervous system of the human body [9]. The neurological disorders in human
beings generally affect the functioning of brain and behavior. As stated earlier, generally,
the disorders related to the brain, nervous system, and the spinal cord in human beings are
categorized as neurological disorders [3]. On the other hand, the disorders related to an
ongoing dysfunctional pattern of thought, emotion, and behavior that causes significant
distress are known as psychological disorders [10]. Some of the major differences among
neurological and psychological disorders are presented in Table 12.1
According to the University of California [11], there exist about 600 known neurological
disorders that can be diagnosed by a neurologist [12]. Some neurological disorders
can be treated, whereas some require a serious prognosis before starting treatment.
The diagnostic test includes regular screening tests such as blood or urine test, genetic
tests such as chorionic villus sampling, uterine ultrasound, etc., and other neurological
examinations, such as x-ray, fluoroscopy, biopsy, brain scans, computed tomography
(CT scan), electroencephalography (EEG), electronystagmography, electromyography,
magnetic resonance imaging (MRI), thermography, polysomnogram, positron emission
tomography (PET), etc. [13]. All these methods are time-consuming and costly. Moreover,
several computational models have also been designed to efficiently diagnose neurological
disorders. The objective of this research work is to devise a smart and promising neuro-
logical diagnostic framework using the amalgamation of different emerging computing
techniques.
60000
45956
42431
50000
40000
Population
23415
30000
20000
8734
6326
6193
4316
2012
1908
1205
10000
379
229
209
202
150
117
125
57
35
19
0
r
us tis iti
s ke tia so
n
ps
y is
ch
e se ce
an gi al ro en in le ro
s
a ea n
Te
t i n h St em rk pi cl
e ad is Ca
en ce
p a E S e D n
M
En
D P l e H on ai
tip rv
e ur Br
ul Se N
e
M or
ot
M
Prevalence Death Rate Neurological Disorders
Figure 12.2 Prevalence and death rate due to neurological disorders in the year 2015 [15].
244 12 A Smart and Promising Neurological Disorder Diagnostic System
USA
6% India
30%
Indonesia
6%
and epilepsy, which were highly prevalent in human beings. Also, death rate due to stroke
and dementia is very high. It is observed from Figure 12.2 that the percentage of death rate
according to prevalence is high in patients suffering from tetanus, brain cancer, motor neu-
ron disease, and stroke, i.e., 27%, 19%, 17%, and 15%, respectively. The risk of neurological
disorders has spread widely all over the world. Figure 12.3 shows the top 10 countries with
the highest prevalence of neurological disorders.
In 2015, China and India were the highest populated countries and also have shown a
high prevalence of neurological disorders in their population. Other than China and India,
Indonesia, the United States, Russia, Pakistan, Bangladesh, Brazil, Nigeria, and Japan are
also countries with a high prevalence of neurological disorders.
Primarily, IoT was used in manufacturing industries and business to predict new oppor-
tunities with analytics. Nowadays, it is applicative with different sectors such as banking,
health care, finance, learning, transportation, communication, entertainment, etc. with
smart devices. The effective use of IoT devices seems to be very beneficial in the health
care sector. It can improve the diagnostic process, treatment efficiency and hence the over-
all health of the patients [18]. Real-time monitoring helps to save patients’ life in emergency
events such as worse heart conditions, seizures, and asthmatic attacks [19]. A huge amount
of data (structured and unstructured) is generated by IoT devices from different sources
such as a hospital, laboratory, clinics’ embedded health care devices, and research centers.
This data requires large storage, integration, assortment, security, and analysis for remote
health monitoring and medical assistance. These issues can be solved by using big data
analytics to diagnose various chronic diseases such as diabetes, cancer, cardiac disorders,
and psychological and neurological disorders, in order to deliver precise data to perform
potential research, etc.
Due to uncontrollable data growth, the effective utilization and analysis of data are imper-
ative. This effective data analysis can be practicable in sale forecasting, economic analysis,
disease diagnosis, social network analysis, business management, etc. Some organizations
erstwhile use analytics on the systematic data in the form of reports. The consequences
related to data size and computing powers can be figured out by introducing the concept of
big data [20].
Big Data
Velocity Variety
Classification
Supervised
Learning
Regression
Soft Computing Techniques
Machine
Learning
Reinforcement
Learning
Clustering
Unsupervised
Learning Dimensionality
Reduction
Genetic Algorithm
Evolutionary
Computation
Evolutionary
Computation
Differential Evolution
Swarm
Intelligence
GWO
ACO
ABC
PSO
ALO
FA
Satisfy
termination
No criteria
Yes
Update variables
(Optimal solution)
STOP
Finance, 13500,
12%
Manufacturing, Weather
16700, 15% Forecasting, 1490,
1%
Text Processing, Sentiment Analysis,
7640, 7% 15600, 14%
vector machine (SVM), random forest, naive Bayes, and ID3 are some of the dominantly
used classifiers. Regression is the process of imperative data mining used to determine a
trend between dependent and independent variables. The aim of the regression is to predict
the value of the dependent variable based on the independent variable [52].
Contrarily, unsupervised learning techniques are employed when the data is not labeled.
These techniques are normally suited for transactional data. These techniques can also be
beneficial in finding facts from the past data. Clustering and dimension reduction are two
well-known unsupervised learning techniques. Clustering is a process of data segmenta-
tion used to divide large data sets into similar groups. Some methods for clustering are
hierarchical, partitioning, grid-based, density-based, and model-based methods [53, 54].
Data is the consequential component of data mining. It is required to collect patients’ data
from different sources, viz. medical devices, IoT, hospitals, neurological clinics, online
repositories, research centers, etc., for the diagnosis of several neurological disorders. This
study emphasizes the diagnosis of several neurological disorders using different emerging
techniques in computer science. This section briefly summarizes the study of different
researchers for diagnosing different neurological disorders using soft computing and
machine learning techniques.
Emina Alickovic et al. [55] introduced a model to detect and predict seizures in patients by
classifying EEG signals. Freiburg and Children’s Hospital Boston- Massachusetts Institute
of Technology (CHB-MIT) EEG databases were used to conduct the study. The proposed
model was defined by four components: noise removal, signal decomposition, feature selec-
tion, and classification of signals. Multi-scale principle component analysis was used for
noise removal; decomposition of signals was performed using empirical mode, discrete
wavelet transform, and packet decomposition; feature selection was performed using sta-
tistical methods; and signal classification was done using machine learning techniques
(random forest, SVM, multi-layer perceptron [MLP], and k-nearest neighbor [KNN]). It was
observed that the best results have shown by the study using wavelet packet decomposition
with SVM. The performance of the proposed model was evaluated and given by accuracy
(99.7%), sensitivity (99.6%), and specificity (99.8%).
Pedro P. Rebouças Filho et al. [56] proposed a feature extraction method called analysis
of brain tissue density (ABTD) to detect and classify strokes. The study was applied over
the data set of 420 CT images of patients’ skulls. Data preprocessing, segmentation, feature
extraction, and classification steps were included to devise the method. MLP, SVM, KNN,
optimal path forest (OPF), and Bayesian classifiers were used to classify the data set. The
best diagnostic accuracy (i.e., 99.3%) was shown by OPF with a Euclidean distance method.
Abdulhamit Subasi et al. [57] devised a hybrid model using SVM with GA and PSO for the
detection of epileptic seizures in patients. A data set of five patients with 100 EEG for each
was used to conduct the study. Discrete wavelet transformation was used for feature extrac-
tion. It was observed that hybrid SVM with PSO has shown better classification accuracy
(i.e., 99.38%) as compare to simple SVM and hybrid SVM with GA.
250 12 A Smart and Promising Neurological Disorder Diagnostic System
U. Rajendra Acharya et al. [58] presented a study deep convolutional network (CNN)
to diagnose epilepsy in patients by analyzing EEG signals. A data set of five patients was
used in the study; 100 EEG signals were analyzed for each data set. It was categorized
into three parts: normal, preictal, and seizure. A normalization of signals was performed
using Z-score, standard deviation, and zero means. Deep learning and artificial neural
network (ANN) with 10-fold cross validation method were used to detect a seizure. The
diagnostic accuracy, specificity, and sensitivity showed by the study were 88.7%, 90%, and
95%, respectively.
Benyoussef et al. [59] carried out a diagnostic study of Alzheimer’s patients. The authors
employed three distinct data mining techniques, such as decision tree, discriminant analy-
sis, and logistic regression, and found that the classification results obtained using discrim-
inant analysis were better than the outcomes of other two approaches. The highest rate of
predictive accuracy achieved using the discriminant analysis was 66%.
Lama et al. [60] diagnosed Alzheimer’s disease using three important data mining
techniques, i.e., SVM, import vector machine (IVM), and rough extreme learning machine
(RELM). The authors used an MRI images data set of 214 instances collected from the
Alzheimer’s disease neuroimaging initiative (ADNI) database. The author found a better
diagnostic rate with RELM. The highest rate of prediction achieved using RELM was
76.61%.
S. R. Bhagya Shree et al. [61] conducted a study on Alzheimer’s disease diagnosis using
a naïve Bayes algorithm with the Waikato Environment for Knowledge Analysis (WEKA)
tool. A data set of 466 patients was collected for the study by conducting the neuropsycho-
logical test. A number of steps were performed to preprocess the data such as attribute selec-
tion, imbalance reduction, and randomization. After data preprocessing, feature selection
is performed using the wrapper method with synthetic minority over-sampling technique
(SMOTE) filter. Data classification was performed using naive Bayes with cross-validation
method. The model was evaluated on different platforms, viz. explorer, knowledge flow,
and Java application programming interface (API).
Doyle et al. [62] proposed a study to diagnose Alzheimer’s disorder in patients using brain
images with the regression method. The study used a data set with 1023 instances and 57
attributes. Accuracy, specificity, and sensitivity were 74%, 72%, 77%, respectively.
Johnson et al. [63] used GA and logistic regression to diagnose Alzheimer’s disease in
patients. GA was used as feature selection and selects 8 features out of 11. Logistic regression
was used as a classification technique applied with five folds.
Koikkalainen et al. [64] proposed to diagnose Alzheimer’s disorder in patients using
regression techniques. Authors have used 786 instances from the ADNI database. The
predictive rate of accuracy given by the study was 87%.
J. Maroco et al. [65] proposed a model to predict dementia disorder in patients. Authors
have compared the sensitivity, specificity, and accuracy of different data mining techniques,
viz. neural networks, SVM, and random forests, linear discriminant analysis, quadratic
discriminant analysis, and logistic regression. The study reveals that the random forest
and linear discriminant analysis have shown the highest sensitivity, specificity, and accu-
racy, whereas neural networks and SVM have shown the lowest sensitivity, specificity, and
accuracy.
The comparison of the performance of the studies is done by comparing the predictive
rate of accuracy, sensitivity, and specificity achieved by different authors in their studies
12.5 The Need for Neurological Disorders Diagnostic System 251
using different soft computing and machine learning techniques for several neurological
disorder diagnoses. All these performance parameters are achieved by introducing confu-
sion matrix, i.e., the concept of correct and incorrect actual and predictive events.
Accuracy is the ratio of correct prediction to the total prediction. Figure 12.8 shows the
comparison of the predictive rate of accuracies achieved by different authors in their study.
Sensitivity is the ratio of true positives to the sum of true positives and false negatives.
Figure 12.9 shows the comparison of sensitivity achieved by different authors in their
study.
Specificity is the ratio of true negatives to the sum of true negatives and false positives.
Figure 12.10 shows the comparison of specificity achieved by different authors in their
study.
It is observed from Figure 12.8–12.10 that the range of accuracies, sensitivity, and speci-
ficity lies between 60–99.7%, 54–100%, and 62–98.2%, respectively. Parkinson and seizures
have shown the highest performance among other disorders. Also, SVM techniques have
shown better predictive results. The publication summary for diagnosis of different neuro-
logical disorders of the last 10 years has been shown in Figure 12.11.
In the last 10 years, maximum work was done with dementia, epilepsy, and stroke diag-
nosis using soft computing techniques. And in 2018, work done on multiple sclerosis is
increasing consistently after epilepsy. Table 12.2 shows the detailed publication trends along
with citations for some of the neurological disorder diagnosis articles.
From Table 12.2, it is observed that authors from different countries, viz. India, the
United States, Korea, Malaysia, UK, Australia, Brazil, and Singapore, etc., have success-
fully employed different computing techniques for the diagnosis of different neurological
disorders. The citation rate of these articles lies between 1 and 148. Also, the wide range
of publishers, viz. Elsevier, Springer, BMJ, PMC, Hindawi, PubMed, and Public Library
of Science, etc., have published articles on neurological disorder diagnosis using soft
computing and machine learning techniques.
It is observed that different soft computing and machine learning techniques, such as naive
Bayes, random forest, SVM, ANN, genetic algorithm, and PSO, etc., have been extensively
used to diagnose different neurological disorders, viz. epilepsy, stroke, Alzheimer’s, menin-
gitis, etc. However, insignificant attention has been given to design and analyze the diag-
nostic methods for neurological disorders such as epilepsy, stroke, Alzheimer’s, meningitis,
Parkinson’s, brain cancer, etc., by receiving data from different IoT devices. Therefore, an
intelligent system is required for the early and precise diagnosis of different neurological
disorders with IoT and big data concepts.
80%
Accuracy
60%
40%
20%
0%
Seziures Seziures Alzheimer’s Seziures Parkinson Parkinson Stroke Alzheimer’s Alzheimer’s Alzheimer’s Alzheimer’s Epilepsy Epilepsy Epilepsy
(SVM) [54] (CNN) [57] (NB) [60] (SVM-PSO) (SVM) [65] (SVM) [66] (ANN) [67] (Reg) [61] (SVM) [59] (IVM) [59] (RELM) (SVM) [71] (ANN) [72] (ANN) [73]
[56] [59]
Disease (Technique)
Figure 12.8 The accuracy achieved by different studies for neurological disorder diagnosis.
Senstivity Comparison of Different Studies for Diagnosing Neurological Disorders
99.60% 97.60% 99.50% 100%
95%
100% 90.00% 91.29%
84.39%
80% 78%
77%
80%
62.50% 61.70%
54.90%
Senstivity
60%
40%
20%
0%
Seziures Seziures Alzheimer’s Seziures Parkinson Parkinson Stroke Alzheimer’s Alzheimer's Alzheimer’s Alzheimer’s Epilepsy Epilepsy Epilepsy
(SVM) [54] (CNN) [57] (NB) [60] (SVM-PSO) (SVM) [65] (SVM) [66] (ANN) [67] (Reg) [61] (SVM) [59] (IVM) [59] (RELM) (SVM) [71] (ANN) [72] (ANN) [73]
[56] [59]
Disease (Technique)
Figure 12.9 Sensitivity achieve by different studies for neurological disorder diagnosis.
Specificity Comparison of Different Studies for Diagnosing Neurological Disorders
100% 99.80% 100% 99.19%
96.70% 99.25% 99.25% 98.80%
90% 90.63% 90%
86.20%
80% 81.10%
72%
Specificity
60% 62%
40%
20%
0%
Seziures Seziures Alzheimer’s Seziures Parkinson Parkinson Stroke Alzheimer’s Alzheimer’s Alzheimer’s Alzheimer’s Epilepsy Epilepsy Epilepsy
(SVM) [54] (CNN) [57] (NB) [60] (SVM-PSO) (SVM) [65] (SVM) [66] (ANN) [67] (Reg) [61] (SVM) [59] (IVM) [59] (RELM) (SVM) [71] (ANN) [72] (ANN) [73]
[56] [59]
Disease (Technique)
Figure 12.10 Specificity achieve by different studies for neurological disorder diagnosis.
12.5 The Need for Neurological Disorders Diagnostic System 255
2
64
76
2018 1
61
26
2
5
20 211
2017 3
206
91
6
12
9
156
2016 6
178
71
5
8
8
111
2015 7
147
60
10
6
9
97
2014 1
104
47 Brain Cancer
4
7 Multiple Sclerosis
15
Year
84 Epilepsy
2013 3
93
41 Parkinson
1
7 Dementia
7
71 Stroke
2012 6
86
34 Meningitis
2
4
6
43
2011 3
58
20
4
2
2
41
2010 1
52
21
2
5
9
28
2009 2
25
69
4
1
1
17
2008 0
29
26
1
0 50 100 150 200 250
Number of Research Papers
Figure 12.11 Publication trend from 2008 to 2018 for neurological disorder diagnosis using
emerging techniques.
Table 12.2 Publications details along with citations used in the study.
Sources
Camera
Medical Devices
Data
PATIENTS Patients
Low energy consumption devices
Diagnostic Centres
Biological
Factors
Distributed
Data Warehouse Cloud Storage
Platforms
Big Data
Storage
Cognitive
Data Pre-
Psychological Descriptive
Regression Data Modelling Data Visualization
Factors Analysis
Friends &
Predictive
Data Mining Classification Bayesian Method Family
Analytics
Big Data
Analysis
Perspective
Social Analysis
Stochastic Method Simulation Optimization
Factors
Initially, data of patients, viz. biological, behavioral, cognitive, psychological, social and
emotional factors, have to be collected from different sources such as medical devices, diag-
nostic, and research centers, and electronic medical records (EMRs). Moreover, live data
of patients can also be collected using IoT devices like sensors and wearable devices. IoT
has been used to optimize energy as a large amount of energy is wasted in the health
care sector by using different medical devices. This problem can be solved by introduc-
ing low power consumption IoT devices to collect neurological disordered patients’ data
sets. After collecting data from different sources, it is important to store this big data over
different platforms for the effective utilization of data. Data warehouse, distributed plat-
forms, multi-dimensional storage, and cloud storage are some important big data storage
methods. Different big data analytics techniques like descriptive, prescriptive, and predic-
tive can be used for data analysis [66]. Different preprocessing techniques will be used to
handle noise, missing values, and to remove outliers from the data. Three different feature
selection techniques called filters, wrapper, and embedded, and their hybridization can be
used to select more suitable attributes. Additionally, emerging nature-inspired computing
(ACO, PSO, ABC, FA, ALO, GWO, etc.) techniques may be used for better feature selection.
The basic objective of features selection is to remove the unwanted attributes so that the
12.6 Conclusion 259
data mining process can be expedited. The use of nature-inspired computing techniques
will improve the predictive rate of accuracy and reduce the error rate as well as time to pro-
cess data of the diagnostic system. Different big data analytics techniques, viz. descriptive,
predictive, and perspective analytics, are used to perform effective analysis of the collected
data set. Finally, the diagnostic analysis is performed by using machine learning, evolu-
tionary, and nature-inspired computing techniques for early disease diagnosis of different
neurological disorders. Finally, the rate of classification can be further improved by incor-
porating the concept of granular computing [67].
Challenges
● Data collection is difficult, as the data may be in unstructured and semi-structured for-
mats.
● Energy optimization devices are required to improve the efficiency and power consump-
tion used by IoT devices to collect live data.
● Data security is a critical issue to protect patients’ data from unauthorized access.
12.6 Conclusion
Neurological disorders cover several nerve and brain-related issues and have cognitive,
biological, and behavioral repercussions. Here, the focus was on the types of neurologi-
cal disorders, and how they are different from psychological disorders. It is observed that
dementia, stroke, and epilepsy havea high prevalence. However, fortunately, the death rate
is zero. And unfortunately, the death rate of some disorders such as tetanus, brain cancer,
motor neuron disease, and stroke is high, i.e., 27%, 19%, 17%, and 15%, respectively. Further-
more, it is observed from the statistics that China and India have shown a high prevalence
of neurological disorders in their population. The role and relation of IoT and big data are
presented in the study. Different authors have implemented various machine learning and
soft computing techniques, viz. naive Bayes, random forest, ANN, SVM, GA, and PSO, etc.,
260 12 A Smart and Promising Neurological Disorder Diagnostic System
in order to diagnose different neurological disorders. And, the performance of each study
was measured on three parameters, viz. accuracy, sensitivity, and specificity, whose range
lies between 60–99.7%, 54–100%, and 62–100%, respectively. Parkinson’s and seizures have
shown the highest performance among other disorders. Also, SVM has shown better pre-
dictive results. Finally, to fulfill future requirements, an intelligent neurological disorder
diagnostic framework is proposed.
References
1 WHO Library Cataloguing-in-Publication Data. Neurological Disorders public health
challenges 2006.
2 Gautam, R. and Sharma, M. (2020). Prevalence and Diagnosis of Neurological Disorders
Using Different Deep Learning Techniques: A Meta-Analysis. Journal of Medical Systems
44 (2): 49.
3 Meyer, M.A. (2016). Neurologic Disease: A Modern Pathophysiologic Approach to Diagno-
sis and Treatment. Heidelberg: Springer.
4 Roth, G. and Dicke, U. Evolution of nervoussystems and brains. In: Neurosciences - from
Molecule to Behavior: A University Textbook (eds. C.G. Galizia and P.-M. Lledo). Berlin,
Heidelberg: Springer Spektrum.
5 Sokoloff, L. (1996). Circulation in the central nervoussystem. In: Comprehensive Human
Physiology (eds. R. Greger and U. Windhorst), 561–578. Berlin, Heidelberg: Springer.
6 Pleasure, D.E. (1999). Diseases affecting both the peripheral and central nervous sys-
tems. In: Basic Neurochemistry: Molecular, Cellular and Medical Aspects, 6e (eds. G.J.
Siegel, B.W. Agranoff, R.W. Albers, et al.). Philadelphia: Lippincott-Raven.
7 https://www.ucsfhealth.org/conditions/neurological_disorders [accessed on 20 October
2018]
8 Aarli, J.A. (2005). Neurology and psychiatry: “OhEast is East and West is West …”. Neu-
ropsychiatric Disease and Treatment 1 (4): 285–286.
9 Brandt, T., Caplan, L.R., and Kennard, C. (2003). Neurological Disorders Course and
Treatment, 2e.
10 Charles, S. and Walinga, J. (2014). Definingpsychological disorders. In: Introduction to
Psychology, 1st Canadian Edition. Vancouver, BC: BCcampus.
11 Sharifi, M.S. (2013). Treatment ofneurological and psychiatric disorders with deep brain
stimulation; raising hopes and future challenges. Basic and Clinical Neuroscience 4 (3):
266–270.
12 https://medlineplus.gov/neurologicdiseases.html [Accessed on 20 October, 2018]
13 Galetta, S.L., Liu, G.T., and Volpe, N.J. (1996). Diagnostic tests in neuro-ophthalmology.
Neurologic Clinics 14 (1): 201–222.
14 Tegueu, C.K., Nguefack, S., Doumbe, J. et al. (2013). The spectrum of neurological dis-
orders presenting at a neurology clinic in Yaoundé, Cameroon. The Pan African Medical
Journal 14: 148.
15 GBD 2015 Neurological Disorders Collaborator Group (2017). Global, regional, and
national burden of neurological disorders during 1990–2015: a systematic analysis for
the Global Burden of Disease Study 2015. Global Health Metrics 16 (11): 877–897.
References 261
16 Tsiatsis, V., Karnouskos, S., Holler, J. et al. Internet of Things Technologies and Applica-
tions for a New Age of Intelligence, 2. New York: Academic Press.
17 Elhoseny, M., Abdelaziz, A., Salama, A.S. et al. (2018). A hybrid model of the Inter-
net of Things and cloud computing to manage big data in health services applications.
Future Generation Computer Systems 86: 1383–1394.
18 Farahani, B., Firouzi, F., Chang, V. et al. (2018). Towards fog-driven IoTeHealth:
promises and challenges of IoT in medicine and healthcare. Future Generation Com-
puter Systems 78 (2): 659–676.
19 Firouzia, F., Rahmani, A.M., Mankodiy, K. et al. (2018). Internet-of-Things and big data
for smarter healthcare: from device to architecture, applications and analytics. Future
Generation Computer Systems 78 (2): 583–586.
20 Bhatt, C., Dey, N., and Ashour, A.S. (2017). Internet of Things and Big Data Technologies
for Next Generation Healthcare. Cham: Springer.
21 Picciano, A.G. (2012). The evolution of big data and learning analytics in American
highereducation. Journal of Asynchronous Learning Networks 16 (3): 9–20.
22 McAfee A. and Brynjolfsson E. (2012) Big data: the management revolution. Cambridge,
MA: Harvard Business Review.
23 Big Data is the Future of Healthcare (2012). Cognizant 20–20 insights.
24 Gandomi, A. and Haider, M. (2015). Beyond the hype: big data concepts, methods, and
analytics. International Journal of Information Management 35 (2): 137–144.
25 Kruse, C.S., Goswami, R., Raval, Y., and Marawi, S. (2016). Challenges and opportuni-
ties of big data in healthcare: asystematic review. JMIR Medical Informatics 4 (4).
26 Agarwal, P. and Mehta, S. (2014). Nature-inspiredalgorithms: state-of-art, problems and
prospects. International Journal of Computers and Applications 100 (14): 14–21.
27 Mitchell, M. (1996). An Introduction to GeneticAlgorithms. Cambridge, MA: MIT Press.
28 Kaur, P. and Sharma, M. (2017). A survey on using nature inspired computing for fatal
diseasediagnosis. International Journal of Information System Modeling and Design 8 (2):
70–91.
29 Blum, C. and Li, X. (2008). Swarm intelligence in optimization. In: Swarm Intelligence,
Natural Computing Series (eds. C. Blum and D. Merkle). Berlin, Heidelberg: Springer.
30 Marco Dorigo and Giaani Di Caro (1992). Ant Colony Optimization Meta-Heuristic,
PhD Thesis, Chapter 2.
31 Kennedy’, J. and Eberhart, R. (1995). Particle Swarm Optimization. IEEE.
32 Karaboga, D. (2005). Artificial bee colony algorithm. Scholarpedia 5 (3): 6915.
33 Yang, X.-S. and He, X. (2013). Firefly algorithm: recentadvances and applications. Inter-
national Journal of Swarm Intelligence 1 (1): 36–50.
34 Mirjalili, S. (2015). The ant lion optimizer, advances in engineering software. Advances
in Engineering Software 83: 80–90.
35 Mirjalili, S., Mirjalili, S.M., and Lewis, A. (2014). Grey wolf optimizer. Advances in Engi-
neering Software 69: 46–61.
36 Askarzadeh, A. (2016). A novel metaheuristic method for solving constrained engi-
neering optimization problems: crow search algorithm. Computers and Structures 169:
1–12.
37 Mirjalili, S. and Lewis, A. (2016). The whale optimization algorithm. Advances in Engi-
neering Software 95: 51–67.
262 12 A Smart and Promising Neurological Disorder Diagnostic System
38 Dhinesh Babu, L.D. and Krishna, P.V. (2013). Honey bee behaviour inspired the load
balancing of tasks in a cloud computing environment. Applied Soft Computing 13 (5):
2292–2303.
39 Rajabioun, R. (2011). Cuckoo optimizationalgorithm. Applied Soft Computing 11 (8):
5508–5518.
40 Zheng, Y.-J. (2015). Water wave optimization: a new nature-inspired metaheuristic.
Computers & Operations Research 55: 1–11.
41 Kaveh, A. and Khayatazad, M. (2012). A new meta-heuristic method: ray optimization.
Computers & Structures 112: 283–294.
42 Chakraborty, A. and Kar, A.K. Swarm intelligence: A review of algorithms. In:
Nature-Inspired Computing and Optimization (eds. S. Patnaik, X.S. Yang and K.
Nakamatsu), 475–494. Cham: Springer.
43 Sharma, M. et al. (2013). Stochastic analysis of DSS queries for a distributed
databasedesign. International Journal of Computer Applications 83 (5): 36–42.
44 Sharma, M., Singh, G., and Singh, R. (2018). A review of different cost-based distributed
query optimizers. In: Progress in Artificial Intelligence, vol. 8.1, 45–62.
45 Long, N.C., Meesad, P., and Unger, H. (2015). A highly accurate firefly based algorithm
for heart disease prediction. Expert Systems with Applications 42 (21): 8221–8231.
46 Fu, Q., Wang, Z., and Jiang, Q. (2010). Delineating soil nutrient management zones
based on fuzzy clustering optimized by PSO. Mathematical and Computer Modelling 51
(11–12), 2010: 1299–1305.
47 Fürnkranz, J. et al. (2012). Foundations of Rule Learning, Cognitive Technologies.
Springer-Verlag Berlin Heidelberg.
48 Royal Society (Great Britain). Machine learning: the power and promise of computers
that learn by example, The Royal Society.
49 Jiawei, H., Micheline, K., and Jian, P. (2011). Data Mining: Concepts and Techniques, 3e.
Hoboken, NJ: Elsevier.
50 Sharma, M., Sharma, S., and Singh, G. (2018). Performance analysis of statistical and
supervised learning techniques instock data mining. Data 3 (4): 54.
51 Sharma, M., Singh, G., and Singh, R. (2017). Stark assessment of lifestyle based human
disorders using data mining based learning techniques. IRBM 36 (6): 305–324.
52 Ian, W. and Eibe, F. (2005). Data Mining: Practical Machine Learning Tools and Tech-
niques, 2e. Hoboken, NJ: Elsevier.
53 Kaur, P. and Sharma, M. (2018). Analysis of data mining and soft computing techniques
in prospecting diabetes disorder in human beings: areview. International Journal of
Pharmaceutical Sciences and Research 9 (7): 2700–2719.
54 Xu, D. and Tian, Y. (2015). A comprehensive survey of clustering algorithms. Annals of
Data Science 2 (2): 165–193.
55 Alickovic, E., Kevric, J., and Subasi, A. (2018). Performance evaluation of empirical
mode decomposition, discrete wavelet transform, and wavelet packed decomposition for
automated epileptic seizure detection and prediction. Biomedical Signal Processing and
Control 39: 94–102.
56 RebouçasFilhoa, P.P., Sarmentoa, R.M., Holandaa, G.B., and de Alencar Lima, D. (2017).
A new approach to detect and classify stroke in skull CT images via analysis of brain
tissue densities. Computer Methods and Programs in Biomedicine 148: 27–43.
References 263
57 Subasi, A., Kevric, J., and Abdullah Canbaz, M. (2019). Epileptic seizure detection using
hybrid machine learning methods. Neural Computing and Applications 31.1: 317–325.
58 Rajendra Acharya, U., Oh, S.L., Hagiwara, Y. et al. (2017). Deep convolutional neural
network for the automated detection and diagnosis of a seizure using EEG signals.
Computers in Biology and Medicine 100: 270–278.
59 Benyoussef, E.M., Elbyed, A., and El Hadiri, H. (2017). Data mining approaches
for Alzheimer’s disease diagnosis. In: Ubiquitous Networking (eds. E. Sabir, A.
García Armada, M. Ghogho and M. Debbah), 619–631. Cham: Springer.
60 Lama, R.K., Gwak, J., Park, J.S., and Lee, S.W. (2017). Diagnosis of Alzheimer’s disease
based on structural MRI images using a regularized extreme learning machine and PCA
features. Journal of Healthcare Engineering 2017: 1–11.
61 Bhagya Shree, S.R. and Sheshadri, H.S. (2018). Diagnosis of Alzheimer’s disease using
naive Bayesian classifier. Neural Computing and Applications 29 (1): 123–132.
62 Doyle, O.M., Westman, E., Marquand, A.F. et al. (2014). Predicting progression of
Alzheimer’s disease using ordinalregression. PLoSOne 9 (8): 1–10.
63 Johnson, P., Vandewater, L., Wilson, W. et al. (2014). Genetic algorithm with logistic
regression for prediction of progression to Alzheimer’s disease. BMC Bioinformatics 15:
1–14.
64 Koikkalainen, J., Pölönen, H., Mattila, J. et al. (2012). Improved classification of
Alzheimer’s disease data via removal of nuisance variability. PLoS One 7 (2): e31112.
65 Maroco, J., Silva, D., Rodrigues, A. et al. (2011). Data mining methods in the prediction
of dementia: a real-data comparison of the accuracy, sensitivity and specificity of linear
discriminant analysis, logistic regression, neural networks, support vector machines,
classification trees and random forests. BMC Research Notes 4 (299): 1–14.
66 Sharma, M., Singh, G., and Singh, R. (2018). Accurate prediction of life style based dis-
orders by smart healthcare using machine learning andprescriptive big data analytics.
Data Intensive Computing Applications for Big Data 29: 428.
67 Sharma, M., Singh, G., and Singh, R. (2018). An advanced conceptual diagnostic health-
careframework for diabetes and cardiovascular disorders. EAI Endorsed Transactions on
Scalable Information Systems 5: 1–11.
68 Rajamanickam, Y., Rajendra Acharya, U., and Hagiwara, Y. (2018). A novel Parkin-
son’s disease diagnosisindex using higher-order spectra features in EEG signals. Neural
Computing and Applications 30 (4): 1225–1235.
69 Abedi, V., Goyal, N., Tsivgoulis, G., and Hosseinichimeh, N. (2017). Novel screening tool
for stroke using artificialneural network. Stroke 48 (6): 1678–1681.
70 Kellner, C.P., Sauvageau, E., Snyder, K.V. et al. (2018). The VITAL study and overall
pooled analysis with the VIPS non-invasive stroke detection device. Journal of NeuroInt-
erventional Surgery: 1–7.
71 Nilashi, M., Ibrahim, O., Ahmadi, H. et al. (2018). A hybrid intelligent system for the
prediction of Parkinson’s disease progression using machine learning techniques. Biocy-
bernetics and Biomedical Engineering 38 (8): 1–15.
72 Lee, E.-J., Kim, Y.-H., Kim, N., and Kanga, D.-W. (2017). Deep into the brain: artificial-
intelligence in stroke imaging. Journal of Stroke 19 (3): 277–285.
73 SugunaNanthini, B. and Santhi, B. (2014). Seizure detection using SVM classifier on
EEGsignal. Journal of Applied Sciences 14 (14): 1658–1661.
264 12 A Smart and Promising Neurological Disorder Diagnostic System
74 Nesibe, Y., Gülay, T., and Cihan, K. (2015). Epilepsy diagnosis using artificial neu-
ral network learned by PSO. Turkish Journal of Electrical Engineering and Computer
Sciences 23: 421–432.
75 Patnaik, L.M. and Manyam, O.K. (2008). Epileptic EEG detection using neural networks
and post-classification. Computer Methods and Programs in Biomedicine 91: 100–109.
76 Awad, M. and Khanna, R. Bioinspired computing: swarm intelligence. In: Efficient
Learning Machines, 105–125. Berkeley, CA: Apress.
77 Bansal, J.C., Singh, P.K., and Pal, N.R. (eds.). Evolutionary and Swarm Intelligence
Algorithms. Springer, Berlin, Heidelberg.
265
13
13.1 Introduction
In the new regime of the world wide web, plenty of structured and unstructured data is
available. Structured data is associated with a database while unstructured data can be tex-
tual or nontextual. The analysis of unstructured data to discern patterns and trends that
are relevant to the users is known as text mining or text analytics. Text mining emerged
in the late 1990s and discovers hidden relationships and complex patterns from large tex-
tual sources. It uses several techniques such as classification, decision trees, clustering, link
analysis, and so on. These techniques can be applied to high-dimensional data using dimen-
sionality reduction statistical techniques such as singular value decomposition and support
vector machine. Text analytics is evolving rapidly and has become an asset for researchers
as it is capable of addressing diverse challenges in all fields.
In this research work, text analytics is used to uncover various patterns and trends from
software bug reports reported in issue tracking systems. The issue tracking system is used
to track and record all kinds of issues such as bug, new features, enhancements, tasks,
and sub-tasks, or any other complaint in the system. There exists different issue tracking
systems such as Bugzilla, Jira, Trac, Mantis, and many others. The Jira issue tracking sys-
tem was developed by the Atlassian company and is used for bug tracking, issue tracking,
and project management. Bug reports of 20 distinct projects of Apache software foundation
(ASF) projects under the Jira repository over last nine years (2010–2018) are extracted. Sev-
eral bug attributes such as Bug Id, Priority name, status, resolution, one-line description,
developer assigned to, fix version, component they belong to, number of comments made
for each bug, long description of bugs, and comments made among various contributors
are extracted using the tool Bug Report Collection System (BRCS) [1]. Bug reports contain
useful hidden information that can be beneficial for software developers, test leads, and
project managers. Based on severity, bugs are broadly classified into two categories: severe
and nonsevere. Severe bugs are critical in nature and can cause system crash or can degrade
the performance of software. Therefore, enormous open bugs of a high severity level are
undesirable. To discover more valuable and profitable information from long descriptions
and comments of bug reports, text analytics is used. Textual data is preprocessed using
tokenization, stemming, and stop word removal. The most frequently occurring words are
mined and correlated, and then these words are extracted to categorize bugs into various
types of errors such as logical code error, input/output (I/O), network, and resource alloca-
tion errors, or to predict the severity of newly reported bugs. These aspects are analyzed to
assist test leads, project managers, and developers to resolve critical open bugs, to classify
into various categories of error and predict severity using keywords, and to predict dupli-
cate bug reports using clustering, all without reading an entire bug report. This will aid in
quick bug resolution to improve a system’s performance and also saves time.
To the best of our knowledge, this is the first work that analyzes various trends from
software bug reports to benefit test leads in maintenance processes of software projects.
This analysis is performed in the form of research questions.
Research Question 1: Is the performance of software affected by open bugs that are critical
in nature?
Research Question 2: How can test leads improve the performance of software systems?
Research Question 3: Which is the most error-prone areas that cause system failure?
Research Question 4: Which are the most frequent words and keywords to predict critical
bugs?
Research Questions 5: What is the importance of frequent words mined from bug reports?
● Open bugs of various projects are analyzed and results exhibited that 24% of open bugs
are of high severity level (blocker, critical, and major), which are critical and need to be
resolved as a high priority to prevent failure of software systems.
● Most contributing developers of various projects are based on the maximum number of
bugs resolved and comments made are extricated, which will assist test leads to assign a
critical open bug to them.
● Co-occurring and co-related words are extracted and bugs are categorized into various
types of errors, such as logical code error, network error, I/O error, and resource allocation
error.
● It is established that the severity of a newly reported bug report can be predicted using
frequently occurring words and most associated words.
● Frequent words of various projects are clustered using k-means and hierarchical cluster-
ing, which will help developers to detect duplicate bug reports.
The organization of the paper is as follows: Section 13.2 gives a brief description of the
issue tracking system and bug report statistics studied in this work, followed by related
work on data extraction process and various application of comments of bug reports in
Section 13.3. Section 13.4 describes the data collection process and analysis of bug reports
in done in Section 13.5. Threats to validity are discussed in Section 13.6, followed by a
conclusion.
13.2 Background 267
13.2 Background
13.2.1 Issue Tracking System
An issue tracking system is a software repository used by organizations for software main-
tenance and evolution activities. It provides a shared platform where team members of an
organization can plan, track, and report various issues and releases of software. It gives a
single view of all elements regardless of whether it is a bug, or a task, subtasks related to,
or a new feature request to a software team. This single view of information helps team
to prioritize their goals and to assign them to the right team member at the right time. It
also records the progress of every issue until it is resolved. There are several issue tracking
systems such as Bugzilla,1 Codebeamer,2 Trac,3 and Jira.4 The Jira issue tracking system is
developed by the Atlassian software company and provides an easy-to-use user interface. Its
prime advantage is that it provides easy integration with other Atlassian tools. It is devised
as multi-project request tracking system, which allows projects to be sorted in different
categories, where each project can have its own settings with respect to filters, workflows,
reports, issue types, etc. Jira is selected over other issue tracking repositories due to the
following reasons:
● Jira provides easy installation and does not requires any environment preparation.
● It provides intuitive and easy to use user interface.
● It supports a multi-project environment.
● It supports a variety of plugins to increase its functionality.
Among several types of issues maintained under the Jira repository, bug reports are an
essential software artifact. Thus, bug reports of various projects of ASF based on Jira repos-
itory are extracted. A bug report is characterized by useful standard fields of information
such as title of bug, its resolution, severity, developer assigned to, date, component to which
they belong, and so on. It also consists long description of bug reports and long comments
made by several developers to discuss and share experience about the bug resolution. In this
work, various fields of bug reports along with its long description and threaded comments
made among various contributors are extracted and analyzed.
1 www.bugzilla.org.
2 www. Intland.com/products/codebeamer.html.
3 http://trac.edgewall.org.
4 www.atlassian.com/software/jira.
268 13 Comments-Based Analysis of a Bug Report Collection System and Its Applications
140000
120000
100000
80000
60000
40000
20000
0
ri
Ac rby
y
on
oo se
fs
e
te
Lu a
ne
os
en
iz
k
e
ov
ar
ba
ov
ul
iv
fk
Pi
pi
in
fb
m
hd
ni
a
es
m
ce
av
H
m
Ka
Sl
e
Sp
ro
O
ac Ca
rd
H Hb
Ig
Am
p-
D
M
cu
M
G
co
co
he
p-
ad
oo
ad
Ap
Figure 13.1 Statistics of bug reports of 20 projects of the Apache Software foundation.
68788
70000
60000
50000
40000
30000
20000
6769
10000 2685 4155 2700 3768
340 117 131 276 98
0
lid
te
em e
d
te
Fi
xe
uc
on
te
bu
lve
ca
le
va
La
en
Fi
’t
ob
od
so
i
a
In
on
pl
pr
pr
ot
Re
Du
W
re
N
a
pl
Im
ot
ot
N
nn
Ca
(a)
80000
70000
60000
50000
40000
30000
20000
10000
0
Minor Trivial Blocker Critical Major
(b)
Figure 13.2 (a) Number of bug reports based on resolution. (b) Number of bug reports based on
severity.
of bugIds, between which all bug reports will be extracted and saved in the form of Microsoft
Excel files [1]. In this research work, an extension to tool BRCS is made. Along with several
bug attributes, the number of comments reported for each bug and detailed conversations
among various contributors in the form comments, along with the contributor’s name and
date of comment made are recorded. Stack traces, source code, and other structural infor-
mation is also extracted. This extraction of information in the form of natural language is
used in several applications, which are discussed in Section 13.3.2. A comparison of previ-
ous work done for the extraction of software artifacts is done in Table 13.1.
270 13 Comments-Based Analysis of a Bug Report Collection System and Its Applications
N. Bettenburg Bug Reports (stack InfoZilla Eclipse Several filters Several filters
et al. [2] traces, source code, are based on
patches) implemented to elements needs
extract these to be
elements implemented
Y. Yuk et al. [4] Bug reports — Mozilla Pages of bug Web pages are
reports are crawled through
crawled and parsing method,
then no tool is used.
transformed to
data traces
R. Malhotra [5] Log files CMS JTreeView Logs are —
obtained from
CVS repository
using “diff”
command
M. Nayrolles Bug reports BUMPER Gnome, Web-based tool Sound
et al. [3] eclipse, github, that extracts knowledge of
netbeans, data based on query language
apache queries in two is required to
software modes: basic use the tool.
foundation and advanced
Our work Bug Reports BRCS 20 projects of Bug reports are —
(Metadata, log Apache accessed
description, Software through REST
comments) Foundation APIs.
[16], and conditional random fields [17]. Extractive document summarization has been
done using query-focused techniques in which sentences are extracted based on queries
entered [18–20]. Along with text summarization, summarization of bug reports is an emerg-
ing field that has motivated researchers to generate useful summaries. Summarization of
bug reports is an effective technique to reduce the length of bug reports, keeping intact
all important and useful description needed to resolve and fix bugs. S. Rastkar et al. used
an e-mail summarizer developed by Murray et al. [21] to discover the similarity between
e-mail threads and software bug reports. It was found that an e-mail summarizer can be
used for bug summarization [22]. To improve the performance of this finding, Bug Report
Corpus (BRC) was built during their successive study. BRC comprises 36 bug reports from
four open-source projects; Mozilla, Eclipse, Gnome, and Kde. Manual summary of each bug
report is created by three annotators and golden standard summary (GSS) is formed. GSS
consists of those sentences that were elected by all three annotators, and a logistic regres-
sion classifier was trained on it. For automatic generation of summary of bug reports, four
features, such as structural, participant, lexical, and length features were extracted and the
probability of each sentence was computed. Sentences with high probability value form a
summary of bug reports [23]. He. Jiang et al. modified the BRC corpus by adding duplicate
bug reports of each master bug report. It was proposed that training a classifier on a corpus
of master and duplicate bug reports could generate a more accurate summary of a mas-
ter bug report. PageRank algorithm was applied to compute textual similarity between the
master and duplicate bug reports and sentences were ranked based on similarity. Also, the
BRC classifier proposed by Rastkar et al. [23] was used to extract features for each sentence
and the probability of each sentence is calculated. Sentences extracted from PageRank and
BRC classifier are merged using a ranking merger and finally, the summary of master bug
reports is created [24]. I. Ferriera et al. ranks comments of bug reports using various ranking
techniques and the approach is evaluated on 50 bug reports of open-source projects [25].
In contrast to supervised extractive summarization of bug reports, a couple of
research works focus on unsupervised extractive summarization. Mani et al. pro-
posed a centroid-based summarization approach and removes noise from sentences on
the basis of questions, investigative, code, and other formats. The approach was evaluated
by SDS and DB2 corpora [26]. This research work was extended by Nithya et al., which
focuses on duplicate bug report detection from a noise reducer [27]. Lotufo et al. generates
a summary of bug reports based on hypothesis; four summarizers were implemented, one
for each hypothesis and one for all three hypotheses. The sentences were ranked based on
relevance and the Markov chain method was used to generate a summary [28].
considered due to low occurrence in issue comments. Five variants of classifier based on
each emotion were constructed based on support vector machine, naïve Bayes, single-layer
perception, k-nearest neighbor, and random forest. The performance of each classifier
is evaluated using a bootstrap validation approach. The results confirmed that issue
comments do convey emotional information and it is possible to identify emotions through
emotion-driven keywords [29]. Q. Umer et al. propose an emotion-based automatic
approach for priority prediction of bug reports. For this approach, bug reports of four
open-source projects from Bugzilla are extracted and summary of each bug report is
preprocessed. After preprocessing, emotion-analysis is performed on each bug to identify
emotion-words using emotion-based corpus and an emotion value is assigned. A machine
learning classifier support vector machine is then trained on the data set based on
emotion-value and the priority of bug reports is predicted. The approach is evaluated using
various performance metrics such as precision, recall, and F-score. It was concluded that
an emotion-based priority prediction helps developers to assign appropriate priority to
bug reports [30]. G. Yang et al. proposed an approach to predict the severity of bugs using
emotion similarity. For this approach, an emotion word-based dictionary (EWD) is created
and similar emotions bug reports are identified using a smoothed unigram model (UM)
based KL-divergence method. An emotion-score is assigned to each bug report and an
emotion similarity multinomial (ES-multinomial) machine learning technique is used.
The approach is evaluated on five open-source projects, namely, Eclipse, Gnu, Jboss,
Mozilla, and Wireshark. Results are compared with other machine learning techniques,
i.e., naïve Bayes multinomial and EWD multinomial. It was concluded that the proposed
approach outperforms other techniques [31]. In contrast to emotions, the sentiment-based
analysis is performed by J. Ding et al. and the authors developed a sentiment analysis
tool SentiSW, which classifies issue comments into three categories: negative, positive,
and neutral, and < sentiment, entity > tuple is generated. Six machine learning classifiers
such as random forest, bootstrap aggregating, gradient boosting tree, naïve Bayes, ridge
regression, and support vector machine are implemented in the Scikit-learn tool. The
approach is evaluated using 10-fold cross-validation technique on 3000 issue comments.
The tool achieves a mean precision of 68.71%, recall of 63.98%, and 77.19% accuracy, which
outperforms other existing tools [32].
is also extracted along with long threaded discussions in the form of comments among var-
ious contributors are also extracted. Along with comments, the name of the contributor,
and the time and date of comments are also fetched. Bug reports of 20 projects of ASF are
extracted, namely, Ambari, Camel, Apache Cordova, Derby, Accumulo, Hadoop-Common,
Hadoop-Hdfs, Hbase, Hive, Ignite, Kafka, Lucene, Mesos, Maven, Ofbiz, Pig, Qpid, Sling,
Spark, and Groovy. The information is retrieved in the form of various reports, which are
discussed in Section 13.4.3. Figure 13.3 represents a snapshot of various attributes of bug
reports that have been extracted.
● To interact with the Jira server applications remotely, Jira REST APIs are used, which is
standard interface of the Jira repository.
274 13 Comments-Based Analysis of a Bug Report Collection System and Its Applications
● BRCS access the Jira’s REST APIs that provide access to data entities via URI paths. An
HTTP request is generated and in response, bug reports data is collected that is parsed
into objects.
● The communication format used by Jira REST APIs is JSON and some standard HTTP
methods like GET, PUT, POST, and DELETE. The general structure of URI is as:
http://host:port/context/rest/api-name/api-version/resource-name. Api-name used by
our application is api, which is used for everything else in contrast to auth used for
authentication-related operations. Current api-version is 2.
● In next step, the name of the Apache projects from which data needs to be extracted
is entered. Along with the name, StartIssue Id and EndIssue Id of bug is entered. The
process will then fetch all the bugs in between the entered IssueIds. The field IssueTypeId
is checked, if response from the server is 1, it indicates that it is a bug, then all relevant
information is extracted and if IssueTypeId is other than 1, then it denotes other types of
issues and are generated in the form of error reports.
Access REST
URI path HTTP Response
API via HTTP PARSER
Accessed Generated
request
This section analyzes bug reports of various software projects of ASF based on textual
descriptions of bug reports and comments of various contributors. The results are analyzed
in the form of various research questions.
2000
1600
1200
800
400
0
Hadoop-Hdts
Ambari
Camel
Cordova
Derby
Accumulo
Hadoop-common
Hbase
Ignite
Kafka
Lucene
Mesos
Maven
Ofbiz
Pig
Qpid
Sling
Spark
Groovy
Hive
Blocker Critical Major Minor Trivial
can alter the functionality of software projects and may cause system failure, a large number
of open bugs of high severity are undesirable.
To evaluate this aspect, the number of open bugs of each severity level are obtained. It
was established that 3.3% of open bugs are blockers, which can cause system failure, 7% are
critical bugs and 12.6% are major bugs and are open in nature, which can cause failure of any
functionality and reduce the performance of the software system. The results will assist test
leads to assign these critical open bugs to developers in order to prevent system crash and to
improve the performance of software systems. Figure 13.5 represents the number of open
bugs per severity level for 20 distinct projects of ASF. It shows that each project has a greater
number of open bugs with “major” severity level, which can degrade system’s performance.
Figure 13.6 depicts the total percentage of number of open bugs as per severity level, which
needs to be resolved on a priority basis. It depicts that 9% of open bugs are blocker bugs,
which can cause system failure and 18% and 33% of open bugs are critical in nature due to
which system’s performance can be declined.
minor
22%
major
33%
13.5 Analysis of Bug Reports 277
13.5.2 Research Question 2: How Can Test Leads Improve the Performance
of Software Systems?
Open-source projects can be modified and shared by any end user as it is accessible pub-
licly. In open-source software projects, many developers contribute in various projects in
several ways such a bug fixing, bug resolution, adding new enhancements, requesting new
features, providing feedback, and many others. As software evolves, the number of develop-
ers contributing to the software project changes. Some developers complete their assigned
tasks and leave in the middle of the project. Only a few developers remain intact during the
entire life cycle of software. In this scenario, it is challenging for test leads to find developers
who can resolve high priority and high severity bugs at the earliest. To help test leads and
project managers improve the system’s performance, the top five developers of each project
are retrieved based on the number of bugs resolved and their contribution in the form of
the number of comments made for resolving bugs. The maximum number of comments
made by developers exhibits those who are most involved in the bug resolution process.
Therefore, test leads can assign bugs with high severity level to these developers in order
for their quick resolution. Figure 13.7a–d represents the name of the developers for each
project with maximum contribution in the bug resolution process. The results are retrieved
from comments of bug reports of various projects of ASF.
13.5.3 Research Question 3: Which Are the Most Error-Prone Areas that Can
Cause System Failure?
To explore most error-prone areas, textual attributes of bug reports, i.e., one-line descrip-
tion, long description, and several comments are analyzed. It was discovered that there are
four common type of errors that are occurring frequently. These errors are logical errors,
I/O errors, network errors and resource allocation errors. Certain keywords are determined
and to help developers and test lead to uncover these errors, several associated words are
identified. The presence of such keywords can indicate the presence of a particular type of
error. This will help developers to locate the type of error without reading an entire bug
report and can result in the quick resolution of a bug.
To accomplish this, data has been preprocessed using standard preprocessing steps in R
language. It includes tokenization, stemming and stop word removal. For preprocessing,
“tm” and “nlp” packages are installed. Tokenization is segmentation of a string of text into
words called as tokens. Stemming eliminates the prefixes and suffixes from a word and
convert it into a stem word and lastly, stop words are removed. Stop words are the most
commonly used words in a language and do not contribute much to the meaning. These
preprocessing steps are illustrated below:
Example: The boy was running after the bus.
Tokenization: “the,” “boy,” “was,” “running,” “after,” “the,” “bus”
Stemming: Running → run
Stop word removal: “the” is removed.
After preprocessing of text data, document term matrix is constructed and sparsity of
a matrix is reduced by a factor of 0.98 using the removeSpaseTerms() function. The fre-
quency of each word is computed using the findfreqterms() function. For each type of
278 13 Comments-Based Analysis of a Bug Report Collection System and Its Applications
(a)
(b)
(c)
(d)
Figure 13.7 (a)–(d) Most contributing developer for 20 projects of Apache Software foundation.
13.5 Analysis of Bug Reports 279
Logical code error Throws, exception, divide, unhandled, pointer, throw, uncaught, null
pointer, raises, system out of memory exception, trigger, dividezero,
rethrow
Input/output error Logging, build, classpath, inputformat, api-jarfilemissing, log, loggers,
imports, initialized, requestchannels, map output location get file,
displayed, console log
Resource allocation Task_allocation, buffers, memory, synchronized, configuration,
error memory failure, runtime, dynamic, bucketcache
Network related Datanode, localhost, address, port, domain, security, process, https,
error global, interfaces, binding, virtual, bindexcept, limits
Figure 13.8 Code for finding corelated words (a) Association graph for logical code error for Kafka
project, (b) Association graph for input/output error for Kafka graph, (c) Association graph for
network error for kafka project, (d) Association graph for resource allocation error of Kafka project.
error, correlated words around particular keywords are determined. Correlation indicates
the cooccurrence of words in a document. FindAssocs() function is used, which always
returns cooccurring words with the searched term. To classify a bug report into a partic-
ular type of error, a list of keywords is established. Table 13.2 represents frequently used
keywords to identify a type of error. For illustration, the snapshot of implemented code is
shown in Figure 13.8. The association graphs of various types of error are depicted for the
Kafka project in Figure 13.9a–d.
pointer
throws
throw
association_df[. 1]
raises
uncaught
encountered
defer
skip
hits
obscured
networksend
major
kafkacommonmessagesizetoolargeexception
internally
inappropriate
gettgt
externally
exposes
distinguish
detailed
debuglevel
consumerproperties
connectivity
confidential
blockingchannel
reduce
information
unless
stderr
restoration
indicate
easier
displayed
acks
0.10 0.15 0.20 0.25
association_df[. 2]
(b)
Figure 13.9 (a)–(d) Association graphs of various errors for Kafka project (a) Word-frequency plot
of Apache Cordova project, (b) Association plot for high severe associated words.
13.5 Analysis of Bug Reports 281
defaults
sanity
association_df[. 1]
perf
checks
threads
mirrormaker
minikdc
strategy
idle
association_df[. 1]
rate
port
failures
kafkaconsumer
memory
high
movementthreshold
migrating
mediccheck
freq
lets
javahome
intersecting
incubatorcordovajs
forwardback
250 forcibly
fonts
descriptive
createspec
createdirectoryatpath
copyfrom
charachters
cdvcommandstatusnoresult
autotests
admin
0 interaction
views
destroy
semver
rd ra
l
dr d
br a id
ow pp
ca bui r
co meld
cr crava
as sh
de hes
do e
er cs
ev ror
fa t
lls
fi e
pb les
w x
in ser
m ed s
pl issi ia
at ng
p rm
u n
re oje s
m ct
e
su reen
pp n
te t
t st
up ests
uste
u e
w ers g
do n
w wo s
or rk
w ing
ng
al
en
or
se
ro fi
m io
pr gin
w
ov
an ad
fil
pl lugi
sc ru
v sin
in io
o
da
st
o
vi
ro
fo
seems
0.1 0.2 0.3 0.4
ap
in
word association_df[. 2]
(a) (b)
Figure 13.10 (a)–(b) Frequency and association plots for severe bugs.
13.5 Analysis of Bug Reports 283
Hadoop-hdfs Fix, error, commands, message, Blocks, remove, failure, use, incorrect,
typo, log, tests, report storage, webhdfs
Apache cordova Work, run, file, error, incorrect, Crashes, version, configxml, warning,
remove, plugins, platform xcode, event, destroy, forcibly
Hadoop-common Message, default, log, error, typo, Unable, written, upgrade, stuck,
options, root timeout, tokens, zero, retry.
Consequence, shutdown, close,
returned
Hive Error, exception, null, partition, Join, columns, query, execution,
throws dynamic, views, vectorize
Ambari Error, default, message, check, Upgrade, failed, python, install, hosts,
button, page, clusters
Groovy Access, breaks, default, causes, Allow, assert, deadlock, closure,
delegate, throws exception, fail
Hbase Typo, warning, incorrect, missing, Fails, abort, block, exception, split,
log, remove, wrong deadlock, inconsistent,
authfailedexception, datablock,
compress, preemptive, fast
Lucene Can, field, fix, incorrect, query, test Bug, broken, causes, failure, exception,
nullpointerexception
Mesos Broken, add, log, link. Type, wrong, Bad, crashes, fail, invalid, shutdown,
warning implement, update
Maven Error, plugin, pom, version, Causes, deploy, configuration,
warning, fix exception, failure, invalid,
nullpointerexception
Qpid Console, deleted, error, exchange, Crash, deadlock, faie, exception,
test, broker throws, timeout, unable
Sling Tests, missing, due, integration, Null, volatile, exception, fails, content,
error npe
Spark Fix, incorrect, error, return, wrong, Block, fail, parquet, exception, executor,
missing, default invalid, thrift
Accumulo Error, fix missing, checking, test Lock, blocks, deletion, misconfigured,
nameerror, shell server it delete throws
Kafka Unrelated, irregular, relative, ahead, Indefinitely, unclosed, abstract fetcher
commitsync thread shutdown, test broker failure
improved decision making. Clustering is of two types: k-means and hierarchical clustering.
K-means clustering clusters the data based on their similarity and splits the data set into
k-groups. Clusters are formed by computing the distance between each pair of observa-
tions. Euclidean distance is used to find similarity between two elements (a, b) defined in
Eq. (13.1).
√
√ n
√∑
d(a, b) = √ (ai − bi )2 (13.1)
i=1
284 13 Comments-Based Analysis of a Bug Report Collection System and Its Applications
CLUSPLOT (as.matrix(d))
5
Component 2
0
–5
Hierarchical clustering does not require a predefined number of clusters and represents
data into a tree-based representation called as dendrogram. Cluster analysis had been
widely used in various fields such as duplicate bug report detection [33], clustering defect
reports [34], and categorization and labeling of bug reports [35]. To evaluate the importance
of frequent words of bug reports, K-means and hierarchical clustering is performed on 20
projects of ASF and similar words are grouped into clusters. The results are illustrated for
the Apache Cordova project. The K-means cluster in Figure 13.11 forms three clusters,
showing dissimilarity of 74.75% in cluster 1 and 2, whereas cluster 3 contains immense
similar words, thus depicting overlapping. Hierarchical clustering in Figure 13.12 depicts
several clusters in the form of a dendrogram. Thus, clusters of similar words are used in
many applications and is significant.
13.7 Conclusion
Software bug reports reported in the issue tracking system contains valuable and informa-
tive data that is useful to several applications, such as emotion mining, bug summarization,
Figure 13.12
Height
5 10 15 20 25 30
android
Dendogram of most similar words.
cordova
windows
file
error
app
blackberry
update
remove
camera
project
plugins
use
cli
tests
docs
configxml
support
version
create
script
crashes
using
run
missing
api
device
inappbrowser
working
test
guide
broken
page
fix
hclust (*, "complete")
code
Cluster Dendrogram
crash
media
command
install
npm
d
default
xcode
directory
name
filetransfer
event
incorrect
returns
callback
new
files
cordovajs
failing
link
wrong
fail
set
phonegap
image
screen
application
mobilespec
causes
exception
platforms
needs
issue
adding
errors
url
breaks
webview
path
plugman
build
work
fails
add
platform
ios
plugin
286 13 Comments-Based Analysis of a Bug Report Collection System and Its Applications
bug report duplicate detection, and many others. Analysis of this useful data is highly
desirable. In this research work, quantitative analysis of data of bug reports is performed to
discern various patterns and trends. The analysis is conducted on bug reports of 20 projects
of ASF under the Jira repository. It was found that 24% of open bugs are critical in nature,
which can cause system failure and are on high priority for resolution. Test leads can
assign critical open bugs to the most contributing developers for quick resolution. Frequent
words and most correlated words are extricated from long description and comments of
bug reports, which can be used to categorize bugs into various types of errors and can
predict the severity of a newly reported bug in the software. Clusters of frequent occurring
keywords are formed using unsupervised clustering techniques to help developers detect
duplicate bug report detection. This effective and practical analysis of bug reports of the
issue tracking system will help software developers, test leads, and project managers in the
software maintenance process.
References
1 Kaur, Arvinder. 2017. “Bug Report Collection System (BRCS),” 697–701. doi:https://doi
.org/10.1007/1-84628-262-4_10.
2 Bettenburg, Nicolas, Thomas Zimmermann, and Sunghun Kim. 2008. “Extracting Struc-
tural Information from Bug Reports,” 1–4.
3 Nayrolles, Mathieu. 2016. “BUMPER: A Tool for Coping with Natural Language
Searches of Millions of Bugs and Fixes.” doi:https://doi.org/10.1109/SANER.2016.71.
4 Yuk, Yongsoo, and Woosung Jung. 2013. “Comparison of Extraction Methods for Bug
Tracking System Analysis,” 2–3.
5 Malhotra, R. (2014). CMS tool : calculating defect and change data from software
project repositories. ACM Sigsoft Software Engineering Notes 39 (1): 1–5. https://doi
.org/10.1145/2557833.2557849.
6 Gong, Y. and Liu, X. (2001). Generic text summarization using relevance measure and
latent semantic analysis. In: Proceedings of the 24th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval - SIGIR 01, 19–25.
ACM https://doi.org/10.1145/383952.383955.
7 Wang, D., Li, T., Zhu, S., and Ding, C. (2008). Multi-document summarization via
sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of
the 31st annual international ACM SIGIR conference on Research and development in
information retrieval, 307–314.
8 Hua, W., Wang, Z., Wang, H. et al. (2016). Understand short texts by harvesting and
analyzing semantic knowledge. IEEE transactions on Knowledge and data Engineering
29 (3): 499–512.
9 Jha, N. and Mahmoud, A. (2018). Using frame semantics for classifying and summarizing
application store reviews. Empirical Software Engineering https://doi.org/10.1007/s10664-
018-9605-x.
10 Erkan, G. and Radev, D.R. (2004). LexRank: graph-based lexical centrality as salience
in text summarization. Journal of Artificial Intelligence Research 22: 457–479. https://doi
.org/10.1613/jair.1523.
References 287
14
14.1 Introduction
Sentiments have become a puzzle to be solved these days. Verbal or written expressions
of sentiments are tough to comprehend because of the innovative ways people have been
adapting in order to express them. Where sentiments used to be a binary value earlier with
just positive and negative values to look for, the advent of sarcasm has made the idea a little
more explicit. Sarcasm is when someone decides to use words of opposite meaning to what
he/she is actually feeling. Sarcasm is the new trendsetter and is so widely used and appre-
ciated that those who do not know it have started to learn it. So, the text we come across
in our day-to-day lives, be it on Amazon reviews or Twitter feeds or maybe the daily news
headlines are a carrier of sarcasm, in some way or the other. If we wish to detect the senti-
ment values accurately, we need an algorithm that detects the types of sarcastic expressions
along with the positive and negative emotions. According to linguistic Camp E. [1], broadly,
sarcasm is of four types: propositional, embedded, “like”-prefixed, and illocutionary. Hyper-
bole, a type of embedded sarcasm is considered a strong marker of sarcasm. It is recognized
when a sentence has both positive and negative polarity. The presence of these contradict-
ing sentiment values is a pointer of hyperbolic sarcasm. Table 14.1 gives few examples for
hyperbolic sarcasm.
The examples in Table 14.1, show hyperbolic sarcasm. As observed, each of the sentences
have positive and negative polarity words in them. For instance, the sentence “A friend in
need is a pest indeed” the word friend is a positive word and pest is negative word. Both
the words are equally positive and negative in polarity. Hence, the sentence is a hyperbolic
sarcasm. In all the other sentences of Table 14.1, we can find the bold words have positive
as well as negative words in them, which are of similar weightage on the sentiment polarity
scale. These sentences are few from the human annotated corpus of hyperbolic sarcasm.
This chapter works on a rule-based approach to detect at least one type of sarcasm
(hyperbole) along with general sarcasm and non-sarcasm class. It uses SentiStrength1
implemented in Python to find out the sentiment strengths. The algorithm proposed in
this paper detects sarcastic, hyperbole, and non-sarcastic sentences. The sarcastic data
set used for the experiment is taken from the social media pages dedicated to sarcastic
posts on Facebook, Instagram, and Pinterest. The posts are collected manually from social
media pages dedicated to sarcastic quotes and so are human annotated. The corpus has
1500 sentences with 500 sarcastic, 500 positive, and 500 negative sentences. Positive and
1 http://sentistrength.wlv.ac.uk.
14.2 Literature Survey 291
Table 14.2 Examples for general sarcasm, positive sentences, and negative sentences.
Sarcasm:
The trouble of being punctual is that no one is there to appreciate it.
I correct the auto-correct more than the auto-correct corrects me.
Marriage is one of the chief causes of divorce.
Positive:
The best preparation for tomorrow is doing your best today.
Learn from yesterday, live for today, hope for tomorrow.
Positive action combined with positive thinking results in success.
Negative:
Tears come from the heart and not from the brain.
The person who tries to keep everyone happy often ends up feeling lonely.
The worst kind of sad is not being able to explain why.
negative sentences are also taken from an already labeled data set of Amazon reviews from
the University of California–Irvine (UCI) machine learning repository.2
The remaining chapter is arranged in the following manner: the literature survey of
relevant papers elaborating various approaches of sarcasm detection, experimental setup,
results, and evaluation of results.
Examples of sarcastic, positive, and negative sentences are given in Table 14.2.
2 https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences.
292 14 Sarcasm Detection Algorithms Based on Sentiment Strength
features and distance weighted similarity features to detect four types of word-embeddings:
GloVe, LSA, Word2Vec, and dependency weights. It concludes experimentally that consid-
ering only word embedding features is not enough and sarcasm detection is better when it
is used along with other features. Word embedding suffers from several issues like incorrect
senses, contextual incompleteness, and metaphorical nonsarcastic sentences.
Joshi et al. [5] in their survey paper, have covered all the aspects of automatic sarcasm
detection, including statistical approaches. They have identified learning algorithms such
as SVM, logistic regression, and naïve Bayes as the most commonly used algorithms. They
have also pointed out the use of the Winnow algorithm (balanced) to get high ranking fea-
tures. The irony, political humor, and education labeled sentences area easily identified by
naïve Bayes with the sequence nature of output labels is detected easily by SVM-HMM.
Joshi et al. [6] used the concept of “sentence completion” for detecting sarcasm. For this
approach, the researchers have adopted two approaches: first; they have considered all the
words that are occurring in a sentence as incongruous candidate words and then take the
similarity measures; in the second approach, they have cut down the number of words to
half by eliminating redundant comparisons for incongruous words. The researchers have
used Word2Vec and WordNet as the similarity metric for both the approaches. The authors
have observed that if the exact incongruous words are known, then the model gives much
better results than other approaches. The model suffers errors due to the absence of Word-
Net senses and inaccurate sentence completion.
Muresan et al. [7] have reported on a method for building a repository of tweets that
are sarcastic in nature and are explicitly labeled by their authors. They used TwitterAPI to
extract tweets of sarcastic nature detecting by their hashtags #sarcasm or #sarcastic and
other sentiment tweets (exact positive or exact negative) by their hashtags for happy, lucky,
joy, and sadness, angry, and frustrated. The authors have compared the annotated corpus
with the positive and negative sentiment utterances and have investigated the effect of
lexical features like n-grams or dictionary-based as well as pragmatic factors like emoti-
cons, punctuation, and contexts on the machine learning approach used for sarcastic sen-
tences. Their final corpus had 900 tweets falling under the category of sarcastic, positive,
and negative in nature. They have used and compared the effects of three popular classi-
fication algorithms naïve Bayes, SVM, and logistic regression. They have conducted four
classifications: Sarcasm_Positive_Negative, Sarcasm_NonSarcasm, Sarcasm_Positive, and
Sarcasm_Negative.
Reganti et al. [8] have shown results for three classifiers: random forest, SVM, decision
trees, and logistic regression. They have taken n-grams and sentiment lexica as bag-of-words
features for detecting satire in a Twitter data set, newswire articles, and Amazon product
reviews. Their results show that a combined approach with all features considered together
gives better recognition rate than the features used alone.
Bharti et al. [9] recognized that verbal sarcasm is composed of heavy tonal stress on cer-
tain words or specific gestures like the rolling of eyes. However, these markers are missing
in case of textual sarcasm. This paper has discussed sarcasm detection based on parsing
and interjection words. They have used NLTK and TEXTBLOB packages of Python to find
14.2 Literature Survey 293
the POS and parse of tweets. The PBLGA algorithm, proposed by the paper takes tweets as
input and detects sarcasm based on sentences with positive situation-negative sentiment
and negative situation-positive sentiment. The second algorithm proposed by the paper
detects sarcasm based on interjection words present in the sentence. These approaches give
good recall and precision.
Rilof et al. [10] attempted to find sarcasm as a dividing line between a positive sentiment
expressed in a negative context. For this bootstrap, learning is used where rich phrases of
positive sentiments and negative situations are learned by the proposed model. Any tweet
encountered with positive sentiment or a negative situation marks the source of sarcasm.
Then those sentences are tested for contrasts in polarity. The sentences with a contrast
between the two features are categorized as sarcastic.
Rajadesingan et al. [11] try to look into the psychological and behavioral aspect of sarcasm
utterance and detection. This paper has tried to capture different types of sarcasm depend-
ing on the user’s current and past tweets. Through this paper, the researcher has proposed
SCUBA approach, i.e., sarcasm classification using a behavioral modeling approach. This
approach basically proposes that given an unlabeled tweet from the user along with his past
tweets then it can be automatically detected if the current tweet is sarcastic or not.
Kunneman et al. [12] identify linguistic markers of hyperbole as an essential feature to
identify sarcasm. Since hyperbolic words are intensity carriers, their presence makes sar-
casm detection much more relaxed than otherwise. For example: “the weather is good” is
less likely to be sarcastic but “fantastic weather” is easily detected as hyperbolic sarcasm
because of the word “fantastic.” Intensifiers, thus, strengthen an utterance and play an
essential role in hyperbole identification. Authors have used the balanced Winnow clas-
sification. Tweets were tokenized, less frequent words were removed, and the punctua-
tions, emoticons, and capitalizations were preserved as possible sarcasm markers. N-grams
(n = 1,2,3) were used as features for the data set.
Felbo et al. [13] used emoji occurrences with tweets as sarcasm indicators. The
researchers show that the millions of texts with emojis available on social media can
be used to train models and make them capable of representing emotional content in
text. They have developed a pretrained DeepMoji model and observed that the diversity
of emojis is crucial for its better performance. Their methodology includes pretrain-
ing on only English tweets without URLs having emoticons, which makes the model
learn the emotional content in texts in a richer way. The DeepMoji model has used 256
dimension-embedding layers to pitch every word in the vector space.
Maynard et al. [14] have focused on hashtag tokenization because it contains crucial sen-
timent information. While tokenizing tweets, these hashtags get tokenized as one unit and
effects the sentiment polarity of the sentence. Authors have used the Viterbi algorithm to
find the best matches in the process. They have also used hashtags for detecting scope of
sentiment in a tweet. In this paper, the authors have also considered the impact of differ-
ent values of sarcastic modifiers on the meaning of tweets and have observed the change in
polarity of sentiment expressed.
294 14 Sarcasm Detection Algorithms Based on Sentiment Strength
Joshi et al. [15] have discovered thematic structures in a big corpus. The paper attempts
to find out the occurrence of sarcasm dominant topics and difference in the distribution
of sentiment in sarcastic and nonsarcastic content. It also focuses on hyperbolic sarcasm
having positive words with a negative implication according to the context. The proba-
bility of a word falling into sentiment or topic, sentiment distribution in tweets of a par-
ticular label and the distribution of topic over a label are the features taken up by the
researchers. It was an initial but promising work where sarcasm detection was based on
various sarcasm-prevalent topics. They detected topics based on two approaches, first with
only sarcastic tweets and the second one with all kinds of tweets to capture the prevalence
of sarcasm.
14.3 Experiment
This chapter attempts to find sarcasm along with positive and negative sentiments from
sentences using the proposed algorithms and evaluate its performance using the confu-
sion matrix. The experiment consists of the following steps: data collection, finding Sen-
tiStrength, classification, and evaluation.
Sarc_Detector:
scorei = Pi + Ni
IF (scorei <= 0):
IF (scorei = 0):
IF (Pi != 1 AND Ni != -1):
print ("Hyperbole")
ELSE:
print ("Neutral")
ELSE:
print ("Sarcasm")
ELSE:
print ("Non-Sarcasm")
296 14 Sarcasm Detection Algorithms Based on Sentiment Strength
Start
Input P, N, i
Scorei = Pi + Ni
Scorei <= 0 No
Yes
No Scorei = 0
Yes
Yes
Print “Neutral”
Print “Hyperbole”
End
14.3 Experiment 297
Sarc_Detector:
scorei = Pi + Ni
IF (scorei <= 0):
IF (scorei = 0):
IF (Pi != 1 AND Ni != -1):
print ("Hyperbole")
ELSE:
print ("Neutral")
ELSE:
IF (Pi = 1 AND Ni < -1):
print ("Negative")
ELSE:
print ("Sarcasm")
ELSE:
IF (Pi > 1 AND Ni = -1):
print ("Positive")
ELSE:
print("Hyperbole*")
The Algorithm 14.2. attempts to detect hyperbole, along with positive and negative sen-
tences. However, hyperbole* needs human annotation to be sure. The algorithms are dis-
cussed with details in the next subsection.
Start
Input P, N, i
Scorei = Pi + Ni
No Scorei <= 0
Yes
Scorei = 0 No
No
Yes No
Yes Pi ! = 1 & Ni ! = –1
No Print “Negative”
Yes
Print “Hyperbole*” Print “Neutral”
Print “Positive”
End
as hyperbolic in a similar way as above. Further, it looks for nonzero negative scores and
finds if the negative score is a result of neutral-positive strength. In such case, the algorithm
marks the sentence as pure negative as its score is a result of the only negative strength of
the sentence.
In case of a negative score if the positive component is not neutral but less than the neg-
ative component, then the sentence is likely to be sarcastic. Moreover, if the overall score
14.3 Experiment 299
is positive, then the components are checked for a neutral negative and nonneutral posi-
tive value. In such case, the sentence is a pure positive sentence. If the positive score of a
sentence is a result of significant positive strength value but a nonneutral negative value
then again, the sentence can be a possible “hyperbole*.” This hyperbole needs to be anno-
tated manually to get good results. Such cases can be hyperbolic sarcasm as well as impure
positive sentences.
The formulae used in the Algorithm 14.1 and 14.2 are as follows:
For positive:
Pi + Ni > 0, (14.1)
where,
Pi ≠ 1 and
Ni = −1.
The sentences with only positive sentiment words and no negative words are classified
as purely positive sentences.
For negative:
Pi + Ni < 0, (14.2)
where,
Pi = 1 and
Ni ≠ −1.
The sentences spotted with high negative strengths and neutral positive strength are cat-
egorized as purely negative sentences.
For sarcasm:
Pi + Ni <= 0, (14.3)
where,
Pi ≠ 1
Ni ≠ −1 and
|Ni| > Pi.
Sarcastic sentences are observed to be usually carrying negative words, i.e., general prac-
tice is to express a positive feeling wrapped up in negative words. So, sentences having both
polarities in them but the negativity supersedes the positivity, that has good chances of
being sarcastic in nature.
For hyperbole:
Pi + Ni = 0, (14.4)
where,
Pi = |Ni | and
Pi ,|Ni | ≠ 1
300 14 Sarcasm Detection Algorithms Based on Sentiment Strength
14.3.5 Classification
The above algorithms were derived on the following observation:
a) The sentences having all positive words and no negative words can be classified as pos-
itive sentences.
b) The sentences having all negative words and no positive words can be classified as neg-
ative sentences.
c) Hyperbolic sentences are known to have both positive and negative sentiments
expressed in them in equal manner.
d) General sarcasm has varied proportions of negative and positive sentiment strengths.
Table 14.3 describes the cases of sentiment prediction considered by the algorithms and
Table 14.4 gives a deeper understanding with suitable examples.
14.3.5.1 Explanation
In the Table 14.4, which shows example cases for the logic discussed in Table 14.3, the first,
second, and third sentences are all neutral as per their scores but in a little different way.
The first sentence doesn’t have any sentiment word contributing in its formation and hence
both of its positive and negative scores are neutral (1 and −1) summing up to give a neutral
overall score to the sentence (case1). Similar case is with the second sentence where there
is no sentiment word involved but the sentence, as per human annotator, is a sarcastic sen-
tence. Hence it is wrongly classified by the algorithm as neutral (case1). Moreover, the third
sentence has two sentiment words “trouble” and “appreciate” both having equal negative
and positive strengths (i.e., −2 and 2) respectively. So, their overall sentiment score again
sums up to be neutral but the algorithm rightly classifies it as hyperbole (case2).
14.3 Experiment 301
Table 14.3 Shows the patterns used by extended Algorithm 14.2 to detect the positive, negative,
hyperbole, and sarcasm.
The fourth and fifth sentences are both good examples of case3 sentences. They both have
positive sentiment words, i.e., “best” and “hope,” which have good positive scores (i.e., 2
and 3) and these sentences lack any negative sentiment words so their overall score sums
up to be positive rendering the sentences purely positive.
302 14 Sarcasm Detection Algorithms Based on Sentiment Strength
The sixth and seventh sentences are the examples for the case4 where there’s equal
chances of the sentence being a hyperbolic sarcasm or a positive sentence. The fifth
sentence has a negative token “sorry” and a positive word “enjoying,” but the weightage of
positive word is more than the negative one (i.e., 3 and −2), which leaves the score of the
sentence as positive. The sentence, however, is an example of hyperbolic sarcasm. The sev-
enth sentence of Table 14.4 also has the similar sentiment strengths like the sixth sentence.
Due to the presence of a positive and negative word in the same sentence “difficult” and
“beautiful” with positive sentiment strength more than the negative one, i.e., 3 and −2,
the sentence qualifies as a positive sentence, which is indeed right. That’s why the case4
requires human annotation to be sure because the case where both sentiment words are
present but higher positive sentiment strength beats a low negative strength and gives a
positive score, can be both a hyperbolic sarcasm as well as a positive sentence.
The eighth and ninth sentences are case5 examples. In the eighth sentence “Marriage is
one of the chief causes of divorce,” the word divorce has a negative sentiment strength of −2
but ironically, its opposite word “marriage” doesn’t has a positive score so the sentence is
rendered negative in the absence of positive sentiment score. Such anomalies are exceptions
and rare and the algorithm fails to recognize them correctly. The sentence is, as a matter
of fact, a sarcastic sentence but the algorithm classifies as negative. The ninth sentence
as a lot of negative words, i.e., “dark,” “sad,” and “broken” and no positive ones so the
sentence turns out to be neutral on positivity and get a −4 on negative sentiment strength.
This sentence is correctly classified as a case5 negative sentence.
Last but not the least, sentences like the tenth and eleventh ones in Table 14.4, which
have both positive and negative sentiment words like “smile” (weight: 3) and “sad” (weight:
−4) in tenth sentence and “talent” (weight: 2), and “doesn’t” (weight: −3) in the eleventh
sentence but the negative weight is more than the positive one and the overall score of the
sentences comes negative. Such sentences containing both positive and negative sentiment
weights but giving overall negative score are classified as sarcastic that is the case6 of the
algorithm.
14.3.6 Evaluation
The algorithms proposed by the paper classify the sentences as positive, negative, sarcastic,
and hyperbolic. The classification results then undergo evaluation on the following criteria:
F-Score. It is the accuracy measure of the test. It is calculated on the values of precision and
recall,
Prec. ∗ Rec.
F − Score (F) = 2 ∗
Prec. + Rec.
The results obtained after implementing the extended algorithm are shown in Tables 14.5
and 14.6. Table 14.5 shows the total number of instances given to algorithms for testing and
how many were correctly identified by the algorithm. It also presents the total number of
sentences that were not correctly identified by the extended algorithm. The extended algo-
rithm was able to detect 68 hyperboles out of 100 hyperbolic sentences, whereas it detected
270 out of 400 general sarcastic sentences correctly. For positive and negative sentences, the
extended algorithm was able to detect 440 out of 500 and 350 out of 500 correctly respec-
tively. The classification results were evaluated on the confusion matrix on the criteria of
accuracy, recall, precision, and F-measure. Table 14.6 displays the evaluation results of the
extended algorithm. The table shows the accuracy, precision, recall, and F-measure values
for all four types of sentiments. Accuracy values for all four sentiments (in bold) show the
performance of the extended algorithm. The extended algorithm was 88% accurate in case of
hyperbole, 84% in case of identifying overall sarcasm and gave 87% and 77% accurate results
for positive and negative sentiments respectively. Apart from accuracy values, the algo-
rithm shows 68% and 88% recall for hyperbole and positive sentiments respectively, and 68%
and 75% precision values for sarcasm and negative sentiments. These results are depicted
graphically for analysis purposes (Figures 14.2 and 14.3). In the first graph (Figure 14.2) the
Table 14.5 True positive and True negative values of the classification result.
Hyperbole 100 68 32
Sarcasm 400 270 130
Positive 500 440 60
Negative 500 350 150
Table 14.6 Evaluation results of the classification done by the extended algorithm.
Classification Result
600
500
400
300
200
100
0
Hyperbole Sarcasm Positive Negative
Evaluation Results
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Hyperbole Sarcasm Positive Negative
topmost line shows total number of instances, and the next line is for correctly identified
instances, which is visibly close to the total number of instances. The third line is for the
sentences that were not correctly classified by the algorithm, which is visibly close to the
0 line. This shows that the performance of algorithm is very good since the errors are very
low in count. The next graph (Figure 14.3) shows evaluation results graphically. Observa-
tion of the graph tells us that the algorithm gave best results for positive sentiment as all
four evaluation criterion lie above the 70% mark. Apart from that, the accuracy line for all
for all four sentiments tops the chart, which shows that the proposed extended algorithm
was able to give great results at identifying the sentiments accurately.
References 305
14.5 Conclusion
Sarcastic sentences are known to be of a different behavior than the regular sentiment sen-
tences. Due to this ambiguous nature of sarcasm, it is complicated to identify. This paper
has tried to adopt a logical yet straightforward approach of catching the scope of sarcasm
over a sentence via its sentiment strength. We calculate the sentiment strength of sentences.
These strengths are in two parts, positive strength and negative strengths. The devised algo-
rithms then calculate these strengths. The algorithms are formulated on the basis of human
observations of positive, negative, sarcasm, and hyperbole. It takes in the positive strength,
negative strength, and their sum to decide in a unique way if the sentence falls into one of
the four sentiment categories or is neutral. This experiment was performed on a data set of
1500 sentences. The results show that this may prove to be a right approach to rule-based
sarcasm detection. This approach saves the time and complexity of data cleaning. The algo-
rithms can be extended further and can be made to detect all types of sarcasm along with
all modes of positive and negative sentiments. Future works may aim toward developing a
system or algorithm that can detect all sentiments altogether.
References
10 Riloff, E., Qadir, A., Surve, P. et al. (2013). Sarcasm as contrast between a positive sen-
timent and negative situation. In: EMNLP, Association of Computational Linguistics, vol.
13, 704–714.
11 Rajadesingan, A., Zafarani, R., and Liu, H. (2015). Sarcasm detection on twitter: a
behavioral modeling approach. In: Proceedings of the Eighth ACM International Confer-
ence on Web Search and Data Mining, 97–106. ACM.
12 Kunneman, F., Liebrecht, C., Van Mulken, M., and Van den Bosch, A. (2015). Signal-
ing sarcasm: from hyperbole to hashtag. Information Processing & Management 51 (4):
500–509.
13 Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., & Lehmann, S. (2017). Using millions of
emoji occurrences to learn any-domain representations for detecting sentiment, emotion
and sarcasm. arXiv preprint arXiv:1708.00524.
14 Maynard, D. and Greenwood, M.A. (2014). Who cares about sarcastic tweets? Investigat-
ing the impact of sarcasm on sentiment analysis. In: LREC, 4238–4243.
15 Joshi, A., Jain, P., Bhattacharyya, P., & Carman, M. (2016). ‘Who would have thought
of that!’: A Hierarchical Topic Model for Extraction of Sarcasm-prevalent Topics and
Sarcasm Detection. arXiv preprint arXiv:1611.04326.
16 Kotzias, D., Denil, M., De Freitas, N., and Smyth, P. (2015). From group to individ-
ual labels using deep features. In: Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 597–606. ACM.
307
15
15.1 Introduction
Predictive analytics uses archive data, machine learning, and artificial intelligence for
prediction. Predictive analytics has two steps: first, archive data is fed into algorithms and
patterns are created, and second, current data is used on same algorithms and patterns for
predictions. Digitization of every resource has aided the predictive analytics. Currently,
hundreds of tools have been invented that can be used to predict probabilities in an
automated way and decrease human labor.
Predictive analytics involves identifying “what is” needed from archive data, then to study
whether the archive data used meets our needs. Then the algorithm is modified to learn
from a data set to predict appropriate outcomes. It is important to use an accurate and
authentic data set and update the algorithm regularly.
To the Best of Our Knowledge and Wisdom, Predictive Analytics Process Can be as
Detailed as: [1]
1) Define project: All the minute details like outcomes, scope, deliverables, objectives, and
data sets to be used are defined.
2) Data collection: It involves data mining with complete customer interaction.
3) Data analysis: Extracting important information via inspection, cleaning, and modeling.
4) Statistics. Validation and testing.
5) Modeling: Accurate predictive models are prepared.
6) Deployment: Models prepared are deployed in everyday decision-making process to get
results and output.
7) Model monitoring: Maintenance of the models.
Some case studies where predictive analytics can play a major role are as follows:
● While deciding cutoffs for college admissions, previous year’s cutoffs are studied. Highest
key benefit of predictive analytics has been in the field of industry and business.
● Losses incurred due to natural calamities like tsunami and earthquake can be reduced
by adopting timely predictive analytics practices by analyzing seismic zones and evicting
people from those zones.
● Predictive analytics has continuously been used in weather forecasting for many years.
● Predictive analytics can be used in the field of pharmacy to decide the production of
medicine with the arrival of a particular season.
● Similarly, predictive analytics is used in the industry to decide the sales and production
of goods in different regions.
● For prediction of jobs in the IT industry.
● To predict the agricultural produce for the season and requirements of the products such
as pesticides, tools, etc.
● To predict rainfall for the season.
● Stock market.
Some pitfall areas where predictive analytics cannot be applied:
● Unforeseen scenarios or abrupt incidents.
● Successful implementation of government policies for urban and rural citizens.
● Accuracy and authenticity of archived data and the tools used.
● Whether the movie released will be successful or not.
● Sport matches outcome.
Some applications [2] of predictive analytics are customer relationship management,
health care, collection analytics, cross sell, fraud detection, risk management, direct mar-
keting. Predictive analytics can be used in statistical analysis and visualization, predictive
modeling and data mining, decision management and deployment, as well as big data
analytics.
The literature survey of different research papers undertaken as part of the current
research work is summarized below.
recognizing the barrier, exploiting operational data, improving mining techniques, and
increasing marketing effectiveness with the increasing of data at a faster pace; it becomes
important to pick unique combination of competitive strengths and optimize analytics
dimensions.
In [5], authors(s) have discussed what is big data, its history, emergence of big data, big
data analytic tools, issues, and challenges of big data. Multiple big data applications have
been presented. Comparison on data mining techniques has also been considered.
In [6], author(s) have discussed the importance of predictive analytics in the industry
for a decision-making process. Technological cycle of predictive analytics is described. Cost
is an important characteristic of big data epoch, thereby it becomes a duty to select the
best technology for predictive analytics. Therefore, author(s) have proposed architectural
solutions based on international standards of data mining, creating analytic platforms for
industrial information analytics systems. Issues of mobility and computational efficiency
in predictive analytics technology have also been discussed.
In [7], author(s) have described what is predictive analytics, its uses, applications,
and technologies. Then, author(s) have applied predictive analytics at both micro- and
macro-levels of granularity. Articles applied at the “macro” level entitled “Predicting
Elections for Multiple Countries Using Twitter and Polls,” “Financial Crisis Forecasting via
Coupled Market State Analysis,” and “Dynamic Business Network Analysis for Correlated
Stock Price Movement Prediction” have been discussed.
In [8], author(s) have discussed the importance of predictive analytics in education field
by addressing fields such as enrollment management and curriculum development. Predic-
tive analytics help organizations grow, compete, enforce, improve, satisfy, learn, and act.
Goal-directed practices are observed for organization success. Author(s) have presented a
case study at Delhi Technological University, sponsored by All India Council for Technical
Education to determine the effectiveness of predictive analytics in the field of education.
The process used involves the following four steps:
1) Data collection
2) Build the predictive model
3) Model validation
4) Analytics enabled decision making
According to author(s) opinion, deployment of cloud computing techniques will facilitate
cloud-based predictive analytics in a cost-effective and efficient manner.
In [9], author(s) have presented a predictive visual analytics system for predicting event
patterns. Future events are predicted by combining contextually similar cases that had
occurred in the past. Social media data and news media data has been used for analysis. Two
cases have been discussed for evaluating the system: German winds clashed in the Alps on
March 24, 2015 and a heavy snow storm on the east coast of the United States on January
26, 2015. Steps for predictive value analytics are: topic extraction, predictive analysis, and
visual analytics.
Predictive modeling can be fully utilized for improvements where total possible outcomes
are known. So, in [10], authors(s) have predicted the outcomes of a sports match. To have
more accurate predictions, a model based on knowledge discovery in database has been
used. The model is capable of providing suggestions for improvements in itself to provide
15.2 Literature Survey 311
better predictions and reconstruct itself. The model predicts on the basis of various factors,
for example, performance of the individual, performance of the team, atmosphere, type of
match, health problems, etc. This model is best suited for games in groups.
In [11], author(s) have applied big data analytics on agricultural practices to increase the
profit by increasing the production and helping the farmers, further reducing suicide rates
of farmers. The main factors considered in this work are crop yield and land area under
cultivation. Five different data sets were integrated and a regression technique was used to
find the relation between the two factors stated above.
Since Python programming engines are really few, in [12], author(s) have proposed a
Python predictive analysis by observing neighboring calculations. Their aim is to detect
bugs. A novel encoding scheme has been incorporated for handling of dynamic features. A
prototype is prepared and evaluated on real-world Python programs. The prototype frame-
work has two steps: first is to collect traces of execution of passing run, the second is encod-
ing the traces and some unexecuted branches into symbolic constraints.
In [13], author(s) have discussed the importance of social network analysis with the help
of a Twitter data set. Analysis is important for the well-being of the society. It becomes nec-
essary to understand the social networking today in the world of technology and gain useful
knowledge. Author(s) have analyzed Twitter user profiles on the basis of some character-
istics. Analysis include clustering, classification, and detection of outliers among the user
profiles. The incorporated approach is related to big data in terms of volume of data gen-
erated. Hence, big data analytics can also be considered. The presented approach involves
two steps: extraction and analysis.
In [14], author(s) have discussed the application of social network analysis in the field of
data fusion. Parallel computing based methodology has been proposed for data extraction.
The methodology will also help in enhancing the fusion quality. Hop count weighted and
path salience approaches were taken into consideration. Results are visualized as a cumu-
lative associated data graph. Results also show sensitivity of betweenness, closeness, and
degree centrality measure of social network.
Big data plays an important role in the field of education. With data comes predictive
modeling. Analytics is important to deduce results and impacts. In [1], author(s) have
used prediction- and classification-based algorithms on students’ data to analyze the
performance of first-year bachelor students in a computer application course. Different
algorithms were used and then comparative analysis has also been done on the WEKA
tool to find which one was the best. According to author(s) results, multilayer perception
algorithm performed the best.
In [2], author(s) have discussed predictive modeling using machine learning techniques
and tools in the field of health care. It is an emerging field since there’s a need for reduction
in price of medicines and other health care–related things, such as, health tests, health care
monitoring devices, surgical instruments, etc. Author(s) have also discussed other machine
learning applications and showcased its importance in the field of health care. Author(s)
have discussed what is machine learning, its techniques, algorithms, and its tools.
In [15], analysis of management of big data in a Windows environment is being carried
out on a cloud environment with the help of the Aneka tool. A remote electronic vot-
ing machine application has been analyzed using MongoDB – a big data tool. The paper
presents a secured voting based application by using a login ID and password, which is
312 15 SNAP: Social Network Analysis Using Predictive Modeling
generated randomly, hence, difficult to hack. Further, views are encrypted and saved in the
database
Social network is a wide area of research these days. So, in [16], author(s) have done a
comparative study of two different mathematical models based on mixed integer linear pro-
gramming for an analysis of social network. Their aim is to determine the existing models
that best suits the social network analysis criteria. First, the model works on the mini-
mization of the maximum diameter of cluster diameters while the second model works on
minimizing minimum distance between objects of same cluster. As per author(s) results,
both models are adequate choices with respect to different parameters.
In [17], author(s) have presented predictive modeling approach on the Bombay stock
exchange, the Dow Jones Industrial Average, the Hang Seng Index, the NIFTY 50, the NAS-
DAQ, and the NIKKEI daily index prices. Equity markets have been analyzed in detail to
check whether they follow pure random walk–based models. Dynamically changing stock
market prices have been studied thoroughly. Predictive modeling uses different machine
learning algorithms for building frameworks to predict future index prices. Some of the
algorithms are: adaptive neuro-fuzzy inference system, support vector regression, dynamic
evolving neuro-fuzzy inference system, Jordan neural network, and random forest. As per
analysis conducted, prices can be forecasted beforehand using effective predictive modeling
algorithms. Pure random walls are not followed. According to author(s), this research can
provide business markets with huge profits. It can also help in developing different trading
strategies.
Predictive analytics is used for better decision making and thereby having better results.
Due to complexity and dynamic nature, these predictive analytics techniques cannot be
used on cloud computing. Therefore, in [18], author(s) have devised methods for predictive
analytics on cloud systems based on quality of service parameters. Along with the models,
some simulation tools have been developed for graph transformation systems, models for
self-adaptive systems, and adaptation mechanisms.
In [19], author(s) have proposed an error recovery solution for a micro electrode dot array
using real-time sensing and droplet operations. These methodologies work on predictive
analytics and adaptive droplet aliquot operations. Analytics has been used for determining
minimum droplet aliquot size that can be obtained on a micro electrode dot array biochip. A
predictive model has been proposed describing completion rate of mixing leading to mixed
error recovery control flow.
In [20], author(s) have used a Hadoop platform for processing k nearest neighbors and
classification. Advantage of using Hadoop is that it is scalable, easy to use, and robust to
node failures. Aim of the analysis was to find fast processing techniques of classification
for enhancing classification accuracy, utilizing logistic regression, and k nearest neighbors.
Further, the analysis enhances the rate of accuracy, true positive, false positive, precision
recall, sensitivity, and the specificity rate of classification. Prediction-based classification
process involves the following four steps:
1. Selecting appropriate data set.
2. Temporal distribution of data set.
15.4 Simulation and Analysis 313
In Section 15.2, various applications have been discussed in the field of predictive modeling
and social network analysis. Table 15.1 summarizes the application name and the area of
application.
The data set [14] used is a real time data set from Cambridge University. In a three-day
conference, nine people were given nodes and their movement was recorded for all the
three days, that is, when these nodes came in contact with each other, duration of contact,
etc. The data set describes the contacts that were recorded by all devices distributed during
this experiment.
A few random rows of the data set are given below.
1 8 121 121 1 0
1 3 236 347 1 0
1 4 236 347 1 0
1 5 121 464 1 0
1 8 585 585 2 464
314 15 SNAP: Social Network Analysis Using Predictive Modeling
Paper reference
number Application domain Application
15.4.1 Few Analyses Made on the Data Set Are Given Below
15.4.1.1 Duration of Each Contact Was Found
Destination Initial time End time No of time Time between two Duration of
Source node node of contact of contact of contact consecutive meeting each contact
1 8 121 121 1 0 0
1 3 236 347 1 0 111
1 4 236 347 1 0 111
1 5 121 464 1 0 343
1 8 585 585 2 464 0
15.4.1.2 Total Number of Contacts of Source Node with Destination Node Was Found
for all Nodes
For example; analysis for node 1 is given below. Total number of times node 1 came in
contact with each node was found.
15.4 Simulation and Analysis 315
1 2 19
1 3 8
1 4 17
1 5 16
1 6 2
1 7 27
1 8 24
1 9 17
According to this analysis, the most active node was node 2 and the most passive node was
node 5.
15.4.1.3 Total Duration of Contact of Source Node with Each Node Was Found
For example, duration of contact for first node with each other node is given below.
1 2 1241
1 3 19 708
1 4 38 963
1 5 6556
1 6 7
1 7 2921
1 8 1176
1 9 1212
According to this analysis, the most active node was node 3 and the least active was node 5.
Only those nodes have been considered that had interaction in less than 50% of total
number of contacts, that is, they might have come in contact with each other many times
but the number of times they interacted is less than half of total number of contacts. For
example, suppose node A contacted B 10, but, out of 10 times node A interacted with B
only 4 times and the rest 6 times they only came close to each other but did not have any
interaction. These nodes were considered for this analysis.
The analysis is given below.
Mobility pattern
Destination Total No of
Source node node NOC DOC interactions
1 2 19 1241 6
1 6 2 7 1
1 7 27 2921 127
1 8 24 1176 7
1 9 17 1212 4
2 1 32 4803 13
2 5 11 595 3
2 8 10 1597 4
3 5 3 0 0
3 8 2 579 1
5 2 2 0 0
5 3 1 589 1
5 8 16 2269 7
5 9 2 4 1
6 1 3 0 0
7 1 26 2185 10
7 4 16 55 761 7
7 5 3 4 1
7 6 13 2728 5
7 8 1 0 0
7 9 8 6364 3
8 3 13 844 5
8 5 23 1210 6
8 7 7 1277 3
9 1 3 122 1
9 5 1 0 0
9 8 1 0 0
15.4 Simulation and Analysis 317
15.4.1.5 Unidirectional Contact, that is, Only 1 Node is Contacting Second Node but
Vice Versa Is Not There
Such nodes are:
0 0 0
3 5 1 6 9 5
589 7 4
595 0 4434
2 5 7 8 8 9
0 1277 0
15.4.1.6 Graphical Representation for the Duration of Contacts with Each Node is
Given below
Duration of contact of node 1 with each node:
8: 1.64%
7: 4.07% 2: 1.73%
6: 0.01%
9: 1.69%
5: 9.13%
3: 27.45%
4: 54.28%
1: 5.40%
9: 32.93%
3: 29.89%
8: 1.80%
7: 12.67% 4: 11.71%
6: 4.92% 5: 0.67%
1: 8.58%
2: 8.01%
4: 14.37%
9: 62.65% 6: 2.44%
7: 3.69%
8: 0.25%
9: 6.53%
8: 4.65%
7: 10.24% 1: 30.30%
6: 7.47%
2: 5.98%
5: 12.35%
3: 22.49%
9: 0.01%
8: 6.23%
1: 19.76%
7: 17.85%
3: 1.62%
6: 8.63%
4: 45.90%
7 6,501 8 2,269 9 4
15.4 Simulation and Analysis 319
9: 8.96% 2: 9.53%
3: 13.24%
8: 25.45%
4: 23.71%
7: 9.80%
5: 9.31%
9: 13.60% 1: 4.67%
6: 5.83%
2: 26.24%
5: 0.01%
4: 12.31%
3: 37.34%
7 2,728 8 0 9 6,364
1: 7.44%
9: 11.06% 2: 3.51%
7: 3.19% 3: 2.11%
4: 26.82%
6: 42.87%
5: 3.02%
2: 0.62%
1: 1.32% 3: 0.37%
4: 6.93% 1: 0.05%
2: 10.77%
3: 79.93%
3 180,381 4 15,648 5 0
15.4.1.7 Rank and Percentile for Number of Contacts with Each Node
Rank defines relative position of an element among a list of objects [24]. Percentile defines
the percentage below which the given value falls in the group of observations [25–29].
Rank and percentile has been calculated using inbuilt data analysis feature of Microsoft
Excel.
Rank and percentile for number of contacts of node 1 with each node:
1 2 19
1 3 8
1 4 17
1 5 16
1 6 2
1 7 27
1 8 24
1 9 17
6 27 1 100.00
7 24 2 85.70
1 19 3 71.40
3 17 4 42.80
8 17 4 42.80
4 16 6 28.50
2 8 7 14.20
5 2 8 0.00
Rank and percentile for number of contacts of node 2 with each node:
2 1 32
2 3 59
2 4 24
2 5 11
2 6 20
2 7 51
2 8 10
2 9 34
2 59 1 100.00
6 51 2 85.70
8 34 3 71.40
1 32 4 57.10
3 24 5 42.80
5 20 6 28.50
4 11 7 14.20
7 10 8 0.00
322 15 SNAP: Social Network Analysis Using Predictive Modeling
Rank and percentile for number of contacts of node 3 with each node:
3 1 6
3 2 43
3 4 29
3 5 3
3 6 7
3 7 22
3 8 2
3 9 66
8 66 1 100.00
2 43 2 85.70
3 29 3 71.40
6 22 4 57.10
5 7 5 42.80
1 6 6 28.50
4 3 7 14.20
7 2 8 0.00
Rank and percentile for number of contacts of node 4 with each node:
4 1 36
4 2 16
4 3 19
4 5 18
4 6 28
4 7 16
4 8 34
4 9 10
1 36 1 100.00
7 34 2 85.70
5 28 3 71.40
3 19 4 57.10
4 18 5 42.80
2 16 6 14.20
6 16 6 14.20
8 10 8 0.00
Rank and percentile for number of contacts of node 5 with each node:
5 1 13
5 2 2
5 3 1
5 4 11
5 6 7
5 7 17
5 8 16
5 9 2
6 17 1 100.00
7 16 2 85.70
1 13 3 71.40
4 11 4 57.10
5 7 5 42.80
2 2 6 14.20
8 2 6 14.20
3 1 8 0.00
324 15 SNAP: Social Network Analysis Using Predictive Modeling
Rank and percentile for number of contacts of node 6 with each node:
6 1 3
6 2 12
6 3 8
6 4 31
6 5 6
6 7 17
6 8 11
6 9 9
4 31 1 100.00
6 17 2 85.70
2 12 3 71.40
7 11 4 57.10
8 9 5 42.80
3 8 6 28.50
5 6 7 14.20
1 3 8 0.00
Rank and percentile for number of contacts of node 7 with each node:
7 1 26
7 2 36
7 3 45
7 4 16
7 5 4
7 6 13
7 8 1
7 9 8
3 45 1 100.00
2 36 2 85.70
1 26 3 71.40
4 16 4 57.10
6 13 5 42.80
8 8 6 28.50
5 4 7 14.20
7 1 8 0.00
Rank and percentile for number of contacts of node 8 with each node:
8 1 24
8 2 6
8 3 13
8 4 35
8 5 23
8 6 39
8 7 7
8 9 24
6 39 1 100.00
4 35 2 85.70
1 24 3 57.10
8 24 3 57.10
5 23 5 42.80
3 13 6 28.50
7 7 7 14.20
2 6 8 0.00
326 15 SNAP: Social Network Analysis Using Predictive Modeling
Rank and percentile for number of contacts of node 9 with each node:
9 1 3
9 2 34
9 3 71
9 4 25
9 5 1
9 6 8
9 7 9
9 8 1
3 71 1 100.00
2 34 2 85.70
4 25 3 71.40
7 9 4 57.10
6 8 5 42.80
1 3 6 28.50
5 1 7 0.00
8 1 7 0.00
15.4.1.8 Data Set Is Described for Three Days Where Time Is Calculated in Seconds.
Data Set can be Divided Into Three Days. Some of the Analyses Conducted on the Data
set Day Wise Are Given Below
i. Total Duration of Contact and Total Number of Contacts for Node 2 on Day 1 with all the
other nodes.
2 1 358 6
2 3 115 3
2 4 237 3
2 5 7 2
2 6 241 4
2 7 1209 13
2 8 1438 3
2 9 0 0
15.4 Simulation and Analysis 327
Similar analyses can be conducted for other nodes for all three days.
ii. Total Duration of Contacts and Total Number of Contacts for three days individually for
Node 1.
2 21 1220 0
3 462 19 246 0
4 18 596 19 775 592
5 6556 0 0
6 7 0 0
7 2182 739 0
8 1176 0 0
9 0 111 1101
2 8 11 0
3 2 6 0
4 8 8 1
5 16 0 0
6 1 1 0
7 15 12 0
8 20 0 4
9 0 4 12
Duration of Contact for Each Day for Node 1 With Other Nodes
45000
40000
Total Duration of Contact
35000
30000
25000
20000
15000
10000
5000
0
2 3 4 5 6 7 8 9
Day 3 0 0 592 0 0 0 0 1101
Day 2 1220 19246 19775 0 0 739 0 111
Day 1 21 462 18596 6556 7 2182 1176 0
328 15 SNAP: Social Network Analysis Using Predictive Modeling
Total Number of Contacts Each Day for Node 1 With Other Nodes
30
Total Number of Contact
25
20
15
10
5
0
2 3 4 5 6 7 8 9
Day 3 0 0 1 0 0 0 4 12
Day 2 11 6 8 0 1 12 0 4
Day 1 8 2 8 16 1 15 20 0
1 2 1241 19
1 3 19 708 8
1 4 38 963 17
1 5 6556 16
1 6 7 2
1 7 2921 27
1 8 1176 24
1 9 1212 16
9 2
13% 12%
8 3
13% 12%
7 4
13% 12%
6 5
13% 12%
References 329
9 2
13% 12%
8 3
13% 12%
7 4
13% 12%
6 5
13% 12%
References
6 Dorogov, A. Yu. “Technologies of predictive analytics for big data.” Soft Computing and
Measurements (SCM), 2015 XVIII International Conference on. IEEE, 2015.
7 Brown, D.E., Abbasi, A., and Lau, R.Y.K. (2015). Predictive analytics: predictive model-
ing at the micro level. IEEE Intelligent Systems 30 (3): 6–8.
8 Rajni, J. and Malaya, D.B. (2015). Predictive analytics in a higher education context. IT
Professional 17 (4): 24–33.
9 Yeon, H. and Jang, Y. (2015). Predictive visual analytics using topic composition. In:
Proceedings of the 8th International Symposium on Visual Information Communication
and Interaction. ACM.
10 Grover, Purva, and Rahul Johari. “PAID: Predictive agriculture analysis of data integra-
tion in India.” Computing for Sustainable Global Development (INDIACom), 2016 3rd
International Conference on. IEEE, 2016.
11 Zhao, Baojin, and Lei Chen. “Prediction Model of Sports Results Base on Knowledge
Discovery in Data-Base.” Smart Grid and Electrical Automation (ICSGEA), International
Conference on. IEEE, 2016.
12 Xu, Z. et al. (2016). Python predictive analysis for bug detection. In: Proceedings of
the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software
Engineering. ACM.
13 Iglesias, J.A. et al. (2016). Social network analysis: Evolving Twitter mining. In: 2016
IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE.
14 Farasat, A. et al. (2016). Social network analysis with data fusion. IEEE Transactions on
Computational Social Systems 3 (2): 88–99.
15 Varshney, Karishma, Rahul Johari, and R. L. Ujjwal. “BiDuR: Big data analysis in UAF
using R-voting system.” Reliability, Infocom Technologies and Optimization (Trends and
Future Directions)(ICRITO), 2017 6th International Conference on. IEEE, 2017.
16 Pirim, Harun. “Mathematical programming for social network analysis.” Big Data (Big
Data), 2017 IEEE International Conference on. IEEE, 2017.
17 Ghosh, I., Sanyal, M.K., and Jana, R.K. (2017). Fractal inspection and machine
learning-based predictive Modelling framework for financial markets. Arabian Journal
for Science and Engineering: 1–15.
18 De Oliveira, Patricia Araújo. “Predictive analysis of cloud systems.” Software Engineer-
ing Companion (ICSE-C), 2017 IEEE/ACM 39th International Conference on. IEEE,
2017.
19 Zhong, Zhanwei, Zipeng Li, and Krishnendu Chakrabarty. “Adaptive error recov-
ery in MEDA biochips based on droplet-aliquot operations and predictive analysis.”
Computer-Aided Design (ICCAD), 2017 IEEE/ACM International Conference on. IEEE,
2017.
20 Raj, J. Adarsh Samuel, J. Lenin Fernando, and Y. Sunil Raj. “Predictive Analytics on
Political Data.” Computing and Communication Technologies (WCCCT), 2017 World
Congress on. IEEE, 2017.
21 Samridhi Seth, Rahul Johari. “Evolution of Data Mining Techniques: A case study using
MongoDB.” Proceedings of the 12th INDIA Com; INDIACom-2018; IEEE Conference
ID: 42835 2018 5th International Conference on “Computing for Sustainable Global
Development”, 14th – 16th March, 2018 2018.
References 331
22 Samridhi Seth, Rahul Johari. “Statistical Survey Of Data Mining Techniques: A Walk
Through Approach Using MongoDB.” International Conference On Innovative Comput-
ing And Communication, 5th-6th May, 2018.
23 Sun, Z. et al. (2018). Energy evaluation and prediction system based on data mining.
In: 2018 33rd Youth Academic Annual Conference of Chinese Association of Automation
(YAC). IEEE.
24 Garg, P. et al. (2018). Trending pattern analysis of twitter using spark streaming. In:
International Conference on Application of Computing and Communication Technologies.
Singapore: Springer.
25 https://crawdad.org
26 https://en.wikipedia.org/wiki/Predictive_analytics
27 https://www.predictiveanalyticstoday.com/what-is-predictive-analytics/#content-anchor
28 https://en.wikipedia.org/wiki/Rank
29 https://en.wikipedia.org/wiki/Percentile
333
16
16.1 Introduction
Most developing countries are faced with increased patient mortality from several diseases
primarily as a result of the inadequacy of medical specialists. This insufficiency is impossi-
ble to overcome in a brief span of time. However, institutions of higher learning are capable
of taking immediate action by producing as many doctors as possible. Many lives will be
impacted while waiting for the student’s journey of becoming a doctor and then a specialist.
Currently, for proper diagnosis and treatment, it is required that patients consult special-
ists. Certain medical practitioners do not possess enough experience or expertise to handle
certain high-risk diseases. Nonetheless, the diseases may increase in severity by the time
patients get access to specialists, which may take a few days, weeks, or even months. A
patient may suffer for the rest of their life if the high-risk disease cannot be addressed at an
early enough stage.
With the use of computer technology, the number of mortality and wait times to see
specialist can be effectively reduced. Doctors can be assisted in making decisions without
consultation with specialists by developing specialized software that emulates human intel-
ligence. Software cannot replace a specialist, yet it can aid doctors and specialists in the
process of predicting a patient’s condition and diagnosis from certain rules/experiences.
Patients can then be shortlisted for further treatment on the basis of high-risk factors or
symptoms or if they are predicted to be highly affected with certain diseases or illness [1–3].
It can result in the reduction of time, cost, human expertise, and medical error by applying
intelligent data analysis (IDA) techniques in medical applications.
A program was developed to make clinical decisions that help health professionals. It
is called the medical decision support system (MDSS). The process involves dealing with
medical data and knowledge domains in order to diagnose a patient’s condition and recom-
mends a suitable treatment for a particular patient. Another system developed for assisting
the monitoring, managing, and interpreting of a patient’s medical history as well as provid-
ing aid to the patient and the medical practitioner is a patient-centric health information
Intelligent Data Analysis: From Data Gathering to Data Comprehension,
First Edition. Edited by Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar.
© 2020 John Wiley & Sons Ltd. Published 2020 by John Wiley & Sons Ltd.
334 16 Intelligent Data Analysis for Medical Applications
system. This system would improve decision making in medical science, increase patient
compliance, and iatrogenic diseases, and medical errors could be reduced.
IDA research in medicine is, just like any research, targeted at direct or indirect enhance-
ments in provisions of health care. Techniques and methods can only be tested through
test cases that form problems of the real world. Functional IDA proposals for medicine
are followed by conditions that specify the range of real-life functions discussed by such
propositions, a resulting system’s in-depth interpretation thus constitutes a crucial feature.
The use of IDA systems in the clinical environment is another consideration of IDA. It
is used as a well-informed assistant in order to gather data and comprehend it in order to
assist physicians in performing their tasks with greater efficiency and effectiveness. The
physician will diagnose the disease in a stipulated time and perform appropriate actions
if they get correct information at the right time. The revolution in information has built a
viable collection and storage of a large volume of data across various origins on informa-
tion media. This data can be single-case (single patient) or multiple-case based (multiple
patients). Utilization of raw data in problem solving is infeasible due to its noisiness or insuf-
ficient data. However, if important information is picked out by computationally intelligent
ways, the conversion of data to mine is possible. After mining, relevant and operational
knowledge/information, extracted at the correct level, is then ready to use and aid in the
diagnosis of a patient by a physician. Rapidly emerging globality of information and data
arises to support the following critical issues:
(i) The provision of standards in terminology, vocabularies, and formats to support mul-
tilinguality and sharing of data.
(ii) Standards for interfaces between different sources of data.
(iii) Integration of heterogeneous types of data, including images and signals.
(iv) Standards for the abstraction and visualization of data.
(v) Reusability of data, knowledge, and tools.
(vi) Standards for electronic patient records.
A panel discussion of artificial intelligence (AI) in (AIME 97) has identified these
above-cited issues. The exercise of determining guidelines of any variety is a convoluted
duty. Nonetheless, some guidelines are essential to perform intercommunication and
thus integration between different origins of data. Correct patient management is the
most important aim of health care, which is directly related to proper utilization of a very
invaluable resource that is clinical data. Hence it is thoroughly justified that investment
is done in the process of development of appropriate IDA methods, techniques, and tools
that can be used for analysis of clinical data. A major focus needs to be given to research
by relevant communities and groups.
In the 1990s, we witnessed enhancement in studies in intelligent systems for the effec-
tive utilization of systems based on current requirements. Various researches were done by
combining two or more techniques and the function of the systems as utilized for ensuring
the performance of the system. An intelligent referral system for primary child health care
was established with a primary objective to reduce mortality in children, specifically in rural
areas. The system succeeded in addressing common pediatric complaints after a thorough
consideration of crucial risk factors, such as immunization, weight monitoring, develop-
ment nutrition, and milestones. The intelligent referral system for primary child health
16.1 Introduction 335
care used an expert system for the purpose of collecting data of a patient’s history. Various
expert systems have been made, viz. HERMES, an expert system dealing with diagnosis of
chronic liver diseases. SETH is an expert system for administration of curing acute drug
poisoning. PROVANES is a hybrid expert system for critical patients under anesthesiology,
and ISS for diagnosis of sexually transmitted diseases [11].
models, IDA must frame the relations and principles to provide assistance in making a
decision in complex situations. It should be clear that IDA is neither data mining nor
knowledge discovery in databases (KDD), although it may have a portion of data mining
methods and some its process are also KDD. The union of background knowledge (BK)
and data analysis is the main objective of IDA for providing the user with information.
Preparation of data and its mining: the problem that comes in preparation of data and its
mining is the lack of resources, or could be said lack of true resources. The major sources
of data preparation and data mining are:
a. Medical records taken for research purposes
b. Textbooks
c. Insurance claims
d. Hospital’s/clinic’s personal records
a. The medical records taken for research purposes: Such data is a secondary data and not pri-
mary data, i.e., it depends on the source of data collection, aka humans. Such data could
then be age-specific, gender-specific, lifestyle-specific, and diet-specific, thus depend-
ing upon physical and chemical properties and aspects that thus could not be trusted
solely. A very common aspect to consider this is the exposure to radiation, as most of the
human beings and the animals are now exposed to radiation, and each individual’s expo-
sure level depends upon 1. residential and working locations from the radiation source
2. past diseases, as some medications could have eventually built immunity from certain
radiation exposure, and 3. diet, physical health, and exercise/yoga; thus, for research
purposes, data could prove to be an effective data set but can’t be fully trusted.
b. Textbooks: Theoretical information may stand useful or else in practical applications; for
example, not all fevers are the same, and not all heart surgeries are the same (some hos-
pitals do require the patient’s family’s legal approval before operating on serious cases).
The information could be from old sources, and since, in today’s world most of the people
are exposed to radiation, thus some or other changes do exist in them as a result of which
some medical operation couldn’t be done, but such people are a very crucial source of
information for future purposes and deeper research in the medical field.
c. Insurance claims: Insurance claims could be falsely made for profit and business pur-
poses; there have been way too many cases filed as a result of false medical claims. Data
abstracted from such sources could lead to false information, which would not only
decrease the efficiency of the IDE-based model, but would also increase the input of
financial- and time-based resources, and could lead to the patient’s death or some serious
complication otherwise.
d. Hospital’s/clinic’s personal records: Hospital’s/clinic’s personal records are some of the
most vital sources for data preparation and data mining, but most of the conversations
that take place in such environments are casual, i.e., they take place in personal chats
like text SMS, etc., or are face-to-face talks, thus no record exists of such crucial data.
Sometimes it’s important to ensure certain information doesn’t come out on paper in
order to prevent complexities for a medical professional’s personal career and as well
as the hospital’s reputation. One such example might be that information like a sudden
critical abnormality arising in the patient’s body, such as internal bleeding, swelling,
etc., and the doctor’s immediate response to it, which comes from years of practice. No
16.1 Introduction 337
records of such experienced and immediate thoughts that come to a doctor’s mind could
ever be made, as they are like solutions to a problem that come to one’s mind under
pressure only.
One thing that is common in all of the above is that data that would be true would surely
be present in all the records; hence IDA researchers seek for the information that is com-
mon in all.
And one of the major problem besides all the above is that the model once prepared
would eventually fall back, i.e., become outdated within months, because the medical
field is the most dynamic field, and situations vary from country to country, human to
human, and also the research on a micro- and macro- level in a medical field grows
exponentially each day. Old treatment patterns could suddenly become new with certain
changes and the latest in treatments could fail due to not being in common use because
of a change in medicine chemical composition and new machineries.
intelligent ways, a conversion of data to mine is possible. After mining, relevant and oper-
ational knowledge/information, extracted at the correct level, is then ready to use and aid
in the diagnosis of a patient by a physician.
AI in the medical field was primarily related to the development of medical expert sys-
tems in the late seventies and eighties whose primary aim was aiding diagnostic decision
making in medical domains that were specialized [6]. MYCIN by Shortliffe was a keystone
research in this field that was accompanied by various attempts that led to a specific diag-
nostic and rigoristic expert system, for example: PUFF, PIP, VM, CASNET, HODGKINS,
PIP, HEADMED, CENTAUR, ONCOCIN, MDX, ABEL, GALEN, and various alternates.
Systems made to support diagnosis in internal medicine, which are the most generalized
and comprehensive: INTERNIST-1, which supports expert-defined rules and its follower,
CADUCEUS, which along with INTERNIST-1 constitutes a structure of pathophysiological
states reflecting wide expertise about the issues [7]. Expert system research deals with main
obstacles like knowledge acquisition, as well as its reasoning, representation, and explana-
tion at this early stage.
From the early days of knowledge-based systems, rules were formulated, and particu-
larly when using expert systems, was a prime conformity for expression of knowledge in a
symbolic way. The advantage of rules is uniformity, simplicity, transparency, and inferential
ease, which has resulted in them becoming the most commonly adopted representation of
real-world information. Rules expressed at a level of abstraction are derived from domain
experts that are correct from the expert’s perspective and are certainly understandable to the
expert since these formulations are their rule of thumb. However, rules defined by humans
carry a huge risk that they would come laced with the individual prejudice of an expert,
although it may appear that a clear modular chunk of information is formed by each rule.
The entire knowledge of the whole would be replete with gaps, inconsistencies, and other
defects primarily because of their flat organizations (which is the absence of a global, hier-
archical, and comprehensive organization of the rules).
Furthermore, the expert would provide essential knowledge required for focusing and
guiding the learning. Irrespective of the fact that these rules are directly received or learned
from experts, the layout has to be intuitive, acceptably expressive, and simple in order to be
applied in a particular application.
indirectly by improving services and managing adversities in the health care sector (record
storage and data processing).
One of the most important reasons, or which could be interpreted as the only sole pur-
pose, of integrating data analysis with health care is to increase the care, save the life of a
patient, and reduce the cost of health facilities offered.
details to create the final model [8]. Abstraction, in general, is the process of reducing the
number of characteristics of database into a set of essential characteristics.
The most efficient way to understand the data abstraction is to think of the way the word
“abstract” is used when one talks about long documents. The abstract is the shortened, and
the simplified, form. We often read it to get a glimpse of the facts before reading the entire
document.
The three formal abstraction layers we usually use are:
● User model: How the user describes the database, it is very important to ensure fixed
writing patterns in it.
● Logical model: Much more formal and more detailed database – often rendered as an
entity relationship (ER) model.
● Physical model: Further addition of details – indexing, data types, etc.
4. Classification: Classification manages things that already have some well-defined labels,
unlike cluster detection. This is known as data training – information that is used to train
the model on, or rather easily classify it, using some algorithm. For instance, differences
between content found in spam and legitimate messages is identified by spam filters.
This has been achievable through identification of large sets of e-mails as spam, and is
followed by training of the model.
5. Regression: This technique works by making predictions from the connection, which
exist in the database or data sets. For instance, future engagements of a particular user
on Facebook can be predicted from the data of the user’s history – likes, photo tags,
comments, hash tags, frequently contacted friends, and there similar data and com-
munications with different users as well, friend requests received and sent, and also
various activity(s) on the site. Other instances could be the use of data on the relation-
ship between education level and income of a family to anticipate the best choice of
neighborhood or society to shift their residency to. Thus, it is very clear that regression
allows all relations within the data set/database to be studied for predicting the future
behavior/future outcome.
PROBLEM
RECOGNITION
PROBLEM
IMPLEMENTATION
DEFINITON
ALTERNATIVE
CHOICE
GENERATION
ALTERNATIVE MODEL
ANALYSIS DEVELOPMENT
approaches are present while AI continues to be one of various methodological areas from
which excellent and essential ideas are incorporated.
1. Clinical decision support systems (CDSS): A computerized application that supplements
and assists clinicians in informed decision making by giving them proof-based informa-
tion of patient’s data. CDSS involves three constituents: a language system, an informa-
tion system, and a problem-processing system [14, 19].
2. Intelligent decision support system (IDSS): It is determined by a system that assists deci-
sion making demonstrating intelligent behavior, which might involve reasoning as well
as learning. Knowledge-based system, rule-based expert systems, or neural network sys-
tems are applied to achieve such learning and reasoning.
3. Artificial intelligence (AI): AI refers to the art of arming computers with intelligence at
par with humans. It can be attained by a combination of software and hardware systems
that can execute jobs that are rule-based and involve decision making [15].
4. Artificial neural network (ANN): Functional aspects and a structure of biological neural
network is implied in a mathematical model. It mimics in a plain way, similarly as the
344 16 Intelligent Data Analysis for Medical Applications
High Ranking
Benchmark Layouts
Knowledge Preference
User
Acquisition Inferencing Agent
Interface
Module (PIA)
Figure 16.2 Intelligent system for decision support/expert analysis in layout design.
● Possible hospital regulations may prevent the extensive use of automated diagnosis.
● In enough cases, the ML algorithms are surpassed by classical medical methodologies
(e.g., the biopsy) or human expertise.
16.5 Conclusion
Recent advancements of information systems in modern hospital and health care institu-
tions has resulted ina surge in the volume of the medical data collected. This mandates
that relevant systems are put in place that can be used to obtain relevant characteristics of
medical data and how to address them. IDA is in its nascent research stages and was devel-
oped to overcome this requirement and primarily intends to reduce differences between
data gathering and interpretation. The idea of a knowledge-driven data abstraction method
was designed in the mid-eighties, however, active research in this field took place less than
10 years ago (thereby representing one of the early attempts), which was triggered by ongo-
ing research in temporal reasoning in the medical field (thus the interest focused on tem-
poral data abstractions).
To date, the two technologies are progressing exclusive of each other, even when both
have a common aim of analyzing patient data in an intelligent manner, but for separate
purposes: machine learning for extracting knowledge and data abstraction for formulation
of a high degree of useful information on a single patient. As suggested above, the use of data
abstraction in the context of machine learning is an idea worth exploring. The technology of
data abstraction needs to become more mature. The main focus as of now was a derivation
of temporal trends. This work needs to be continued but also be made applicable to deal
with the different forms of abstraction, such as periodic abstractions. The huge raw data
involved requires computational efficiency; and for systems to be operated in real time is a
major concern.
The final stage of experimental design involves a meticulous choice of patients, their char-
acteristics, and various hypotheses for confirmation. The introduction of data warehouses
changes this selective approach to data collection and all data gathered is devoid of any
pointed purpose. Still medical data collected and collated in warehouses is a very important
resource for the probable unearthing of new knowledge. IDA researchers ought to provide
methods for dual ends of this range, as the evidence of quality of such tools will be their
applicability in medical institutions and their acceptability by experts of the medical field.
References
1 Yoshida, H., Jain, A., Ichalkaranje, A. et al. (eds.) (2007). Advanced Computational Intel-
ligence Paradigms in Healthcare, vol. 1. Heidelberg: Springer.
2 Vaidya, S., Jain, L.C., and Yoshida, H. (eds.) (2008). Advanced Computational Intelligence
Paradigms in Healthcare, vol. 2. Heidelberg: Springer.
3 Sardo, M., Vaidya, S., and Jain, L.C. (eds.) (2008). Advanced Computational Intelligence
Paradigms in Healthcare, vol. 3. Heidelberg: Springer.
4 Intelligent Data Analysis in Medicine and Pharmacology: A Position Statement Riccardo
Bellazzi 1 and Blaž Zupan 2 10.11.1.26.166.
346 16 Intelligent Data Analysis for Medical Applications
5 S. Brahman & L.C. Jain (Eds.): Adv. Comput. Intell. Paradigms in Healthcare 5, SCI
326, pp. 3– http://10.springerlink.com
6 Patel, V.L., Shortliffe, E.H., Stefanelli, M. et al. (2009). The coming of age of artificial
intelligence in medicine. Artificial Intelligence in Medicine 46 (1): 5–17.
7 Chapter Intelligent Decision Support Systems in Healthcare Sheryl Brahman and Lak-
shmi C. Jain.
8 Temporal Data Mining Based on Temporal Abstractions Robert Moskovitch and Yuval
Shahar Medical Informatics Research Center Department of Information Systems Engi-
neering Ben Gurion University, P.O.B. 653, Beer Sheva 84105, Israel.
9 Khare, V.R. and Chougule, R. (2012). Decision support for improved service effective-
ness using domain aware text mining. Knowledge-Based Systems 33: 29–40.
10 Khademolqorani, S. and Hamadani, A.Z. (2013). An adjusted decision support sys-
tem through data mining and multiple criteria decision making. Procedia – Social and
Behavioral Sciences 73: 388–395.
11 Keravnou, E., Garbay, C., Baud, R., Wyatt, J. (Eds.) Artificial Intelligence in Medicine
6th Conference in Artificial Intelligence in Medicine, Europe, AIME ’97, Grenoble,
France, March 23–26, 1997, Proceedings.
12 D. Foster, C. McGregor, S. El-Masri, 2005,Asurveyofagent-based intelligent decision sup-
port systems to support clinical management and research.
13 Viademonte, S. and Burstein, F. (2006). From knowledge discovery to computational
intelligence: a framework for intelligent decision support systems, chapter 4. In: Intel-
ligent Decision-Making Support Sytems (eds. J.N.D. Gupta, G.A. Forgionne and M.T.
Mora), 57–78. Springer-Verlag London Limited.
14 Basu, R., Fevrier-Thomas, U., and Sartipi, K. (2011). Incorporating Hybrid CDSS in
Primary Care Practice Management. McMaster eBusiness Research Centre.
15 Kukar, M., Kononenko, I., and Silvester, T. (1996). Machine learning in prognosis of the
femoralneck fracture recovery. Artificial Intelligence in Medicine 8: 431–451.
16 Lavrǎc, N. and Mozetı̌c, I. (1992). Second generation knowledge acquisition methods
and their application to medicine. In: Deep Models for Medical Knowledge en- Gineering
(ed. E.T. Ker-avnou), 177–199. Elsevier.
17 Kononenko, I., Bratko, I., and Kukar, M. Application of machine learning to medical
diagnosis. In: Machine Learing and Data Mining: Methods and Applications (eds. R.S.
Michalski, I. Bratko and M. Kubat). Willey in press.
18 Larizza, C., Bellazzi, R., and Riva, A. (1997). Temporal abstractions for diabetic patient’s
management. Proc. AIME-97. In: Lecture Notes in Artificial Intelligence, vol. 1211,
319–330. Springer.
19 El-Sappagh, S.H. and El-Masri, S. (2014). A distributed clinical decision support system
architecture. Journal of King Saud University-Computer and Information Sciences 26 (1):
69–78.
347
17
17.1 Introduction
Snooze bruxism is a verbal multipurpose disease caused by crushing or gritting of the fangs
during sleep, i.e., related to an extreme snooze stimulation movement. The tautness and
twisted reasons are glitches in the powers, muscles, and other assemblies near the jaw-
bone, ear aches, annoyances, grazes to the fangs, and syndromes in the mouth joints. These
indications as an entirety are typically labeled as temporomandibular joint (TMJ) pain. In
certain persons, bruxism can be recurrent [1], as a result of annoyances, injured fangs, and
other glitches. Since one can experience essential snooze bruxism, and also be ignorant of it,
pending problems mature [2], and it is significant to distinguish the ciphers and indications
of bruxism in order to pursue dental repair. The snooze bruxism throughout daylight or
restlessness is usually a semi-voluntary “compressing” movement of the jowl (Figure 17.1).
Throughout the darkness, the bruxism incidents are attended by a lurid involuntary crush-
ing. This research will reference solitary to bruxism throughout snooze. Snooze bruxism
is distinct in the global cataloging of snooze syndromes, as a labeled crusade syndrome
considered by crushing or compressing of the teeth during snooze [3, 4].
The main symptoms of bruxism are augmented tooth compassion, a cloudy headache,
depressions on your patois, fangs compressed or cracked, injury from mastication on
your teeth, tenderness, revealing deeper coatings of your spine, weary jaw influences, etc.
The main causes of bruxism are irregular arrangement of higher and lower fangs, snooze
glitches, anxiety, aggressive, anger, stress, Parkinson’s syndrome, etc. The snooze bruxism
is mostly found in youngsters.
Effects of Healthy
Teeth Grinding Teeth
Abfraction
Gum
Recession Attrition
Figure 17.1 Differences of bruxism patient teeth and normal human teeth.
Bruxism snooze syndrome is a mental disturbance by the human and is very harmful
to the teeth. Its syndrome is a generated parallel disease of the teeth, such as pain in the
jaw etc. There is only 60 seconds of data used in this work. This work is very helpful in the
immediate prognostic and has reduced the problem of headache, jaw pain, brain stroke, etc.,
because all of the above generates the side effect of the snooze bruxism or snooze syndrome.
This work stops the generating of new diseases as a side effect of snooze bruxism. The cost of
the prognostic and time using our new method is much less. In this work, a snooze bruxism
syndrome patient’s prognosis results in only one channel and is in an S2 snooze stage. These
channels are a very good and accurate result. Therefore, it is very accurate to the other
prognostic systems.
i) Economy
According to the British economy, snooze deprivation cost 40 billion euro in one year.
j) Mortality
Poor snooze increases the chance of mortality risk.
turbed by a blockage of airflow. The 5% men and 3% in women report it for the obstruc-
tive sleep apnea [28–31]. In the current time, only 10% of obstructive sleep is diag-
nosed.
● Central Sleep Apnea: In the central sleep apnea, the brain breathing control centers are
extreme in the sleep. Additionally, the effects of sleep unaccompanied can eliminate
the brain’s directive for the body to respire [32–34].
● Complex Sleep Apnea: Complex sleep apnea is a combination of central sleep apnea
and obstructive sleep apnea [35–36]. The prevalence ranges of this apnea are 0.56 to
18%. It is generally detected in the treatment of obstructive sleep apnea with continu-
ous positive airway pressure (CPAP) [37].
350 17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording
d) Narcolepsy
Narcolepsy is asleep and neural disorder produced by the mind’s incapability to con-
trol sleep-wake phases. The core structures of narcolepsy are cataplexy and fatigue. The
syndrome is also frequently related to unexpected sleep assaults. To appreciate the fun-
damentals of narcolepsy, it is essential to understand the first analysis of the structures
of normal sleep. Slumber happens in succession. We primarily enter bright sleep stages
and then grow into gradually deeper stages. A deep sleep stage is called non-rapid eye
movement (NREM) slumber. Narcolepsy touches both genders similarly and usually
first mature in youth and may continue unrecognized as they slowly grow. The example
of a familial association with narcolepsy is rather small but a mixture of inherent eco-
logical issues may be the source of this sleep syndrome [47–51].
e) Bruxism
Bruxism is a one type of sleep disorder, in which voluntarily grind and clinches the teeth.
The symptoms of bruxism are fractured, flattered, and chipped teeth. The main causes
of bruxism are disturbances in the sleep and asymmetrical arrangement of the tooth.
The risks of bruxism are a side effect of psychiatric medicines, drinking of the alcohol,
and smoking [52].
f) Rapid Eye Movement Behavioral Disorder (RBD)
The dreaming is purely a “mental” movement that is practiced in the mind while the
body is at rest. However, people who suffer from RBD start acting out their dreams. They
physically move their limbs or even get up and engage in activities associated with wak-
ing. Persons with RBD lack this muscle paralysis, which permits them to act out their
dreams in dramatic and violent manner while they are in the REM sleep stage. Some-
times they start talking, twitching, and jerking during dreaming for years before they
fully act out their REM dreams. The main symptoms of the RBD are leaping, punch-
ing, kicking, etc. People with neurodegenerative disorders like Parkinson’s disease, with
multiple system atrophy are also at higher risk to suffer from RBD [53–54].
17.3 Electroencephalogram Signal 351
in which the patient is separately awakened easily. The human eye is closed and
opens slowly, so that the motion of the eye is gradually slow. The human body might
evoke the bit graphic pictures when awakened from dream. The mind evolution
from alpha-wave frequency is 8–13 Hz, whereas delta-wave frequency is 4–7 Hz. Five
percent to 10% of the total snooze is completed in this stage.
● NREM-2: The eye movement is fully closed and brainwaves slow down. The
theta-wave observed snoozer has developed a slower awakening stage. In this stage,
45–50% of the total snooze happens. Snooze axles range from 12–14 Hz.
● NREM-3: The human eye movement stops and the brain slows down in this stage.
● NREM-4: The human brain completely builds a delta wave. In this stage the human
goes all the way to deep snooze. The all-human organs like brain, muscles, etc., are in
relaxed mode. This totals 15–25% of the snooze completed by the human [8].
● NREM-5: The eye is closed but snooze will be a disrupted in this stage. Only 1% of the
According to this system, we have three different sights that help in electrode placements.
These are inion, nasion, and preauricular points anterior to the ear. A prominent bump is
present at the back of the head. It is the smallest point of the skull from the backside of
the head and is known as the inion. The nasion is defined as a point that lies in between
forehead and nose.
● Mark the distance that is 20% of the nasion to inion distance from FP1 to F3. At the
intersection, point there will be the point named the true F3 mark. Similarly, mark 20%
of the nasion to inion distance from FP2 to F4. At the intersection, point there is the true
F4 mark.
● Now obtain the preliminary mark C3 by measuring from Fp1 to O1; similarly, obtain the
preliminary mark C4 by measuring from Fp2 to O2.
● Find the intersection point of the first and second mark. This intersection point is the
true C3. Similarly, mark half of the distance Fp2-O2. Also, find the intersection point of
the first and second mark. This intersection point is the true C4.
This paper represents the meaning between nap bruxism, verbal badgering at school, and
life enjoyment among Brazilian adolescents. The high school years are age with changes
and conflicts. In previous research, bruxism is a wellness issue, which squashes the lesser
and higher teeth. Deregibus et al. [69] studied that the repeatability in identifying sleep
bruxism events by mutual external electromyography (EMG) and heart rate signals mea-
sured using the Bruxoff device. The ten subjects including five males and five females and
the average are 30.2 years. Castroflorio et al. [70] compare the detection of bruxism by the
combination of heart rate and EMG using the compact device. They used eleven healthy and
fourteen bruxism subjects in the work. They find that compact device is good inaccuracy.
Additionally, the researcher is used in the portable EMG and Electrocardiogram (ECG) for
the detection of bruxism. The accuracy of the systems are 62.2 % [71]. Pawel S Kostka and
Ewaryst J Tkacz [72] are used twelve patients for the primary ten hours sleep recording.
They find that many sources of data analysis with sympathovagal estimation are helpful
in the early detection of bruxism disorder. Nantawachara Jirakittayakorn and Yodchanan
Wongsawat [73] are designed the detection system of bruxism on the masseter muscle using
an EMG instrument. Our research teams are diagnosed by the bruxism using psychological
signals such as ECG, EEG, and EMG. We have used around three hundred samples from
bruxism and healthy subjects. We are also used the two stages of sleep such as REM and
w, We applied the low pass filter, hamming window technique, and power spectral den-
sity method to achieved the normalized power. Our specificity of the research in different
signals were 92.09, 94.48, 77.07, and 77.16 [7, 52, 74].
Start
Data Collection
Normalized Power of
the Signal
Comparisons of
Normal and Bruxism
Normalized Power
by the physionet website (http://physionet.org), this website is used in the research work
[77–79].
Stop
Amplitude
Pass
0 FL FH
Frequency
We analyzed the sleep recording of subjects such as bruxism and normal. The recorded
channels of the bruxism data in the S2 sleep stage are C3-P3, C4-P4, C4-A1, DX1-DX2,
EMG1-EMG2, ECG1-ECG2, FP1-F3, FP2-F4, F3-C3, F4-C4, F7-T3, F8-T4, P3-O1, P4-O2,
ROC-LOC, SX1-SX2, T3-T5, and T4-T6 (Figure 17.4). Additionally, the normal data in the
S2 sleep stage are ADDOME, C3-P3, C4-P4, C4-A1, Dx1-DX2, EMG1-EMG2, ECG1-ECG2,
F2-F4, F4-C4, F1-F3, F3-C3, HR, LOC-ROC, P4-O2, P3-O1, Posizione, ROC-LOC, SX1-SX2,
SpO2, TERMISTORE, and TORACE (Figure 17.5). The common channels of both bruxism
and normal are C4-A1, C4-P4, C3-P3, Dx1-DX2, EMG1-EMG2, ECG1-ECG2, F3-C3, F4-C4,
P4-O2, P3-O1, ROC-LOC, and SX1-SX2. We extract the single-channel C4-A1 channel of the
EEG signal for the S2 sleep stage. The extracted channel of both subjects such as bruxism
and normal are mentioned in Figures 17.6 and 17.7.
The C4-A1 channel of the EEG signal in the S2 sleep stage was filtered using a low pass
filter. The cutoff frequencies of the signal are 25 Hz. We applied this filtration on both sub-
jects such as bruxism and normal mentioned in Figures 17.8 and 17.9. After filtration of
the channel, we applied the Hanning window techniques in both bruxism and normal data
(Figures 17.10 and 17.11). While the estimation of the power spectral densities are men-
tion in Figures 17.12 and 17.13. We used welch techniques for the estimation of the power.
17.7 Data Analysis of the Bruxism and Normal Data Using EEG Signal 357
1500
Fp2-F4 (uV)
F4-C4 (uV)
C4-P4 (uV)
–500
DX1-DX2 (uV)
SX1-SX2 (uV)
–1000
0 10 20 30 40 50 60
Time (sec)
Figure 17.4 The loading of the bruxism data for the EEG signal and the total number of 18
channels of the EEG signal in the S2 snooze stage is present.
–100
TORACE (mV)
ADDOME (mV)
–150 DX1-DX2 (uV)
SX1-SX2 (uV)
–200 Posizione (nd)
HR (BPM)
SpO2 (%)
–250
0 10 20 30 40 50 60
Time (sec)
Figure 17.5 Loading of the normal data for the EEG signal in the S2 snooze stage while the total
number of 21 channels of the EEG signal is present.
This method changes the signal time series into the segment. We applied this method on
bruxism and normal data to find the difference between both subjects for the detection
system.
The filtered signal is passing through the hanning window for the both bruxism data and
normal data of the S2 snooze stage. The hanning window is less noisy or negligible, so the
system is not accurate (Figures 17.10 and 17.11).
The estimating of the PSD by the Welch method. The power signal is estimated by the
different frequencies. This method consists of the time series into the segment. It is applied
in the windowing signal of both the bruxism data and normal data of the S2 snooze stage
(Figures 17.12 and 17.13).
358 17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording
× 104
4
–1
–2
–3
–4
0 0.5 1 1.5 2 2.5 3 3.5
× 104
Figure 17.6 Extracted single-channels C4-A1 of the bruxism for the S2 sleep stage.
7000
6000
5000
4000
3000
2000
1000
0
–1000
–2000
–3000
0 0.5 1 1.5 2 2.5 3 3.5
× 104
Figure 17.7 Extracted single-channels C4-A1 of the normal for the S2 sleep stage.
17.8 Result
We find the normalized value of the power spectral density. The normalized power specifies
the fraction of a specific EEG activity out of the whole power. We achieved the normalized
power of the bruxism and normal data of S2 sleep stages. The EEG signals have some waves
so we find the normalized value of these waves such as theta, beta, and alpha.
In Table 17.1, the normalized power of the theta wave for the C4-A1 channel is between
0.27037 and 0.35722 for the bruxism data and is 0.2774–0.27919 for the normal data. The
difference of the bruxism is 0.8685 and normal is 0.00179. The differences of the normalized
17.8 Result 359
× 104
4
–1
–2
–3
–4
0 10 20 30 40 50 60
Time (second)
Figure 17.8 Filtered C4-A1 channel of S2 sleep stage for bruxism, we used a low pass filter.
7000
6000
5000
4000
3000
2000
1000
0
–1000
–2000
–3000
0 10 20 30 40 50 60
Time (second)
Figure 17.9 Filtered C4-A1 channel of S2 sleep stage for the normal, we used a low pass filter.
power for the C4-A1 channel of the theta wave in the S2 sleep stage of the bruxism are
greater than the normal.
In Table 17.2, the normalized power of the beta wave for the C4-A1 channel is between
0.0020038 and 0.035637 for the bruxism data and is 0.00087–0.00087 for the normal data.
The difference of the bruxism is 0.0336332 and normal is 0.000244. The differences of the
normalized power for the C4-A1 channel of the theta wave in the S2 sleep stage of the brux-
ism are greater than the normal.
In Table 17.3, the normalized power of the alpha wave for the C4-A1 channel is in between
0.10453 and 0.22804 for the bruxism data and is 0.06829–0.077265 for the normal data.
The difference of the bruxism is 0.12351 and normal is 0.008975. The differences of the
360 17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording
× 104
2
1.5
0.5
–0.5
–1
–1.5
–2
0 0.5 1 1.5 2 2.5 3 3.5
× 104
Figure 17.10 Sampled C4-A1 channel of S2 sleep stage for the bruxism using the Hanning
window. The Hanning window has negligible noise, so it’s helpful to the accuracy of the system.
4000
3000
2000
1000
–1000
–2000
0 0.5 1 1.5 2 2.5 3 3.5
× 104
Figure 17.11 Sampled C4-A1 channel of S2 sleep stage for the normal using Hanning window.
The Hanning window has negligible noise, so it’s helpful to the accuracy of the system.
normalized power for the C4-A1 channel of the theta wave in the S2 sleep stage of the
bruxism are greater than the normal.
Figure 17.14 is represented that the normalized values of both subjects for the
single-channel C4-A1 of the S2 sleep stage. We obtained that the differences of the
normalized power for the bruxism are greater than the normal on all waves of EEG signal.
The normalized power of the beta wave for the bruxism and normal is smaller than theta
and alpha wave. Additionally, the alpha wave is higher than other waves such as theta and
alpha.
17.9 Conclusions 361
Figure 17.12 It has represented the estimation of the power spectral density using the welch
method on bruxism for the S2 sleep stage.
60
40
20
–20
0 50 100 150 200 250
Frequency (Hz)
× 105
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0 5 10 15 20 25 30 35
Figure 17.13 It has represented the estimation of the power spectral density using Welch method
on normal for the S2 sleep stage.
17.9 Conclusions
Sleep is the physical phenomenon of human beings. The all-living organisms are compul-
sory to take a good sleep. Because the disturbance of sleep generates many diseases, the
bruxism is one of them. We designed a detection system for the bruxism disorder using
single-channel C4-A1 of the S2 sleep stage on the human recording. We obtained that the
differences of the normalized power for the C4-A1 channel of the EEG signal for the normal
362 17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording
Table 17.1 The comparative analysis between bruxism and a normal human for the C4-A1
channel of the theta wave of the EEG signals in the S2 sleep stage.
Table 17.2 The comparative analysis between bruxism and normal human for the C4-A1 channel
of the beta wave of the EEG signals in the S2 sleep stage.
The normalized powers of the beta 0.035 637 0.002 003 8 0.000 87 0.000 244
wave of the EEG signals for the
C4-A1 channel.
Differences of the same subjects of 0.033 633 2 0.000 244
the normalized powers
High Normalized Power of Low Normalized Power of
Remarks the Bruxism Patient the Normal Human
Table 17.3 The comparative analysis between bruxism and normal human for the C4-A1 channel
of the alpha wave of the EEG signals in the S2 sleep stage.
human is smaller than the bruxism for all waves such as theta, beta, and alpha. We summa-
rized that the alpha wave of bruxism is higher to the beta and theta in both subjects such
as bruxism and normal. The future prospects of the research were the detection of bruxism
using artificial intelligence techniques.
Acknowledgments 363
0.14
Dfferences of the Normalized Powers
0.12
0.1
0.08
Bruxism Patient
0.06 Normal Human
0.04
0.02
0
Theta Wave Beta Wave Alpha Wave
EEG Waves
Figure 17.14 Graphical representation for the normalized value of the single-channel C4-A1 in
the S2 sleeps stage of both subjects such as bruxism and normal. The differences in the normalized
power for bruxism are greater than normal. The normalized power of the beta wave for the bruxism
and normal is smaller than theta and alpha wave.
Acknowledgments
We would like to thanks Prof Behra, Prof. Naseem, Prof. Siddiqui, Prof. Shailendra, Prof.
Hasin Alam, Prof. Chandel, Prof. Mohd Ahmad, Dr. Dennis, Dr. Ijaz Gul, Dr. Bilal Faheem
Chaudhary, Dr. Wazir Ali, and Dr. Deepika Reddy for the useful discussion, proofreading,
and motivation of the research chapter. The National Natural Science Foundation of China
under grant 61771100 supports this work. The authors do not have any competing interests.
Abbreviations
NREM Nonrapid eye movement
EEG Electroencephalogram
EMG Electromyogram
EOG Electrooculography
ECG Electrocardiogram
IMF Intrinsic mode function
PSD Power spectral density
PSG Polysomnography
REM Rapid eye movement
SGRM Sparse group representation model
TMJ Temporomandibular joint
UN United Nations
US United States
364 17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording
References
45 M. B. Bin Heyat, F. Akhtar, M. Ammar, B. Hayat, and S. Azad, “Power Spectral Density
are used in the Investigation of insomnia neurological disorder,” in XL- Pre Congress
Symposium, 2016, pp. 45–50.
46 H. Bb, F. Akhtar, A. Mehdi, S. Azad, S. Azad, and S. Azad, “Normalized Power are used
in the Diagnosis of Insomnia Medical Sleep Syndrome through EMG1-EMG2 Channel,”
Austin J. Sleep Disord., vol. 4, no. 1, pp. 2–4, 2017.
47 R. Pelayo, “Narcolepsy,” in Encyclopedia of the Neurological Sciences, 2014.
48 B. R. Kornum et al., “Narcolepsy,” Nature Reviews Disease Primers. 2017.
49 T. E. Scammell, “Narcolepsy,” New England Journal of Medicine. 2015.
50 R. Pelayo, “Narcolepsy,” in Encyclopedia of the Neurological Sciences, 2014.
51 T. Rahman, O. Farook, B. Bin Heyat, and M. M. Siddiqui, “An Overview of Narcolepsy,”
Int. Adv. Res. J. Sci. Eng. Technol., vol. 3, no. 3, pp. 2393–2395, 2016.
52 D. Lai, M. B. Bin Heyat, F. I. Khan, and Y. Zhang, “Prognosis of Sleep Bruxism Using
Power Spectral Density Approach Applied on EEG Signal of Both EMG1-EMG2 and
ECG1-ECG2 Channels,” IEEE Access, vol. 7, pp. 82553–82562, 2019.
53 S. Stevens and C. L. Comella, “Rapid eye movement sleep behavior disorder,” in Parkin-
son’s Disease and Nonmotor Dysfunction: Second Edition, 2013.
54 J. F. Gagnon, R. B. Postuma, S. Mazza, J. Doyon, and J. Montplaisir, “Rapid-eye- move-
ment sleep behaviour disorder and neurodegenerative diseases,” Lancet Neurology.
2006.
55 R. K. Malhotra and A. Y. Avidan, “Sleep Stages and Scoring Technique,” in Atlas of
Sleep Medicine, 2014.
56 B. Bin Heyat, Y. M. Hasan, and M. M. Siddiqui, “EEG signals and wireless transfer of
EEG Signals,” Int. J. Adv. Res. Comput. Commun. Eng., vol. 4, no. 12, pp. 10–12, 2015.
57 M. Bin Heyat and M. M. Siddiqui, “Recording of EEG, ECG, EMG Signal,” vol. 5, no.
10, pp. 813–815, 2015.
58 T. Kirschstein and R. Köhling, “What is the source of the EEG?,” Clin. EEG Neurosci.,
2009.
59 F. L. Da Silva, “EEG: Origin and measurement,” in EEG - fMRI: Physiological Basis,
Technique, and Applications, 2010.
60 M. Toscani, T. Marzi, S. Righi, M. P. Viggiano, and S. Baldassi, “Alpha waves: A neural
signature of visual suppression,” Exp. Brain Res., 2010.
61 D. L. T. Anderson and A. E. Gill, “Beta dispersion of inertial waves,” J. Geophys. Res.
Ocean., 1979.
62 R. E. Dustman, R. S. Boswell, and P. B. Porter, “Beta brain waves as an index of alert-
ness,” Science (80-. )., 1962.
63 J. R. Hughes, “Gamma, fast, and ultrafast waves of the brain: Their relationships with
epilepsy and behavior,” Epilepsy and Behavior. 2008.
64 F. Amzica and M. Steriade, “Electrophysiological correlates of sleep delta waves,” Elec-
troencephalogr. Clin. Neurophysiol., 1998.
65 D. L. Schacter, “EEG theta waves and psychological phenomena: A review and analy-
sis,” Biol. Psychol., 1977.
66 U. Herwig, P. Satrapi, and C. Schönfeldt-Lecuona, “Using the International 10-20 EEG
System for Positioning of Transcranial Magnetic Stimulation,” Brain Topogr., 2003.
References 367
67 A. Morley, L. Hill, and A. G. Kaditis, “10-20 System EEG Placement,” Eur. Respir. Soc.,
2016.
68 Trans Cranial Technologies Ltd., “10 / 20 System Positioning Manual,” Technol. Trans
Cranial, 2012.
69 A. Deregibus, T. Castroflorio, A. Bargellini, and C. Debernardi, “Reliability of a portable
device for the detection of sleep bruxism,” Clin. Oral Investig., 2014.
70 T. Castroflorio, A. Deregibus, A. Bargellini, C. Debernardi, and D. Manfredini, “Detec-
tion of sleep bruxism: Comparison between an electromyographic and electrocardio-
graphic portable holter and polysomnography,” J. Oral Rehabil., 2014.
71 T. Castroflorio, A. Bargellini, G. Rossini, G. Cugliari, A. Deregibus, and D. Manfredini,
“Agreement between clinical and portable EMG/ECG diagnosis of sleep bruxism,” J. Oral
Rehabil., 2015.
72 P. S. Kostka and E. J. Tkacz, “Multi-sources data analysis with sympatho-vagal balance
estimation toward early bruxism episodes detection,” in Proceedings of the Annual Inter-
national Conference of the IEEE Engineering in Medicine and Biology Society, EMBS,
2015.
73 N. Jirakittayakorn and Y. Wongsawat, “An EMG instrument designed for bruxism detec-
tion on masseter muscle,” in BMEiCON 2014 - 7th Biomedical Engineering International
Conference, 2015.
74 A. S. Heyat M.B.B., Lai D., Akhtar F., Hayat M.A.B., “Short Time Frequency Analysis of
Theta Activity for the Diagnosis of Bruxism on EEG Sleep,” in Advanced Computational
Intelligence Techniques for Virtual Reality in Healthcare. Studies in Computational Intelli-
gence, K. Gupta D., Hassanien A., Ed. Springer, 2020, pp. 63–83.
75 A. R. Hassan and M. I. H. Bhuiyan, “An automated method for sleep staging from EEG
signals using normal inverse Gaussian parameters and adaptive boosting,” Neurocomput-
ing, 2017.
76 A. R. Hassan, “Computer-aided obstructive sleep apnea detection using normal inverse
Gaussian parameters and adaptive boosting,” Biomed. Signal Process. Control, 2016.
77 A. L. Goldberger et al., “Components of a New Research Resource for Complex Physio-
logic Signals,” Circulation, 2000.
78 A. L. Goldberger et al., “PhysioBank, PhysioToolkit, and PhysioNet,” Circulation, 2000.
79 A. L. Goldberger et al., “PhysioBank, PhysioToolkit, and PhysioNet: components of a
new research resource for complex physiologic signals.,” Circulation, 2000.
80 J. Barros and R. I. Diego, “On the use of the Hanning window for harmonic analysis in
the standard framework,” IEEE Trans. Power Deliv., 2006.
81 Y. Li, X. L. Hu, J. Wang, X. H. Liu, and H. T. Li, “FFT algorithm sampled with
hanning- window for roll eccentricity analysis and compensation,” J. Iron Steel Res.,
2007.
82 P. D. Welch, “Welch_1967_Modified_Periodogram_Method,” Trans. Audio Electroacoust.,
1967.
83 H. R. Gupta and R. Mehra, “Power Spectrum Estimation using Welch Method for
various Window Techniques,” Int. J. Sci. Res. Eng. Technol., 2013.
369
18
damaged or die, resulting in loss of cognitive abilities, thus day-to-day activities can become
difficult to perform. Alzheimer’s patients develop other symptoms of neural degeneration
such as impaired reasoning skills, impaired decision making, aphasia, apraxia, impaired
visuospatial abilities, and changes in behavior and overall personality.
There is currently no drug that can reverse the symptoms of this disease. However, there
are medications called acetylcholinesterase inhibitors that delay the progression of AD,
especially in the early to moderate stages [3]. Therefore, early diagnosis can be very helpful
not only for the estimated 27 million patients worldwide but also for their caregivers, by pro-
viding cost-effective treatments [1]. But it is a difficult task to recognize AD in its early stages
because by the time patients seek medical opinion, they already show signs of memory loss
and cognitive impairment. It is important to distinguish between healthy aging and the
onset of Alzheimer’s. The state of mild cognitive impairment (MCI) is clinically treated as
an “in-between” stage. This is due to the fact that the conversion rate for patients with MCI
to AD is about 10–30%, whereas only a 1–2% rate with patients not diagnosed with MCI [4].
MCI does not result in cognitive and functional impairments to the levels of severity as
displayed by AD but they have a significant chance of progressing into AD.
The exact causes of AD are not known but there are certain risk factors. Age is one risk fac-
tor. People older than 65 years are most likely to develop AD. Any immediate family member
suffering from Alzheimer’s can be a risk. Experts have also identified some genes associated
with Alzheimer’s. It is important to mention that having one or more of these risk factors
does not imply one is suffering from AD. It is believed by the scientist, that the development
of abnormal structures called amyloid plaques and neurofibrillary tangles are considered
important. Amyloid clumps are protein fragments that can damage the brain’s nerve cells.
These insoluble, dense protein buildups are found in the hippocampus of the AD patient’s
brain. This region is responsible for memory management. It helps in converting short-term
memory to long-term memory (the fact that hippocampus also helps in spatial navigation
can be predicted by the tendency of AD patients). Neurofibrillary tangles are also insoluble
substances, which can clog the brain. Microtubules in the brain can be compared to a trans-
port system, which helps the neurons deliver nutrients and information to other cells. The
tau proteins, which are responsible for stabilization of these microtubules, are damaged by
these tangles. It causes the tau proteins to be chemically altered, resulting in the eventual
collapse of the system. Memory loss is believed to be a direct result of this.
Symptoms of AD: Alzheimer’s patients display all or a combination of these symptoms
frequently and in an ongoing manner, which worsen with each increasing stage of the
disease:
1. Memory loss
2. Tasks previously classified as easy and frequently performed get harder and harder to
complete
3. Decision making and problem solving are affected
4. Problem in reading, speaking, and/or writing
5. Confusion regarding date, timem and place
6. Behavior and personality changes
7. Uncharacteristic and sudden shyness
Even though it’s more prevalent in the senior population, early onset of AD is also possible
in people who are in their forties and fifties. The symptoms of early onset AD include trouble
18.1 Introduction and Background 371
care and often display significant confusion and a sense of being lost. The sleeping pattern
might change and trouble may occur regarding control of bladder and bowel movements.
Severely affected patients require round the clock care. They lose awareness and cannot
respond or communicate properly.
Directly diagnosing AD is difficult because as of now, the only way to confirm is to look
at brain tissue after death. Therefore, a series of tests to determine the mental and cogni-
tive abilities and rule out other probable causes of the conditions exhibited by the patients
are conducted. The first step is examining family history for genetic links. Several mental,
physical, neurological, and imaging tests are conducted. MRIs are used to detect bleeding,
inflammation, and structural issues. CT scans are used to detect any abnormality through
x-ray imaging. Plaque buildup is detected using positron emission tomography (PET) scans.
Blood tests are used to detect genes that might increase the risk factor.
Maintaining a healthy lifestyle by exercising regularly, avoiding substance or alcohol
abuse, eating vegan diet, and maintaining an active social life is considered the best way to
prevent the onset of AD.
Writing or drawing are complex activities that require cognitive, psychomotor, and bio-
physical processes. It is a network composed of cognitive, kinesthetic, and motor abilities
[5]. Therefore, handwriting can give us a clue about the writer’s personality, behavior, and
neurophysiologic functioning. It is also important to have good hand–eye coordination,
and have good spatial cognitive ability and dexterity. As a result, people suffering from
NGD, which affects these functions of the patient’s body, have difficulty writing. Significant
changes in handwriting is one of the noticeable early features of AD. The cerebral cortex,
basal ganglia, and cerebellum are required to carry out the complex task of writing [6].
Handwriting analysis has been associated with AD since the first patient was examined
by Alois Alzheimer in 1907. The omission of certain syllables and repetition of others was
observed. Agraphia was reported among AD patients in the early stages [7]. Lexicosemantic
errors, which transformed to phonological error as the disease progressed, and the demen-
tia, became more severe. Recent studies have been conducted that use dynamic collection
of handwriting or analysis, such as [8]. Kinematic measures of the handwriting of persons
with MCI compared with those with mild AD and healthy controls to study differences in
their handwriting for different written tasks. Goal-oriented tasks showed significant slow-
down of motor movements. Handwritten signatures can also be used for an early diag-
nosis of NGD [9]. It not only represents the first and last names of the person but also
provides significant information regarding their writing system and their mental and emo-
tional state. Hence it is used as a biometric and identification for important verifications.
The sigma-lognormal model for the signature representation was used for a cost-effective
analysis in terms of AD. Movement time and smoothness were the parameters in the study.
AD and MCI patients demonstrated slower, shakier, less coordinated, and inconsistent writ-
ing, compared to their healthy counterparts.
A study [10] conducted for comparing copying tasks between healthy people, MCI
patients and AD patients was able to classify 69–72% of the participants correctly, even
though the performance in the MCI group was poor.
Thus the inability to communicate via writing is seen as an early manifestation of AD.
Graphology can therefore be used as a diagnostic tool.
18.1 Introduction and Background 373
1) Personal Relationships: Handwriting analysis can provide deep insights into the per-
sonal, social, and professional nature and judge compatibility.
2) Educational: In the field of education for the counseling of students
3) Personnel Selection: In the selection of personnel, handwriting analysis is an invaluable
tool for helping to choose the most suitable person for the job.
4) Police Profiling
374 18 Handwriting Analysis for Early Detection of Alzheimer’s Disease
ENCODER
DECODER
Variational auto encoder (VAE) is considered a neural network that can encrypt
images into a direction in the latent space of z real numbers. For the sample that has
been collected from a z-dimensional normal distribution, the direction is assumed to
be arbitrary. Then the decoder network decodes the encoded vector representation and
obtains the original image. Random samples can be drawn from the distribution and
fetches them into the decoder network for which the latent space is z-dimensional normal
distribution, from which new images can be obtained that are absent in the data set we
trained on.
The architecture of the encoder is represented below (Figure 18.1):
The pivotal facet of the VAE lies in the fact that its latent representation, y ∈ℙK , is derived
from a particular Gaussian distribution (i.e., p(y) = N (y| μ, Σ), where μ denotes mean and
Σ denotes covariance). The VAE is trained in an unsupervised manner as the output is basi-
cally the reconstructed version of the input. Let us consider the given set of training data is
S = {sn }n = 1 N , and the VAE comprises a probabilistic encoder q𝜃(y|s), which finds the latent
representation y for the given set of input data s. It also consists of a probabilistic decoder
p𝜙(s|y), which reconstructs the input data for the specific latent representation, where 𝜃
represents the network parameter for encoder and 𝜙 represent s the network parameter for
decoder.
Optimization of the variational bound, VAE (𝜃, 𝜙), with respect to the parameters 𝜃 and
𝜙 in the encoder and decoder occurs while learning the VAE.
∑
N
VAE (𝜃, 𝜙) = −Ey∼q𝛉(y∣sn ) [logp𝜙(sn ∣ y)] + KL(q𝜃(y ∣ sn )||p(y))
N=1
Here, the first term is the reconstruction error that has been computed by taking the expec-
tation with respect to the distribution of y while the second term denotes regularizer, which
is the Kullback-Leibler (KL) divergence between the estimated distribution q𝜃(y|sn ) and
the true distribution p(y). This divergence is the measurement of the amount of informa-
tion lost while using q to represent p. The estimation of parameters can be done using an
auto-encoding variational Bayes (AEVB) algorithm [11].
The pictorial representation of an encoder and decoder is given in Figure 18.2.
The encoder is said to be a neural network whose input is a data point while output
is a hidden representation y, having weights and biases 𝜃. Specifically saying, let us con-
sider y to be a 15 by 15 pixel photo of a handwritten digit. The encoder “encodes” the
18.1 Introduction and Background 375
ENCODER DECODER
qθ(y|s) pϕ(s|y)
Data: s Reconstruction: s
encode → decode →
Inference Generative
Input Reconstructed
Image μ Image
2 σ 2
qθ(y|s) pϕ(s|y)
x x
S S
Latent
Distribution
data that is 225-dimensional into a latent (hidden) representation space y, which is much
less than 225 dimensions. This is conventionally termed to be a “bottleneck” because the
encoder must be capable of efficient compression of the data into this lower-dimensional
space.
From Figure 18.2 we can consider or assume the encoder to be denoted as q𝜃(y|s).
We know that the lower-dimensional space is arbitrary: the parameters to q𝜃(y∣s),
which is a Gaussian probability density and the output of the encoder. A sampling
of this distribution can be done to obtain the noisy values of the representations y
(Figure 18.3).
The decoder is also a neural network having representation y as the input while the
parameters to the probability distribution of the data is its output, and has weights and
biases 𝜙. The decoder has been denoted by p𝜙(s∣y). On the execution with the handwritten
digit as example, let it be assumed that the images are black and white and representation of
each pixel is in the form of 0 or 1. The probability distribution of a single pixel can be then
represented using a Bernoulli distribution. The latent representation of a digit y is given
as input to the decoder and the output thus obtained is 225 Bernoulli parameters, one for
each of the 225 pixels in the image. Decoding the real-valued numbers in y into 225 real
376 18 Handwriting Analysis for Early Detection of Alzheimer’s Disease
valued numbers between 0 and 1 is done by the decoder. There is the occurrence of loss of
information as it goes from a smaller to a larger dimensionality [12].
The requirements of VAE are as follows:
● It is essential to use two tractable distributions:
– The initial distribution p(y) must be easy to be sampled from
– The conditional likelihood (s|y, 𝜃) should be computable
● In practice this implies that the two distributions of the discussion are seldom complex,
for example, uniform, Gaussian, or even isotropic Gaussian
and scenes. The success of image analysis depends on the reliability of segmentation, but
an accurate partitioning of an image is generally a very challenging problem.
Segmentation techniques are either contextual or noncontextual. The latter takes no
account of spatial relationships between features in an image and group pixels together
on the basis of some global attribute, e.g., gray level or color. Contextual techniques
additionally exploit these relationships, e.g., group together pixels with similar gray
levels and close spatial locations.
In handwriting image segmentation, digital handwriting is segmented into three differ-
ent types of segments, i.e., word segmentation, letter segmentation, and line segmenta-
tion, each used for different processing. The line segmentation is done by locating optimal
text and gap zones. The words and characters are subsequently located using the same
strategy by scanning each line.
Step 3: Image Preprocessing
Image preprocessing, which needs to be done properly so that accurate results are
obtained and to minimize the error due to external factors such as noise. Image prepro-
cessing is the technique in which the handwritten sample is translated into a format
that can be efficiently processed in further steps. These steps involve binarization,
noise removal, line segmentation, word segmentation, and character segmentation.
Binarization converts gray scale image into binary image. Noise removal techniques are
applied to enhance the quality of the image.
Step 4: Line, Word, and Character Segmentation
For line segmentation, the image is first converted to grayscale and then binarized
inversely, such that the background is dark and the text is light. This image is then
dilated and the contours are found using the findContours() function in OpenCV
library. Each detected contour is stored as a vector of points. The retrieval mode used is
RETR_EXTERNAL, which returns only the extreme outer contours. The approximation
method used is CHAIN_APPROX_SIMPLE, which compresses horizontal, vertical, and
diagonal segments and leaves only their end points. For example, an upright rectangular
contour is encoded with four points.
For word segmentation the above method is applied on the segmented lines and the char-
acters are subsequently segmented using the isolated words.
STEP 3: Convert the grayscale image to binary image using proper threshold and invert the
image
STEP 4: Dilate the binary image using kernel of 5 × 35 matrix of ones
STEP 5: Find contours from the diluted image and consider only the extreme outer flags
STEP 6: Extract the region of interest from the image using those contours and save them
as images, which are basically words in the lines
Algorithm for Character Segmentation
STEP 1: Load the sample words as images
STEP 2: Resize the contour containing the word using bicubic interpolation over 4 × 4 pixel
neighborhood
STEP 3: Convert it to grayscale
STEP 4: Convert the grayscale image to binary image using proper threshold and invert the
image
STEP 5: Dilate the binary image using kernel of 5 × 5 matrix of ones
STEP 6: Find contours from the diluted image and consider only the extreme outer flags
STEP 7: Extract the region of interest from the image using those contours and save them
as images, which are basically characters in the words
Step 5: Implementation and Use of VAE
The segmented characters are passed through the VAE system. The characters are then
reconstructed and classified to obtain the common features from the handwritten sam-
ples. The clustering will be helpful for the purpose of understanding whether the samples
have any features that can be used as a parameter in order to help diagnose AD. It can
also help in classifying the stage of the AD depending on the handwriting as, with the
disease getting progressively worse, the handwriting gets more and more illegible or error
prone.
Algorithm for Variational Auto encoder
STEP 1: The image is read
STEP 2: The image is converted to desirable format for required channels and size
STEP 3: The image pixel intensity value is stored in an array
STEP 4: The array is shaped to coincide with the input shape of the VAE model
STEP 5: The data is split into training and test set
STEP 6: The encoder model is constructed using fully connected layers are used in the
model to calculate mean and variance
STEP 7: The decoder, a sequential model with fully connected layers, is constructed
STEP 8: The input and output shapes for the encoder and decoder are used to construct the
autoencoder
STEP 9: The model is compiled using negative log normal loss function
STEP 10: The middle hidden layer represents the mean and variance layer using KL; the
model is trained as a whole and the mean and variance is updated for every batch of data
in each epoch using back propagation.
18.3 Results and Discussions 379
REGION OF INTEREST IS 1
START
EXTRACTED TO OBTAIN WORDS
STOP
READ ALL CHARACTER
THE IMAGE IS BINARIZED AND
IMAGES
INVERTED
Line 1 Line 2
Line 3 Line 4
Line 5 Line 6
Line 7 Line 8
Figure 18.4 Line segment from handwritten sample from patients suffering from AD.
Figure 18.5 (a and b) Word segmentation samples produced from the segmented line.
382 18 Handwriting Analysis for Early Detection of Alzheimer’s Disease
Another feature of patients affected by AD is that double letters can have unclear/
incorrect writing. Problems in spelling phonologically similar words are also evident. As
the disease progresses, the patient’s motor control gets progressively impaired. The result
is shaky, disconnected handwriting (Figure 18.11).
18.3 Results and Discussions 383
(a)
(b)
(c)
(d)
Figure 18.8 Clusters of reconstructed images using VAE. (a) Cluster for “e,” (b) Cluster for “l,” (c)
Cluster for “o,” (d) Cluster for “t” and “w.”
384 18 Handwriting Analysis for Early Detection of Alzheimer’s Disease
18.4 Conclusion
In this study, a method is proposed to extract the common features for patients suffering
from the neurodegenerative AD. This method can help in the early diagnosis of the
disease. This is a cost-effective and noninvasive method of diagnostic aid. Any particular
pattern has not been identified as an irrefutable sign of AD, but the results from the study
can helps test the cognitive abilities of the patient. The study can be further modified to
add more functionality like monitoring, to track the gradual progression of the disease
and also identify the stage in which the patient is at the time of the sample. This can
only continue up to the point where the patient still has some functional ability intact,
as in the last stage, where the patient might overcome memory loss, confusion, and
impaired functional ability and also be unable to provide sufficient data to analyze using
this model.
The study can also be applied to extract features for other NGDs with slight modifi-
cations in the implementation to determine whether handwriting analysis can be used
as an early diagnostic method or not. This study will also benefit from the availability of
a large training set and dynamic data collection. This will help for better image recon-
struction. The dynamic collection will provide more precise parameters for more accurate
clustering.
References 385
References
1 Ranginwala, N.A., Hynan, L.S., Weiner, M.F., and White, C.L.I. (2008). Clinical criteria
for the diagnosis of Alzheimer disease: still good after all these years. The American
Journal of Geriatric Psychiatry 16 (5): 384–388.
2 Alzheimer’s Association (2015). Alzheimer’s disease facts and figures. Alzheimer’s &
Dementia 11 (3): 332–384.
3 McKhann, G., Drachman, D., Folstein, M., Katzman, R., Price, D., & Stadlan E. M.
(1984). Clinical diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA Work
Group under the auspices of Department of Health and Human Services Task Force on
Alzheimer’s Disease.
4 Ward, A., Arrighi, H.M., Michels, S., and Cedarbaum, J.M. (2012). Mild cognitive
impairment: disparity of incidenidence and prevalence estimates. Alzheimer’s & Demen-
tia 8 (1): 14–21.
5 Tseng, M.H. and Cermak, S.A. (1993). The influence of ergonomic factors and percep-
tual motor abilities on handwriting performance. American Journal of Occupational
Therapy 47 (10): 919–926.
6 Kandel, E.R., Schwartz, J.H., and Jessell, T.M. (2000). Principles of Neural Science, 4e.
McGraw-Hill Medical.
7 Platel, H., Lambert, J., Eustache, F. et al. (1993). Characteristics and evolution of writing
impairment in alzheimer’s disease. Neuropsychologia 31 (11): 1147–1158.
8 Werner, P., Rosenblum, S., Bar-On, G. et al. (2006). Handwriting process variables
discriminating mild alzheimer’s disease and mild cognitive impairment. Journal of
Gerontology: Psychological Sciences 61 (4): 228–236.
9 Pirlo, G., Cabrera, M.D., Ferrer-Ballester, M.A. et al. (2015). Early diagnosis of neurode-
generative diseases by handwritten signature analysis. In: New Trends in Image Analysis
and Processing, ICIAP 2015 Workshops, 290–297, Springer.
10 Werner, P., Rosenblum, S., Bar On, G. et al. (2006). Handwriting process variables dis-
criminating mild Alzheimer’s disease and mild cognitive impairment. The Journals of
Gerontology. Series B, Psychological Sciences and Social Sciences 61 (4): P228–P236.
11 D. P. Kingma and M. Welling “Auto-Encoding Variational Bayes”, arXiv:1312.6114
(2014)
12 https://jaan.io/what-is-variational-autoencoder-vae-tutorial
387
Index
a dysgraphia 375
abstractive summarization 270 early diagnosis 372
accumulated local effects (ALE) 9–12 flowchart 381
accuracy 250 goal-oriented tasks 374
acetylcholinesterase inhibitors 372 graphology 374, 375
active phased array 186 handwritten signatures 374
advanced spaceborne thermal emission and image preprocessing 379
reflection radiometer (ASTER) 175 image reconstruction process 377
agraphia 374 image segmentation 378–379
air pollution 157, 158 Kullback-Leibler (KL) divergence 376
ALE. See accumulated local effects (ALE) lexicosemantic errors 374
allometric model 52 line segment from handwritten sample
alpha waves 352 381–382
Alzheimer’s disease (AD), handwriting line, word, and character segmentation
analysis for 379
acetylcholinesterase inhibitors 372 MRIs 374
agraphia 374 neurofibrillary tangles 372
algorithm for character segmentation 380 peripheral dysgraphia 375
algorithm for line segmentation 379 sleeping pattern 374
algorithm for variational auto encoder stages of 373
380 symptoms of 372
algorithm for word segmentation 379–380 tau proteins 372
amyloid plaques 372 three-stage system 373
application of graphology 375 variational auto encoder 376
architecture of the encoder 376 word segmentation samples 381, 383
auto-encoding variational Bayes algorithm analysis of brain tissue density (ABTD) 249
376 analysis of survival data 57–58
character segmentation 382, 384–386 analysis of variance (ANOVA) 53–55, 78,
collection of data samples 378 163
copying text 375 analysis phase 29
CT scans 374 anomaly detection technique 340
decoder 377 Apache Cordova project 279, 280, 282
traffic and driver management benefits 204 visible light property method 121–123
traffic congestion and parking 211 visualization 26
traffic control center (TCC) 201 Viterbi algorithm 293
traffic information system 200 volatile organic compounds (VOCs) 158
traffic management center (TMC) 201
traffic management system (TMS) 200 w
traffic safety 203–204 Waikato Environment for Knowledge
transient insomnia 350 Analysis (WEKA) 145, 250, 311
TwitterAPI 292 Web 27
2G, 3G, 4G, and 5G mobile communications Web of things (WoT) 188
205 Welch method 358–359
two-way table 77 whiskers 48
white box models 2
u interpretation
univariate analysis 77 decision tree 5–6
unstructured data 20–21 linear regression 3–5
unsupervised learning techniques 248 issues and challenges 13
Urban Forest Effects Model (UFORE) wide column database 39
158 Wi-Fi 205
WiMAX 205
v Winnow algorithm 292
vanishing gradient problem 90–91 word embedding 292
variability measures 48 WordNet 292
variable identification 77 Word2Vec 292
variance 26, 48
variance and standard deviation 45 x
variational auto encoder (VAE) 376 XML 23–24
variety 26
vehicle tracking system 200 z
velocity 26 ZIP file formats 22
verbal sarcasm 292 Z-Test/T-Test 78
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.