Final Thesis
Final Thesis
Final Thesis
Thesis Report
JANUARY 2022
i
ACKNOWLEDGEMENT
First and foremost, I would like to express my sincere gratitude to my mentor Akshita
Bandari for guiding me throughout this project and, I would like to convey my sincere thanks
to my parents, friends and the fellow learners of my batch who have supported me in the
project.
ii
ABSTRACT
Machine learning has contributed a lot to the scientific world in recent years. As technology
evolves, the amount of data generated is also growing exponentially. This growth also caused
a problem known as Data Imbalance. It occurs when the distribution of one class dominates
the other. Imbalanced data can cause severe problems because machine learning algorithms
work best with equally distributed data. This problem disables metrics such as Accuracy,
which we commonly use in the case of classification. We may end up having more than 95%
accuracy without classifying any of the minority classes correctly. Currently, we have some
techniques to handle imbalanced data, such as assigning class weights, Sampling, Adaboost,
XGBoost. Further, there are two divisions known as under-sampling and oversampling in
sampling. In this research paper, we will be using a dataset of thyroid disease data available
in Kaggle and contains around 3700 instances. This dataset is severely imbalanced, with only
6% of cases classified as positive for thyroid disease. We will compare the methods
mentioned above to handle data imbalance and improve the performance of various machine
learning algorithms such as Logistic regression, Decision trees, Random forests and, Naïve
Bayes. Evaluation of these combinations will be with the help of metrics such as the AUC-
ROC curve, Specificity, Precision, Recall, and F1-score. By identifying the best combination,
we can mitigate the effect caused by data imbalance.
iii
LIST OF TABLES
iv
LIST OF FIGURES
v
Figure 4.9. Box plots for Bivariate Analysis - 1............................................................39
vi
LIST OF ABBREVIATIONS
Abbreviation Expansion
T3 Triiodothyronine
BA Balanced Accuracy
ACKNOWLEDGEMENTS
ABSTRACT
vii
LIST OF TABLES...................................................................................................................
LIST OF FIGURES..................................................................................................................
LIST OF ABBREVIATIONS
CHAPTER 1: INTRODUCTION
1.1 Introduction
2.1 Introduction
2.4 Summary
3.1 Introduction
viii
3.2.4 Exploratory Data Analysis
4.1 Introduction
4.4.4 Summary...............................................................................................................44
5.1 Introduction
ix
Summary...................
CHAPTER 6 : CONCLUSION................................................................................................61
6.1 Conclusion
REFERENCES.........................................................................................................................63
x
CHAPTER 1
INTRODUCTION
This research will deal with thyroid disease data publicly available on Kaggle and UCL
repositories. It consists of 3700 instances and 30 attributes. This dataset contains around 94%
of negative cases, indicating severe data imbalance. First, we perform data cleaning, feature
engineering, and exploratory data analysis steps. Then, we deploy multiple machine learning
algorithms, Logistic Regression, Decision Tree, Random Forest and, Naive Bayes to see the
effect of data imbalance on the model's performance. Then, we will try different
combinations of sampling techniques and boosting algorithms to enhance the performance of
the models mentioned above. Since accuracy is not helpful, we will use other metrics like
Precision, AUC-ROC curve, Recall, F1-score to evaluate the model. We can derive these
metrics from the confusion matrix. We will compare all the possible combinations and
analyse the advantages and disadvantages to suggest an application that can help mitigate the
effect of data imbalance.
1
1.2 Problem Statement
The problem statement is to search for a suitable combination of a machine learning model
and a data imbalance handling technique that mitigates the effect of data imbalance in
classifying the minority classes correctly.
• To perform effective EDA to detect any underlying patterns and to get the data ready
for modeling.
• To employ various classification algorithms and study the effect of data imbalance on
their performance .
2
1.4 Research Questions
This research tries to compare the performance of sampling techniques and boosting methods
both individually and combined in handling data imbalance problems. The study starts by
highlighting the difficulty faced in handling skewed data. Then we try out different
combinations of methods mentioned above to address the issues. These are a few research
questions framed from the related works section:
This study's scope is to combine various sampling and boosting techniques to increase the
performance of classification algorithms. We will be using Logistic regression, Decision
trees, Random Forests, Naïve Bayes algorithms in this research. This research will explore
sampling techniques such as Random oversampling, random under-sampling, ADASYN,
SMOTE, and its variants in combination with boosting methods such as Adaboost, Gradient
boosting, XGBoost. This study aims to handle the data imbalance problems and reduce false
alarms effectively. All the methods will be evaluated with the help of metrics like Precision,
AUC-ROC curve, Recall, g-mean, F1-score. Further, this research can be extended by using
deep learning algorithms.
3
1.6 Significance of the Study
This research tries to study the effects of data imbalance in classification. We will be using
thyroid disease data containing around 3700 instances, of which only 6% are positive,
indicating the extreme skewness of the data. This study intends to identify any underlying
patterns with the Exploratory Data Analysis and then move on to handle data imbalance. As
mentioned in the related works section, many studies have worked on sampling techniques
and boosting methods. The novelty of this study lies in combining the sampling techniques
with boosting strategies. Data imbalance is a serious concern in many real-world applications
such as Thyroid disease diagnosis, credit card fraud detection, Market Segmentation. Most of
the previous studies focused on classifying minority classes correctly, but we should also deal
with False alarms. Classifying a person without disease as sick or identifying a genuine
transaction as fraud will have serious consequences. Also, failing to identify a rare instance
has serious consequences. By combining sampling and boosting methods, we can reduce
false alarms effectively. Our final solution will aid in building an application that can handle
data imbalance effectively.
Introduction – This section gives a brief background of the research, its aims, and
objectives. Further, we will discuss the scope and significance of the research in this
chapter.
Literature Review – This section explores the state of art techniques available for
dealing with data imbalance. Further, we will discuss the research papers which
influence our present research.
Research Methodology – This section gives a complete picture of the roadmap of this
research. We will discuss data cleaning, data pre-processing, modeling, data
imbalance handling techniques, and evaluation metrics in detail.
4
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
Classification algorithms such as Logistic Regression, Decision Trees, and Random Forests
often wound up with data problems like Data imbalance. It has been a consistent problem for
years since modern-day applications of classification algorithms include dealing with highly
imbalanced data. A plethora of research has been carried out over the years to study and
mitigate the effect of data imbalance. We will discuss a few related to our research in this
section. Data imbalance occurs when the distribution of one class dominates the other.
Imbalanced data can cause severe problems because machine learning algorithms work best
with equally distributed data. In (Shakeel et al.), it is defined that “a data set is said to be
imbalanced when the number of instances of majority class outnumbers the number of
instances of minority class by a large proportion.” Further, in (IEEE Communications Society
and Institute of Electrical and Electronics Engineers.), it is mentioned that the imbalanced
distribution of classes in datasets appears when one class has a higher ratio than the other
class. Over the years, researchers have tried to use different machine learning concepts to
overcome the problem of data imbalance. For example, (Ebenuwa et al., 2019a) proposed to
employ feature selection using variance ranking techniques to handle data imbalance.
(Jawaharlal Nehru Engineering College. Department of Computer Science & Engineering et
al.) deployed random forest algorithm to encounter data imbalance. Further, in (IEEE
Computational Intelligence Society et al., 2018), it was proposed to use Sampling techniques
to handle extreme data imbalance in a streaming environment. Concepts such as Sampling
and Boosting have stood out among them as the state of art results. This research will deploy
the combinations of several sampling techniques, classification algorithms, and boosting
methodologies. Further, we will do a deep dive into the research done in the aforementioned
areas.
5
2.2 Classification Algorithms
Classification algorithms are supervised machine learning models that classify massive data
into discrete values based on trained data observations. Examples of classification algorithms
are Logistic Regression, Decision Trees, Random Forests, Naive Bayes, and Support Vector
Machines. This section will discuss a few pieces of research that explored these classification
algorithms in detail. Logistic regression is one of the earliest and most famous classification
algorithms. It classifies the data point by calculating the log odds. In (Cheng et al., 2006), it is
mentioned that a “ logistic regression (LR) model may be used to predict the probabilities of
the classes based on the input features after ranking them according to their relative
importance.” It can be said feature selection can be done explicitly with logistic regression.
They have also concluded that the logistic regression model can reduce the number of the
features substantially while still achieving a high classification accuracy. Decision trees are a
supervised machine learning model derived from the analogy of the human thinking process.
They are one of the most robust algorithms which are suitable for both classification and
regression. In (Ishaq et al., 2021a), it is mentioned that decision trees are used to create a tree
like structure which classifies the data based on some key attributes.
But, in many cases a single decision tree will not be sufficient which led to the development
of Ensembles algorithms.
(IEEE Systems and Institute of Electrical and Electronics Engineers.)
mentioned that Random Forest is a combination of several decision trees in a way
that each tree depends on an independent dataset, and all of them follow the same
distribution. Even though Random Forest produces better results than a single decision tree, it
is not well suited to handle data imbalance, especially while working with a large dataset.
Naïve Bayes is a classification algorithm derived from the Bayes’ theorem, and it operates
under the principle that predictors are independent of each other. In (Misra et al.), it is
mentioned that “ Naïve Bayes chooses the decision based on the highest probability. It
estimates unknown probabilities from known values, which means it considers prior
knowledge and logic to be applied to the decision-making process.” Support Vector
Machines (SVM) is a supervised machine learning algorithm with ability to work on
classification, regression, and outlier analysis.
(Institute of Electrical and Electronics Engineers.)
defines, “SVM is an assorted research algorithm that helps in performing the
analysis in a precise way.”
6
“The Review of Random Forest Classification Techniques to Resolve Data Imbalance”
research
(Jawaharlal Nehru Engineering College. Department of Computer Science & Engineering et al., n.d.)
has tested the performance of random forest variants such as Balanced random forest,
Weight sensitive random forest and other classification techniques such as Naive Bayes,
Adaboost, and Support Vector Machine against an imbalanced dataset. They have used under
sampling and oversampling techniques to handle data imbalance and noise in the data. They
concluded that even though random forest variants took more time than others, they
effectively handled data imbalance problems compared to other classification algorithms.
Further, “The Interactive Thyroid Disease Prediction System Using Machine Learning
Technique” research (Institute of Electrical and Electronics Engineers.) compares different
classification algorithms like Decision Trees, K-means clustering, SVM, and Neural Network
on thyroid disease identification problems. They used a different metric called the absolute
mean error to evaluate each model. They concluded that there is a need to develop predictive
models that require a person's minimum number of parameters to diagnose thyroid disease,
which can save money and time.
In,” The Prediction of Thyroid Disease Using Data Mining Techniques “(Institute of
Electrical and Electronics Engineers. Madras Section and Institute of Electrical and
Electronics Engineers.), they have used data mining algorithms such as SVM, ID3, Naïve
Bayes to identify the correlation between different thyroid hormones and its diseases. They
mentioned that the models are evaluated based on their speed, accuracy, performance, and
treatment cost. Finally, they concluded that all these techniques helped minimize the noise in
the data. “The Variance Ranking Attributes Selection Techniques for Binary Classification
Problem in Imbalance Data” research (Ebenuwa et al., 2019) has used three different
classification algorithms Logistic regression, Naive Bayes, Decision tree, on three different
datasets. They have compared various attribution selection methods like Variance ranking,
Pearson C, Information gain, and used metrics such as precision, recall, F-measure, ROC area
to evaluate the models. They have compared the forementioned model results with and
without attribution selection techniques and concluded that attribution selection could play a
vital role in handling data imbalance. Further, they concluded that no single technique could
be used for all types of datasets. Technique should be selected on the basis of the use-case
requirement.
7
2.3 Data Imbalance Handling Techniques
In this section, we will discuss a few pieces of research which discuss the details of handling
data imbalance in detail. We will focus on the research that deals with various sampling
techniques such as oversampling, under sampling, SMOTE and its variants, ROSE, and
various boosting algorithms such as Gradient boosting, Adaboost, and XGboost.
Sampling is a statistical technique that helps us find a subset of data known as a sample with
all the properties of the complete data known as a population. Sampling helps us analyze the
population's characteristics without going through every data point. This technique is
beneficial in handling data imbalance problems. There are two divisions of sampling,
oversampling and under sampling. Under sampling refers to the downsizing of majority class
labels to balance data. In
(IEEE Circuits and Systems Society and Institute of Electrical and Electronics Engine
, it is mentioned that under sampling becomes ineffective as it leads to
significant data loss.
Oversampling tries to balance the data distribution by duplicating the minority class labels.
Since we are duplicating existing labels, it will not lead to data loss. “The Handling class
imbalance in high-dimensional biomedical datasets” research (Pes, 2019) combines Feature
selection with sampling-based class balancing strategies. They worked on a highly
imbalanced genomic data set which is available in Gene Expression Machine Learning
Repository. They used the Random Forest model as the classification algorithm. They
concluded that the model's performance in terms of false positives improved when they used
the combination of sampling and feature engineering methods. However, (Mathew et al.,
2018) clearly highlighted that random oversampling could lead to the overfitting of the
minority class labels. This problem gave rise to the new form of oversampling known as
Informed oversampling methods. Informed oversampling methods generate synthetic
minority data labels to balance the data distribution. These methods will essentially be adding
new data to the model so that it will not lose any data or overfit. Some of the examples of
these methods are SMOTE, ROSE, and their variants.
8
Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique that
synthetically creates minority class labels from the existing ones. It generates new instances
using the KNN clustering algorithm to identify feature space where minority instances lie
together. (Yan et al., 2019) defines, “ SMOTE is a technique that generates a certain number
of minority class samples according to the similarity of two random minority class samples to
balance the distribution between the majority and minority class samples.” They have used a
Constructive covering algorithm for data cleaning and identifying key minority samples.
They have evaluated the performance of SMOTE and its variants using metrics like
Precision, F- score. They have concluded that the combination of SMOTE and Constructive
covering algorithm performs better than other comparing algorithms. The “A Review on
Handling Imbalanced Data” (SVS College of Engineering and Institute of Electrical and
Electronics Engineers, n.d.) survey has discussed the characteristics of data imbalance such
as small disjuncts, overlapping and data shifts, and the methods needed to handle it in detail.
They have compared different under sampling, oversampling, and hybrid sampling
techniques combined with different classification algorithms such as Decision Tree, Random
Forests, SVM, and Ensemble methods published by another research. They concluded that in
most cases, the SMOTE method succeeded in handling data imbalance effectively.
Many variants of SMOTE came into existence, such as SVM SMOTE, Border Line SMOTE,
and safe level SMOTE. “The Handling Class Imbalance Problem using Oversampling
Techniques: Review” research
(IEEE Communications Society and Institute of Electrical and Electronics Engineers
explained that Border line SMOTE identifies the border line or
nearby instances which are more prone to misclassification and generates synthetic data from
them. They have explained that Safe level SMOTE assigns a safe level value and creates
synthetic instances which lies closer to the safe level values. They have used oversampling
techniques such as SMOTE, Borderline SMOTE, Safe level SMOTE combined with
classification algorithms such as Naive Bayes, Support Vector Machine, and Nearest
neighbor classification on six different datasets. It concluded that Safe level SMOTE
outperformed all other methods based on F-score and g-mean metrics. Further, “The
Application of Balancing Techniques with Ensemble Approach for Credit Card Fraud
Detection” research (Shaw et al., n.d.) has used data balancing techniques such as Down-up
sampling, SMOTE family and ADASYN and Random Forest model. The research concluded
that the SVM SMOTE method combined with the random forest model performed better with
the highest recall, precision, and F-score values.
9
Random Over Sampling Examples (ROSE) is another bootstrapped-based technique that
generates balanced synthetic data and can handle categorical and continuous data. The
“Predictive Models with Resampling: A Comparative Study of Machine Learning Algorithms
and their Performances on Handling Imbalanced Datasets” research
(Chakravarthy et al., 2019)
has compared the sampling techniques SMOTE and ROSE on five different
classification algorithms Cart 5.0, Random Forest, Neural nets, K means and Support Vector
Machine. They have used two highly imbalanced datasets Open corrosion and Colorectal.
They have concluded that the ROSE technique delivered consistent results compared to
SMOTE.
Boosting algorithms are a type of Ensemble algorithm that combines the predictive power of
multiple weak learners to create one strong ensemble with low bias. Examples of boosting
algorithms are Adaboost, Gradient Boosting, and XGBoost. AdaBoost is one of the most
popular algorithms for classification and has some real-world applications such as face
detection and text classification. (Liu et al., 2017)explained that “ the main idea of Adaboost
as to construct a succession of weak learners through different training sets with different
weights.” However, one disadvantage of Adaboost is that it overfits as the number of base
learners increases. Gradient Boosting is often considered to be one of the strongest algorithms
in machine learning. This algorithm can be used with both categorical and continuous data.
This algorithm is prone to overfit quickly, and variants such as XGBoost have been
developed to overcome this disadvantage. XGBoost is an optimized variant of gradient
boosting technique that is highly effective, portable, and flexible. In
(Ogunleye and Wang, 2020)
, XGBoost is defined as a highly efficient boosting algorithm that can overpower many
of other boosting algorithms. This research used XGBoost with hyperparameters tuned along
with optimization algorithms such as Genetic algorithm, annealing algorithm, and Particle
Swamp Organization (PSO) algorithm to diagnose chronic kidney diseases (CKD). They
have used Specificity, Sensitivity, precision, and ROC- AUC metrics to calculate the
efficiency of the model. Further, they concluded that the XGBoost model has achieved state
of art results in diagnosing chronic kidney diseases with an accuracy, specificity, and
sensitivity as 1.0 respectively.
10
“The Improving the Prediction of Heart Failure Patients' Survival Using SMOTE and
Effective Data Mining Techniques” research (Ishaq et al., 2021b) has used SMOTE in
combination with various supervised learning techniques such as Random forests, Extra tree
classifier, Adaboost, XGboost, etc. They have used the metrics Accuracy, precision, and
recall to evaluate the models. They concluded that the SMOTE had improved the
performance of all algorithms, and the combination of SMOTE and Random forest has
achieved state-of-art results. The “A Data Augmentation-Based Framework to Handle Class
Imbalance Problem for Alzheimer's Stage Detection” research (Afzal et al., 2019) used the
Transfer Learning-based method on multi-class Alzheimer's data. To handle data imbalance,
they have used data augmentation techniques such as Cropping the image, Rotating the image
with multiple parameters. They have concluded that Data augmentation has increased the
model's overall performance and helped overcome the state-of-art results. “The Synthetic
oversampling with the majority class: A new perspective on handling extreme imbalance”
research (Sharma et al., 2018) has used SMOTE along with Swim In The Majority technique.
This technique is designed to overcome the limitations of SMOTE. They have tested this
combination on 13 datasets and three different classification algorithms. They concluded that
the combination of SMOTE with Swim In The Majority had achieved more significant results
than when SMOTE alone was used.
2.4 Summary:
11
CHAPTER 3
RESEARCH METHODOLOGY
3.1 Introduction
This section will discuss the research methodology flow from the initial stage of data
selection to the final stage of model evaluation. It contains steps like data selection, data
understanding, data cleaning, Exploratory Data Analysis, pre-processing, modeling, sampling
techniques, Boosting algorithms, and evaluation metrics. In data selection, we will discuss the
origin of the dataset and the reasons to select it for this research. In the data understanding
step, we will discuss the attributes and the distribution of the data. The data cleaning phase
consists of techniques such as null value handling, performing sanity checks, outlier analysis,
checking data types of attributes and dropping redundant attributes. In Exploratory Data
Analysis, we perform steps such as univariate analysis, bivariate analysis and, multivariate
analysis to detect any underlying patterns in the data. This step is also known as data
visualization, where we try to understand the correlation between independent attributes. In
data pre-processing, we work to make the data model ready. We perform steps such as
creating dummy variables, scaling and, train-test split. In the modeling step, we will discuss
machine learning algorithms such as logistic regression, Decision trees, Random Forest,
Naive Bayes, which we will be using in this research. Under Sampling techniques, we will
briefly discuss under-sampling, oversampling, SMOTE and its variants, ROSE methods. We
will discuss Adaboost, XGboost, and Gradient Boosting methods in boosting algorithms. In
the evaluation metrics phase, we will discuss various metrics such as the ROC-AUC curve,
precision, and F-1 score. We will discuss each of the steps mentioned above and the
techniques in detail. We will finish the section with a summary of our research methodology.
We will built a suitable combination of a machine learning model and a data imbalance
handling technique that mitigates the effect of data imbalance in classifying the minority
classes correctly.
12
Figure 3.1.1: Process Flow Diagram
13
3.2 Research Methodology
In this research, we will be using the Thyroid disease dataset obtained from the UCI machine
learning repository. The raw data is available in the UCL repository in a CSV file. The
dataset consists of the test results of the patients who took the test for thyroid-related
diseases. The reason for selecting this dataset for this research is the imbalance of data
distribution. These columns indicate queries on age, gender, sickness, Goiter, tumor, and
measured values of various thyroid hormones such as TSH, T3, TT4, T4U, FTI and, TBG.
This dataset is provided by Garavan Institute in Sydney, Australia. The result will state
negative if the patient tests negative for the thyroid disease and will state sick if they test
positive for the disease.
Data imbalance has been a severe obstacle in applying machine learning algorithms for real-
world applications. Some of these applications which suffer from data imbalance are Thyroid
disease diagnosis, cancer diagnosis, Marketing Segmentation. Generally, these algorithms
work best when data is distributed evenly. This research paper will deal with the Thyroid
disease dataset available on Kaggle and UCL repositories. The reason for selecting this
dataset for this research is the imbalance of data distribution. There are 3772 instances in the
dataset, out of which only 6% are positive for Thyroid disease. The dataset contains the basic
details of the patient, such as age and gender. It also includes symptoms of patients such as
sick, tumor, goitre, hyperthyroid. It consists of the medication being used by the patient, such
as I131_treatment, Lithium. This dataset also contains the measured value of different thyroid
hormones like TSH, T3, TT4, TF4, FTI, etc. The attribute referral_source refers to the origin
of the patient. Finally, the attribute class is the dependent variable, and it denotes whether the
patient is sick or not. In total, this dataset contains 30 attributes. The below table contains the
description and data type of each attribute.
14
Attribute Data Type Description
Age Numeric Age of the patient
Sex Boolean (f/m) Indicates patient’s gender
on_thyroxine Boolean (f/t) Indicates if the patient is
using thyroxine medication
on_antithyroid_medication Boolean (f/t) Indicates if the patient is on
anti-thyroid medication.
sick Boolean (f/t) Indicates if the patient is
sick or not.
Pregnant Boolean (f/t) Indicates if the patient is
pregnant or not.
thyroid_surgery Boolean (f/t) Indicates if the patient has
undergone thyroid surgery.
I131_treatment Boolean (f/t) Indicates if the patient is
receiving I131 treatment.
lithium Boolean (f/t) Indicates if the patient is
using lithium or not.
goitre Boolean (f/t) Indicates if the patient is
suffering from goitre or not.
tumour Boolean (f/t) Indicates if the patient has a
tumor or not.
hypopituitary Boolean (f/t) Indicates if the patient has
hypo-pituitary disorder.
psych Boolean (f/t) Indicates if the patient is
referred to psych evaluation.
TSH_measured Boolean (f/t) Indicates if the patients’
TSH value is measured or
not.
T3_measured Boolean (f/t) Indicates if the patients’ T3
value is measured or not.
TT4_measured Boolean (f/t) Indicates if the patients’ TT4
value is measured or not.
15
T4U_measured Boolean (f/t) Indicates if the patients’
T4U value is measured or
not.
FTI_measured Boolean (f/t) Indicates if the patients’ FTI
value is measured or not.
TSH Numeric Indicates the TSH value of
patient.
T3 Numeric Indicates the T3 value of
patient.
TT4 Numeric Indicates the TT4 value of
patient.
T4U Numeric Indicates the T4U value of
patient.
FTI Numeric Indicates the FTI value of
patient.
referral_source String Indicates the referral source
of patient.
Class Boolean (Negative/sick) Indicates if the patient is
suffering from Thyroid
disease or not.
Table 3.2.2: Dataset Details
Data cleaning is a crucial step in any machine learning project, and it takes up the bulk of the
project time. This step includes dealing with null values, dropping redundant attributes,
16
performing sanity checks.
This step will start with finding the percentage of null values in each attribute. If the null
values in an attribute are exceedingly high, we will drop that; otherwise, we try to fill in null
values. If the attribute is numerical, we will use the median value of the attribute to fill the
null values, and we will use the mode value if the attribute is categorical.
Sanity checks
Sanity checks are used to check the completeness of the data. Especially when we are using
data from external sources, it is essential to check the quality of the data. This step will check
if all the attributes are matched with the correct data types. We will check if the range of
numerical columns is consistent or not. Then, we will start looking for the duplicate instances
in the data and drop them. We will also drop redundant columns having only one unique
value. We can also create new variables from the existing variables, which can be more
correlated to the target variable. This step is also known as Feature Engineering.
In this step, we try to identify any underlying patterns in the data by grouping the data and
visualizing it. Data visualization is of three types, univariate, bivariate, and multivariate. We
can use plots like count, bar, line, box, scatter, and heat maps to visualize the data. Through
this step, we try to find the correlation between independent variables of the data. We can
also perform Outlier Analysis as a part of data cleaning in this step.
Univariate Analysis
In this step, we try to understand the composition and distribution of each variable
individually. For categorical variables, we use count plots to understand their composition,
17
whereas, for numerical variables, we can use a box plot or hist plot to understand its
distribution.
Bivariate Analysis
This step tries to understand the correlation between any pair of variables. This step is
essential because the correlation between two independent variables may lead to drastic
changes in the model’s performance. We can use box plots, bar plots, scatter plots, and joint
plots for this step.
Multivariate Analysis
In this step, we try to understand the correlation of each variable with the target variable
taking the help of a heat map. We can also build pivot tables or groups to aggregate the data
at a granular level to understand the relation between more than two variables at a time.
In this step, we perform necessary operations to make the data model ready. The data
Normalization step helps machine learning algorithms to converge faster. Then, we need to
create dummy variables for categorical variables using encoding techniques. Our final step
before building a model is the Train-Test split, in which we divide our dataset into two sets,
train and test. Generally, we use the 75:25 ratio to split the dataset.
18
Logistic Regression
Logistic regression is one of the earliest and most famous classification algorithms. It is a
supervised classification algorithm that is used in several real-world applications. In
(Institute of Electrical and Electronics Engineers and Manav Rachna International Institute of R
, Logistic Regression is defined as a classification model that gives binomial
outcome based on the values of input variables. It assumes there is no multicollinearity
between the independent variables and that only significant variables are included in the
model. This function uses Sigmoid function as the cost function.
1
p= −z
(1+e )
Figure 3.2. Equation for sigmoid function
Logistic regression is also known as the Logit model because it calculates the probability of
the target variable using the Log odd values. We can decide a cut off value of probability to
classify the target variable.
Some of the advantages of logistic regression are that it is simple to implement, has less
computational power requirement, and has ease of regularization. However, it is also prone to
overfitting and is greatly affected by multi-collinearity. This model cannot solve non-linear
problems too. This model's real-world applications are customer churn prediction, disease
diagnosis, and predicting the probability of a corrupt / fault systems.
Decision Trees
Decision trees are a supervised machine learning model derived from the analogy of the
human thinking process. They are one of the most robust algorithms which are suitable for
19
both classification and regression. It predicts the target variable by learning simple decision
rules inferred from the data features in the form of a tree. The first node at the top of the tree
is called a root. The tree is split into branches called internal nodes. Further, these internal
nodes will be split into leaf nodes. Leaf nodes are the bottom nodes that cannot be split
further. Decision trees split data into multiple sets, which are divided again to arrive at a
decision. Each internal node and the root node act as an if-then-else condition on the
variable's value to split the data. The variable for the split is decided based on the impurity
measure concept. The variable which reduces the most impurity is selected for the split. Some
impurity measuring methods are Gini impurity, Classification Error, and Entropy.
ε = 1 – max ( pi )
Figure 3.4. Equation for Classification error
G= −¿ ∑ pi ( 1− p i)
i=1
D= - ∑ pi log2 pi
i=1
After selecting the variable with the help of impurity measure technique, we proceed to split
the tree from the root node and then this process is repeated iteratively for each internal nodes
until all of them are split to leaf nodes. Decision trees represents a upside-down tree and the
process is quite similar to human decision making process.
20
Figure 3.7. Example of Decision Tree implementation
Hyperparameters are the parameters that we pass on to the learning model to control the
21
model's training. These are the choices for the modeler to tune the behavior of the learning
model. They significantly affect the model's output and can hold multiple values or a list of
values. The process of finding optimal valued for them is called as Hyperparameter tuning.
The decision tree has a good collection of hyperparameters, making it more robust. The
below table contains the description and names of hyperparameters for decision tree.
Unlike other machine learning algorithms, Decision trees are easy to interpret, understand,
and visualize. They can handle both numerical and categorical data. They can handle multi-
collinearity effectively. They are faster and highly efficient compared to other classification
algorithms. They do not require the normalization of numerical variables. However, Decision
trees are highly prone to overfitting. We can use the hyperparameter max_depth to prevent
the tree from overfitting. Decision trees are highly susceptible to the changes in the dataset
even a small change in the data can make the tree unstable.
Random Forest
22
Random Forests is one of the most popular ensemble algorithms. It is a combination of
several decision trees so that each tree depends on an independent dataset, and all of them
follow the same distribution. The training process of the random forest model is different
from the process of a decision tree. Even though a random forest gives better results than a
decision tree, it is not simple and difficult to interpret. Random forest is a black-box model
that makes it difficult to explain or interpret the working model. Random forests are more
stable, diverse, and less overfitting than a decision tree. The random forest model can use all
the hyperparameters of a decision tree. In addition, it can use n_estimators hyperparameter
that specifies the number of trees to be used in the model. We can use GridsearchCV, a cross-
validation method, to try out all the possible combinations of hyperparameters and sort them
out in the order of a metric of our choice.
Naïve Bayes
Naïve Bayes is a classification algorithm derived from the Bayes' theorem, and it operates
under the principle that predictors are independent of each other.
(Jawaharlal Nehru Engineering College. Depart
defined that “Naive Bayes
classifiers are a collection of classification algorithms that follows Bayes' principle and
chooses the decision based on the highest probability”. It estimates unknown probabilities
from known values, which means it considers prior knowledge and logic to be applied to the
decision-making process.
P ( B∨ A ) . P ( A )
P ( A∨B ) = P ( B)
Naive Bayes classifiers are extremely fast, simple, and require less data to work. They can
handle numerical and categorical data and are scalable with the predictors and the data points.
They are used in real-world applications such as spam mail detection, real-time prediction,
and recommendation systems. However, if we have new data in the test dataset that is not
available in the training dataset, then the naive Bayes classifier will not compute that
probability.
23
3.3.2 Sampling Techniques
Sampling is a statistical technique that helps us find a subset of data known as a sample with
all the properties of the complete data known as a population. Sampling helps us analyze the
population's characteristics without going through every data point. This technique is
beneficial in handling data imbalance problems. There are two divisions of sampling,
oversampling and under sampling.
Under Sampling
Under sampling refers to the downsizing of majority class labels to balance the data
distribution. Since we are deleting the data instances, it can cause loss of information and
leads to underfitting of the model.
Over Sampling
Oversampling refers to duplicating minority class labels to balance the data distribution. The
model will not underfit since we are not deleting data, but the minority class can get
overfitted due to duplication. There are many techniques derived from oversampling such as
SMOTE and its variants, ADASYN.
SMOTE Family:
24
ADASYN
Boosting algorithms are a type of Ensemble algorithm that combines the predictive power of
multiple weak learners to create one strong ensemble with low bias. Examples of boosting
algorithms are Adaboost, Gradient Boosting, CatBoost and XGBoost.
Adaboost
Adaptive Boosting (AdaBoost) is one of the most popular algorithms for classification and
has some real-world applications such as face detection and text classification. Adaboost
constructs a succession of weak learners through different training sets with different
weights. However, one disadvantage of Adaboost is that it overfits as the number of base
learners increases.
Gradient Boosting
XGBoost
25
Evaluation of models is a critical step in any machine learning project. We need to identify
the best metrics to estimate the model's efficiency concerning our use case.cr
(Makki et al., 2019)
explains that the Confusion Matrix is one of the widely used matrices used to calculate
metrics such as accuracy, precision, F- score, specificity, and sensitivity. It is a 2x2 matrix
that stores four variables True Positives (TP), False Positives (FP), False Negatives (FN), and
True Negatives (TN). True Positives are the instances that are truly classified as positive.
False Positives are the instances that are falsely classified as positive. False Negatives are the
instances that are falsely identified as negative. True Negatives are the instances that are truly
classified as negative.
• “Accuracy is defined as the ratio of total correct predictions to the total predictions”.
TP+TN
Accuracy =
TP+ FP+ FN +TN
Figure 3.9. Equation for Accuracy
• “Precision is defined as the ratio of correctly predicted positives to the total predicted
positives”. It is also known a positive predictive value.
TP
Precision =
TP+ FP
Figure 3.10. Equation for Precision
TP
Sensitivity=
TP+ FN
Figure 3.11. Equation for Sensitivity
26
• “Specificity is defined as the ratio of correctly identified negatives to the total
negatives”. It is also known as Selectivity or True Negative Rate.
TN
Specificity=
TN + FP
Figure 3.12. Equation for Specificity
• “False Positive Rate (FPR) is defined as the ratio of falsely classified as negatives to
the total positives”. It is also known as Fallout.
FN
FPR =
TP+ FN
Figure 3.13. Equation for FPR
• “False Negative Rate (FNR) is defined as the ratio of falsely classified as positives to
the total negatives”. It is also known as Miss rate.
FP
FNR =
TN + FP
Figure 3.14. Equation for FPR
1 TP TN
BA = ( + )
2 TP+ FN TN + FP
Figure 3.15. Equation for BA
F1 Score = ( 2∗precision∗sensitivity
precision+ sensitivity )
Figure 3.16. Equation for F1 Score
Among the above, we will be using accuracy, precision, sensitivity, specificity, balanced
27
accuracy, and F1 score as metrics to evaluate the models in this research. We will also be
using ROC – AUC curve, which will be discussed below.
AUC – ROC Curve
The AUC-ROC curve, also known as Area Under the Curve of Receiver Operator
Characteristic curve, is an evaluation metric used for binary classification that plots
sensitivity against specificity. It has a range of 0 ≤ AUC ≤ 1. An AUC score of 0 indicates
that the model classifies all positives as negatives and vice versa. An AUC score of 0.5
indicates that the model is not better than a random guess. An AUC score of 1 indicates that
the model correctly classifies all positives and negatives correctly. To conclude, as the AUC
score increases, the efficiency of the model increases.
Please, make sure that all the packages are updated to their latest versions.
28
3.4.2 Hardware Requirements
Laptop – Windows/Mac
RAM – 6 GB or higher
Processor – Intel Core i5 or higher
GPU – Intel integrated or higher
OS – Windows 10 / Mac Catalina or equivalent
Google Drive
3.5 Summary
In this section, we have discussed all the research steps in detail. We have started with data
cleaning techniques, Exploratory Data Analysis steps, and data processing techniques. Then,
we have discussed each of the classification algorithms, sampling techniques, and boosting
algorithms that we will use in this research. We have discussed the evaluation metrics
suitable to address the data imbalance problem. We finished the section with our research's
software and hardware requirements.
29
CHAPTER 4
ANALYSIS AND IMPLEMENTATION
4.1 Introduction
This chapter will discuss implementing the steps such as data understanding, data cleaning,
Exploratory Data Analysis, Model building, and model evaluation. We will also discuss the
outcomes of each step in this chapter.
This research will be using the Thyroid disease dataset open on Kaggle and UCL repository.
This dataset consists of around 3772 instances and 30 attributes. This dataset is severely
imbalanced as it contains only 6% instances of the minority class.
• The variables age and sex indicate the age and gender of the patient. The variables
sick and pregnant indicate if the patient is sick or pregnant.
• The variables lithium, goitre, hypopituitary, tumour, psych indicates if the patient is
suffering from the respective conditions.
• The variable thyroid_surgery indicate if the patient has undergone a surgery for
thyroid. The variable referral_source indicates the location from where patient was
referred.
30
• The variables TSH_measured, T3_measured, T4U_measured, TT4_measured,
TBG_measured, FTI_measured indicates if the patient has measured their respective
hormone values.
• The variables TSH, T3, T4U, TBG, FTI indicates the respective hormone levels of the
patient.
• The variable Class is the target variable, and it indicates if the patient is suffering
from a thyroid related disease.
In this section, we will have a look at the statistical properties such as count, sum, mean,
percentile values, minimum value, maximum value, mode of all numerical attributes.
Here we can notice some discrepancies in the data, such as the maximum value of age being
455, the maximum value of TSH being 530, and the absence of data for the TBG variable. All
of these will be taken care of in the data cleaning steps.
31
4.3 Data Cleaning
Data cleaning is a crucial step in any machine learning project, and it makes the data ready
for the Exploratory Data Analysis step. Data cleaning includes checking data types, handling
duplicate rows, dealing with null values, dropping redundant attributes, performing sanity
checks.
This step will start with finding the percentage of null values in each attribute. If the null
values in an attribute are exceedingly high, we will drop that; otherwise, we try to fill in null
values. If the attribute is numerical, we will use the median value of the attribute to fill the
null values, and we will use the mode value if the attribute is categorical. In our case, we
dropped the variable TBG since it had more than 90% null values. We used mode to replace
null values for the variable sex. For the variables representing measured hormone values, we
replaced null values with 0 since all null values have measured flag as false.
We need to ensure that all the data variables are declared with appropriate data types. In our
case, the variable age is declared as a float variable while the values are integers. So, we
changed the data type of the variable age to int. We mapped our categorical variables, except
for the sex variable, into 0&1 instead of f/t.
We need to ensure that only significant variables are included in the model. So, we need to
drop any insignificant variables in the dataset. In our case, the variable TBG_measured
contains only one unique model making the variable insignificant. So, we dropped the
variable TBG_measured.
Duplicate Rows
We need to identify and delete duplicate rows in the dataset. Otherwise, they can lead to
overfitting. In our dataset, there are 63 duplicated rows, and we dropped them.
32
Range Checks
We need to ensure that the range of all numerical variables is in order. For example, in our
dataset, the variable age has a maximum value of 455, which is impossible, so we are
replacing this value with the second-highest value, i.e., 95. The range check also revealed that
many of the hormone variables have outliers.
We identified that all the hormone variables have outliers. In order to handle outliers, we
need to clip the variable values. We will observe details of the outliers from the box plot,
such as the side of the plot, extent of the outliers, and declare lower and upper values to clip
the variables. For example, after observing the box plot of the T3 variable, we identified that
outliers are concentrated on the right side of the plot, i.e., towards the maximum. So, we
declared the lower value as the minimum value of T3 and set the upper value as the 90th
percentile of the T3 variable.
33
For T4U variable, we observed from the box plot that the outliers are present on both ends.
So, we set lower value at 10 th percentile and upper value at 80 th percentile to clip the T4U
variable.
In the same way, we have handled outliers for the remaining hormone variables in the
dataset.
Feature Engineering
Feature Engineering – a technique in data cleaning process to create new features from the
raw data using domain knowledge.
• We created a new variable named TSH_Normal, which indicates whether the TSH
levels of the patient are normal or not.
• We created a variable named age_50, which indicates if the age of the patient is
greater than or equal to 50 or not. Ig age ≥ 50, it is recorded as 1. Else, it will be 0.
34
4.4 Exploratory Data Analysis
In this step, we try to understand the composition and distribution of each variable
individually. For categorical variables, we use bar plots to understand their composition,
whereas, for numerical variables, we can use a box plot or hist plot to understand its
distribution. For categorical variables, univariate analysis will help us to identify if the
variables are imbalanced or not.
35
The above figure shows that most of our categorical variables are severely imbalanced, with
the majority class occupying around 90% of the distribution. However, the variables created
as part of feature engineering are well distributed compared to our existing categorical
variables. We can see the distribution of those in the below figure.
Further, the count plot of our target variable makes it clear that the data is severely
imbalanced with only 6.2 % of minority class labels.
36
We used Distribution plots to analyze the distribution of all numerical variables in our
dataset. Besides the age variable, all the other numerical variables are heavily skewed. We
can see the distribution of numerical variables in the below plot.
37
Further, the analysis of the age variable indicated that most of the people who are sick are
aged 50 or above. This observation leads to the creation of a new variable named age_50 that
indicates if the person's age is above 50 or not. This age_50 variable is well balanced
compared to other variables in the dataset.
38
4.4.2 Bivariate Analysis
In this section, we will try to to understand the correlation between any pair of variables. This
step is essential because the correlation between two independent variables may lead to
drastic changes in the model’s performance. We can use box plots, bar plots, scatter plots,
and joint plots for this step. To analyse the correlation between a numerical and categorical
variable, we can use box plot. In the below figure, we can see the correlation between all
numerical variables with the target variable.
We can see that the variables age, TSH, FTI are positively correlated with the target variable,
i.e., when the target variable is 1, those variables' values increase and vice versa. Meanwhile,
the variables T3, TT4, T4U are negatively correlated with the target variable, i.e., when the
target variable value is 1, those variable's value decreases and vice versa.
39
Figure 4.10. Box plots for Bivariate Analysis – 2
In the above plot, we can see the correlation between our numerical variables with the gender
variable. The variables TT4, T4U, and T3 are positively correlated for females, whereas other
variables have almost identical distribution for both males and females.
40
Figure 4.11. Bar plots for Bivariate Analysis
In the above figure, we plotted our categorical variables to find their correlation with the
target variable. As we can see, apart from the variables gender and sick, all the other plots are
heavily skewed, highlighting the data imbalance problem in the dataset.
41
4.4.3 Multivariate Analysis
The multivariate analysis involves visualizing the relation between two variables in the
presence of one or more other variables. We can also build pivot tables or groups to
aggregate the data at a granular level to understand the relation between more than two
variables at a time. In this step, we will be using Swarm plots, Violin plots, Cat plots, and
Heat maps to visualize the relation between the variables.
In the above plot, we visualized the relation between our hormone variables with the target
variable in the presence of the gender variable. We can observe that there is a significant
change in values for every hormone variable for sick patients for both genders.
42
In this step, we will work on correlation analysis to find out the correlation between our
independent variables and the target variable. We will be using heat map to visualize the
correlation table.
We can see from the above figure that the variable age has the highest positive correlation
(0.16) with the target variable, and the variable T3 has the highest negative correlation (-0.19)
with the target variable. We can also see some independent variables with a strong correlation
between them, indicating multicollinearity.
43
Figure 4.14. Cat plot for Multivariate Analysis
In the above figure, we used a cat plot to understand the relation of four variables age,
gender, sick, pregnant with the target variable. We can see that there are no sick, pregnant,
and suffering from thyroid diseases. Most of the patients suffering from thyroid disease are
neither sick nor pregnant.
4.4.4 Summary
The Exploratory data analysis results revealed that the variables T3, age, and age_50
significantly correlate with the target variable. We understood the composition and
distribution of all the independent variables. We can conclude that all the original variables
are highly imbalanced. We introduced new variables age_50, All_harmones_measured,
TSH_Normal with a better data distribution as a part of feature engineering. We also
concluded that some of the independent variables correlated, indicating multicollinearity.
44
4.5 Data Preprocessing
In this section, we will apply methods such as normalizing numerical variables, encoding
categorical values, and splitting the dataset into train and test datasets to make the data ready
for modeling.
4.5.1 Encoding
Encoding is the process of converting categorical variables into numerical variables having
two values, 0 and 1, to fit them into machine learning algorithms easily. There are only two
categorical variables which are needed to be encoded, sex and referral_source. After creating
the dummy variables, we dropped the original variables from the dataset.
In this step, we divide our dataset into two parts, train and test data sets which are in turn
used to teach the modeling algorithm and then test it. We divide the dataset in 75:25 ratio for
train and test datasets.
4.5.3 Normalization
Normalization is converting data into the range of [0,1] to help the machine learning
algorithms converge faster. We use a standard scalar for this process. We fit our train dataset
and transform it, but we only use the transform function for the test dataset so that the
modeling algorithm will not get to know about it.
45
Figure 4.15. Heat map for Dummy Variables
Before moving onto modeling, the above figure is a heat map to visualize the correlation
between the new variables created during Encoding and the target variable. We can see that
among these new variables, refered_by_svm has a 0.3 correlation with the target variable,
which is the highest in the whole dataset. This shows the importance of encoding categorical
variables before starting modeling.
46
4.6 Modeling
In this section, we started off by making four individual models namely Logistic Regression,
Decision Tree, Random Forest, and Naïve Bayes. For Logistic Regression and Naïve Bayes
models, we have used Cross Validation with 5-folds. We fitted the train data and generated
predictions for both train and test datasets. Then, we designed a custom function to calculate
the metrics such as accuracy, precision, AUC ROC Curve, Sensitivity, Specificity and F1
score.
For Decision Tree and Random Forest models, we used Grid Search to handle
hyperparameter tuning along with the cross-validation method. After running all the possible
combinations, we filtered out the best model and fitted the data. Then, we generated
predictions for both train and test sets and sent them to a custom function to calculate the
metrics. In the below table, we can refer to the hyperparameters used in the model.
n_estimators - [2,4,5,6,8]
Table 4.2: Hyperparameter Tuning - I
47
After reviewing the performance of each model and studying the effect of data imbalance, we
used seven different sampling techniques, Under Sampling, Over Sampling, ADASYN,
SMOTE, SVM SMOTE, Borderline SMOTE, and KNN SMOTE, to create seven different
datasets that are well equipped to fight data imbalance. After creating the datasets, we have
visualized the distribution of each of those datasets using bar plots.
48
We also tested out the Boosting and Bagging algorithms such as AdaBoost, XGBoost,
Gradient Boosting, Bagging individually. We used Grid Search to tune the hyperparameters
and the Cross-validation technique to try out all possible combinations. After running the
models, we calculated the metrics and stored them in a separate data frame.
Bagging [1,5,10,20,25,30] -
Table 4.3: Hyperparameter Tuning – II
CHAPTER 5
49
RESULTS AND DISCUSSION
5.1 Introduction
This section discusses the evaluation and the results of all the trained models on both train
and test data sets. These models are evaluated with the help of an array of metrics suited for
the problems caused by data imbalance. We will provide a thorough comparison of the
efficiencies of the models for various industrial use cases.
Logistic Regression
First, we will check the results of individual Logistic regression model and then the results of
combinations of sampling techniques and logistic regression model.
From the above table, we can see that the train and test dataset's results do not vary
significantly, indicating that the model is not overfitting. However, all the variables other
than accuracy have significantly lower values, indicating that the model can benefit further
from the sampling techniques.
50
Under_Sampling Train 90.00 12.00
85.00% 90.00% % 86.00% % 8.00%
Under_Sampling Test 86.00 20.00
83.00% 86.00% % 81.00% % 8.00%
Over_Sampling Train 90.00 10.00 11.00
89.00% 89.00% % 89.00% % %
Over_Sampling Test 91.00 10.00
90.00% 91.00% % 91.00% 8.00% %
ADASYN Train 94.00
95.00% 94.00% % 94.00% 6.00% 6.00%
ADASYN Test 94.00
94.00% 94.00% % 94.00% 6.00% 6.00%
SMOTE Train 94.00
94.00% 94.00% % 94.00% 6.00% 6.00%
SMOTE Test 94.00
94.00% 94.00% % 94.00% 5.00% 6.00%
SVM_SMOTE Train 94.00
94.00% 94.00% % 94.00% 5.00% 6.00%
SVM_SMOTE Test 94.00
94.00% 95.00% % 94.00% 5.00% 6.00%
Borderlevel_SMOT Train 93.00
E 94.00% 93.00% % 93.00% 8.00% 6.00%
Borderlevel_SMOT Test 94.00
E 94.00% 94.00% % 93.00% 7.00% 5.00%
KNN_SMOTE Train 96.00
96.00% 96.00% % 96.00% 3.00% 4.00%
KNN_SMOTE Test 96.00
95.00% 96.00% % 95.00% 4.00% 5.00%
From the above figure, we can see that all the sampling techniques have increased the
efficiency of our Logistic regression model. KNN_SMOTE technique outsmarted all other
techniques in almost all the metrics. However, the metrics, FPR and FNR worked better on
the train data set than the test dataset. The original Logistic regression model has a better
False Negative Rate value when compared to all the sampling techniques. Overall, we can
conclude that sampling techniques helped increase the efficiency of the logistic regression
model along with all the metrics.
51
Decision Tree
First, we will check the results of individual Decision Tree model and then the results of
combinations of sampling techniques and decision tree model.
We can see that there is a significant change in values for almost all the metrics from train
dataset to test dataset. This indicates that the decision tree is overfitting. Now, we will see the
results of combinations of decision tree with the sampling techniques.
52
KNN_SMOTE Test 97.00% 98.00% 98.00% 1.00% 3.00% 98.00%
From the above table, we can see that all the sampling techniques have improved the
efficiency of our decision tree model, but the SVM SMOTE has produced consistent results
in all the metrics for both train and test datasets.
Random Forest
First, we will check the results of individual Random Forest model and then the results of
combinations of sampling techniques and Random Forest model.
From the above table, we can understand the effect of data imbalance on the efficiency of our
random forest model. Also, there is a significant change in metrics from train dataset to test
indicating overfitting of the model. Now, we will see the results of combinations of random
forest model with the sampling techniques.
53
SMOTE Train 98.00 2.00
98.00% 98.00% % 98.00% 1.00% %
SMOTE Test 98.00 3.00
97.00% 98.00% % 97.00% 2.00% %
SVM_SMOTE Train 98.00 2.00
98.00% 98.00% % 98.00% 1.00% %
SVM_SMOTE Test 98.00 3.00
97.00% 98.00% % 97.00% 2.00% %
Borderline_SMOT Train 98.00 2.00
E 98.00% 99.00% % 98.00% 1.00% %
Borderline_SMOT Test 98.00 2.00
E 98.00% 98.00% % 98.00% 1.00% %
KNN_SMOTE Train 99.00 1.00
99.00% 99.00% % 99.00% 1.00% %
KNN_SMOTE Test 98.00 2.00
98.00% 98.00% % 98.00% 1.00% %
Table 5.6. Random Forest with Sampling Techniques Results
From the above table, we can see that all the sampling techniques have improved the
efficiency of our random forest model, but the under-sampling method failed to solve the
problem of overfitting. KNN SMOTE has produced consistent results in all the metrics for
both train and test datasets.
Naive Bayes
First, we will check the results of individual Naïve Bayes classifier and then the results of
combinations of sampling techniques and naïve bayes classifier.
This poor performance of the Naive Bayes classifier is due to zero probability error. This
error occurs when no instances are available where the target variable and the independent
variable are together in the same frequency. Now, we will see the results of combinations of
naive bayes classifier with the sampling techniques.
54
Sampling Data Precisio AUC_RO BA F1 - FPR FNR
Technique Set n C score
Under_Sampling Train 1.00 68.00
43.00% 66.00% 66.00% 60.00% % %
Under_Sampling Test 8.00 73.00
37.00% 60.00% 60.00% 53.00% % %
Over_Sampling Train 2.00 78.00
56.00% 60.00% 60.00% 71.00% % %
Over_Sampling Test 2.00 76.00
56.00% 61.00% 61.00% 71.00% % %
ADASYN Train 2.00 54.00
65.00% 72.00% 72.00% 78.00% % %
ADASYN Test 2.00 56.00
63.00% 71.00% 71.00% 77.00% % %
SMOTE Train 2.00 54.00
64.00% 72.00% 72.00% 77.00% % %
SMOTE Test 2.00 53.00
65.00% 72.00% 72.00% 78.00% % %
SVM_SMOTE Train 3.00 56.00
64.00% 70.00% 70.00% 77.00% % %
SVM_SMOTE Test 3.00 56.00
63.00% 70.00% 70.00% 76.00% % %
Borderline_SMOT Train 2.00 55.00
E 64.00% 71.00% 72.00% 77.00% % %
Borderline_SMOT Test 2.00 55.00
E 64.00% 72.00% 72.00% 77.00% % %
KNN_SMOTE Train 1.00 52.00
66.00% 73.00% 74.00% 79.00% % %
KNN_SMOTE Test 1.00 53.00
65.00% 73.00% 73.00% 78.00% % %
Table 5.8. Naïve Bayes Classifier with Sampling Techniques Results
Sampling techniques helped increase the efficiency of the Naive Bayes classifier, but the
results are still significantly less than that of logistic regression, decision tree, and random
forests.
5.2.2 Bagging and Boosting Algorithms
Bagging
First, we will check the results of individual Bagging algorithm and then the results of
combinations of sampling techniques and bagging.
We can see from the above table that the bagging model is clearly overfitting. Now, we will
see the results of combinations of bagging with the sampling techniques. We will try to find
the combination that can reduce overfitting.
From the above table, we can see that all the sampling techniques except for Under-
sampling, have helped the bagging model in reducing overfitting.
AdaBoost
First, we will check the results of individual Bagging algorithm and then the results of
combinations of sampling techniques
Dataset Train Test
and bagging. Accuracy 97.00% 97.00%
Precision 80.00% 70.00%
AUC_ROC Curve 85.00% 84.00%
Sensitivity 72.00% 70.00%
Specificity 99.00% 98.00%
Balanced Accuracy 56 86.00% 84.00%
F1 - score 76.00% 70.00%
FPR 28.00% 30.00%
FNR 1.00% 2.00%
Table 5.11: AdaBoost Algorithm Results
We can see from the above table that the Adaboost model is clearly overfitting. Now, we will
see the results of combinations of Adaboost with the sampling techniques. We will try to find
the combination that can reduce overfitting.
From the above table, we can see that all the sampling techniques have helped the AdaBoost
model in reducing overfitting. The sampling technique, KNN_SMOTE stands out with least
False Positive and False Negative Rates.
Gradient Boosting
First, we will check the results of individual Gradient Boosting algorithm and then the results
of combinations of Dataset Train Test sampling techniques
and gradient boosting. Accuracy 99.00% 98.00%
Precision 94.00% 84.00%
AUC_ROC Curve 93.00% 90.00%
Sensitivity 85.00% 80.00%
Specificity 100.00% 99.00%
Balanced Accuracy 57 92.00% 90.00%
F1 - score 89.00% 82.00%
FPR 15.00% 20.00%
FNR 0.00% 1.00%
Table 5.13: Gradient Boosting Results
From the above table, we can see signs of overfitting and the values of False Positive Rate are
not good for both train and test datasets. Now, we will see the results of combinations of
Gradient Boosting with the sampling techniques.
From the above table, we can see that all the sampling techniques have helped the Gradient
boosting model in reducing False Positive Rate metric. The sampling techniques, SMOTE
and its variants have produced similar results on this gradient boosting algorithm.
XGBoost
First, we will check the results of individual Extreme Gradient Boosting algorithm and then
the results of combinations of sampling techniques and XGBoost.
58
Accuracy 98.00% 98.00%
Precision 92.00% 83.00%
AUC_ROC Curve 92.00% 90.00%
Sensitivity 85.00% 82.00%
Specificity 99.00% 99.00%
Balanced Accuracy 92.00% 90.00%
F1 - score 88.00% 82.00%
FPR 15.00% 18.00%
FNR 1.00% 1.00%
Table 5.15: XGBoost Results
From the above table, we can conclude that the model is overfitting, and metrics such as
sensitivity, F1 Score, and False Positive Rate are not up to the level. Now, we will see the
results of combinations of XGBoost with the sampling techniques to overcome these
disadvantages.
All the sampling techniques except for under-sampling have achieved state of art results in
combination with XGBoost model.
We have compared models to check which combination has achieved State of Art results. In
real-world use cases, models are evaluated based on the conditions they are getting used to.
For example, in diagnosing cancer diseases, identifying cancer diseases which is a rare
59
instance, is more important than the usual metric accuracy. Some use cases emphasize
reducing false alarms such as fraud detection and faulty systems as they have more damage
with false alarms. So, in this section, we will sort out the combinations of all the models
discussed in different metrics. So, we will consider the metrics precision, AUC ROC curve,
F1 score, False Positive Rate from now on. So, now we produce top five combinations for
each of the following metric.
Precision
F1 – Score
60
Figure 5.3. Top 5 combinations for F1 – Score
5.4 Summary
The last section considered the models' results, which ran on test datasets only. We have
picked the top five combinations for some metrics such as precision, AUC ROC Curve, F1
Score, False Positive Rate, and False Negative Rate. We have observed that the combinations
( Random Forest + SVM SMOTE + AdaBoost ) and ( Random Forest + KNN SMOTE +
AdaBoost ) performed consistently with all the metrics. Further we can conclude that
combining Random Forest classification algorithm with Adaboost and SMOTE family
sampling techniques has achieved the best results.
CHAPTER 6
CONCLUSION
61
6.1 Conclusion
In the modeling process, we started by making four individual models, Logistic Regression,
Decision Tree, Random Forest, and Naive Bayes Classifier, with their respective
hyperparameters. We evaluated the models using a custom-designed function that calculates
precision, AUC ROC Curve, F1- score, sensitivity, and specificity for each model. After
evaluating the models, we understood the effect of data imbalance on each of them. We
created seven different datasets using seven different sampling techniques, under-sampling,
oversampling, ADASYN, SMOTE, SVM SMOTE, Borderline SMOTE, and KNN SMOTE.
Then, we tried out Bagging, AdaBoost, Gradient Boosting, and XGBoost models. We tried
out all the combinations of classification models with sampling techniques, Bagging, and
Boosting algorithms.
Since accuracy is rendered not useful in handling data imbalance, we picked out the top five
combinations of models for metrics such as Precision, AUC ROC Curve, F1 score, and False
Positive Rate. We concluded that the combination of the Random Forest algorithm with
SMOTE family sampling techniques and the Adaboost algorithm had produced State of the
Art results.
6.2 Future Scope for Improvement
62
In the below steps, we will discuss how we can take this research forward on handling data
imbalance.
We worked with machine learning algorithms in modeling, and we can extend this by
using deep learning algorithms for classification. We can further work on building
pipelines consisting of the techniques mentioned above.
REFERENCES
63
Afzal, S., Maqsood, M., Nazir, F., Khan, U., Aadil, F., Awan, K.M., Mehmood, I. and Song, O.Y.,
(2019) A Data Augmentation-Based Framework to Handle Class Imbalance Problem for Alzheimer’s
Stage Detection. IEEE Access, 7, pp.115528–115539.
Chakravarthy, A.D., Bonthu, S., Chen, Z. and Zhu, Q., (2019) Predictive models with
resampling: A comparative study of machine learning algorithms and their performances on
handling imbalanced datasets. In: Proceedings - 18th IEEE International Conference on
Machine Learning and Applications, ICMLA 2019. Institute of Electrical and Electronics
Engineers Inc., pp.1492–1495.
Cheng, Q., Varshney, P.K. and Arora, M.K., (2006) Logistic regression for feature selection
and soft classification of remote sensing data. IEEE Geoscience and Remote Sensing Letters,
34, pp.491–494.
IEEE Circuits and Systems Society and Institute of Electrical and Electronics Engineers.,
(n.d.) 2019 IEEE International Symposium on Circuits and Systems (ISCAS) : proceedings :
ISCAS 2019 : Sapporo, Japan, May 26-29 2019.
IEEE Communications Society and Institute of Electrical and Electronics Engineers, (n.d.)
2017 International Conference on Advances in Computing, Communications and Informatics
(ICACCI) : 13-16 Sept. 2017.
IEEE Systems, M. and Institute of Electrical and Electronics Engineers, (n.d.) ICNSC 2018 :
the 15th IEEE International Conference on Networking, Sensing and Control : March 27-29,
2018, Zhuhai, China.
Institute of Electrical and Electronics Engineers and Manav Rachna International Institute of
Research and Studies, (n.d.) Proceedings of the International Conference on Machine
Learning, Big Data, Cloud and Parallel Computing : trends, prespectives and prospects :
64
COMITCON-2019 : 14th-16th February, 2019.
Ishaq, A., Sadiq, S., Umer, M., Ullah, S., Mirjalili, S., Rupapara, V. and Nappi, M., (2021a)
Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective
Data Mining Techniques. IEEE Access, 9, pp.39707–39716.
Ogunleye, A. and Wang, Q.G., (2020) XGBoost Model for Chronic Kidney Disease
Diagnosis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 176,
pp.2131–2140.
Shaw, R.N., Walde, P., Galgotias University, Institute of Electrical and Electronics Engineers
and IEEE Industry Applications Society, (n.d.) 2019 International Conference on Computing,
Power and Communication Technologies (GUCON) : Galgotias University, Greater Noida,
UP, India, Sep 27-28, 2019.
Liu, X., Dai, Y., Zhang, Y., Yuan, Q. and Zhao, L., (2017) A preprocessing method of
AdaBoost for mislabeled data classification. In: Proceedings of the 29th Chinese Control and
Decision Conference, CCDC 2017. Institute of Electrical and Electronics Engineers Inc.,
pp.2738–2742.
Makki, S., Assaghir, Z., Taher, Y., Haque, R., Hacid, M.S. and Zeineddine, H., (2019) An
Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud
Detection. IEEE Access, 7, pp.93010–93022.
Ishaq, A., Sadiq, S., Umer, M., Ullah, S., Mirjalili, S., Rupapara, V. and Nappi, M., (2021b)
Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective
65
Data Mining Techniques. IEEE Access, 9, pp.39707–39716.
Pes, B., (2019) Handling Class Imbalance in High-Dimensional Biomedical Datasets. In:
Proceedings - 2019 IEEE 28th International Conference on Enabling Technologies:
Infrastructure for Collaborative Enterprises, WETICE 2019. Institute of Electrical and
Electronics Engineers Inc., pp.150–155.
Sharma, S., Bellinger, C., Krawczyk, B., Zaiane, O. and Japkowicz, N., (2018) Synthetic
Oversampling with the Majority Class: A New Perspective on Handling Extreme Imbalance.
In: Proceedings - IEEE International Conference on Data Mining, ICDM. Institute of
Electrical and Electronics Engineers Inc., pp.447–456.
Yan, Y., Liu, R., Ding, Z., Du, X., Chen, J. and Zhang, Y., (2019) A parameter-free cleaning
method for SMOTE in imbalanced classification. IEEE Access, 7, pp.23537–23548.
66
APPENDIX B: RESEARCH PROPOSAL
67
Data Imbalance handling techniques: a comprehensive study
STUDENT ID - 980445
Research Proposal
Liverpool John Moores University – Master’s in Data Science July 2021
68
Abstract
Machine learning has contributed a lot to the scientific world in recent years. As technology
evolves, the amount of data generated is also growing exponentially. This growth also caused
a problem known as Data Imbalance. It occurs when the distribution of one class dominates
the other. Imbalanced data can cause severe problems because machine learning algorithms
work best with equally distributed data. This problem disables metrics such as Accuracy,
which we commonly use in the case of classification. We may end up having more than 95%
accuracy without classifying any of the minority classes correctly. Currently, we have some
techniques to handle imbalanced data, such as assigning class weights, Sampling, Adaboost,
XGBoost. Further, there are two divisions known as under-sampling and oversampling in
sampling. In this research paper, we will be using a dataset of thyroid disease data available
in Kaggle and contains around 3700 instances. This dataset is severely imbalanced, with only
6% of cases classified as positive for thyroid disease. We will compare the methods
mentioned above to handle data imbalance and improve the performance of various machine
learning algorithms such as Logistic regression, Decision trees, Random forests and, Naïve
Bayes. Evaluation of these combinations will be with the help of metrics such as the AUC-
ROC curve, Specificity, Precision, Recall, and F1-score. By identifying the best combination,
we can mitigate the effect caused by data imbalance.
List of Figures
69
Figure1: Process Flow Diagram………………………………………………….……14
Figure2: Research Plan………………………………….……………………………..19
List of Tables
70
Table1: List of Abbreviations………………………………………………………………4
Table2: Data set details…………………………………………………………………….15
List of Abbreviations
71
Abbreviation Expansion
T3 Triiodothyronine
1. Background
72
prevalent in use cases such as disease classification, churn prediction, predicting natural
distances, etc. To overcome this problem, we can adopt methods like Assigning class
weights, sampling and, boosting. Sampling indicates taking out a sample from the population.
We have two types of Sampling techniques that will balance the data by deleting the majority
class examples or adding minority class examples. We also have hybrid sampling techniques
which utilize both sampling methods. SMOTE is another sampling technique that creates data
synthetically using clustering methods. There are many variants of SMOTE, such as SVM
SMOTE, Safe level SMOTE, and KMEANS SMOTE. We can also use Boosting methods
such as Adaboost, Gradient boosting, XGboost, and Catboost for handling data
imbalance. Boosting algorithms is an ensemble method that generates a group of weak
learners and combines their predictions to create a strong learner. These algorithms can
reduce the bias of the model, reducing incorrect classifications.
This research will deal with thyroid disease data publicly available on Kaggle and UCL
repositories. It consists of 3700 instances and 30 attributes. This dataset contains around 94%
of negative cases, indicating severe data imbalance. First, we perform data cleaning, feature
engineering, and exploratory data analysis steps. Then, we deploy multiple machine learning
algorithms such as Logistic Regression, Decision Tree, Random Forest and, Naive Bayes to
see the effect of data imbalance on the model's performance. Then, we will try different
combinations of sampling techniques and boosting algorithms to enhance the performance of
the models mentioned above. Since accuracy is not helpful, we will use other metrics like
Precision, AUC-ROC curve, Recall, F1-score to evaluate the model. We can derive these
metrics from the confusion matrix. We will compare all the possible combinations and
analyse the advantages and disadvantages to suggest an application that can help mitigate the
effect of data imbalance.
2. Related Works
2.1 Introduction
Classification algorithms such as Logistic Regression, Decision Trees, and Random Forests
73
often wound up with data problems like Data imbalance. It has been a consistent problem for
years since modern-day applications of classification algorithms include dealing with highly
imbalanced data. A plethora of research has been carried out over the years to study and
mitigate the effect of data imbalance. We will discuss a few related to our research in this
section. Data imbalance occurs when the distribution of one class dominates the other.
Imbalanced data can cause severe problems because machine learning algorithms work best
with equally distributed data. In (Shakeel et al.), it is defined that a data set is said to be
imbalanced when the number of instances of majority class outnumbers the number of
instances of minority class by a large proportion. Further, in (IEEE Communications Society
and Institute of Electrical and Electronics Engineers.), it is mentioned that the imbalanced
distribution of classes in datasets appears when one class has a higher ratio than the other
class. Over the years, researchers have tried to use different machine learning concepts to
overcome the problem of data imbalance. For example, (Ebenuwa et al., 2019a) proposed to
employ feature selection using variance ranking techniques to handle data imbalance.
(Jawaharlal Nehru Engineering College. Department of Computer Science & Engineering et
al.) deployed random forest algorithm to encounter data imbalance. Further, in (IEEE
Computational Intelligence Society et al., 2018), it was proposed to use Sampling techniques
to handle extreme data imbalance in a streaming environment. Concepts such as Sampling
and Boosting have stood out among them as the state of art results. This research will deploy
the combinations of several sampling techniques, classification algorithms, and boosting
methodologies. Further, we will do a deep dive into the research done in the aforementioned
areas.
Classification algorithms are supervised machine learning models that classify massive data
into discrete values based on trained data observations. Examples of classification algorithms
are Logistic Regression, Decision Trees, Random Forests, Naive Bayes, and Support Vector
74
Machines. This section will discuss a few pieces of research that explored these classification
algorithms in detail. Logistic regression is one of the earliest and most famous classification
algorithms. It classifies the data point by calculating the log odds. In (Cheng et al., 2006), it is
mentioned that a logistic regression (LR) model may be used to predict the probabilities of
the classes based on the input features after ranking them according to their relative
importance. It can be said feature selection can be done explicitly with logistic regression.
They have also concluded that the logistic regression model can reduce the number of the
features substantially while still achieving a high classification accuracy. Decision trees are a
supervised machine learning model derived from the analogy of the human thinking process.
They are one of the most robust algorithms which are suitable for both classification and
regression. In (Ishaq et al., 2021a), it is mentioned that decision trees are used to create a tree
like structure which classifies the data based on some key attributes.
But, in many cases a single decision tree will not be sufficient which led to the development
of Ensembles algorithms.
(IEEE Systems and Institute of Electrical and Electronics Engineers.)
mentioned that Random Forest is a combination of several decision trees in a way
that each tree depends on an independent dataset, and all of them follow the same
distribution. Even though Random Forest produces better results than a single decision tree, it
is not well suited to handle data imbalance, especially while working with a large dataset.
Naïve Bayes is a classification algorithm derived from the Bayes’ theorem, and it operates
under the principle that predictors are independent of each other. In (Misra et al.), it is
mentioned Naïve Bayes choses the decision based on highest probability. It estimates
unknown probabilities from known values which means that it considers prior knowledge and
logic to be applied to the decision-making process. Support Vector Machines (SVM) is a
supervised machine learning algorithm with ability to work on classification, regression, and
outlier analysis. In (Institute of Electrical and Electronics Engineers.), they are defined as an
assorted research algorithm that helps in performing the analysis in a precise way.
The Review of Random Forest Classification Techniques to Resolve Data Imbalance research
(Jawaharlal Nehru Engineering College. Department of Computer Science & Engineering et al., n.d.)
has tested the performance of random forest variants such as Balanced random
forest, Weight sensitive random forest and other classification techniques such as Naive
Bayes, Adaboost, and Support Vector Machine against an imbalanced dataset. They have
used under sampling and oversampling techniques to handle data imbalance and noise in the
75
data. They concluded that even though random forest variants took more time than others,
they effectively handled data imbalance problems compared to other classification
algorithms. Further, The Interactive Thyroid Disease Prediction System Using Machine
Learning Technique research (Institute of Electrical and Electronics Engineers.) compares
different classification algorithms like Decision Trees, K-means clustering, SVM, and Neural
Network on thyroid disease identification problems. They used a different metric called the
absolute mean error to evaluate each model. They concluded that there is a need to develop
predictive models that require a person's minimum number of parameters to diagnose thyroid
disease, which can save money and time.
In The Prediction of Thyroid Disease Using Data Mining Techniques (Institute of Electrical
and Electronics Engineers. Madras Section and Institute of Electrical and Electronics
Engineers.), they have used data mining algorithms such as SVM, ID3, Naïve Bayes to
identify the correlation between different thyroid hormones and its diseases. They mentioned
that the models are evaluated based on their speed, accuracy, performance, and treatment
cost. Finally, they concluded that all these techniques helped minimize the noise in the data.
The Variance Ranking Attributes Selection Techniques for Binary Classification Problem in
Imbalance Data research (Ebenuwa et al., 2019) has used three different classification
algorithms Logistic regression, Naive Bayes, Decision tree, on three different datasets. They
have compared various attribution selection methods like Variance ranking, Pearson C,
Information gain, and used metrics such as precision, recall, F-measure, ROC area to evaluate
the models. They have compared the forementioned model results with and without
attribution selection techniques and concluded that attribution selection could play a vital role
in handling data imbalance. Further, they concluded that no single technique could be used
for all types of datasets. Depending on some intrinsic properties of the data items, each
known technique must be used on the correct dataset.
In this section, we will discuss a few pieces of research which discuss the details of handling
data imbalance in detail. We will focus on the research that deals with various sampling
techniques such as oversampling, under sampling, SMOTE and its variants, ROSE, and
various boosting algorithms such as Gradient boosting, Adaboost, and XGboost.
76
2.3.1 Sampling Techniques
Sampling is a statistical technique that helps us find a subset of data known as a sample with
all the properties of the complete data known as a population. Sampling helps us analyze the
population's characteristics without going through every data point. This technique is
beneficial in handling data imbalance problems. There are two divisions of sampling,
oversampling and under sampling. Under sampling refers to the downsizing of majority class
labels to balance data. In
(IEEE Circuits and Systems Society and Institute of Electrical and Electronics Engine
, it is mentioned that under sampling becomes ineffective as it leads to
significant data loss.
Oversampling tries to balance the data distribution by duplicating the minority class labels.
Since we are duplicating existing labels, it will not lead to data loss. The Handling class
imbalance in high-dimensional biomedical datasets research (Pes, 2019) combines Feature
selection with sampling-based class balancing strategies. They worked on a highly
imbalanced genomic data set which is available in Gene Expression Machine Learning
Repository. They used the Random Forest model as the classification algorithm. They
concluded that the model's performance in terms of false positives improved when they used
the combination of sampling and feature engineering methods. However, (Mathew et al.,
2018) clearly highlighted that random oversampling could lead to the overfitting of the
minority class labels. This problem gave rise to the new form of oversampling known as
Informed oversampling methods. Informed oversampling methods generate synthetic
minority data labels to balance the data distribution. These methods will essentially be adding
new data to the model so that it will not lose any data or overfit. Some of the examples of
these methods are SMOTE, ROSE, and their variants.
77
have used a Constructive covering algorithm for data cleaning and identifying key minority
samples. They have evaluated the performance of SMOTE and its variants using metrics like
Precision, F- score. They have concluded that the combination of SMOTE and Constructive
covering algorithm performs better than other comparing algorithms. The A Review on
Handling Imbalanced Data (SVS College of Engineering and Institute of Electrical and
Electronics Engineers, n.d.) survey has discussed the characteristics of data imbalance such
as small disjuncts, overlapping and data shifts, and the methods needed to handle it in detail.
They have compared different under sampling, oversampling, and hybrid sampling
techniques combined with different classification algorithms such as Decision Tree, Random
Forests, SVM, and Ensemble methods published by another research. They concluded that in
most cases, the SMOTE method succeeded in handling data imbalance effectively.
Many variants of SMOTE came into existence, such as SVM SMOTE, Border Line SMOTE,
and safe level SMOTE. The Handling Class Imbalance Problem using Oversampling
Techniques: Review research
(IEEE Communications Society and Institute of Electrical and Electronics Engineers.
explained that Border line SMOTE identifies the border line or near
by instances which are more prone to misclassification and generates synthetic data from
them. They have explained that Safe level SMOTE assigns a safe level value and creates
synthetic instances which lies closer to the safe level values. They have used oversampling
techniques such as SMOTE, Borderline SMOTE, Safe level SMOTE combined with
classification algorithms such as Naive Bayes, Support Vector Machine, and Nearest
neighbor classification on six different datasets. It concluded that Safe level SMOTE
outperformed all other methods based on F-score and g-mean metrics. Further, The
Application of Balancing Techniques with Ensemble Approach for Credit Card Fraud
Detection research (Shaw et al., n.d.) has used data balancing techniques such as Down-up
sampling, SMOTE family and ADASYN and Random Forest model. The research concluded
that the SVM SMOTE method combined with the random forest model performed better with
the highest recall, precision, and F-score values.
Random Over Sampling Examples (ROSE) is another bootstrapped-based technique that
generates balanced synthetic data and can handle categorical and continuous data. The
Predictive Models with Resampling: A Comparative Study of Machine Learning Algorithms
and their Performances on Handling Imbalanced Datasets research
(Chakravarthy et al., 2019)
has compared the sampling techniques SMOTE and ROSE on five different
classification algorithms Cart 5.0, Random Forest, Neural nets, K means and Support Vector
78
Machine. They have used two highly imbalanced datasets Open corrosion and Colorectal.
They have concluded that the ROSE technique delivered consistent results compared to
SMOTE.
Boosting algorithms are a type of Ensemble algorithm that combines the predictive power of
multiple weak learners to create one strong ensemble with low bias. Examples of boosting
algorithms are Adaboost, Gradient Boosting, and XGBoost. AdaBoost is one of the most
popular algorithms for classification and has some real-world applications such as face
detection and text classification. (Liu et al., 2017)explained the main idea of Adaboost as to
construct a succession of weak learners through different training sets with different weights.
However, one disadvantage of Adaboost is that it overfits as the number of base learners
increases. Gradient Boosting is often considered to be one of the strongest algorithms in
machine learning. This algorithm can be used with both categorical and continuous data. This
algorithm is prone to overfit quickly, and variants such as XGBoost have been developed to
overcome this disadvantage. XGBoost is an optimized variant of gradient boosting technique
that is highly effective, portable, and flexible. In (Ogunleye and Wang, 2020), XGBoost is
defined as a highly efficient boosting algorithm that can overpower many of other boosting
algorithms. This research used XGBoost with hyperparameters tuned along with optimization
algorithms such as Genetic algorithm, annealing algorithm and Particle Swamp Organization
(PSO) algorithm to diagnose Chronic Kidney Diseases (CKD). They have used Specificity,
Sensitivity, precision, and ROC- AUC metrics to evaluate the performance of XGBoost
model. Further, they concluded that the XGBoost model has achieved state of art results in
diagnosing chronic kidney diseases with an accuracy, specificity, and sensitivity as 1.0
respectively.
The Improving the Prediction of Heart Failure Patients' Survival Using SMOTE and
Effective Data Mining Techniques research (Ishaq et al., 2021b) has used SMOTE in
combination with various supervised learning techniques such as Random forests, Extra tree
classifier, Adaboost, XGboost, etc. They have used the metrics Accuracy, precision, and
recall to evaluate the models. They concluded that the SMOTE had improved the
performance of all algorithms, and the combination of SMOTE and Random forest has
79
achieved state-of-art results. The A Data Augmentation-Based Framework to Handle Class
Imbalance Problem for Alzheimer's Stage Detection research (Afzal et al., 2019) used the
Transfer Learning-based method on multi-class Alzheimer's data. To handle data imbalance,
they have used data augmentation techniques such as Cropping the image, Rotating the image
with multiple parameters. They have concluded that Data augmentation has increased the
model's overall performance and helped overcome the state-of-art results. The Synthetic
oversampling with the majority class: A new perspective on handling extreme imbalance
research (Sharma et al., 2018) has used SMOTE along with Swim In The Majority technique.
This technique is designed to overcome the limitations of SMOTE. They have tested this
combination on 13 datasets and three different classification algorithms. They have also tried
these combinations in various imbalance ratios. They concluded that the combination of
SMOTE with Swim In The Majority had achieved more significant results than when
SMOTE alone was used.
2.4 Summary:
3. Research Questions
This research tries to compare the performance of sampling techniques and boosting methods
both individually and combined in handling data imbalance problems. The study starts by
highlighting the difficulty faced in handling skewed data. Then we try out different
combinations of methods mentioned above to address the issues. These are a few research
80
questions framed from the related works section:
• To perform effective EDA to detect any underlying patterns and to get the data ready
for modeling.
• To compare the combinations of sampling methods and boosting techniques to
minimize the effects of skewed data.
• To provide a detailed report of the performance of the above combinations and
analyse their pros and cons.
This research aims to determine the effects of data imbalance in classification. We will be
using thyroid disease data containing around 3700 instances, of which only 6% are positive,
indicating the extreme skewness of the data. This study intends to identify any underlying
81
patterns with the Exploratory Data Analysis and then move on to handle data imbalance. As
mentioned in the related works section, many studies have worked on sampling techniques
and boosting methods. The novelty of this study lies in combining the sampling techniques
with boosting strategies. Data imbalance is a serious concern in many real-world applications
such as Thyroid disease diagnosis, credit card fraud detection, Market Segmentation. Most of
the previous studies focused on classifying minority classes correctly, but we should also deal
with False alarms. Classifying a person without disease as sick or identifying a genuine
transaction as fraud will have serious consequences. By combining sampling and boosting
methods, we can reduce false alarms effectively. Our final solution will aid in building an
application that can handle data imbalance effectively.
This study's scope is to combine various sampling and boosting techniques to increase the
performance of classification algorithms. We will be using Logistic regression, Decision
trees, Random Forests, Naïve Bayes algorithms in this research. This research will explore
sampling techniques such as Random oversampling, random under-sampling, ROSE,
SMOTE, and its variants in combination with boosting methods such as Adaboost, Gradient
boosting, XGBoost. This study aims to handle the data imbalance problems and reduce false
alarms effectively. All the methods will be evaluated with the help of metrics like Precision,
AUC-ROC curve, Recall, g-mean, F1-score. Further, this research can be extended by using
deep learning algorithms.
7. Research Methodology
7.1. Introduction
Data imbalance has been a severe obstacle in applying machine learning algorithms for real-
world applications. Some of these applications which suffer from data imbalance are Thyroid
82
disease diagnosis, cancer diagnosis, Marketing Segmentation. Generally, These algorithms
work best when data is distributed evenly. There are many procedures developed to handle
this problem. Some of them include Sampling and Boosting methods. This research tries to
connect sampling and boosting techniques to determine the best combination for handling
data imbalance. This research will be using the Thyroid disease dataset open on Kaggle and
UCL repository. This dataset consists of around 3700 instances and 30 attributes. This dataset
is severely imbalanced as it contains only 6% instances of the minority class. We will be
using metrics such as precision, AUC-ROC Curve, F1-score, and Recall to evaluate our
combinations. This research is separated into various phases, which are listed below:
In the data understanding step, we get to have a closer look at our research data. This step
includes gaining access to the data and analysing its metadata information. Data cleaning is a
crucial step in any machine learning project, and it takes up the bulk of the project time. This
step includes dealing with null values, verifying the data types of attributes, dropping
unnecessary attributes and columns with only one unique value, performing sanity checks,
deriving new variables, etc. These operations generally vary depending on the project
requirements.
In this step, we try to identify any underlying patterns in the data by grouping the data and
visualizing it. Data visualization is of three types, univariate, bivariate, and multivariate. We
can use plots like count, bar, line, box, scatter, and heat maps to visualize the data. Through
this step, we try to gain insights from the data.
In this step, we will make the data ready for the modelling phase. This step includes adding
dummy variables, performing normalization, and splitting the data into train and test sets.
83
Step 4: Build & Tune Models:
We will be using a random forest algorithm to build the model. Then we will use k-cross
validation and hyperparameter tuning to make the model more effective. We will assess the
effect of data imbalance, and then we will use different sampling methods and boosting
techniques and their combinations to mitigate the impact of data imbalance. We will analyse
the effect of each sampling and boosting technique individually along with their advantages
and disadvantages.
This step is project-specific, and the metrics used will vary based on the requirements or
problems faced by the project. Accuracy will not be insightful in the case of data imbalance
so, we will use metrics such as Precision, Specificity, AUC-ROC, Recall, F1- score. For
more detailed information, please refer to the 'Evaluation Metrics' section.
In this step, we will populate a table containing the performance of all the combinations we
have tried. After that, we will analyze the performance of each combination and discuss its
pros and cons. We conclude with the best combination available for us and discuss the future
scope of this project.
84
Figure 1: Process Flow Diagram
85
This research paper will deal with the Thyroid disease dataset available on Kaggle and UCL
repositories. The reason for selecting this dataset for this research is the imbalance of data
distribution. There are 3772 instances in the dataset, out of which only 6% are positive for
Thyroid disease. The dataset contains the basic details of the patient, such as age and gender.
It also includes symptoms of patients such as sick, tumor, goitre, hyperthyroid. It consists of
the medication being used by the patient, such as I131_treatment, Lithium. Finally, this
dataset also contains the measured value of different thyroid hormones like TSH, T3, TT4,
TF4, FTI, etc.
In total, this dataset contains 30 attributes. The below table contains the description and data
type of each attribute.
Table 2: Dataset Details
Attribute Data Type Description
Age Numeric Age of the patient
Sex Boolean (f/m) Indicates patient’s gender
on_thyroxine Boolean (f/t) Indicates if the patient is
using thyroxine medication
on_antithyroid_medication Boolean (f/t) Indicates if the patient is on
anti-thyroid medication.
sick Boolean (f/t) Indicates if the patient is
sick or not.
Pregnant Boolean (f/t) Indicates if the patient is
pregnant or not.
thyroid_surgery Boolean (f/t) Indicates if the patient has
undergone thyroid surgery.
I131_treatment Boolean (f/t) Indicates if the patient is
receiving I131 treatment.
lithium Boolean (f/t) Indicates if the patient is
using lithium or not.
goitre Boolean (f/t) Indicates if the patient is
suffering from goitre or not.
tumour Boolean (f/t) Indicates if the patient has a
tumour or not.
86
hypopituitary Boolean (f/t) Indicates if the patient has
hypo-pituitary disorder.
psych Boolean (f/t) Indicates if the patient is
referred to psych evaluation.
TSH_measured Boolean (f/t) Indicates if the patients’
TSH value is measured or
not.
T3_measured Boolean (f/t) Indicates if the patients’ T3
value is measured or not.
TT4_measured Boolean (f/t) Indicates if the patients’ TT4
value is measured or not.
T4U_measured Boolean (f/t) Indicates if the patients’
T4U value is measured or
not.
FTI_measured Boolean (f/t) Indicates if the patients’ FTI
value is measured or not.
TSH Numeric Indicates the TSH value of
patient.
T3 Numeric Indicates the T3 value of
patient.
TT4 Numeric Indicates the TT4 value of
patient.
T4U Numeric Indicates the T4U value of
patient.
FTI Numeric Indicates the FTI value of
patient.
referral_source String Indicates the referral source
of patient.
Class Boolean (Negative/sick) Indicates if the patient is
suffering from Thyroid
disease or not.
87
In this step, we perform necessary operations to make the data model ready. To clean the
data, we need to do a few procedures such as dealing with null values, handling duplicate
rows, checking the data types of attributes, mapping Boolean variables, performing sanity
checks, dealing with outlier values, etc. These steps may vary depending on the research.
Next, we need to perform EDA to discover the underlying patterns. Feature Engineering is a
procedure where we derive new features that are more relevant to the outcome. The data
Normalization step helps machine learning algorithms to converge faster. Our final step
before building a model is the Train-Test split, in which we divide our dataset into two sets,
train and test. Generally, we use the 75:25 ratio to split the dataset.
7.4. Models
This step will start out with a random forest model and continue to tune its hyperparameters.
Then, we will analyse the effect of data imbalance on the performance of our model. Next,
we will try out each of the sampling methods and Boosting methods individually. These
methods should be sufficient for the problem we are attempting to address for the sake of this
research. They are extensively utilized in the business and are frequently used as starting
points for tackling data imbalance challenges. We will discuss the effect of these methods
along with their pros and cons in dealing with data imbalance. Then, we will start making
combinations of sampling and boosting methods based on their individual performances.
For handling data imbalance, we need to focus on classifying minority class instances
correctly. So, we will be using metrics based on that reason. We will use metrics such as
Precision, Specificity, AUC-ROC curve, Recall, F1-score, etc. Most of these metrics can be
derived from the confusion matrix. Specificity denotes the performance of our model in
classifying minority class instances. It is also known as True Negative Rate. False alarms are
the case where we classify a healthy person as sick. We use recall to calculate the effect of
false alarms. ROC curve is a graph that displays the trade-off between sensitivity and
specificity. F1 score is derived from the values of precision and recall.
8. Required Resources
88
Laptop – Windows/Mac
RAM – 6 GB or higher
Processor – Intel Core i5 or higher
GPU – Intel integrated or higher
OS – Windows 10 / Mac Catalina or equivalent
Google Drive
Software Requirements:
Python
Anaconda
Jupyter notebook
NumPy
Pandas
Matplotlib
Seaborn
Plotly
Sklearn
Imblearn
SciPy
Statsmodel
Please, make sure that all the packages are updated to their latest versions.
9. Research Plan
89
References
Afzal, S., Maqsood, M., Nazir, F., Khan, U., Aadil, F., Awan, K.M., Mehmood, I. and Song,
O.Y., (2019) A Data Augmentation-Based Framework to Handle Class Imbalance Problem
90
for Alzheimer’s Stage Detection. IEEE Access, 7, pp.115528–115539.
Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., Hawalah, A. and
Hussain, A., (2016) Comparing Oversampling Techniques to Handle the Class Imbalance
Problem: A Customer Churn Prediction Case Study. IEEE Access, 4, pp.7940–7957.
Chakravarthy, A.D., Bonthu, S., Chen, Z. and Zhu, Q., (2019) Predictive models with
resampling: A comparative study of machine learning algorithms and their performances on
handling imbalanced datasets. In: Proceedings - 18th IEEE International Conference on
Machine Learning and Applications, ICMLA 2019. Institute of Electrical and Electronics
Engineers Inc., pp.1492–1495.
Ebenuwa, S.H., Sharif, M.S., Alazab, M. and Al-Nemrat, A., (2019) Variance Ranking
Attributes Selection Techniques for Binary Classification Problem in Imbalance Data. IEEE
Access, 7, pp.24649–24666.
IEEE Communications Society and Institute of Electrical and Electronics Engineers, (n.d.)
2017 International Conference on Advances in Computing, Communications and Informatics
(ICACCI) : 13-16 Sept. 2017.
Institute of Electrical and Electronics Engineers, (n.d.) 2018 2nd International Conference on
Informatics and Computational Sciences (ICICoS).
Institute of Electrical and Electronics Engineers. Madras Section and Institute of Electrical
and Electronics Engineers, (n.d.) 2019 5th International Conference on Advanced Computing
& Communication Systems (ICACCS).
Ishaq, A., Sadiq, S., Umer, M., Ullah, S., Mirjalili, S., Rupapara, V. and Nappi, M., (2021)
Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective
91
Data Mining Techniques. IEEE Access, 9, pp.39707–39716.
Pes, B., (2019) Handling Class Imbalance in High-Dimensional Biomedical Datasets. In:
Proceedings - 2019 IEEE 28th International Conference on Enabling Technologies:
Infrastructure for Collaborative Enterprises, WETICE 2019. Institute of Electrical and
Electronics Engineers Inc., pp.150–155.
Sharma, S., Bellinger, C., Krawczyk, B., Zaiane, O. and Japkowicz, N., (2018) Synthetic
Oversampling with the Majority Class: A New Perspective on Handling Extreme Imbalance.
In: Proceedings - IEEE International Conference on Data Mining, ICDM. Institute of
Electrical and Electronics Engineers Inc., pp.447–456.
Shaw, R.N., Walde, P., Galgotias University, Institute of Electrical and Electronics Engineers
and IEEE Industry Applications Society, (n.d.) 2019 International Conference on Computing,
Power and Communication Technologies (GUCON) : Galgotias University, Greater Noida,
UP, India, Sep 27-28, 2019.
SVS College of Engineering and Institute of Electrical and Electronics Engineers, (n.d.)
Proceedings of the 2018 International Conference on Current Trends towards Converging
Technologies : 01 - 03, March 2018.
Yan, Y., Liu, R., Ding, Z., Du, X., Chen, J. and Zhang, Y., (2019) A parameter-free cleaning
method for SMOTE in imbalanced classification. IEEE Access, 7, pp.23537–23548.
92
93