axioms-11-00607-v2
axioms-11-00607-v2
axioms-11-00607-v2
Article
A Method for Analyzing the Performance Impact of Imbalanced
Binary Data on Machine Learning Models
Ming Zheng 1,2, * , Fei Wang 1 , Xiaowen Hu 1 , Yuhao Miao 3 , Huo Cao 1 and Mingjing Tang 4,5, *
1 School of Computer and Information, Anhui Normal University, Wuhu 241002, China
2 Anhui Provincial Key Laboratory of Network and Information Security, Wuhu 241002, China
3 Affiliated Institution of Anhui Normal University, Wuhu 241002, China
4 School of Life Science, Yunnan Normal University, Kunming 650500, China
5 Engineering Research Center of Sustainable Development and Utilization of Biomass Energy,
Ministry of Education, Yunnan Normal University, Kunming 650500, China
* Correspondence: mzheng@ahnu.edu.cn (M.Z.); tmj@ynnu.edu.cn (M.T.)
Abstract: Machine learning models may not be able to effectively learn and predict from imbalanced
data in the fields of machine learning and data mining. This study proposed a method for analyzing
the performance impact of imbalanced binary data on machine learning models. It systematically
analyzes 1. the relationship between varying performance in machine learning models and imbalance
rate (IR); 2. the performance stability of machine learning models on imbalanced binary data. In
the proposed method, the imbalanced data augmentation algorithms are first designed to obtain the
imbalanced dataset with gradually varying IR. Then, in order to obtain more objective classification
results, the evaluation metric AFG, arithmetic mean of area under the receiver operating characteristic
curve (AUC), F-measure and G-mean are used to evaluate the classification performance of machine
learning models. Finally, based on AFG and coefficient of variation (CV), the performance stability
evaluation method of machine learning models is proposed. Experiments of eight widely used
machine learning models on 48 different imbalanced datasets demonstrate that the classification
performance of machine learning models decreases with the increase of IR on the same imbalanced
Citation: Zheng, M.; Wang, F.; Hu, X.;
data. Meanwhile, the classification performances of LR, DT and SVC are unstable, while GNB, BNB,
Miao, Y.; Cao, H.; Tang, M. A Method
KNN, RF and GBDT are relatively stable and not susceptible to imbalanced data. In particular, the
for Analyzing the Performance
Impact of Imbalanced Binary Data on
BNB has the most stable classification performance. The Friedman and Nemenyi post hoc statistical
Machine Learning Models. Axioms tests also confirmed this result. The SMOTE method is used in oversampling-based imbalanced
2022, 11, 607. https://doi.org/ data augmentation, and determining whether other oversampling methods can obtain consistent
10.3390/axioms11110607 results needs further research. In the future, an imbalanced data augmentation algorithm based
on undersampling and hybrid sampling should be used to analyze the performance impact of
Academic Editor: Jong-Min Kim
imbalanced binary data on machine learning models.
Received: 19 September 2022
Accepted: 28 October 2022 Keywords: machine learning models; imbalanced data; machine learning; data mining;
Published: 1 November 2022 performance impact
Publisher’s Note: MDPI stays neutral
MSC: 68T09
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
1. Introduction
In the fields of data mining and machine learning, imbalanced data classification is a
Copyright: © 2022 by the authors.
ubiquitous natural phenomenon. Imbalanced data classification is a kind of supervised
Licensee MDPI, Basel, Switzerland. learning, which means that the distribution of response variables in the dataset varies
This article is an open access article greatly in different classes. In binary or multiclass datasets affected by imbalanced data
distributed under the terms and classification, a response variable with fewer samples is referred to as a positive class or
conditions of the Creative Commons a minority class, whereas a response variable with more samples is known as a negative
Attribution (CC BY) license (https:// class or a majority class. Due to the machine learning models being based on situations
creativecommons.org/licenses/by/ in which the data distribution is relatively balanced, these machine learning models may
4.0/). experience different degrees of defects when faced with imbalanced data classification and
may thus become inefficient [1]. For example, suppose there is a patient dataset containing
990 normal patients and 10 cancer patients. If we do not modify the machine learning
model or improve the distribution of data and use the machine learning model directly,
it will tend to think that all patients in the dataset are normal patients. This can lead to
a terrible situation in which cancer patients missed the optimal treatment time because
they cannot be accurately predicted, thus endangering their lives or even death. Therefore,
enhancing the analysis and understanding of machine learning models for imbalanced
data classification has important theoretical significance and application value [2].
Current studies on the imbalanced data classification issues are mainly concerned with
two methods [3]. The first involves newly designing or improving the machine learning
models, which generally entails reducing the sensitivity of the classification algorithm to the
imbalanced data. Typically, either ensemble learning is used to increase the robustness of
the machine learning models [4] or a cost-sensitive learning method [5] is used to make the
cost of misclassifying the minority class higher than that of misclassifying the majority class.
The second approach is to use a sampling method to balance the dataset on the data level,
mainly via oversampling [6,7], undersampling [3] or hybrid sampling [8]. The purpose of
oversampling is to increase the number of samples in the minority class, thus improving the
distribution of data among classes. Undersampling has the same purpose as oversampling,
but instead removes samples from the majority class. Finally, hybrid sampling is intended
to balance the dataset by combining the two aforementioned sampling methods.
This study aims to provide a method to analyze the performance impact of imbalanced
binary data on machine learning models. Hence, proposing new techniques addressing
imbalanced data classification is not the focus. The method proposed in this study not
only analyzes the relationship between varying performance in machine learning models
and IR, but also analyzes the performance stability of the machine learning models on the
imbalanced datasets. The main contributions of this study can be summarized as follows.
(1) To obtain the imbalanced dataset with gradually varying IR and belonging to
the same distribution, this study proposes three different augmentation algorithms of
imbalanced data by combining the oversampling method, the undersampling method and
the hybrid sampling method, respectively.
(2) This study proposes a performance evaluation metric AFG by analyzing and
combining evaluation metrics AUC, F-measure and G-mean.
(3) Our comparative study systematically analyzes the relationship between varying
performance in machine learning models and IR, as well as the performance stability of
eight machine learning models on 48 benchmark imbalanced datasets, which can provide
an important reference value for imbalanced data classification application developers and
researchers.
The remainder of this study is organized as follows. Section 2 describes the proposed
approach in detail. Experiment settings are given in Section 3, and the experimental results
are discussed in Section 4. The related works are discussed in Section 5. Finally, Section 6
briefly summarizes our study and presents the conclusions.
2. Proposed Method
The overall framework of the method for analyzing the performance impact of imbal-
anced binary data on machine learning models is shown in Figure 1.
Specifically, in the framework, a new set of imbalanced data with decreasing IR
is augmented based on the augmentation algorithms in the first. How to augment an
imbalanced dataset with decreasing IR will be described in detail in Section 2.1. Then, in
Section 2.2, in order to obtain the relationship between varying performance in machine
learning models and IR, we use AFG, the arithmetic mean of AUC, F-measure and G-mean,
to evaluate the classification performance of machine learning models. Finally, in Section 2.3,
the performance stability of machine learning models on imbalanced datasets is evaluated
by combining AFG and CV. Meanwhile, statistical tests are applied to further verify whether
the performance stability of these machine learning models is significantly different.
Axioms 2022, 11, 2022,
Axioms x FOR 11,PEER
607 REVIEW 3 of 19 3 of 21
Imbalanced
RF DT
data2
performance stability of
Augmentation Training and classification the standard classification
Imbalanced More objective Performance stability algorithms on the
Imbalanced data algorithms of based on machine learning
data... evaluation Metric: AFG evaluation method imbalanced data
imbalanced data models
Imbalanced
GNB BNB
datan-1
KNN GBDT
Imbalanced
datan
1 2 3
Figure 1. Overall
Figure framework
1. Overall frameworkof
of the proposedmethod.
the proposed method.
Algorithm 1 first divides the original imbalanced data into majority class samples
Tmajority and minority class samples Tminority according to the class label of samples. Among
them, the number of minority class samples is n1 , and the number of majority class samples
is n2 . Then, the IR of the original imbalanced data T is calculated and denoted with r; note
that r is the value rounded down. Traversing r, each traversal will calculate the oversampling
ratio. According to the oversampling ratio and original imbalanced data T, the oversampling
approach is used to generate minority class samples, merge the generated minority class
them, the number of minority class samples is n1, and the number of majority class sam-
ples is n2. Then, the IR of the original imbalanced data T is calculated and denoted with r;
note that r is the value rounded down. Traversing r, each traversal will calculate the over-
sampling ratio. According to the oversampling ratio and original imbalanced data T, the over- 4 of 19
Axioms 2022, 11, 607
sampling approach is used to generate minority class samples, merge the generated minor-
ity class samples with the original imbalanced data T to get new imbalanced data and the
above steps are repeated
samplesuntil
withthe
the end of the
original loop todata
imbalanced finally
T to obtain
get newthe augmented
imbalanced imbal-
data and the above
steps are repeated until the end of the loop to finally obtain the augmented
anced dataset Taugment (a group of imbalanced datasets with decreasing IR). The relationship imbalanced
dataset Taugment (a group of imbalanced datasets with decreasing IR). The relationship
between the augmentation process of imbalanced data and IR change in Algorithm 1 is
between the augmentation process of imbalanced data and IR change in Algorithm 1 is
shown in Figure 2. shown in Figure 2.
Similarly, Algorithm 2 first divides the original imbalanced data into majority class
samples Tmajority and minority class samples Tminority according to the class of samples.
Among them, the number of minority class samples is n1 , and the number of majority
class samples is n2 . Then, the IR of the original imbalanced data T is calculated and
denoted with r; note that r is the value rounded down again. Traversing r, each traversal
will calculate the undersampling ratio. According to the undersampling ratio and original
Axioms 2022, 11, 607 5 of 19
imbalanced data T, the undersampling approach is used to delete majority class samples,
remove deleted majority class samples from the original imbalanced data T to get new
imbalanced data and the above steps are repeated until the end of the loop to finally
R PEER REVIEW obtain the augmented imbalanced dataset Taugment (a group of imbalanced datasets
6 of 21 with
decreasing IR). The relationship between the augmentation process of imbalanced data and
IR change in Algorithm 2 is shown in Figure 3.
Last, the imbalanced data augmentation algorithm based on the hybrid sampling
Last, the imbalanced data
method is augmentation
introduced. algorithm
After applying based
Algorithm on augmented
3, the the hybrid samplingdataset
imbalanced
with decreasing IR can be obtained.
method is introduced. After applying Algorithm 3, the augmented imbalanced dataset
with decreasing IR can be obtained.
Algorithm 3 Hybrid sampling based imbalanced data augmentation
Input: T: Original imbalanced data.
Algorithm 3 HybridOutput:
sampling based
Taugment imbalanced
: Augmented data
imbalanced augmentation
dataset.
Procedure Begin
Input: T: Original imbalanced data.
1. T ← Tmajority ∪ Tminority
Output: Taugment: Augmented
2. imbalanced
n1 ← size of Tminority dataset.
3. n2 ← size of Tmajority
Procedure Begin
4. r ← int (n2 /n1 )
1. T ← Tmajority ∪ Tminority
5. for i = 0 to r − 1 do
6. oversampling ratio ← (n1 + n2 )/rn2
2. n1 ← size of Tminority
7. undersampling ratio ← rn1 /[(n1 + n2 )(r − 1)]
8. generated minority class samples ← oversampling approach (T, oversampling ratio)
3. n2 ← size of Tmajority
9. deleted majority class samples ← undersampling approach (T, undersampling ratio)
4. r ← int (n2/n1) 10. Taugment [i] ← generated minority class samples ∪ T—deleted majority class samples
11. return Taugment [i]
5. for i = 0 to r − 1 12.
do end for
13. return Taugment
6. oversampling ratio
14. End← (n1 + n2)/rn2
7. undersampling ratio ← rn1/[(n1 + n2)(r − 1)]
8. Algorithm
generated minority class samples ← oversampling
3 first divides the original imbalanced data
approach (T, into majority ratio)
oversampling class samples
Tmajority and minority class samples Tminority according to the class of samples. Among them,
9. deleted majority class of
the number minority←
samples undersampling
class approach
samples is n1 , and (T, undersampling
the number of majority classratio)
samples is n2 .
10. Then, the IRminority
Taugment[i] ← generated of the original
classimbalanced
samples data T is calculated
∪ T—deleted and denoted
majority class with r; note that r
samples
11. return Taugment[i]
12. end for
Axioms 2022, 11, 607 6 of 19
R PEER REVIEW 7 of 21
is the value rounded down. Traversing r, each traversal will calculate the oversampling ratio
and the undersampling ratio. According to the oversampling ratio, undersampling ratio and
samples, respectively. The imbalanced
original generateddata
minority class samples
T, the oversampling areand
approach merged and the
undersampling deleted
approach are used
majority class samples are removed
to generate from
the minority thesamples
class original
andimbalanced dataclass
delete the majority T tosamples,
obtainrespectively.
new
imbalanced data andThethe above steps
generated areclass
minority repeated
samplesuntil the end
are merged andofthe
the loop,majority
deleted to finally
classob-
samples
removed from the original imbalanced data T
tain the augmented imbalanced dataset with decreasing IR. The relationship between the and
are to obtain new imbalanced data
the above steps are repeated until the end of the loop, to finally obtain the augmented
augmentation process of imbalanced data and IR change in Algorithm 3 is shown in Fig-
imbalanced dataset with decreasing IR. The relationship between the augmentation process
ure 4. of imbalanced data and IR change in Algorithm 3 is shown in Figure 4.
It should be noted that the three imbalanced data augmentation algorithms have good
It should be noted thatThis
flexibility. theisthree imbalanced
because the resampling data augmentation
methods of the three algorithms
augmentationhavealgorithms
good flexibility. This is because the resampling methods of the three augmentation algo-
are not fixed, and can be arbitrary oversampling, undersampling and hybrid sampling
methods, so they are represented in italics. Meanwhile,
rithms are not fixed, and can be arbitrary oversampling, undersampling and hybrid sam- the above three augmentation
algorithms have the same purpose and are all designed to augment imbalanced data into
pling methods, so they are represented in italics. Meanwhile, the above three augmenta-
a group of imbalanced datasets with decreasing IR. As long as the resampling methods
tion algorithms haveinthe
the same
above purpose and algorithms
augmentation are all designed
are goodtoenough,
augment the imbalanced
augmented datadata and the
into a group of imbalanced datasets with
original imbalanced decreasing
data belong IR. As
to the same long as the
distribution. resampling
Because the SMOTE meth-
(synthetic
minority oversampling
ods in the above augmentation technique)
algorithms are good[6] isenough,
a classicalthe
andaugmented
widely used oversampling
data and the method
in the studies of imbalanced data classification [9–11], this study uses a SMOTE-based
original imbalanced data belong to the same distribution. Because the SMOTE (synthetic
augmentation algorithm to augment the imbalanced binary data.
minority oversampling technique) [6] is a classical and widely used oversampling method
in the studies of imbalanced data
2.2. Performance classification
Evaluation Metric [9–11], this study uses a SMOTE-based
augmentation algorithmThe to augment
evaluation the imbalanced
metrics binaryand
AUC, F-measure data.
G-mean are widely used to evaluate
the classification performance of machine learning models for imbalanced data classifica-
tion [12–14]. To facilitate the introduction of the calculation rules of the evaluation metrics,
2.2. Performance Evaluation Metric
the confusion matrix was first established, as detailed in Table 1.
The evaluation metrics AUC, F-measure and G-mean are widely used to evaluate the
Table 1. Binary
classification performance classification
of machine confusion
learning matrix. for imbalanced data classification
models
[12–14]. To facilitate the introduction of the calculation rules
Predicted of the evaluation
Positive metrics,
Predicted the
Negative
confusion matrix was first Actual
established,
positive as detailedTrue in Table 1.(TP)
positives False negatives (FN)
Actual negative False positives (FP) True negatives (TN)
Table 1. Binary classification confusion matrix.
3. Experiment Settings
The benchmark imbalanced data are described in Section 3.1. We briefly introduce the
eight machine learning models for training and classifying imbalanced data in Section 3.2.
Section 3.3 explains the experimental flow design. Finally, Section 3.4 introduces statistical
test method.
Imbalanced
data
Split training set and
90% of data testing set randomly 10% of data
Perform 100 times
Figure
Figure6.6.Experimental
Experimental flowchart.
flowchart.
3.4.Statistical
3.4. Statistical Test
TestMethod
Method
Thisstudy
This study employed
employed non-parametric
non-parametrictesting
testingtotoanalyze
analyzeand
andcompare
compare whether there
whether there
are significant differences in the performance stability of machine learning
are significant differences in the performance stability of machine learning models models on on
imbalanced datasets.
imbalanced datasets. These
These tests
tests have
have been
beenused
usedininseveral
severalempirical
empiricalstudies andand
studies are are
highly recommended in the field of machine learning and data mining [20,30] to confirm
highly recommended in the field of machine learning and data mining [20,30] to confirm
experimental results. The non-parametric test procedure consists of three steps. First,
experimental results. The non-parametric test procedure consists of three steps. First,
ranking scores are computed by assigning a rank score to each machine learning model in
ranking scores are computed by assigning a rank score to each machine learning model in
each imbalanced dataset. Because the smaller the CV AFG , the more stable the performance
each imbalanced dataset. model,
of the machine learning Becausetherefore
the smaller
8 isthe CVAFG, to
assigned thethe
more
moststable the performance
unstable machine
oflearning
the machine
model, 7learning model,
to the second therefore
unstable 8 is learning
machine assignedmodel,
to theand
most unstable
so on. machine
The ranking
learning model, 7 to the second unstable machine learning model, and so on. The
score of the best stable machine learning model is 1. Then, the mean ranking scores of eightranking
machine learning models on 48 imbalanced datasets are computed. Next, the Friedman test
is used to determine whether these machine learning models deliver the same performance
stability. If performance stability differs, the hypothesis that all machine learning models
have the same performance stability is rejected; if the performance stability of the machine
learning model is significantly different, a post hoc test is needed to further distinguish
each machine learning model. Finally, when the hypothesis that all machine learning
models have the same performance stability is rejected, the Nemenyi post hoc test is
applied to check whether the control machine learning model (usually the most stable
Axioms 2022, 11, 607 11 of 19
one) significantly outperforms the remaining machine learning models. The Nemenyi
post hoc procedures enable calculating the critical distance of the mean ranking score
difference. If the difference between the mean ranking scores of the two machine learning
models exceeds the critical distance, the hypothesis that the performance stability of the
two machine learning models is the same is rejected at a specified level of significance α
(i.e., there exist significant differences); in this study, α = 0.05.
1.0 1.0
1.0 1.0
0.9
0.8
0.8 0.8
0.8
0.6
AFG
AFG
AFG
AFG
0.7 0.6 0.6
0.4
0.6
0.4 0.4
0.2
0.5
0.2 0.2
0.0 0.4
0.9
1.0 0.8 0.5
0.8
0.7 0.4
0.8 0.6
AFG
AFG
AFG
AFG
0.6
0.4
0.4 0.2 0.2
0.3
Zernike3vs0-2_4-9 Zernike7vs0-6_89 Libra1vs0_2-14 Libra4vs0-3_5-14 Libra8vs0-7_9-14 Libra12vs0-11_13-14 Zernike3vs0-2_4-9 Zernike7vs0-6_89 Libra1vs0_2-14 Libra4vs0-3_5-14 Libra8vs0-7_9-14 Libra12vs0-11_13-14
Zernike4vs0-3_5-9 Zernike8vs0-7_9 Libra2vs01_3-14 Libra5vs0-4_6-14 Libra9vs0-8_10-14 Libra13vs0-12_14 Zernike4vs0-3_5-9 Zernike8vs0-7_9 Libra2vs01_3-14 Libra5vs0-4_6-14 Libra9vs0-8_10-14 Libra13vs0-12_14
Zernike5vs0-4_6-9 Zernike9vs0-8 Libra3vs0-2_4-14 Libra6vs0-5_7-14 Libra10vs0-9_11-14 Libra14vs0-13 Zernike5vs0-4_6-9 Zernike9vs0-8 Libra3vs0-2_4-14 Libra6vs0-5_7-14 Libra10vs0-9_11-14 Libra14vs0-13
Zernike6vs0-5_7-9 Libra0vs1-14 CICIDS17 Libra7vs0-6_8-14 Libra11vs0-10_12-14 CSE-CIC-IDS2018 Zernike6vs0-5_7-9 Libra0vs1-14 CICIDS17 Libra7vs0-6_8-14 Libra11vs0-10_12-14 CSE-CIC-IDS2018
KNN KNN LR LR
1.2 1.2 1.2 1.2
AFG
AFG
AFG
0.6
0.4 0.4 0.4
0.4
0.2 0.2 0.2
KNN KNN LR LR
1.2 1.1 1.2 1.2
1.0 1.0
1.0 1.0
0.9 0.8
0.8 0.8
AFG
AFG
AFG
AFG
0.8 0.6
0.6 0.6
0.7 0.4
0.4 0.4
0.6 0.2
Figure 7. Relationship between varying performance in GNB, BNB, KNN, LR and IR.
Figure 7. Relationship between varying performance in GNB, BNB, KNN, LR and IR.
Axioms 2022,
Axioms 11,11,
2022, 607x FOR PEER REVIEW 14 13
of of2119
RF RF DT DT
1.2 1.2 1.2 1.2
AFG
AFG
AFG
0.6 0.6 0.6 0.6
RF RF DT DT
1.2 1.2 1.2 1.2
1.0 1.0
1.0 1.0
0.8 0.8
0.8 0.8
AFG
AFG
AFG
AFG
0.6 0.6
0.6 0.6
0.4 0.4
0.4 0.4
0.2 0.2
GBDT GBDT
1.2
SVC SVC
1.2 1.2 1.0
AFG
AFG
0.6
AFG
0.6 0.6
0.4
0.4 0.4 0.4
1.0 1.0
1.0 1.0
0.8 0.8
0.8 0.8
AFG
AFG
AFG
AFG
0.6 0.6
0.6 0.6
0.4 0.4
0.4 0.4
0.2 0.2
Figure8.8.Relationship
Figure Relationshipbetween
betweenvarying
varying performance
performance in
in RF,
RF, DT, GBDT, SVC
DT, GBDT, SVC and
and IR.
IR.
60 60
GNB LR GBDT GNB RF
BNB RF SVC BNB DT
50 KNN DT 50 KNN GBDT
LR SVC
40 40
30 30
CV (%)
CV (%)
20 20
10 10
0 0
e
o
ilt
01
Ec s 1
s0
0
99
23
ag 12
s1
9
2
34
24
KD -9
1
34
34
-9
nc
og
9
Zo
tim vs0
1-
vs
s1
W
s0
00
_3
v
v
0v
vs
19
01
2
KD 1vs
12
01
tim vs
02
1
a
ol
i0
Ye k s 1
i1
vs
0v
2v
0_
Ye vs0
l
e1
e2
D2
NS s01
Sa ge0
Ba
ks
at
up
vs
ol
ol
vs
vs
vs
e0
ss
ss
ss
vs
ag
rm
Ec
t4
oc
t0
t3
oc
t1
DC
t2
ik
v
la
e1
a
la
la
as
e2
as
as
De
as
as
bl
rn
tim
bl
G
G
L-
ik
ge
ge
Ye
Ze
Ye
ik
Ye
Sa
rn
Sa
rn
Pa
Pa
Ze
Ze
Imbalanced data Imbalanced data
80
GNB LR GBDT 60 GNB LR GBDT
BNB RF SVC BNB RF SVC
KNN DT KNN DT Col 8
60 50
40
40
CV (%)
CV (%)
30
20 20
10
10
0
0
9
14
14
9
14
9
9
14
14
14
8
14
4
14
14
13
14
14
8
4
14
17
4-
5-
7-
6-
7_
8
0-
-1
01
1
7-
5-
-
6_
6-
8-
2_
-
2-
3_
1-
-
1-
2-
4-
5_
2_
4_
9
DS
10
3
13
s0
vs
0-
S2
5_
3_
7_
4_
6_
0_
_1
_1
2_
vs
0-
0-
0-
-1
0-
e9
0-
4v
vs
8_
01
1_
CI
0-
0-
0-
a0
ID
0-
0-
vs
Ze 3vs
-9
10
vs
vs
0-
s0
vs
vs
e8
a1
ik
CI
0-
-1
vs
vs
vs
vs
vs
e7
vs
C-
s0
e4
a1
br
vs
e6
e5
3v
rn
0-
ik
s0
e
vs
a2
br
a6
a4
a8
a5
a7
CI
Li
a3
0v
ik
Ze
br
ik
s
ik
rn
a1
ik
ik
a9
Li
2v
1v
br
br
rn
br
br
rn
br
E-
br
rn
rn
Li
a1
br
rn
Ze
br
br
a1
Li
a1
Ze
Li
Ze
Li
Li
Li
Li
Ze
Ze
CS
Li
br
Imbalanced data
Li
Li
br
br
Li
Li
Li
Imbalanced data
Table 3. Mean ranking scores of eight machine learning models with CV AFG on 48 imbalanced
datasets. Bold values indicate the best machine learning model for each row.
oms 2022, 11, x FOR PEER REVIEW
Algorithms GNB BNB KNN LR RF DT GBDT SVC
CV AFG 3.4583 3.4063 3.9167 6.0000 4.1250 5.6354 3.8438 5.6146
Figure Generally,
10. Result theofcontrol
the Nemenyi
method is post hoc test.
the optimal method. In this study, the control method
is the machine learning model with the most stable classification performance, and there-
fore the machine learning model BNB is the control method. Figure 10 reveals that the
Generally, the control method is the optimal method. In this study,
performance stability of BNB is better than the other seven machine learning models.
method is the
Moreover, theperformance
machine stability
learning model
of BNB with thebetter
is significantly most stable
than classification
the three machine pe
learning
and modelsthe
therefore of LR, DT and SVC.
machine learning model BNB is the control method. Figure
that the performance stability of BNB is better than the other seven machine lea
els. Moreover, the performance stability of BNB is significantly better than th
chine learning models of LR, DT and SVC.
5. Related Works
At present, the issue of imbalanced data classification has attracted wide
Axioms 2022, 11, 607 16 of 19
5. Related Works
At present, the issue of imbalanced data classification has attracted wide attention
in the field of artificial intelligence and data mining. In view of the performance impact
of imbalanced data on machine learning models, researchers have also carried out lots of
exploratory work.
Mazurowski et al. [31] explored two methods of neural network training: classical
backpropagation (BP) and particle swarm optimization (PSO) with clinically relevant
training criteria, and used the simulation data and the real clinical data of breast cancer
diagnosis to verify that the performance of the classification algorithm will deteriorate even
if there is a slight imbalance in the training data. The experimental results further show
that the BP algorithm is better than the PSO algorithm for imbalanced data, especially for
imbalanced data with smaller samples and more features.
Loyola-González et al. [32] analyzed and studied the performance impact of using
resampling methods for contrast pattern-based classifiers in imbalanced data classification
issues. Experimental results show that there are statistically significant differences between
using the contrast pattern-based classifiers before and after applying resampling methods.
Yu et al. [16] proposed an approach to analyzing the impact of class imbalance in
the field of software defect prediction. In this method, the original imbalanced data is
transformed into a set of new datasets with increasing IR by the undersampling approach.
The AUC evaluation metric and CV were used to evaluate the performance of the prediction
models. The experimental results show that the performance of C4.5, Ripper and SMO
prediction models decreases with the increase of IR, while the classification performance of
Logistic Regression, Naive Bayes and Random Forest prediction models is more stable.
Luque et al. [33] conducted extensive and systematic research on the impact of class
imbalance on classification performance measurement through the simulation results
obtained by binary classifiers. A new performance measurement method of imbalanced
data based on the binary confusion matrix is defined. From the simulation results, several
clusters of performance metrics have been identified that involve the use of G-mean or
Bookmaker Informedness as the best null-biased metrics if their focus on classification
successes presents no limitation for the specific application where they are used. However,
if classification errors must also be considered, then the Matthews correlation coefficient
arises as the best choice.
Lu et al. [19] took the Bayesian optimal classifier as the research object, and theoretically
studied the influence of class imbalance on classification results. They proposed a data
measure called the Bayes Imbalance Impact Index (BI3 ). The experiment shows that BI3
can be used as a standard to explain the impact of imbalance on data classification.
Kovács [34] presented a detailed, empirical comparison of 85 variants of minority
oversampling techniques and discussed involving 104 imbalanced datasets for evaluation.
The goal of this work is to set a new baseline in the field and determine the oversampling
principles leading to the best results under general circumstances.
Thabtah et al. [35] studied the impact of varying class imbalance ratios on classifier
accuracy, by highlighting the precise nature of the relationship between the degree of class
imbalance and the corresponding effects on classifier performance. They hope to help
researchers to better tackle the problem. The experiments use 10-fold cross-validation on a
large number of datasets and determine that the relationship between the class imbalance
ratio and the accuracy is convex.
A comparative summary of previous efforts in this field is provided in Table 4. The
columns of the table correspond to the following criteria.
Axioms 2022, 11, 607 17 of 19
• MFs indicates whether this approach is validated on the imbalanced data from multi-
ple fields, yes (Y), no (N).
• EMs indicates whether this approach uses multiple evaluation metrics to obtain more
objective experimental results, yes (Y), no (N).
• BDs indicates whether the experiment uses imbalanced data with more than
10,000 observations, yes (Y), no (N).
• CAs indicates how many machine learning models are used in the experiment.
6. Conclusions
In both theoretical research and practical application, imbalanced data classification
is a widespread phenomenon. When dealing with an imbalanced dataset, the standard
classification may have different degrees of defects and may thus become inefficient. To
analyze the performance impact of imbalanced data on machine learning models, we not
only analyzed the relationship between varying performance in machine learning models
and IR, but also analyzed the performance stability of the machine learning models on
imbalanced datasets. Specifically, we empirically evaluated the eight widely used machine
learning models (GNB, BNB, KNN, LR, RF, DT, GBDT and SVC) on 48 different imbalanced
datasets based on a proposed imbalanced data augmentation algorithm, AFG and CV AFG .
The experimental results demonstrate that the classification performance of LR, DT and SVC
is unstable, and is easily affected by imbalanced data, and the classification performance
of GNB, BNB, KNN, RF and GBDT is relatively stable and not susceptible to imbalanced
data. In particular, the BNB machine learning model has the most stable classification
performance. Statistical tests confirm the validity of the experimental results.
Because the method for analyzing the performance impact of imbalanced data on ma-
chine learning models proposed in this study is universal and will not be limited to a certain
field, it can be applied to imbalanced data classification in multiple fields, so as to guide
relevant researchers in choosing appropriate machine learning models when faced with
imbalanced data classification issues. For example, when there is no condition to improve
the distribution of imbalanced data or improve the machine learning models, machine
learning models with relatively stable performance can be selected for imbalanced data
classification, such as GNB, BNB, KNN, RF and GBDT. When we need to improve the ma-
chine learning models, we can select those algorithms that are unstable and easily affected
by imbalanced data, such as LR, DT and SVC. Clustering different imbalanced datasets
and using different validation techniques [36] to analyze the classification performance of
machine learning models will be the focus of our future work.
Institutional Review Board Statement: This work does not contain any studies with human partici-
pants or animals performed by any of the authors.
Acknowledgments: This work was supported by the Major Project of Natural Science Research in
Colleges and Universities of Anhui Province (KJ2021ZD0007); the 2021 cultivation project of Anhui
Normal University (2021xjxm049); Wuhu Science and Technology Bureau Project.
Conflicts of Interest: The authors declare that they have no conflict of interest.
References
1. Jing, X.-Y.; Zhang, X.; Zhu, X.; Wu, F.; You, X.; Gao, Y.; Shan, S.; Yang, J.-Y. Multiset feature learning for highly imbalanced data
classification. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 139–156. [CrossRef] [PubMed]
2. Zheng, M.; Li, T.; Zhu, R.; Tang, Y.; Tang, M.; Lin, L.; Ma, Z. Conditional Wasserstein generative adversarial network-gradient
penalty-based approach to alleviating imbalanced data classification. Inf. Sci. 2020, 512, 1009–1023. [CrossRef]
3. Zheng, M.; Li, T.; Zheng, X.; Yu, Q.; Chen, C.; Zhou, D.; Lv, C.; Yang, W. UFFDFR: Undersampling framework with denoising,
fuzzy c-means clustering, and representative sample selection for imbalanced data classification. Inf. Sci. 2021, 576, 658–680.
[CrossRef]
4. Liang, D.; Yi, B.; Cao, W.; Zheng, Q. Exploring ensemble oversampling method for imbalanced keyword extraction learning in
policy text based on three-way decisions and SMOTE. Expert Syst. Appl. 2022, 188, 116051. [CrossRef]
5. Kim, K.H.; Sohn, S.Y. Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data.
Neural Netw. 2020, 130, 176–184. [CrossRef]
6. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell.
Res. 2002, 16, 321–357. [CrossRef]
7. Lunardon, N.; Menardi, G.; Torelli, N. ROSE: A Package for Binary Imbalanced Learning. R J. 2014, 6, 79–89. [CrossRef]
8. Al, S.; Dener, M. STL-HDL: A new hybrid network intrusion detection system for imbalanced dataset on big data environment.
Comput. Secur. 2021, 110, 102435. [CrossRef]
9. Raghuwanshi, B.S.; Shukla, S. SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl.-Based Syst.
2020, 187, 104814. [CrossRef]
10. Sun, J.; Li, H.; Fujita, H.; Fu, B.; Ai, W. Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble
combined with SMOTE and time weighting. Inf. Fusion 2020, 54, 128–144. [CrossRef]
11. Pan, T.; Zhao, J.; Wu, W.; Yang, J. Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf. Sci. 2020, 512, 1214–1233.
[CrossRef]
12. Saini, M.; Susan, S. VGGIN-Net: Deep Transfer Network for Imbalanced Breast Cancer Dataset. IEEE/ACM Trans. Comput. Biol.
Bioinform. 2022. [CrossRef] [PubMed]
13. Zhu, Q.; Zhu, T.; Zhang, R.; Ye, H.; Sun, K.; Xu, Y.; Zhang, D. A Cognitive Driven Ordinal Preservation for Multi-Modal
Imbalanced Brain Disease Diagnosis. IEEE Trans. Cogn. Dev. Syst. 2022. [CrossRef]
14. Sun, Y.; Cai, L.; Liao, B.; Zhu, W.; Xu, J. A Robust Oversampling Approach for Class Imbalance Problem with Small Disjuncts.
IEEE Trans. Knowl. Data Eng. 2022. [CrossRef]
15. Douzas, G.; Bacao, F. Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst. Appl. 2017, 82, 40–52.
[CrossRef]
16. Yu, Q.; Jiang, S.; Zhang, Y.; Wang, X.; Gao, P.; Qian, J. The impact study of class imbalance on the performance of software defect
prediction models. Chin. J. Comput. 2018, 41, 809–824.
17. Forkman, J. Estimator and tests for common coefficients of variation in normal distributions. Commun. Stat.—Theory Methods
2009, 38, 233–251. [CrossRef]
18. Fernandes, E.R.; de Carvalho, A.C.; Yao, X. Ensemble of classifiers based on multiobjective genetic sampling for imbalanced data.
IEEE Trans. Knowl. Data Eng. 2019, 32, 1104–1115. [CrossRef]
19. Lu, Y.; Cheung, Y.; Tang, Y.Y. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem.
IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 3525–3539. [CrossRef]
20. Leski, J.M.; Czabański, R.; Jezewski, M.; Jezewski, J. Fuzzy Ordered c-Means Clustering and Least Angle Regression for Fuzzy
Rule-Based Classifier: Study for Imbalanced Data. IEEE Trans. Fuzzy Syst. 2019, 28, 2799–2813. [CrossRef]
21. Moraes, R.M.; Ferreira, J.A.; Machado, L.S. A New Bayesian Network Based on Gaussian Naive Bayes with Fuzzy Parameters for
Training Assessment in Virtual Simulators. Int. J. Fuzzy Syst. 2020, 23, 849–861. [CrossRef]
22. Raschka, S. Naive bayes and text classification i-introduction and theory. arXiv 2014, arXiv:1410.5329.
23. Shi, F.; Cao, H.; Zhang, X.; Chen, X. A Reinforced k-Nearest Neighbors Method with Application to Chatter Identification in High
Speed Milling. IEEE Trans. Ind. Electron. 2020, 67, 10844–10855. [CrossRef]
24. Adeli, E.; Li, X.; Kwon, D.; Zhang, Y.; Pohl, K. Logistic regression confined by cardinality-constrained sample and feature selection.
IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1713–1728. [CrossRef]
25. Chai, Z.; Zhao, C. Enhanced random forest with concurrent analysis of static and dynamic nodes for industrial fault classification.
IEEE Trans. Ind. Inform. 2019, 16, 54–66. [CrossRef]
Axioms 2022, 11, 607 19 of 19
26. Esteve, M.; Aparicio, J.; Rabasa, A.; Rodriguez-Sala, J.J. Efficiency analysis trees: A new methodology for estimating production
frontiers through decision trees. Expert Syst. Appl. 2020, 162, 113783. [CrossRef]
27. Wen, Z.; Shi, J.; He, B.; Chen, J.; Ramamohanarao, K.; Li, Q. Exploiting GPUs for efficient gradient boosting decision tree training.
IEEE Trans. Parallel Distrib. Syst. 2019, 30, 2706–2717. [CrossRef]
28. Alam, S.; Sonbhadra, S.K.; Agarwal, S.; Nagabhushan, P. One-class support vector classifiers: A survey. Knowl.-Based Syst. 2020,
196, 105754. [CrossRef]
29. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.
Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
30. Li, L.; He, H.; Li, J. Entropy-based Sampling Approaches for Multi-class Imbalanced Problems. IEEE Trans. Knowl. Data Eng. 2020,
32, 2159–2170. [CrossRef]
31. Mazurowski, M.A.; Habas, P.A.; Zurada, J.M.; Lo, J.Y.; Baker, J.A.; Tourassi, G.D. Training neural network classifiers for medical
decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 2008, 21, 427–436. [CrossRef]
[PubMed]
32. Loyola-González, O.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; García-Borroto, M. Study of the impact of resampling methods
for contrast pattern based classifiers in imbalanced databases. Neurocomputing 2016, 175, 935–947. [CrossRef]
33. Luque, A.; Carrasco, A.; Martín, A.; de las Heras, A. The impact of class imbalance in classification performance metrics based on
the binary confusion matrix. Pattern Recognit. 2019, 91, 216–231. [CrossRef]
34. Kovács, G. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced
datasets. Appl. Soft Comput. 2019, 83, 105662. [CrossRef]
35. Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 2020,
513, 429–441. [CrossRef]
36. Guarino, A.; Lettieri, N.; Malandrino, D.; Zaccagnino, R.; Capo, C. Adam or Eve? Automatic users’ gender classification via
gestures analysis on touch devices. Neural Comput. Appl. 2022, 34, 18473–18495. [CrossRef]