axioms-11-00607-v2

axioms
Article
A Method for Analyzing the Performance Impact of Imbalanced
Binary Data on Machine Learning Models
Ming Zheng 1,2, * , Fei Wang 1 , Xiaowen Hu 1 , Yuhao Miao 3 , Huo Cao 1 and Mingjing Tang 4,5, *
1 School of Computer and Information, Anhui Normal University, Wuhu 241002, China
2 Anhui Provincial Key Laboratory of Network and Information Security, Wuhu 241002, China
3 Affiliated Institution of Anhui Normal University, Wuhu 241002, China
4 School of Life Science, Yunnan Normal University, Kunming 650500, China
5 Engineering Research Center of Sustainable Development and Utilization of Biomass Energy,
Ministry of Education, Yunnan Normal University, Kunming 650500, China
* Correspondence: mzheng@ahnu.edu.cn (M.Z.); tmj@ynnu.edu.cn (M.T.)
Abstract: Machine learning models may not be able to effectively learn and predict from imbalanced
data in the fields of machine learning and data mining. This study proposed a method for analyzing
the performance impact of imbalanced binary data on machine learning models. It systematically
analyzes 1. the relationship between varying performance in machine learning models and imbalance
rate (IR); 2. the performance stability of machine learning models on imbalanced binary data. In
the proposed method, the imbalanced data augmentation algorithms are first designed to obtain the
imbalanced dataset with gradually varying IR. Then, in order to obtain more objective classification
results, the evaluation metric AFG, arithmetic mean of area under the receiver operating characteristic
curve (AUC), F-measure and G-mean are used to evaluate the classification performance of machine
learning models. Finally, based on AFG and coefficient of variation (CV), the performance stability
evaluation method of machine learning models is proposed. Experiments of eight widely used
machine learning models on 48 different imbalanced datasets demonstrate that the classification
performance of machine learning models decreases with the increase of IR on the same imbalanced
Citation: Zheng, M.; Wang, F.; Hu, X.;
data. Meanwhile, the classification performances of LR, DT and SVC are unstable, while GNB, BNB,
Miao, Y.; Cao, H.; Tang, M. A Method
KNN, RF and GBDT are relatively stable and not susceptible to imbalanced data. In particular, the
for Analyzing the Performance
Impact of Imbalanced Binary Data on
BNB has the most stable classification performance. The Friedman and Nemenyi post hoc statistical
Machine Learning Models. Axioms tests also confirmed this result. The SMOTE method is used in oversampling-based imbalanced
2022, 11, 607. https://doi.org/ data augmentation, and determining whether other oversampling methods can obtain consistent
10.3390/axioms11110607 results needs further research. In the future, an imbalanced data augmentation algorithm based
on undersampling and hybrid sampling should be used to analyze the performance impact of
Academic Editor: Jong-Min Kim
imbalanced binary data on machine learning models.
Received: 19 September 2022
Accepted: 28 October 2022 Keywords: machine learning models; imbalanced data; machine learning; data mining;
Published: 1 November 2022 performance impact
Publisher’s Note: MDPI stays neutral
MSC: 68T09
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
1. Introduction
In the fields of data mining and machine learning, imbalanced data classification is a
Copyright: © 2022 by the authors.
ubiquitous natural phenomenon. Imbalanced data classification is a kind of supervised
Licensee MDPI, Basel, Switzerland. learning, which means that the distribution of response variables in the dataset varies
This article is an open access article greatly in different classes. In binary or multiclass datasets affected by imbalanced data
distributed under the terms and classification, a response variable with fewer samples is referred to as a positive class or
conditions of the Creative Commons a minority class, whereas a response variable with more samples is known as a negative
Attribution (CC BY) license (https:// class or a majority class. Due to the machine learning models being based on situations
creativecommons.org/licenses/by/ in which the data distribution is relatively balanced, these machine learning models may
4.0/). experience different degrees of defects when faced with imbalanced data classification and
Axioms 2022, 11, 607. https://doi.org/10.3390/axioms11110607 https://www.mdpi.com/journal/axioms

Axioms 2022, 11, 607 2 of 19
may thus become inefficient [1]. For example, suppose there is a patient dataset containing
990 normal patients and 10 cancer patients. If we do not modify the machine learning
model or improve the distribution of data and use the machine learning model directly,
it will tend to think that all patients in the dataset are normal patients. This can lead to
a terrible situation in which cancer patients missed the optimal treatment time because
they cannot be accurately predicted, thus endangering their lives or even death. Therefore,
enhancing the analysis and understanding of machine learning models for imbalanced
data classification has important theoretical significance and application value [2].
Current studies on the imbalanced data classification issues are mainly concerned with
two methods [3]. The first involves newly designing or improving the machine learning
models, which generally entails reducing the sensitivity of the classification algorithm to the
imbalanced data. Typically, either ensemble learning is used to increase the robustness of
the machine learning models [4] or a cost-sensitive learning method [5] is used to make the
cost of misclassifying the minority class higher than that of misclassifying the majority class.
The second approach is to use a sampling method to balance the dataset on the data level,
mainly via oversampling [6,7], undersampling [3] or hybrid sampling [8]. The purpose of
oversampling is to increase the number of samples in the minority class, thus improving the
distribution of data among classes. Undersampling has the same purpose as oversampling,
but instead removes samples from the majority class. Finally, hybrid sampling is intended
to balance the dataset by combining the two aforementioned sampling methods.
This study aims to provide a method to analyze the performance impact of imbalanced
binary data on machine learning models. Hence, proposing new techniques addressing
imbalanced data classification is not the focus. The method proposed in this study not
only analyzes the relationship between varying performance in machine learning models
and IR, but also analyzes the performance stability of the machine learning models on the
imbalanced datasets. The main contributions of this study can be summarized as follows.
(1) To obtain the imbalanced dataset with gradually varying IR and belonging to
the same distribution, this study proposes three different augmentation algorithms of
imbalanced data by combining the oversampling method, the undersampling method and
the hybrid sampling method, respectively.
(2) This study proposes a performance evaluation metric AFG by analyzing and
combining evaluation metrics AUC, F-measure and G-mean.
(3) Our comparative study systematically analyzes the relationship between varying
performance in machine learning models and IR, as well as the performance stability of
eight machine learning models on 48 benchmark imbalanced datasets, which can provide
an important reference value for imbalanced data classification application developers and
researchers.
The remainder of this study is organized as follows. Section 2 describes the proposed
approach in detail. Experiment settings are given in Section 3, and the experimental results
are discussed in Section 4. The related works are discussed in Section 5. Finally, Section 6
briefly summarizes our study and presents the conclusions.
2. Proposed Method
The overall framework of the method for analyzing the performance impact of imbal-
anced binary data on machine learning models is shown in Figure 1.
Specifically, in the framework, a new set of imbalanced data with decreasing IR
is augmented based on the augmentation algorithms in the first. How to augment an
imbalanced dataset with decreasing IR will be described in detail in Section 2.1. Then, in
Section 2.2, in order to obtain the relationship between varying performance in machine
learning models and IR, we use AFG, the arithmetic mean of AUC, F-measure and G-mean,
to evaluate the classification performance of machine learning models. Finally, in Section 2.3,
the performance stability of machine learning models on imbalanced datasets is evaluated
by combining AFG and CV. Meanwhile, statistical tests are applied to further verify whether
the performance stability of these machine learning models is significantly different.
Axioms 2022, 11, 2022,
Axioms x FOR 11,PEER
607 REVIEW 3 of 19 3 of 21
1 Key research point 1 Augmented imbalanced

dataset with varying
2 Key research point 2 imbalance ratio
3 Key research point 3 The relationship of
Imbalanced varying performance in
data1 machine learning Statistical tests
LR SVC models and IR
Imbalanced
RF DT
data2
performance stability of
Augmentation Training and classification the standard classification
Imbalanced More objective Performance stability algorithms on the
Imbalanced data algorithms of based on machine learning
data... evaluation Metric: AFG evaluation method imbalanced data
imbalanced data models
Imbalanced
GNB BNB
datan-1
KNN GBDT
Imbalanced
datan
1 2 3
Figure 1. Overall
Figure framework
1. Overall frameworkof
of the proposedmethod.
the proposed method.
2.1. Imbalanced Data Augmentation Algorithms

Specifically, in the framework, a new set of imbalanced data with decreasing IR is
Only when an imbalanced dataset with gradually varying IR is obtained can the
augmented based on the augmentation algorithms in the first. How to augment an
relationship between varying performance in machine learning models and IR be analyzed.
imbalanced dataset with decreasing IR will be described in detail in Section 2.1. Then, in
Although it can be simply achieved by collecting imbalanced data with different IRs,
Section 2.2,different
because in order to obtaindata
imbalanced thehave
relationship between varying
different distributions, performance
it is impossible in machine
to effectively
learning
analyzemodels and IR, between
the relationship we usevarying
AFG, performance
the arithmetic mean learning
in machine of AUC,models
F-measure
and IR. and G-
mean,Therefore, this study
to evaluate the proposes three different
classification augmentation
performance algorithms
of machine for imbalanced
learning models.dataFinally, in
based on the oversampling method, the undersampling method and the
Section 2.3, the performance stability of machine learning models on imbalanced datasetshybrid sampling
method, respectively, so as to obtain a set of imbalanced data from the same distribution,
is evaluated by combining AFG and CV. Meanwhile, statistical tests are applied to further
with decreasing IR.
verify whether
First, thethe performance
imbalanced stability ofalgorithm
data augmentation these machine
based onlearning models is
the oversampling significantly
method
different.
is introduced. After applying Algorithm 1, the augmented imbalanced dataset with de-
creasing IR can be obtained.
2.1. Imbalanced Data Augmentation Algorithms
Algorithm
Only when 1 Oversampling-based
an imbalancedimbalanced data augmentation
dataset with gradually varying IR is obtained can the
Input: T: Original
relationship between imbalanced
varyingdata.performance in machine learning models and IR be
Output: Taugment : Augmented imbalanced dataset.
analyzed. Although it can be simply achieved by collecting imbalanced data with different
Procedure Begin
IRs, because different imbalanced data have different distributions, it is impossible to
1. T ← Tmajority ∪ Tminority
effectively
2. n1 analyze
← size of Tthe relationship between varying performance in machine learning
minority
models3. and
n2 ← IR.
sizeTherefore,
of Tmajority this study proposes three different augmentation algorithms
for imbalanced
4. r ← int (ndata
2 /n1based
) on the oversampling method, the undersampling method and
5.
the hybrid i = 0 to r − 1method,
forsampling do respectively, so as to obtain a set of imbalanced data from
6. oversampling ratio ← 1/(r − i)
the same distribution, with decreasing IR.
7. generated minority class samples ← oversampling approach (T, oversampling ratio)
First,
8. the imbalanced
Taugment data minority
[i] ← T ∪ generated augmentation algorithm based on the oversampling
class samples
method9. is introduced.
return Taugment [i]After applying Algorithm 1, the augmented imbalanced dataset
with 10. end for IR can be obtained.
decreasing
11. return Taugment
12. End
Algorithm 1 first divides the original imbalanced data into majority class samples
Tmajority and minority class samples Tminority according to the class label of samples. Among
them, the number of minority class samples is n1 , and the number of majority class samples
is n2 . Then, the IR of the original imbalanced data T is calculated and denoted with r; note
that r is the value rounded down. Traversing r, each traversal will calculate the oversampling
ratio. According to the oversampling ratio and original imbalanced data T, the oversampling
approach is used to generate minority class samples, merge the generated minority class
them, the number of minority class samples is n1, and the number of majority class sam-
ples is n2. Then, the IR of the original imbalanced data T is calculated and denoted with r;
note that r is the value rounded down. Traversing r, each traversal will calculate the over-
sampling ratio. According to the oversampling ratio and original imbalanced data T, the over- 4 of 19
Axioms 2022, 11, 607
sampling approach is used to generate minority class samples, merge the generated minor-
ity class samples with the original imbalanced data T to get new imbalanced data and the
above steps are repeated
samplesuntil
withthe
the end of the
original loop todata
imbalanced finally
T to obtain
get newthe augmented
imbalanced imbal-
data and the above
steps are repeated until the end of the loop to finally obtain the augmented
anced dataset Taugment (a group of imbalanced datasets with decreasing IR). The relationship imbalanced
dataset Taugment (a group of imbalanced datasets with decreasing IR). The relationship
between the augmentation process of imbalanced data and IR change in Algorithm 1 is
between the augmentation process of imbalanced data and IR change in Algorithm 1 is
shown in Figure 2. shown in Figure 2.
Number of Number of Number

Augmented Imbalance
minority class majority class of total
imbalanced data ratio
samples samples samples
Initial n1 n2 n1+n2 Taugment[0] r=n2/n1
1th n2/(r−1) n2 rn2/(r−1) Taugment[1] r−1
2th n2/(r−2) n2 [n2(r−1)]/(r−2) Taugment[2] r−2

· · · · · ·
· · · · · ·
· · · · · ·
(r−2)th n2/2 n2 3n2/2 Taugment[r−2] 2
(r−1)th n2 n2 2n2 Taugment[r−1] 1
Figure 2. RelationshipFigure 2. Relationship

between between theprocess
the augmentation augmentation
and IRprocess
changeandin
IRAlgorithm
change in Algorithm
1. 1.
Then, the imbalanced data augmentation algorithm based on the undersampling

method is introduced. After applying Algorithm 2, the augmented imbalanced dataset
with decreasing IR can be obtained.
Algorithm 2 Undersampling based imbalanced data augmentation

Input: T: Original imbalanced data.
Output: Taugment : Augmented imbalanced dataset.
Procedure Begin
2. n1 ← size of Tminority
3. n2 ← size of Tmajority
4. r ← int (n2 /n1 )
5. for i = 0 to r − 1 do
6. undersampling ratio ← 1/(r − i)
7. deleted majority class samples ← undersampling approach (T, undersampling ratio)
8. Taugment [i] ← T—deleted majority class samples
9. return Taugment [i]
10. end for
11. return Taugment
12. End
Similarly, Algorithm 2 first divides the original imbalanced data into majority class
samples Tmajority and minority class samples Tminority according to the class of samples.
Among them, the number of minority class samples is n1 , and the number of majority
class samples is n2 . Then, the IR of the original imbalanced data T is calculated and
denoted with r; note that r is the value rounded down again. Traversing r, each traversal
will calculate the undersampling ratio. According to the undersampling ratio and original
Axioms 2022, 11, 607 5 of 19
imbalanced data T, the undersampling approach is used to delete majority class samples,
remove deleted majority class samples from the original imbalanced data T to get new
imbalanced data and the above steps are repeated until the end of the loop to finally
R PEER REVIEW obtain the augmented imbalanced dataset Taugment (a group of imbalanced datasets
6 of 21 with
decreasing IR). The relationship between the augmentation process of imbalanced data and
IR change in Algorithm 2 is shown in Figure 3.

Augmented Imbalance
1th n1 (r−1)n1 rn1 Taugment[1] r−1
2th n1 (r−2)n1 (r−1)n1 Taugment[2] r−2

· · · · · ·
· · · · · ·
· · · · · ·
(r−2)th n1 2n1 3n1 Taugment[r−2] 2
(r−1)th n1 n1 2n1 Taugment[r−1] 1

between between the
process and IRprocess
changeandin
IRAlgorithm
change in Algorithm
2. 2.
Last, the imbalanced data augmentation algorithm based on the hybrid sampling
Last, the imbalanced data
method is augmentation
introduced. algorithm
After applying based
Algorithm on augmented
3, the the hybrid samplingdataset
imbalanced
method is introduced. After applying Algorithm 3, the augmented imbalanced dataset
Algorithm 3 Hybrid sampling based imbalanced data augmentation
Algorithm 3 HybridOutput:
sampling based
Taugment imbalanced
: Augmented data
imbalanced augmentation
dataset.
Procedure Begin
Output: Taugment: Augmented
2. imbalanced
n1 ← size of Tminority dataset.
Procedure Begin
4. r ← int (n2 /n1 )
5. for i = 0 to r − 1 do
6. oversampling ratio ← (n1 + n2 )/rn2
2. n1 ← size of Tminority
7. undersampling ratio ← rn1 /[(n1 + n2 )(r − 1)]
8. generated minority class samples ← oversampling approach (T, oversampling ratio)
9. deleted majority class samples ← undersampling approach (T, undersampling ratio)
4. r ← int (n2/n1) 10. Taugment [i] ← generated minority class samples ∪ T—deleted majority class samples
11. return Taugment [i]
5. for i = 0 to r − 1 12.
do end for
13. return Taugment
6. oversampling ratio
14. End← (n1 + n2)/rn2
7. undersampling ratio ← rn1/[(n1 + n2)(r − 1)]
8. Algorithm
generated minority class samples ← oversampling
3 first divides the original imbalanced data
approach (T, into majority ratio)
oversampling class samples
Tmajority and minority class samples Tminority according to the class of samples. Among them,
9. deleted majority class of
the number minority←
samples undersampling
class approach
samples is n1 , and (T, undersampling
the number of majority classratio)
samples is n2 .
10. Then, the IRminority
Taugment[i] ← generated of the original
classimbalanced
samples data T is calculated
∪ T—deleted and denoted
majority class with r; note that r
samples
11. return Taugment[i]
12. end for
Axioms 2022, 11, 607 6 of 19
R PEER REVIEW 7 of 21
is the value rounded down. Traversing r, each traversal will calculate the oversampling ratio
and the undersampling ratio. According to the oversampling ratio, undersampling ratio and
samples, respectively. The imbalanced
original generateddata
minority class samples
T, the oversampling areand
approach merged and the
undersampling deleted
approach are used
majority class samples are removed
to generate from
the minority thesamples
class original
andimbalanced dataclass
delete the majority T tosamples,
obtainrespectively.
new
imbalanced data andThethe above steps
generated areclass
minority repeated
samplesuntil the end
are merged andofthe
the loop,majority
deleted to finally
classob-
samples
removed from the original imbalanced data T
tain the augmented imbalanced dataset with decreasing IR. The relationship between the and
are to obtain new imbalanced data
the above steps are repeated until the end of the loop, to finally obtain the augmented
augmentation process of imbalanced data and IR change in Algorithm 3 is shown in Fig-
imbalanced dataset with decreasing IR. The relationship between the augmentation process
ure 4. of imbalanced data and IR change in Algorithm 3 is shown in Figure 4.

Augmented Imbalance
1th (n1+n2)/r [(r−1)(n1+n2)]/r n1+n2 Taugment[1] r−1
2th (n1+n2)/(r−1) [(r−2)(n1+n2)]/(r−1) n1+n2 Taugment[2] r−2

· · · · · ·
· · · · · ·
· · · · · ·
(r−2)th (n1+n2)/3 2(n1+n2)/3 n1+n2 Taugment[r−2] 2
(r−1)th (n1+n2)/2 (n1+n2)/2 n1+n2 Taugment[r−1] 1

between between the
process and IRprocess
changeandin
IRAlgorithm
change in Algorithm
3. 3.
It should be noted that the three imbalanced data augmentation algorithms have good
It should be noted thatThis
flexibility. theisthree imbalanced
because the resampling data augmentation
methods of the three algorithms
augmentationhavealgorithms
good flexibility. This is because the resampling methods of the three augmentation algo-
are not fixed, and can be arbitrary oversampling, undersampling and hybrid sampling
methods, so they are represented in italics. Meanwhile,
rithms are not fixed, and can be arbitrary oversampling, undersampling and hybrid sam- the above three augmentation
algorithms have the same purpose and are all designed to augment imbalanced data into
pling methods, so they are represented in italics. Meanwhile, the above three augmenta-
a group of imbalanced datasets with decreasing IR. As long as the resampling methods
tion algorithms haveinthe
the same
above purpose and algorithms
augmentation are all designed
are goodtoenough,
augment the imbalanced
augmented datadata and the
into a group of imbalanced datasets with
original imbalanced decreasing
data belong IR. As
to the same long as the
distribution. resampling
Because the SMOTE meth-
(synthetic
minority oversampling
ods in the above augmentation technique)
algorithms are good[6] isenough,
a classicalthe
andaugmented
widely used oversampling
data and the method
in the studies of imbalanced data classification [9–11], this study uses a SMOTE-based
original imbalanced data belong to the same distribution. Because the SMOTE (synthetic
augmentation algorithm to augment the imbalanced binary data.
minority oversampling technique) [6] is a classical and widely used oversampling method
in the studies of imbalanced data
2.2. Performance classification
Evaluation Metric [9–11], this study uses a SMOTE-based
augmentation algorithmThe to augment
evaluation the imbalanced
metrics binaryand
AUC, F-measure data.
G-mean are widely used to evaluate
the classification performance of machine learning models for imbalanced data classifica-
tion [12–14]. To facilitate the introduction of the calculation rules of the evaluation metrics,
2.2. Performance Evaluation Metric
the confusion matrix was first established, as detailed in Table 1.
The evaluation metrics AUC, F-measure and G-mean are widely used to evaluate the
Table 1. Binary
classification performance classification
of machine confusion
learning matrix. for imbalanced data classification
models
[12–14]. To facilitate the introduction of the calculation rules
Predicted of the evaluation
Positive metrics,
Predicted the
Negative
confusion matrix was first Actual
established,
positive as detailedTrue in Table 1.(TP)
positives False negatives (FN)
Actual negative False positives (FP) True negatives (TN)
Table 1. Binary classification confusion matrix.
Predicted Positive Predicted Negative

Actual positive True positives (TP) False negatives (FN)
Actual negative False positives (FP) True negatives (TN)
Axioms 2022, 11, 607 7 of 19
Axioms 2022, 11, x FOR PEER REVIEW

The rows and columns in Table 1 represent the real and predicted sample classes,
respectively. True positive (TP) indicates a positive sample predicted as a positive class by
the model, false negative (FN) represents a positive sample predicted as a negative class
negative
by class
the model, falseby the model.
positive The above
(FP) represents threesample
a negative performance
predicted evaluation metrics are
as a positive class
by the model
as follows. and true negative (TN) represents a negative sample predicted as a negative
class by the model. The above three performance evaluation metrics are defined as follows.
The larger the AUC, the more effective the classifier. Figure 5 illustrates the
The larger the AUC, the more effective the classifier. Figure 5 illustrates the calculation
tion
of of theonAUC
the AUC on a two-dimensional
a two-dimensional chart,
chart, where the graywhere theAUC
area is the grayvalue.
area is the AUC valu
Figure 5. ROC curve and AUC diagram.

Figure 5. ROC curve and AUC diagram.
In Figure 5, the x-axis represents the false positive rate (FPR); that is, the proportion
In Figure
of misclassified 5, the cases
negative x-axisrelative
represents the false
to the total positive
number ratecases.
of negative (FPR);Thethat is, the pr
y-axis
represents the true positive rate (TPR); that is, the proportion of correctly predicted
of misclassified negative cases relative to the total number of negative cases. Th positive
cases relative to the total number of positive cases. As the ROC curve depends on the
represents the true positive rate (TPR); that is, the proportion of correctly predict
classification threshold, the AUC is a useful performance evaluation metric of the classifier
tive cases
because relative to the
it is independent total
of the number
decision of positive
criterion [15]. cases. As the ROC curve depend
classification threshold, the AUC is auseful performance evaluation metric of th
1 + β2 · recall · precision
fier because it is independent
F_measure =of the decision criterion, [15]. (1)
β2 · recall + precision
( )⋅ ⋅
TP TP𝐹_𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = ,
where recall = TP+ FN , precision = TP+ FP , (β = 1 in this ⋅
study).
TP G_mean = precall ×TPspeci f icity, (2)

where recall = , precision = , (β = 1 in this study).
TP
TN
+ FN TP + FP
where specificity = TN + FP .
Although the above three evaluation 𝐺_𝑚𝑒𝑎𝑛 = are
metrics 𝑟𝑒𝑐𝑎𝑙𝑙 × 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦,
widely used in the field for imbal-
anced data classification, their three focuses are different. For example, the AUC evaluation
metric focuses on optimizing TNTPR and 1-FPR at the same time, the F-measure evalua-
where
tion specificity
metric focuses on=optimizing recall. and precision at the same time, and the G-mean
evaluation metric focusesTN on +optimizing
FP the product of recall and specificity. The optimal
valuesAlthough
of the abovethe above
three three metrics
evaluation evaluation
are allmetrics are widely
1. Meanwhile, used
in order in the field fo
to compre-
hensively
anced dataconsider the differenttheir
classification, focuses of the
three three evaluation
focuses metricsFor
are different. to obtain a more
example, the AUC
objective classification performance result, we use AFG, the arithmetic mean of AUC,
tion metric focuses on optimizing TPR and 1-FPR at the same time, the F-measur
F-measure and G-mean as the performance evaluation metric of machine learning models
ation
on metricdatasets.
imbalanced focuses on optimizing recall and precision at the same time, and the
evaluation metric focuses on optimizing the product of recall and specificity. The
AFG = (w1 · AUC + w2 · F_measure + w3 · G_mean)/3,
values of the above three evaluation metrics are all 1. Meanwhile, in order
(3) to com
(w1 = w2 = w2 = 1 in this study)
sively consider the different focuses of the three evaluation metrics to obtain a m
jective classification performance result, we use AFG, the arithmetic mean of
measure and G-mean as the performance evaluation metric of machine learning
on imbalanced datasets.
AFG = (𝑤 ∙ 𝐴𝑈𝐶 + 𝑤 ∙ 𝐹_𝑚𝑒𝑎𝑠𝑢𝑟𝑒 + 𝑤 ∙ 𝐺_𝑚𝑒𝑎𝑛)/3,
(𝑤 = 𝑤 = 𝑤 = 1 in this study)
Axioms 2022, 11, 607 8 of 19
2.3. Performance Stability Evaluation Method

Because the average performance of different machine learning models is greatly
different on the same imbalanced data, the standard deviation of performance cannot be
used to directly measure the discreteness of the performance of each machine learning
model. Therefore, in order to avoid affecting the comparison of imbalanced data dispersion,
the combination of CV [16,17] and AFG was proposed to evaluate the dispersion of the
machine learning models in an imbalanced dataset.
PAFG = {AFG0 , AFG1 , . . . , AFGr−1 }, (4)

r −1
µAFG = ∑i=0 AFGi /r, (5)
q
r −1
σAFG = ∑i=0 (AFGi − µAFG )2 /r, (6)
CVAFG = (σAFG /µAFG ) × 100%, (7)
where µAFG and σAFG represent the average value and standard deviation of all AFG in the
PAFG set. CV AFG represents the CV of PAFG . The larger the CV AFG , the greater the degree
of dispersion of the performance metric. From Equation (7), it can be seen that the CV AFG
is affected by both the average and the standard deviation at the same time, and the larger
CV AFG , the greater the impact of imbalanced data on machine learning models, and the
more unstable the performance of the machine learning model on the imbalanced data.
Generally speaking, when CV AFG is greater than 10%, the performance of the machine
learning model on imbalanced data can be considered to be relatively unstable, while when
CV AFG is less than 10%, the performance of the machine learning model on imbalanced
data can be considered to be relatively stable.
3. Experiment Settings
The benchmark imbalanced data are described in Section 3.1. We briefly introduce the
eight machine learning models for training and classifying imbalanced data in Section 3.2.
Section 3.3 explains the experimental flow design. Finally, Section 3.4 introduces statistical
test method.
3.1. Benchmark Dataset

This study experimented on 48 different benchmark imbalanced datasets in multiple
fields extracted from the UCI Machine Learning Repository, Data Hub and a part of
Network Intrusion Data; these imbalanced data are publicly available on the corresponding
web pages (https://archive.ics.uci.edu/ml/datasets.php, https://datahub.io/machine-
learning, accessed on 1 January 2022). Table 2 summarizes the 48 imbalanced datasets
with different IR, including the total number of instances, numbers of features, class names
and number of instances belonging to the minority. To obtain multiple binary imbalanced
data, we refer to the method in similar imbalanced data classification studies [18–20], and
transform multiclass imbalanced data in the UCI and Data Hub into binary imbalanced data
by combining one or more classes. As shown in Table 2, although some imbalanced datasets
are the same imbalanced data, with different versions, they are different. For example,
Yeast0vs1234, Yeast1vs0234 and Yeast2vs0134 are three imbalanced data versions of Yeast,
which contain different samples and classes. In Yeast0vs1234, the positive class samples
(minority class) belong to class 0, and the negative class samples belong to classes 1, 2, 3
and 4 (majority class samples). In Yeast1vs0234, the positive class samples (minority class)
belong to class 1 and the negative class samples belong to classes 0, 2, 3 and 4 (majority
class samples). In Yeast2vs0134, the positive class samples (minority class) belong to class 2
and the negative class samples belong to classes 0, 1, 3 and 4 (majority class samples).
Axioms 2022, 11, 607 9 of 19
Table 2. Characteristics of the imbalanced data used in the experiment.
Minority Majority Minority Majority

ID Dataset Instances Features IR
Class Class Instances Instances
1 Zoo 101 17 7 all other 10 91 9.1
2 Balance 625 4 B all other 49 576 11.755
3 Dermatology 358 34 6 all other 20 338 16.9
4 Wilt 4839 5 w n 261 4578 17.540
5 Satimage0vs12 6430 36 2 all other 703 5727 8.147
8 Ecoli0vs1 336 7 imU all other 35 301 8.6
9 Ecoli1vs0 336 7 om all other 20 316 15.8
10 Glass0vs12 214 9 3 all other 17 197 11.588
13 Pageblocks0vs1 5473 10 2 all other 329 5144 15.635
14 Pageblocks1vs0 5473 10 5 all other 115 5358 46.591
15 Yeast0vs1234 1484 8 VAC all other 30 1454 48.467
16 Yeast1vs0234 1484 8 EXC all other 35 1449 41.4
17 Yeast2vs0134 1484 8 ME1 all other 44 1440 32.727
20 Zernike0vs1-9 2000 47 1 all other 200 1800 9
21 Zernike1vs0_2-9 2000 47 2 all other 200 1800 9
22 Zernike2vs01_3-9 2000 47 3 all other 200 1800 9
23 Zernike3vs0-2_4-9 2000 47 4 all other 200 1800 9
27 Zernike7vs0-6_89 2000 47 8 all other 200 1800 9
28 Zernike8vs0-7_9 2000 47 9 all other 200 1800 9
29 Zernike9vs0-8 2000 47 10 all other 200 1800 9
30 Libra0vs1-14 360 90 1 all other 24 336 14
31 Libra1vs0_2-14 360 90 2 all other 24 336 14
32 Libra2vs01_3-14 360 90 3 all other 24 336 14
33 Libra3vs0-2_4-14 360 90 4 all other 24 336 14
43 Libra13vs0-12_14 360 90 14 all other 24 336 14
44 Libra14vs0-13 360 90 15 all other 24 336 14
45 KDDCup1999 13,228 41 all other normal 3228 10,000 3.098
46 NSL-KDD2009 13,158 41 all other normal 3158 10,000 3.167
47 CSE-CIC-IDS2018 12,403 78 all other normal 2403 10,000 4.161
48 CICIDS17 12,180 78 all other normal 2180 10,000 4.587
3.2. Machine Learning Models

In order to compare the performance stability between machine learning models,
this study uses eight widely used machine learning models, including Gaussian Naive
Bayes (GNB) [21], Bernoulli naive Bayes (BNB) [22], K-nearest neighbor (KNN) [23], logistic
regression (LR) [24], random forest (RF) [25], decision tree (DT) [26], gradient boosting
decision tree [27] and support vector classifier (SVC) [28], used as classification algorithms
3.2. Machine Learning Models
In order to compare the performance stability between machine learning models, this
study uses eight widely used machine learning models, including Gaussian Naive Bayes
(GNB) [21], Bernoulli naive Bayes (BNB) [22], K-nearest neighbor (KNN) [23], logistic
Axioms 2022, 11, 607 regression (LR) [24], random forest (RF) [25], decision tree (DT) [26], gradient 10 boosting
of 19
decision tree [27] and support vector classifier (SVC) [28], used as classification algorithms
to train and predict imbalanced data. These machine learning models were implemented
based onand
to train thepredict
Pythonimbalanced
library Scikit-Learn [29]
data. These with default
machine learningsettings
modelsemployed.
were implemented
based on the Python library Scikit-Learn [29] with default settings employed.
3.3. Experimental Flow Design
3.3. Experimental Flow Design
To prevent the testing set from being affected by an imbalanced data augmentation
To prevent
algorithm, it wasthe testingfrom
isolated set from being affected
the training by anpreprocessing.
set during imbalanced data augmentation
Therefore, we only
algorithm, it was isolated from the training set during preprocessing. Therefore,
perform an imbalanced data augmentation algorithm on the training set. We randomly we only
perform an imbalanced data augmentation algorithm on the training set. We randomly
selected 10% of the original imbalanced data as the testing set, and the remaining 90% as
selected 10% of the original imbalanced data as the testing set, and the remaining 90%
the training set. Then, IR is calculated according to the number of majority class and
as the training set. Then, IR is calculated according to the number of majority class and
minority class samples. Next, the training set is augmented based on Algorithm 1, IR and
minority class samples. Next, the training set is augmented based on Algorithm 1, IR and
SMOTE. Finally,
SMOTE. Finally, the the performance
performanceof ofthe
themachine
machinelearning
learning models
models is evaluated
is evaluated based
based on on
AFG and CV AFG. Figure 6 shows the experimental flowchart for each imbalanced dataset.
AFG and CV AFG . Figure 6 shows the experimental flowchart for each imbalanced dataset.
With
Witheach
each different IR, the
different IR, theexperiment
experimentwas wasrepeated
repeated 100
100 times
times to to reduce
reduce thethe impact
impact of bias
of bias
caused by the randomness of the experiment.
caused by the randomness of the experiment.
Imbalanced
data
Split training set and
90% of data testing set randomly 10% of data
Perform 100 times
Minority Based oversampling approach

instances Smote and imbalance ratio Apply Prediction
Testing set performance output
metric AFG
Training Average
Oversampled
set results
data Prediction
model
Apply eight standard classification

Majority New
Combination algorithms (GNB, BNB, KNN, LR, RF,
instances training set
DT, GBDT, SVC )
Calculate the Performance

imbalance stability Apply CVAFG
ratio
Figure
Figure6.6.Experimental
Experimental flowchart.
flowchart.
3.4.Statistical
3.4. Statistical Test
TestMethod
Method
Thisstudy
This study employed
employed non-parametric
non-parametrictesting
testingtotoanalyze
analyzeand
andcompare
compare whether there
whether there
are significant differences in the performance stability of machine learning
are significant differences in the performance stability of machine learning models models on on
imbalanced datasets.
imbalanced datasets. These
These tests
tests have
have been
beenused
usedininseveral
severalempirical
empiricalstudies andand
studies are are
highly recommended in the field of machine learning and data mining [20,30] to confirm
highly recommended in the field of machine learning and data mining [20,30] to confirm
experimental results. The non-parametric test procedure consists of three steps. First,
experimental results. The non-parametric test procedure consists of three steps. First,
ranking scores are computed by assigning a rank score to each machine learning model in
ranking scores are computed by assigning a rank score to each machine learning model in
each imbalanced dataset. Because the smaller the CV AFG , the more stable the performance
each imbalanced dataset. model,
of the machine learning Becausetherefore
the smaller
8 isthe CVAFG, to
assigned thethe
more
moststable the performance
unstable machine
oflearning
the machine
model, 7learning model,
to the second therefore
unstable 8 is learning
machine assignedmodel,
to theand
most unstable
so on. machine
The ranking
learning model, 7 to the second unstable machine learning model, and so on. The
score of the best stable machine learning model is 1. Then, the mean ranking scores of eightranking
machine learning models on 48 imbalanced datasets are computed. Next, the Friedman test
is used to determine whether these machine learning models deliver the same performance
stability. If performance stability differs, the hypothesis that all machine learning models
have the same performance stability is rejected; if the performance stability of the machine
learning model is significantly different, a post hoc test is needed to further distinguish
each machine learning model. Finally, when the hypothesis that all machine learning
models have the same performance stability is rejected, the Nemenyi post hoc test is
applied to check whether the control machine learning model (usually the most stable
Axioms 2022, 11, 607 11 of 19
one) significantly outperforms the remaining machine learning models. The Nemenyi
post hoc procedures enable calculating the critical distance of the mean ranking score
difference. If the difference between the mean ranking scores of the two machine learning
models exceeds the critical distance, the hypothesis that the performance stability of the
two machine learning models is the same is rejected at a specified level of significance α
(i.e., there exist significant differences); in this study, α = 0.05.
4. Experimental Results and Discussion

This section presents the experimental results of 48 different imbalanced datasets for
eight machine learning models. Section 4.1 shows the experimental results of the relation-
ship between varying performance in machine learning models and IR. The performance
stability results of eight machine learning models on 48 different imbalanced datasets are
presented in Section 4.2, and Section 4.3 presents the statistical test results.
4.1. Relationship between Varying Performance in Machine Learning Models and IR

In order to analyze the relationship between varying performance in machine learn-
ing models and IR, a line chart is used to display the experimental data. As shown in
Figures 7 and 8, the x-axis and y-axis represent the IR and values of AFG, respectively.
Similarly, 48 different imbalanced datasets will be divided into four groups for comparison,
each group including 11 different imbalanced datasets. Therefore, each small graph in
Figures 7 and 8 represents the relationship between AFG variation and the IR of a machine
learning model on 48 imbalanced datasets. At the same time, the IR of different imbalanced
data is different, so the lines in the figures have different lengths.
From Figures 7 and 8 we can observe:
(1) With the gradual increase in the IR, the AFG performance of all eight machine
learning models on most imbalanced datasets shows a downward trend. In addition,
there are two main situations for this downward trend. One is that the AFG performance
decreases sharply at first, and then the AFG performance gradually decreases as the IR
gradually increases. Another situation is that as the IR gradually increases, the AFG
performance always gradually decreases.
(2) On a few imbalanced datasets, with the gradual increase in the IR, the AFG
performance of some machine learning models increases without degradation. We speculate
that there are two reasons for this. One is that the imbalanced data is too complicated to
train, and the other is that we assume that the data generated by the oversampling method
belongs to the original imbalanced data distribution. However, the real situation is that
the oversampling method may have some limitations, making the quality of minority class
samples generated during each imbalanced data augmentation different, which leads to
this situation.
4.2. Performance Stability Results

Figure 8 shows the coefficient of variation CV AFG (%) of eight machine learning models
on 48 imbalanced datasets. The larger the CV AFG , the greater the influence of imbalanced
data on the machine learning model, and the more unstable the performance. In order
to compare the performance stability of different machine learning models on different
imbalanced data more clearly, 48 different imbalanced datasets will be divided into four
groups for comparison, each group including 12 different imbalanced datasets.
As shown in Figure 9, the x and y axes of each plot represent the different imbalanced
data with different IRs and CV AFG scores, respectively. Among them, eight different
colors represent eight different machine learning models. It should be noted that since the
CV AFG value of a few machine learning models is 0, this indicates that the performance
of the machine learning model on the imbalanced dataset is very stable, for example, the
imbalanced dataset Zoo.
Axioms 2022,
Axioms 11,11,
2022, 607x FOR PEER REVIEW 13 12
of of
2119
GNB GNB BNB BNB

1.2 1.1 1.2 1.2
1.0 1.0
1.0 1.0
0.9
0.8
0.8 0.8
0.8
0.6
AFG
AFG
AFG
AFG
0.7 0.6 0.6
0.4
0.6
0.4 0.4
0.2
0.5
0.2 0.2
0.0 0.4
0.3 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 20 25 30 35 40 45 50 55 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 20 25 30 35 40 45 50 55
Imbalance ratio Imbalance ratio Imbalance ratio Imbalance ratio
Zoo Satimage0vs12 Ecoli1vs0 Glass2vs01 Yeast1vs0234 Zernike0vs1-9 Zoo Satimage0vs12 Ecoli1vs0 Glass2vs01 Yeast1vs0234 Zernike0vs1-9
Balance Satimage1vs02 Glass0vs12 Pageblocks0vs1 Yeast2vs0134 Zernike1vs0_2-9 Balance Satimage1vs02 Glass0vs12 Pageblocks0vs1 Yeast2vs0134 Zernike1vs0_2-9
Dermatology Satimage2vs01 Glass1vs02 Pageblocks1vs0 Yeast3vs0124 Zernike2vs01_3-9 Dermatology Satimage2vs01 Glass1vs02 Pageblocks1vs0 Yeast3vs0124 Zernike2vs01_3-9
Wilt Ecoli0vs1 KDDCup1999 Yeast0vs1234 Yeast4vs0123 NSL-KDD2009 Wilt Ecoli0vs1 KDDCup1999 Yeast0vs1234 Yeast4vs0123 NSL-KDD2009
GNB GNB BNB BNB

1.2 1.0 1.0 0.6
0.9
1.0 0.8 0.5
0.8
0.7 0.4
0.8 0.6
AFG
AFG
AFG
AFG
0.6
0.6 0.4 0.3

0.5
0.4
0.4 0.2 0.2
0.3
0.2 0.2 0.0 0.1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Zernike3vs0-2_4-9 Zernike7vs0-6_89 Libra1vs0_2-14 Libra4vs0-3_5-14 Libra8vs0-7_9-14 Libra12vs0-11_13-14 Zernike3vs0-2_4-9 Zernike7vs0-6_89 Libra1vs0_2-14 Libra4vs0-3_5-14 Libra8vs0-7_9-14 Libra12vs0-11_13-14
Zernike4vs0-3_5-9 Zernike8vs0-7_9 Libra2vs01_3-14 Libra5vs0-4_6-14 Libra9vs0-8_10-14 Libra13vs0-12_14 Zernike4vs0-3_5-9 Zernike8vs0-7_9 Libra2vs01_3-14 Libra5vs0-4_6-14 Libra9vs0-8_10-14 Libra13vs0-12_14
Zernike5vs0-4_6-9 Zernike9vs0-8 Libra3vs0-2_4-14 Libra6vs0-5_7-14 Libra10vs0-9_11-14 Libra14vs0-13 Zernike5vs0-4_6-9 Zernike9vs0-8 Libra3vs0-2_4-14 Libra6vs0-5_7-14 Libra10vs0-9_11-14 Libra14vs0-13
Zernike6vs0-5_7-9 Libra0vs1-14 CICIDS17 Libra7vs0-6_8-14 Libra11vs0-10_12-14 CSE-CIC-IDS2018 Zernike6vs0-5_7-9 Libra0vs1-14 CICIDS17 Libra7vs0-6_8-14 Libra11vs0-10_12-14 CSE-CIC-IDS2018
KNN KNN LR LR
1.2 1.2 1.2 1.2
1.0 1.0 1.0

1.0
0.8 0.8 0.8

0.8
AFG
AFG
AFG
AFG
0.6 0.6 0.6
0.6
0.4 0.4 0.4
0.4
0.2 0.2 0.2
0.0 0.0 0.0 0.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 20 25 30 35 40 45 50 55 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 20 25 30 35 40 45 50 55

KNN KNN LR LR
1.2 1.1 1.2 1.2
1.0 1.0
1.0 1.0
0.9 0.8
0.8 0.8
AFG
AFG
AFG
AFG
0.8 0.6
0.6 0.6
0.7 0.4
0.4 0.4
0.6 0.2
0.2 0.5 0.2 0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Figure 7. Relationship between varying performance in GNB, BNB, KNN, LR and IR.
Figure 7. Relationship between varying performance in GNB, BNB, KNN, LR and IR.
Axioms 2022,
Axioms 11,11,
2022, 607x FOR PEER REVIEW 14 13
of of2119
RF RF DT DT
1.2 1.2 1.2 1.2
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8

AFG
AFG
AFG
AFG
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 20 25 30 35 40 45 50 55 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 20 25 30 35 40 45 50 55
RF RF DT DT
1.2 1.2 1.2 1.2
1.0 1.0
1.0 1.0
0.8 0.8
0.8 0.8
AFG
AFG
AFG
AFG
0.6 0.6
0.6 0.6
0.4 0.4
0.4 0.4
0.2 0.2
0.2 0.2 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

GBDT GBDT
1.2
SVC SVC
1.2 1.2 1.0
1.0 1.0 1.0

0.8
0.8 0.8 0.8

0.6
AFG
AFG
AFG
0.6
AFG
0.6 0.6
0.4
0.4 0.4 0.4
0.2 0.2 0.2

0.2
0.0 0.0 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 20 25 30 35 40 45 50 55 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 20 25 30 35 40 45 50 55
GBDT GBDT SVC SVC

1.2 1.2 1.2 1.2
1.0 1.0
1.0 1.0
0.8 0.8
0.8 0.8
AFG
AFG
AFG
AFG
0.6 0.6
0.6 0.6
0.4 0.4
0.4 0.4
0.2 0.2
0.2 0.2 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Figure8.8.Relationship
Figure Relationshipbetween
betweenvarying
varying performance
performance in
in RF,
RF, DT, GBDT, SVC
DT, GBDT, SVC and
and IR.
IR.
From Figures 7 and 8 we can observe:

(1) With the gradual increase in the IR, the AFG performance of all eight machine
learning models on most imbalanced datasets shows a downward trend. In addition, there
are two main situations for this downward trend. One is that the AFG performance de-
creases sharply at first, and then the AFG performance gradually decreases as the IR grad-
ually increases. Another situation is that as the IR gradually increases, the AFG perfor-
mance always gradually decreases.
(2) On a few imbalanced datasets, with the gradual increase in the IR, the AFG per-
formance of some machine learning models increases without degradation. We speculate
that there are two reasons for this. One is that the imbalanced data is too complicated to
train, and the other is that we assume that the data generated by the oversampling method
belongs to the original imbalanced data distribution. However, the real situation is that
the oversampling method may have some limitations, making the quality of minority class
four groups for comparison, each group including 12 different imbalanced datasets.
As shown in Figure 9, the x and y axes of each plot represent the different imbalanced
data with different IRs and CVAFG scores, respectively. Among them, eight different colors
represent eight different machine learning models. It should be noted that since the CVAFG
value of a few machine learning models is 0, this indicates that the performance of the
Axioms 2022, 11, 607 14 of 19
machine learning model on the imbalanced dataset is very stable, for example, the imbal-
anced dataset Zoo.
60 60
GNB LR GBDT GNB RF
BNB RF SVC BNB DT
50 KNN DT 50 KNN GBDT
LR SVC
40 40
30 30
CV (%)
CV (%)
20 20
10 10
0 0
e
o
ilt
01
Ec s 1
s0
0
99
23
ag 12
s1
9
2
34
24
KD -9
1
34
34
-9
nc
og
9
Zo
tim vs0
1-
vs
s1
W
s0
00
_3
v
v
0v
vs
19
01
2
KD 1vs
12
01
tim vs
02
1
a
ol
i0
Ye k s 1
i1
vs
0v
2v
0_
Ye vs0
l
e1
e2
D2
NS s01
Sa ge0
Ba
ks
at
up
vs
ol
ol
vs
vs
vs
e0
ss
ss
ss
vs
ag
rm
Ec
t4
oc
t0
t3
oc
t1
DC
t2
ik
v
la
e1
a
la
la
as
e2
as
as
De
as
as
bl
rn
tim
bl
G
G
L-
ik
ge
ge
Ye
Ze
Ye
ik
Ye
Sa
rn
Sa
rn
Pa
Pa
Ze
Ze
Imbalanced data Imbalanced data
80
GNB LR GBDT 60 GNB LR GBDT
BNB RF SVC BNB RF SVC
KNN DT KNN DT Col 8
60 50
40
40
CV (%)
CV (%)
30
20 20
10
10
0
0
9
14
14
9
14
9
9
14
14
14
8
14
4
14
14
13
14
14
8
4
14
17
4-
5-
7-
6-
7_
8
0-
-1
01
1
7-
5-
-
6_
6-
8-
2_
-
2-
3_
1-
-
1-
2-
4-
5_
2_
4_
9
DS
10
3
13
s0
vs
0-
S2
5_
3_
7_
4_
6_
0_
_1
_1
2_
vs
0-
0-
0-
-1
0-
e9
0-
4v
vs
8_
01
1_
CI
0-
0-
0-
a0
ID
0-
0-
vs
Ze 3vs
-9
10
vs
vs
0-
s0
vs
vs
e8
a1
ik
CI
0-
-1
vs
vs
vs
vs
vs
e7
vs
C-
s0
e4
a1
br
vs
e6
e5
3v
rn
0-
ik
s0
e
vs
a2
br
a6
a4
a8
a5
a7
CI
Li
a3
0v
ik
Ze
br
ik
s
ik
rn
a1
ik
ik
a9
Li
2v
1v
br
br
rn
br
br
rn
br
E-
br
rn
rn
Li
a1
br
rn
Ze
br
br
a1
Li
a1
Ze
Li
Ze
Li
Li
Li
Li
Ze
Ze
CS
Li
br
Imbalanced data
Li
Li
br
br
Li
Li
Li
Imbalanced data
Figure 9. Performance stability of eight machine learning models on 48 imbalanced datasets.

Figure 9. Performance stability of eight machine learning models on 48 imbalanced datasets.
Since the larger the CV AFG , the more unstable the performance, conversely, the
smaller the CV AFG , the more stable the performance. From Figure 9, the following re-
sults are obtained:
(1) LR, DT and SVC have relatively more imbalanced data with CV AFG greater than
10% in 48 different imbalanced datasets, indicating that these three machine learning
models are easily affected by the imbalanced data. Among them, the SVC is the most
vulnerable to the impact of imbalanced data, because, on the one hand, the imbalanced
data with CV AFG greater than 10% is the highest in SVC, and, on the other hand, the value
of CV AFG in SVC is also relatively the largest, and the larger the CV AFG , the more unstable
the performance.
(2) GNB, BNB, KNN, RF and GBDT have relatively less imbalanced data with CV AFG
greater than 10% in 48 different imbalanced datasets, indicating that these five machine
learning models are not susceptible to imbalanced data. Among them, the machine learning
model BNB is the most stable because the imbalanced data with CV AFG greater than 10% is
the lowest in BNB.
(3) The distribution of different imbalanced data can also affect the performance of
machine learning models. For example, more than half of the machine learning models had
a CV AFG greater than 10% on eight imbalanced data (Ecoli0vs1, Glass2vs01, Yeast3vs0124,
Axioms 2022, 11, 607 15 of 19
Libra0vs1-14, Libra1vs0_2-14, Libra7vs0-6_8-14, Libra10vs0-9_11-14 and Libra14vs0-13), show-

ing unstable performance. In other words, compared with other imbalanced data among
the 48 imbalanced datasets, these eight imbalanced data are likely to affect the performance
of the machine learning models. Especially for the imbalanced data Libra-0, except for BNB,
the CV AFG of the other seven classification algorithms are all greater than 10%. It can be
inferred that these eight imbalanced data are relatively complex and difficult to train. In
addition, the results of the performance stability evaluation are statistically confirmed by
statistical tests.
4.3. Statistical Test Results

The mean ranking scores of eight machine learning models on 48 imbalanced datasets
are listed in Table 3. To determine whether these eight machine learning models exhibit
the same performance stability, the p-value of the Friedman test is 6.3957 × 10−12 . These
results indicate that at the significance level of α = 0.05, the hypothesis that all machine
learning models perform similarly in the mean ranking score with CV AFG is rejected; that
is, the performance stability of eight machine learning models is significantly different.
The Nemenyi post hoc test is used to distinguish whether the control method is better
or significantly better than which machine learning models, and the results are shown in
Figure 10.
Table 3. Mean ranking scores of eight machine learning models with CV AFG on 48 imbalanced
datasets. Bold values indicate the best machine learning model for each row.
oms 2022, 11, x FOR PEER REVIEW
Algorithms GNB BNB KNN LR RF DT GBDT SVC
CV AFG 3.4583 3.4063 3.9167 6.0000 4.1250 5.6354 3.8438 5.6146
Figure 10. Result of the Nemenyi post hoc test.
Figure Generally,
10. Result theofcontrol
the Nemenyi
method is post hoc test.
the optimal method. In this study, the control method
is the machine learning model with the most stable classification performance, and there-
fore the machine learning model BNB is the control method. Figure 10 reveals that the
Generally, the control method is the optimal method. In this study,
performance stability of BNB is better than the other seven machine learning models.
method is the
Moreover, theperformance
machine stability
learning model
of BNB with thebetter
is significantly most stable
than classification
the three machine pe
learning
and modelsthe
therefore of LR, DT and SVC.
machine learning model BNB is the control method. Figure
that the performance stability of BNB is better than the other seven machine lea
els. Moreover, the performance stability of BNB is significantly better than th
chine learning models of LR, DT and SVC.
5. Related Works
At present, the issue of imbalanced data classification has attracted wide
Axioms 2022, 11, 607 16 of 19
5. Related Works
At present, the issue of imbalanced data classification has attracted wide attention
in the field of artificial intelligence and data mining. In view of the performance impact
of imbalanced data on machine learning models, researchers have also carried out lots of
exploratory work.
Mazurowski et al. [31] explored two methods of neural network training: classical
backpropagation (BP) and particle swarm optimization (PSO) with clinically relevant
training criteria, and used the simulation data and the real clinical data of breast cancer
diagnosis to verify that the performance of the classification algorithm will deteriorate even
if there is a slight imbalance in the training data. The experimental results further show
that the BP algorithm is better than the PSO algorithm for imbalanced data, especially for
imbalanced data with smaller samples and more features.
Loyola-González et al. [32] analyzed and studied the performance impact of using
resampling methods for contrast pattern-based classifiers in imbalanced data classification
issues. Experimental results show that there are statistically significant differences between
using the contrast pattern-based classifiers before and after applying resampling methods.
Yu et al. [16] proposed an approach to analyzing the impact of class imbalance in
the field of software defect prediction. In this method, the original imbalanced data is
transformed into a set of new datasets with increasing IR by the undersampling approach.
The AUC evaluation metric and CV were used to evaluate the performance of the prediction
models. The experimental results show that the performance of C4.5, Ripper and SMO
prediction models decreases with the increase of IR, while the classification performance of
Logistic Regression, Naive Bayes and Random Forest prediction models is more stable.
Luque et al. [33] conducted extensive and systematic research on the impact of class
imbalance on classification performance measurement through the simulation results
obtained by binary classifiers. A new performance measurement method of imbalanced
data based on the binary confusion matrix is defined. From the simulation results, several
clusters of performance metrics have been identified that involve the use of G-mean or
Bookmaker Informedness as the best null-biased metrics if their focus on classification
successes presents no limitation for the specific application where they are used. However,
if classification errors must also be considered, then the Matthews correlation coefficient
arises as the best choice.
Lu et al. [19] took the Bayesian optimal classifier as the research object, and theoretically
studied the influence of class imbalance on classification results. They proposed a data
measure called the Bayes Imbalance Impact Index (BI3 ). The experiment shows that BI3
can be used as a standard to explain the impact of imbalance on data classification.
Kovács [34] presented a detailed, empirical comparison of 85 variants of minority
oversampling techniques and discussed involving 104 imbalanced datasets for evaluation.
The goal of this work is to set a new baseline in the field and determine the oversampling
principles leading to the best results under general circumstances.
Thabtah et al. [35] studied the impact of varying class imbalance ratios on classifier
accuracy, by highlighting the precise nature of the relationship between the degree of class
imbalance and the corresponding effects on classifier performance. They hope to help
researchers to better tackle the problem. The experiments use 10-fold cross-validation on a
large number of datasets and determine that the relationship between the class imbalance
ratio and the accuracy is convex.
A comparative summary of previous efforts in this field is provided in Table 4. The
columns of the table correspond to the following criteria.
Axioms 2022, 11, 607 17 of 19
Table 4. A comparative summary of previous efforts in this field.
Approach MFs EMs BDs CAs

Mazurowski et al. [31] N N N 2
Loyola-González et al. [32] Y N N 1
Yu et al. [16] N N N 8
Luque et al. [33] N Y N 1
Lu et al. [19] Y N N 5
Kovács [34] Y Y N 4
Thabtah et al. [35] Y Y N 1
Our approach Y Y Y 8
• MFs indicates whether this approach is validated on the imbalanced data from multi-
ple fields, yes (Y), no (N).
• EMs indicates whether this approach uses multiple evaluation metrics to obtain more
objective experimental results, yes (Y), no (N).
• BDs indicates whether the experiment uses imbalanced data with more than
10,000 observations, yes (Y), no (N).
• CAs indicates how many machine learning models are used in the experiment.
6. Conclusions
In both theoretical research and practical application, imbalanced data classification
is a widespread phenomenon. When dealing with an imbalanced dataset, the standard
classification may have different degrees of defects and may thus become inefficient. To
analyze the performance impact of imbalanced data on machine learning models, we not
only analyzed the relationship between varying performance in machine learning models
and IR, but also analyzed the performance stability of the machine learning models on
imbalanced datasets. Specifically, we empirically evaluated the eight widely used machine
learning models (GNB, BNB, KNN, LR, RF, DT, GBDT and SVC) on 48 different imbalanced
datasets based on a proposed imbalanced data augmentation algorithm, AFG and CV AFG .
The experimental results demonstrate that the classification performance of LR, DT and SVC
is unstable, and is easily affected by imbalanced data, and the classification performance
of GNB, BNB, KNN, RF and GBDT is relatively stable and not susceptible to imbalanced
data. In particular, the BNB machine learning model has the most stable classification
performance. Statistical tests confirm the validity of the experimental results.
Because the method for analyzing the performance impact of imbalanced data on ma-
chine learning models proposed in this study is universal and will not be limited to a certain
field, it can be applied to imbalanced data classification in multiple fields, so as to guide
relevant researchers in choosing appropriate machine learning models when faced with
imbalanced data classification issues. For example, when there is no condition to improve
the distribution of imbalanced data or improve the machine learning models, machine
learning models with relatively stable performance can be selected for imbalanced data
classification, such as GNB, BNB, KNN, RF and GBDT. When we need to improve the ma-
chine learning models, we can select those algorithms that are unstable and easily affected
by imbalanced data, such as LR, DT and SVC. Clustering different imbalanced datasets
and using different validation techniques [36] to analyze the classification performance of
machine learning models will be the focus of our future work.
Author Contributions: Conceptualization, Methodology, Software and Writing—original draft, M.Z.;

data curation and experiment, F.W.; writing—reviewing, editing and supervision, X.H.; writing—
reviewing and editing, Y.M.; visualization, H.C.; data curation and experiment, M.T. All authors have
read and agreed to the published version of the manuscript.
Funding: This research was funded by the Major Project of Natural Science Research in Colleges and
Universities of Anhui Province, grant number: KJ2021ZD0007; the 2021 cultivation project of Anhui
Normal University, grant number: 2021xjxm049; Wuhu Science and Technology Bureau Project.
Axioms 2022, 11, 607 18 of 19
Institutional Review Board Statement: This work does not contain any studies with human partici-
pants or animals performed by any of the authors.
Acknowledgments: This work was supported by the Major Project of Natural Science Research in
Colleges and Universities of Anhui Province (KJ2021ZD0007); the 2021 cultivation project of Anhui
Normal University (2021xjxm049); Wuhu Science and Technology Bureau Project.
Conflicts of Interest: The authors declare that they have no conflict of interest.
References
1. Jing, X.-Y.; Zhang, X.; Zhu, X.; Wu, F.; You, X.; Gao, Y.; Shan, S.; Yang, J.-Y. Multiset feature learning for highly imbalanced data
classification. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 139–156. [CrossRef] [PubMed]
2. Zheng, M.; Li, T.; Zhu, R.; Tang, Y.; Tang, M.; Lin, L.; Ma, Z. Conditional Wasserstein generative adversarial network-gradient
penalty-based approach to alleviating imbalanced data classification. Inf. Sci. 2020, 512, 1009–1023. [CrossRef]
3. Zheng, M.; Li, T.; Zheng, X.; Yu, Q.; Chen, C.; Zhou, D.; Lv, C.; Yang, W. UFFDFR: Undersampling framework with denoising,
fuzzy c-means clustering, and representative sample selection for imbalanced data classification. Inf. Sci. 2021, 576, 658–680.
[CrossRef]
4. Liang, D.; Yi, B.; Cao, W.; Zheng, Q. Exploring ensemble oversampling method for imbalanced keyword extraction learning in
policy text based on three-way decisions and SMOTE. Expert Syst. Appl. 2022, 188, 116051. [CrossRef]
5. Kim, K.H.; Sohn, S.Y. Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data.
Neural Netw. 2020, 130, 176–184. [CrossRef]
6. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell.
Res. 2002, 16, 321–357. [CrossRef]
7. Lunardon, N.; Menardi, G.; Torelli, N. ROSE: A Package for Binary Imbalanced Learning. R J. 2014, 6, 79–89. [CrossRef]
8. Al, S.; Dener, M. STL-HDL: A new hybrid network intrusion detection system for imbalanced dataset on big data environment.
Comput. Secur. 2021, 110, 102435. [CrossRef]
9. Raghuwanshi, B.S.; Shukla, S. SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl.-Based Syst.
2020, 187, 104814. [CrossRef]
10. Sun, J.; Li, H.; Fujita, H.; Fu, B.; Ai, W. Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble
combined with SMOTE and time weighting. Inf. Fusion 2020, 54, 128–144. [CrossRef]
11. Pan, T.; Zhao, J.; Wu, W.; Yang, J. Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf. Sci. 2020, 512, 1214–1233.
[CrossRef]
12. Saini, M.; Susan, S. VGGIN-Net: Deep Transfer Network for Imbalanced Breast Cancer Dataset. IEEE/ACM Trans. Comput. Biol.
Bioinform. 2022. [CrossRef] [PubMed]
13. Zhu, Q.; Zhu, T.; Zhang, R.; Ye, H.; Sun, K.; Xu, Y.; Zhang, D. A Cognitive Driven Ordinal Preservation for Multi-Modal
Imbalanced Brain Disease Diagnosis. IEEE Trans. Cogn. Dev. Syst. 2022. [CrossRef]
14. Sun, Y.; Cai, L.; Liao, B.; Zhu, W.; Xu, J. A Robust Oversampling Approach for Class Imbalance Problem with Small Disjuncts.
IEEE Trans. Knowl. Data Eng. 2022. [CrossRef]
15. Douzas, G.; Bacao, F. Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst. Appl. 2017, 82, 40–52.
[CrossRef]
16. Yu, Q.; Jiang, S.; Zhang, Y.; Wang, X.; Gao, P.; Qian, J. The impact study of class imbalance on the performance of software defect
prediction models. Chin. J. Comput. 2018, 41, 809–824.
17. Forkman, J. Estimator and tests for common coefficients of variation in normal distributions. Commun. Stat.—Theory Methods
2009, 38, 233–251. [CrossRef]
18. Fernandes, E.R.; de Carvalho, A.C.; Yao, X. Ensemble of classifiers based on multiobjective genetic sampling for imbalanced data.
IEEE Trans. Knowl. Data Eng. 2019, 32, 1104–1115. [CrossRef]
19. Lu, Y.; Cheung, Y.; Tang, Y.Y. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem.
IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 3525–3539. [CrossRef]
20. Leski, J.M.; Czabański, R.; Jezewski, M.; Jezewski, J. Fuzzy Ordered c-Means Clustering and Least Angle Regression for Fuzzy
Rule-Based Classifier: Study for Imbalanced Data. IEEE Trans. Fuzzy Syst. 2019, 28, 2799–2813. [CrossRef]
21. Moraes, R.M.; Ferreira, J.A.; Machado, L.S. A New Bayesian Network Based on Gaussian Naive Bayes with Fuzzy Parameters for
Training Assessment in Virtual Simulators. Int. J. Fuzzy Syst. 2020, 23, 849–861. [CrossRef]
22. Raschka, S. Naive bayes and text classification i-introduction and theory. arXiv 2014, arXiv:1410.5329.
23. Shi, F.; Cao, H.; Zhang, X.; Chen, X. A Reinforced k-Nearest Neighbors Method with Application to Chatter Identification in High
Speed Milling. IEEE Trans. Ind. Electron. 2020, 67, 10844–10855. [CrossRef]
24. Adeli, E.; Li, X.; Kwon, D.; Zhang, Y.; Pohl, K. Logistic regression confined by cardinality-constrained sample and feature selection.
IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1713–1728. [CrossRef]
25. Chai, Z.; Zhao, C. Enhanced random forest with concurrent analysis of static and dynamic nodes for industrial fault classification.
IEEE Trans. Ind. Inform. 2019, 16, 54–66. [CrossRef]
Axioms 2022, 11, 607 19 of 19
26. Esteve, M.; Aparicio, J.; Rabasa, A.; Rodriguez-Sala, J.J. Efficiency analysis trees: A new methodology for estimating production
frontiers through decision trees. Expert Syst. Appl. 2020, 162, 113783. [CrossRef]
27. Wen, Z.; Shi, J.; He, B.; Chen, J.; Ramamohanarao, K.; Li, Q. Exploiting GPUs for efficient gradient boosting decision tree training.
IEEE Trans. Parallel Distrib. Syst. 2019, 30, 2706–2717. [CrossRef]
28. Alam, S.; Sonbhadra, S.K.; Agarwal, S.; Nagabhushan, P. One-class support vector classifiers: A survey. Knowl.-Based Syst. 2020,
196, 105754. [CrossRef]
29. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.
Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
30. Li, L.; He, H.; Li, J. Entropy-based Sampling Approaches for Multi-class Imbalanced Problems. IEEE Trans. Knowl. Data Eng. 2020,
32, 2159–2170. [CrossRef]
31. Mazurowski, M.A.; Habas, P.A.; Zurada, J.M.; Lo, J.Y.; Baker, J.A.; Tourassi, G.D. Training neural network classifiers for medical
decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 2008, 21, 427–436. [CrossRef]
[PubMed]
32. Loyola-González, O.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; García-Borroto, M. Study of the impact of resampling methods
for contrast pattern based classifiers in imbalanced databases. Neurocomputing 2016, 175, 935–947. [CrossRef]
33. Luque, A.; Carrasco, A.; Martín, A.; de las Heras, A. The impact of class imbalance in classification performance metrics based on
the binary confusion matrix. Pattern Recognit. 2019, 91, 216–231. [CrossRef]
34. Kovács, G. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced
datasets. Appl. Soft Comput. 2019, 83, 105662. [CrossRef]
35. Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 2020,
513, 429–441. [CrossRef]
36. Guarino, A.; Lettieri, N.; Malandrino, D.; Zaccagnino, R.; Capo, C. Adam or Eve? Automatic users’ gender classification via
gestures analysis on touch devices. Neural Comput. Appl. 2022, 34, 18473–18495. [CrossRef]

axioms-11-00607-v2

Uploaded by

Copyright:

Available Formats

axioms-11-00607-v2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

axioms-11-00607-v2

Uploaded by

Copyright:

Available Formats

axioms

Axioms 2022, 11, 607. https://doi.org/10.3390/axioms11110607 https://www.mdpi.com/journal/axioms

1 Key research point 1 Augmented imbalanced

2.1. Imbalanced Data Augmentation Algorithms

Number of Number of Number

Initial n1 n2 n1+n2 Taugment[0] r=n2/n1

1th n2/(r−1) n2 rn2/(r−1) Taugment[1] r−1

2th n2/(r−2) n2 [n2(r−1)]/(r−2) Taugment[2] r−2

(r−2)th n2/2 n2 3n2/2 Taugment[r−2] 2

(r−1)th n2 n2 2n2 Taugment[r−1] 1

Figure 2. RelationshipFigure 2. Relationship

Then, the imbalanced data augmentation algorithm based on the undersampling

Algorithm 2 Undersampling based imbalanced data augmentation

Number of Number of Number

Initial n1 n2 n1+n2 Taugment[0] r=n2/n1

1th n1 (r−1)n1 rn1 Taugment[1] r−1

2th n1 (r−2)n1 (r−1)n1 Taugment[2] r−2

(r−2)th n1 2n1 3n1 Taugment[r−2] 2

(r−1)th n1 n1 2n1 Taugment[r−1] 1

Figure 3. RelationshipFigure 3. Relationship

Number of Number of Number

Initial n1 n2 n1+n2 Taugment[0] r=n2/n1

1th (n1+n2)/r [(r−1)(n1+n2)]/r n1+n2 Taugment[1] r−1

2th (n1+n2)/(r−1) [(r−2)(n1+n2)]/(r−1) n1+n2 Taugment[2] r−2

(r−2)th (n1+n2)/3 2(n1+n2)/3 n1+n2 Taugment[r−2] 2

(r−1)th (n1+n2)/2 (n1+n2)/2 n1+n2 Taugment[r−1] 1

Figure 4. RelationshipFigure 4. Relationship

Predicted Positive Predicted Negative

Axioms 2022, 11, x FOR PEER REVIEW

Figure 5. ROC curve and AUC diagram.

TP G_mean = precall ×TPspeci f icity, (2)

2.3. Performance Stability Evaluation Method

PAFG = {AFG0 , AFG1 , . . . , AFGr−1 }, (4)

3.1. Benchmark Dataset

Table 2. Characteristics of the imbalanced data used in the experiment.

Minority Majority Minority Majority

3.2. Machine Learning Models

Minority Based oversampling approach

Apply eight standard classification

Calculate the Performance

4. Experimental Results and Discussion

4.1. Relationship between Varying Performance in Machine Learning Models and IR

4.2. Performance Stability Results

GNB GNB BNB BNB

0.3 0.0 0.0

GNB GNB BNB BNB

0.6 0.4 0.3

0.2 0.2 0.0 0.1

Imbalance ratio Imbalance ratio Imbalance ratio Imbalance ratio

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.0 0.0 0.0 0.2

Imbalance ratio Imbalance ratio Imbalance ratio Imbalance ratio

0.2 0.5 0.2 0.0

1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0