1 s2.0 S1319157823000228 Main
1 s2.0 S1319157823000228 Main
1 s2.0 S1319157823000228 Main
a r t i c l e i n f o a b s t r a c t
Article history: In machine learning, two approaches outperform traditional algorithms: ensemble learning and deep
Received 18 November 2022 learning. The former refers to methods that integrate multiple base models in the same framework to
Revised 8 January 2023 obtain a stronger model that outperforms them. The success of an ensemble method depends on several
Accepted 19 January 2023
factors, including how the baseline models are trained and how they are combined. In the literature, there
Available online 1 February 2023
are common approaches to building an ensemble model successfully applied in several domains. On the
other hand, deep learning-based models have improved the predictive accuracy of machine learning
Keywords:
across a wide range of domains. Despite the diversity of deep learning architectures and their ability
Ensemble learning
Ensemble methods
to deal with complex problems and the ability to extract features automatically, the main challenge in
Machine learning deep learning is that it requires a lot of expertise and experience to tune the optimal hyper-
Deep learning parameters, which makes it a tedious and time-consuming task. Numerous recent research efforts have
Ensemble deep learning been made to approach ensemble learning to deep learning to overcome this challenge. Most of these
efforts focus on simple ensemble methods that have some limitations. Hence, this review paper provides
comprehensive reviews of the various strategies for ensemble learning, especially in the case of deep
learning. Also, it explains in detail the various features or factors that influence the success of ensemble
methods. In addition, it presents and accurately categorized several research efforts that used ensemble
learning in a wide range of domains.
Ó 2023 Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the
CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
2. Trends of ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
3. Foundations of ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
3.1. Data sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
3.2. Training baseline classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
3.3. Fusion method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
3.3.1. Voting method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
3.3.2. Meta learning method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
4. Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
4.1. Common ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
4.1.1. Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
4.1.2. Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
4.1.3. Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
https://doi.org/10.1016/j.jksuci.2023.01.014
1319-1578/Ó 2023 Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
titles, abstract, and keywords. Fig. 1 shows the number of articles estimated number of articles mentioned, estimated as 16,782 doc-
published for the search term ‘‘Ensemble Learning” each year in uments. Fig. 3 shows the number of articles published for the
the abovementioned period. The figure shows that the number of search term ‘‘Ensemble Deep Learning” each year in the abovemen-
articles found using this term was estimated at 25,262, indicating tioned period. The figure shows that the number of articles found
an increase in the ensemble learning trend over several years. In using this term was estimated as 6,173, indicating increased inter-
addition, Fig. 2 shows the number of articles that discussed the est from researchers in this trend. Also, Fig. 4 shows the number of
search term ‘‘Ensemble Learning” in all fields. From the figure, it articles that discussed the search term ‘‘Ensemble Deep Learning in
can be noted that the field of computer sciences has the highest all fields. From the figure, it can be noted that the field of computer
Fig. 1. The trends of search term ‘‘Ensemble Learning” in ‘‘Scopus” from 2014 to 2021 (Scopus, 2023).
Fig. 2. The different fields of search term ‘‘Ensemble Learning” in ‘‘Scopus” from 2014 to 2021 (Scopus, 2023).
759
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
Fig. 3. The trends of search term ‘‘Ensemble Deep Learning” in ‘‘Scopus” from 2014 to 2021 (Scopus, 2023).
Fig. 4. The different fields of search term ‘‘Ensemble Deep Learning” in ‘‘Scopus” from 2014 to 2021(Scopus, 2023).
sciences has the highest estimated number of articles mentioned, highest utilization rate of ensemble learning and deep ensemble
estimated at 4520 documents. learning of 30% and 35.1%, respectively.
According to the above statistical information, it is clear that
research in ensemble learning and ensemble deep learning is
growing faster each year due to its ability to improve prediction 3. Foundations of ensemble learning
performance. According to estimates, the largest number of articles
using ‘‘Ensemble Learning” and ‘‘Ensemble Deep Learning” in 2021 The general framework of any ensemble learning system is to
was estimated at 7160 and 2340 documents, respectively. In addi- use an aggregation function G to combine a set h of baseline clas-
tion, ensemble learning and deep ensemble learning have been sifiers, c1; c2; . . . ; ch , towards predicting a single output. Given a
applied in several fields, especially computer science, with the dataset of size n and features of dimension
760
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
m; D ¼ fðxi ; yi Þg; 1 6 i 6 n; xi 2 Rm , the predication of the output is based on the same data. In heterogeneous classifiers, the feature
based on this ensemble method is given by Eq. 1. selection method is different for the same training data. Finally,
homogeneous ensemble methods are more appealing to research-
yi ¼ /ðxi Þ ¼ Gðc1; c2; . . . ; ck Þ ð1Þ ers since they are easier to understand and apply. Also, it is less
Fig. 5 illustrates the general abstract framework of ensemble costly to build homogeneous ensembles than heterogeneous ones
learning. All ensembles are made up of a collection of baseline clas- (Hosni et al., 2019).
sifiers (classifiers ensemble) that have been trained on input data Generally, any ensemble framework can be viewed and defined
that produce predictions that are combined to produce an aggre- using three characteristics that affect its performance. The first one
gate prediction (Lakshminarayanan et al., 2017). Ensemble strate- is the dependency on the trained baseline models, whether they
gies differ on how to select the baseline classifiers that are are sequential or parallel. The second characteristic is the fusion
trained. Two strategies generate diversity among the base classi- methods, which involve choosing a suitable process for combining
fiers based on their nature, either homogeneous or heterogeneous outputs of the baseline classifiers using different weight voting or
ensembles as shown in Fig. 6 (Seijo-Pardo et al., 2017). Homoge- meta-learning method. The third characteristic is the heterogene-
neous ensemble (da Conceição et al., 2015) consists of baseline ity of the involved baseline classifiers, whether homogeneous or
classifiers of the same type, with each classifier based on different heterogeneous. Table 1 summarizes the characteristics of the pop-
data. The feature selection method in this strategy is the same for ular ensemble methods. In what follows, those characteristics will
different training data. The main difficulty in homogeneous form is be discussed in detail.
the generation of diversity from the same learning algorithm.
Whereas heterogeneous ensembles consist of different numbers 3.1. Data sampling
of baseline classifiers, (da Conceição et al., 2016), as each classifier
The selection of a data sampling method is one of the most
important factors affecting the performance of the ensemble sys-
tem. In the ensemble system, we need diversity in the data sam-
pling decisions of the baseline classifiers. There are two
strategies of the sampling methods from the training dataset in
the ensemble system: the independent datasets strategy and the
Table 1
Categorization of ensemble methods.
761
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
3.3. Fusion method 1. Max Voting: The first and most popular voting method is the
max voting (Kim et al., 2003) often, often known as majority
Output fusion refers to integrating the outputs of the baseline voting or hard voting. The idea of max voting involves collecting
classifiers into a single output. There are two methods of fusion, predictions for each class label and predicting the class label
the voting method, and the meta-learning method. We will explain with the most votes as shown in function (2). For example,
762
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
assuming we combine three classifiers, C1, C2, and C3, that multiplying each prediction by the weight of the classifiers to
assign the following classifications to a training sample: produce a weighted sum and then dividing the result by the
[0,0,1] becomes y =mode [0,0,1]=0. We would categorize the sum of the weights of the classifier, these weights may be used
sample as ‘‘class 0”. Max voting is often used in the bagging to calculate the weighted average for each class 0 or class 1 as
method. Another type of max voting is soft voting. Soft voting shown in function (5). For instance, suppose the ensemble of
involves collecting predicted probabilities for each class label classifiers contained three members: C1(x)=[97.2,2.8], C2(x)=
and predicting the class label with the largest probability as [100.0,0], and C3(x)=[95.8,4.2]. It has constant weights for
shown in function (3). Max voting is distinguished from soft ensemble members [0.84, 0.87, 0.75]. To calculate the class 0
voting in that once we know the prediction for any of the base- y0 = ((97.2 * 0.84) + (100.0 * 0.87) + (95.8 * 0.75))/ (0.84 + 0.8
line classifiers, we do not need to store any other information 7 + 0.75) =97.763. And to calculate the class 1 y1 = ((2.8 *
about the probability distributions of the predictions. On the 0.84) + (0 * 0.87) + (4.2 * 0.75))/ (0.84 + 0.87 + 0.75) =2.235,
other hand, soft voting needs to store and use all the distribu- would yield a prediction y ¼ 0. The weighted average voting
tion values, making it more computationally and costly for stor- method is more accurate than the simple average-voting
age. However, in soft voting, we can use various methods to method. The challenge in using a weighted average ensemble
calculate the prediction, such as calculating maximum or aver- is choosing each member’s relative weighting. Also, the compu-
age probability values (Delgado, 2022). In general, the max vot- tation is more expensive than the average voting method, as it
ing method has the advantages of being simple to understand requires calculating the weighted average of the prediction
and the simplest method of voting. The drawbacks of the max results of all the baseline models, which makes it of little appli-
voting method include the computational expense of using sev- cation (Khan et al., 2020).
eral baseline models. Additionally, max voting is useless when
the baseline classifiers predictions are the same results and
X
m
wj xi
may not fit all problems (Nti et al., 2020). j¼1
y ¼ ð5Þ
y ¼ mod½C 1 ðxÞ; C 2 ðxÞ; ::; C n ðxÞ ð2Þ Xm
wj
Where y a predict the class label via majority (plurality) voting j¼1
of each classifier C n .
where w weighted average, m is a number of terms to be aver-
X
n
aged, weights applied to x values wj , and data values to be aver-
y ¼ argmax wj Pij ð3Þ
i j¼1
aged xj .
th
Where wj is the weight that can be assigned to the j classifier. 3.3.2. Meta learning method
2. Averaging Voting: The second voting method is the averaging The second fusion method is meta-learning (Soares et al., 2004),
voting (Montgomery et al., 2012). The idea of averaging voting also known as ‘‘learning to learn”, which is the process of learning
is that predictions are extracted from multiple models, and an from learners. The term ‘‘meta-learning” covers learning based on
average of the predictions is used to make the final prediction. previous experience with other tasks. Therefore, it is used to
Average prediction is calculated using the arithmetic mean, improve the performance and results of a learning algorithm by
which is the sum of the predictions divided by the total predic- changing some aspects of the learning algorithm based on experi-
tions made as shown in function (4). For instance, suppose the ment results. The meta-learning method differs from traditional
ensemble of classifiers contained three members: C1(x)= machine-learning models in that it involves more than one learn-
[0.9,0.1], C2(x)=[0.2,0.8], and C3(x)=[0.6,0.4]. The mean predic- ing stage where the individual inducer outputs serve as an input
tion would be as follows: to calculate the class 0 y0 [0.9 + 0.2 to the meta-learner that generates the final output (Kuruvayil
+ 0.6/3] = 0.566. And to calculate the class 1 y1 [0.1 + 0.8 + 0.4 and Palaniswamy, 2021).
/3] = 0.433, would yield a prediction y ¼ 0. The average voting Over the past five years, interest in meta-learning has increased,
method has the advantage of being the strongest from the point especially after 2017. With the increased use of advanced machine
of view of predictive power. In addition, it is more accurate in learning algorithms, the difficulties of training these learning algo-
performance than majority voting and reduces overfitting. Also, rithms have led to an increased interest in meta-learning. Machine
the average voting is a natural competitor to the max voting for learning algorithms have many challenges, such as the high oper-
bagging method. The drawbacks of the average voting method ational costs due to many experiments during the training phase,
include being computationally more expensive than the max which takes a long time to find the best model that achieves the
voting method, as it requires averaging the prediction results best performance for a certain dataset. Meta-learning helps to
of all the baseline models. One limitation of the averaging vot- meet these challenges by improving learning algorithms and find-
ing method is that it assumes that all baseline models in the ing learning algorithms that perform better (Kuruvayil and
ensemble are equally effective. However, it is not the case as Palaniswamy, 2022). In addition, the benefits of meta-learning
some models may be better than others (Hopkinson et al., include speeding up learning processes by reducing the number
2020). of experiments required, helping learning algorithms better adapt
1X m to changing conditions, and optimizing hyperparameters to
y ¼ argmax wij ð4Þ achieve optimal results. Moreover, this method provides an oppor-
i n j¼1
tunity to tackle many challenges of deep learning, including data
th th size, computational complexities, and generalization. The chal-
where wij is the probability of the i class label of the j
lenge in meta-learning is to learn from experience in a systematic,
classifier.
data-driven manner (Hospedales et al., 2021). There are many
3. A weighted Average Voting: The third method of voting is the
meta-learning methods, the most common of which is stacking
weighted average voting, which is a slightly modified version of
(Haghighi and Omranpour, 2021). To implement the meta-
averaging voting (Latif-Shabgahi, 2004). The idea of weighted
learning, there are several challenges represented in defining an
average voting is different weights given to the baseline learn-
appropriate meta-learning approach and the computation time
ers, indicating the importance of each model in prediction. By
complexity, whether through a large amount of available dataset
763
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
or through multiple baseline models or multiple levels of meta- the dataset that the previous models in the sequences badly han-
learning (Monteiro et al., 2021). dled. Boosting, like bagging, can be used for regression and classi-
fication problems. Boost algorithms include three types, namely,
4. Ensemble methods Adaptive Boosting (AdaBoost) (Freund et al., 2003), Stochastic Gra-
dient Boosting (SGB) (Friedman, 2001), and Extreme Gradient
This section presents two aspects. The first aspect includes the Boosting (XGB), also known as XGBoost(Friedman et al., 2000).
structure of the most popular ensemble learning methods and lists Several studies have applied various types of boosting. For exam-
each method’s benefits, drawbacks, and implementation chal- ple, the AdaBoost algorithm is implemented in Sun et al. (2016)
lenges separately. The second aspect presents the idea of deep for noise detection and in Asbai and Amrouche (2017) for speech
ensemble learning and the advantages of its application compared feature extraction. The XGB algorithm is implemented in
to traditional ensemble learning. It also discusses the deep learning Haumahu et al. (2021) for Fake news classification. The SGB algo-
challenges that ensemble deep learning overcomes them. More- rithm is implemented in Shin (2019) for early prediction of safety
over, it introduces the different strategies for applying ensemble accidents at construction sites. Boosting provides ease of interpre-
deep learning and the advantages of each strategy with an expla- tation of the model and helps reduce variance and bias in a
nation of the factors that can affect its performance. machine learning ensemble. The drawback of boosting is that each
classifier must fix the errors in the predecessors. To implement
4.1. Common ensemble methods boosting, several challenges are represented by the difficulty of
scaling sequential training in boosting. It is computationally costly
Three popular ensemble learning methods can be used to and more vulnerable to overfitting when increasing the number of
improve the machine learning process: bagging, boosting, and iterations. Finally, it can be noted that boosting algorithms can be
stacking. We will discuss the nature of each method’s work and slower to train when compared to bagging because a large number
its characteristics regarding the nature of data generation, the nat- of parameters can also affect the behavior of the model. In sum-
ure of training of baseline classifiers, and the appropriate fusion mary, the boosting method uses sequential ensemble techniques
methods. In addition, the benefits, drawbacks, and implementation where different learners learn sequentially, as there is data depen-
challenges of each method will be covered. dency and the fusion methods depend on different voting methods.
The function of boosting is shown as follows (7):
4.1.1. Bagging
X
f ðxÞ ¼ at ht ðxÞ ð7Þ
The bagging method (Breiman, 1996), also known as bootstrap t
aggregating, is a completely data-specific algorithm. It refers to
creating multiple small subsets of data from the actual dataset. where creates a strong classifier f ðxÞ from several weak classifiers
The goal of bagging is to create more diverse predictive models ht ðxÞ. This is done by building a model from the training data, then
by adjusting a stochastic distribution of the training datasets, creating a second model that attempts to correct the errors from the
where small changes in the training data set will lead to significant first model at .
changes in the model predictions. Bagging is shorthand for the
combination of bootstrapping and aggregating. In bootstrapping, 4.1.3. Stacking
the training of the ensemble models on bootstrap replicates the Stacking method (Smyth and Wolpert, 1997), also known as
training dataset. In aggregation, the final result is achieved by Stacked Generalization, is a model ensembling technique used to
majority voting of the model’s predictions performed to determine combine information from multiple predictive models to generate
the final prediction. Bagging offers the advantage of reducing vari- a new model (meta-model). The architecture of a stacking model
ance, thus eliminating overfitting. It also performs well on high- involves two or more base models, referred to as a level-0 model,
dimensional data. The drawback of bagging is that it is computa- and a meta-model that combines the predictions of the base mod-
tionally expensive and has high bias, and it also leads to a loss of els, referred to as a level-1 model. In level 0 models (base models),
interpretability of a model (Bühlmann and Yu, 2002). Random For- models fit on the training data and whose predictions are com-
ests (RF) algorithm (Breiman, 2001) is a good example of bagging. piled. However, in the level 1 model (meta-model), the model
There are several challenges to implementing the bagging method: learns how to combine the base models’ predictions best. The out-
determining the optimal number of base learners and subsets and puts from the base models used as input to the meta-model may be
the maximum number of bootstrap samples per subset. In addi- probability values, or class labels in the case of classification (Ma
tion, the determine of fusion method of integrating the outputs et al., 2018). The stacking method typically performs better than
of the base classifiers from various voting methods. In summary, all trained models. For instance, a stacking ensemble learning sys-
the bagging method uses parallel ensemble techniques where tem was proposed by Divina et al. (2018) to forecast electric energy
baseline learners are generated simultaneously, as there is no data usage in Spain and Qiu et al. (2014) to forecast electric energy
dependency and the fusion methods depend on different voting usage in Australia. Stacking has the benefit of a deeper comprehen-
methods. The function of bagging is shown as follows (6): sion of the data, making it more precise and effective. Overfitting is
a major issue with model stacking because there are so many pre-
1X B
dictors that all predict the same target that is merged. In addition,
f ðxÞ ¼ f ð6Þ
B B¼1 bðxÞ multi-level stacking is costly to data (as lots of data needed to be
trained) and time-consuming (as each layer adds multiple models)
1
where f bðxÞ weak learners, B
generates bootstrapping sets. (Xiong et al., 2021). Xiong et al. (2021). To implement stacking,
several challenges are represented by identifying the appropriate
4.1.2. Boosting number of baseline models and the baseline models that can be
Boosting method was first presented by Freund and Schapire in relied upon to generate better predictions from datasets when
the year 1997 (Freund et al., 1996), and is a sequential process designing a stacking ensemble from scratch. Also, the difficulty of
where each subsequent model attempts to correct the errors of interpreting the final model and the computation time complexity
the previous model. Boosting consists of sequentially multiple are added when the amount of available data grows exponentially.
weak learners in a very adaptive way, whereby each model in A highly complex model would take months to run. Finally, the
the sequence is fitted, giving more importance to observations in problem of multi-label classification raises many issues, such as
764
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
overfitting and the curse of dimensionality, from the high dimen- those of an ensemble learning system (Mohammed and Kora,
sionality of the data (Chatzimparmpas et al., 2020). In summary, 2021). Despite the power of ensemble deep learning system meth-
the stacking method uses parallel ensemble techniques where ods in improving prediction performance, most of the ensemble
baseline learners are generated simultaneously, as there is no data deep learning literature focuses on only applying a majority of vot-
dependency and the fusion methods depend on the meta-learning ing algorithms to enhance the performance due to its simplicity.
method. The function of stacking is shown as follows (8): Ensemble learning based on deep learning models is more diffi-
cult than ensemble learning based on traditional classifiers due to
X
n
f s ðxÞ ¼ ai f i ðxÞ ð8Þ deep neural networks containing millions to billions of hyper-
i¼1 parameters that need a lot of time and space to train multiple base
deep learners. Thus, hyper-parameters are challenges in the appli-
A formal stacking concept: Here, we make predictions from sev- cation of ensemble deep learning techniques. Ensemble learning
eral models ðm1; m2; m3:::; mnÞ to build a new model, where the strategies are formed in the context of manipulating the data level
new model is used to make predictions on the test dataset. Stack- or the baseline model level. In manipulation at the level of data, by
ing seeks to increase the predictive power of a model. The basic sampling data or cross-validation data (re-sampling) to create new
idea of stacking is to ‘‘stack” the predictions of training sets to train different base learners. In manipulation at the
ðm1; m2; m3:::; mnÞ by a linear combination of weights level of basic models, deep learning is distinguished by more
aj ; :::; ði ¼ 1; 2; :::; nÞ. diverse strategies than traditional or machine learning, which is
the possibility of reducing the number of hyper-parameters used
4.2. Ensemble deep learning in the ensemble base deep models by selecting the same model
and changing the hyper-parameters (Saleh et al., 2022). Fig. 9
In recent years, deep learning or deep neural learning has led to shows four strategies through which deep learning can be con-
a series of achievements in various tasks(Arel et al., 2010). Deep ducted based on the ensemble represented by: (A) Applying many
learning architectures have shown great success in almost all chal- different basic models using the same data. (B) Applying different
lenges related to machine learning across different areas, such as structures of the same basic model using the same data. (C) Apply-
NLP (Mohammed and Kora, 2019; Elnagar et al., 2020), computer ing many different basic models using many different data sam-
vision (Haque et al., 2020; Brunetti et al., 2018), speech recognition ples. (D) Applying different structures of the same basic model
(Jaouedi et al., 2020; Noda et al., 2015). Machine translation (Popel using many different data samples. Comparing these strategies
et al., 2020; Popel et al., 2020). Deep neural network models are shows that strategy A and strategy C are compatible with deep
nonlinear methods that learn through a stochastic training algo- learning models and traditional learning techniques. Whereas
rithm. This means that it is highly flexible, able to learn the com- strategy B and strategy D only apply to deep learning models and
plex relationships between variables and approximate any cannot be used with traditional learning techniques, making the
mapping function. The downside to this flexibility is that the mod- ensemble deep learning strategies diverse. In addition, strategy B
els need a higher variance. The high variance of the deep model can and strategy D enable ensemble deep learning to reduce the
be addressed by ensemble deep learning approach opportunities hyper-parameters of the baseline deep models by different struc-
by training multiple deep models for the problem and combining tures of the same basic model by altering some of the hyper-
their predictions. Hence, ensemble deep learning methods refer parameters values. In addition to these strategies, the strength of
to training several baseline deep models and combining some rules the ensemble deep learning system depends on the ensemble sys-
to make predictions. Ensemble deep learning aims to effectively tem design, from identifying the most effective deep learning mod-
combine the major benefits of several deep learning models with els to address the problem and determining the appropriate
765
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
number of baseline deep learning models, such as three or more true negativ e
Specificity ¼ ð12Þ
and also determining the optimal ratio for data splitting such as negativ e
(80–20 or 70–30 or 60–40). Moreover, we consider factors that
where true negative denotes the number of true negative observa-
may affect the deep ensemble system, such as defining the nature
tions and negative denotes the number of negative observations.
of data generation, training deep baseline models, and deciding the
There is commonly a trade-off between precision and recall
most appropriate fusion method of combining outputs of the base-
metrics. Attempting to enhance one measure often results in the
line classifiers, as previously mentioned. These three factors affect
fall of the second. Thus, F-Measure quantifies this trade-off by cal-
the general framework of the ensemble system.
culating the harmonic mean of both precision and recall. More
specifically, this measure is defined in Eq. 13:
5. Evaluating ensembles
2 x Precision x Recall
F Measure ¼ ð13Þ
With the emergence of ensemble learning approaches, lots of Precision þ Recall
research has been conducted to evaluate the methods of ensemble
(Hashino et al., 2007; Zhang et al., 2016; Das and Sengur, 2010; 5.2. Computational complexity
Hosni et al., 2019). The evaluation is crucial to determining the
effectiveness of a certain ensemble method. There are several cri- The computational complexity of the ensemble approach is an
teria for evaluation ensemble, including predictive performance. additional essential aspect to consider. Generally, the computa-
Other criteria, such as the computational complexity or the com- tional cost refers to the amount of CPU time required by each
prehensibility of the generated ensemble, can also be important. ensemble model. The computational cost is distributed on two
In the following, we summarize the different evaluation criteria complexity metrics: The computational cost of training and creat-
of ensemble learning. ing the ensemble model and the computational cost of predicting a
new instance: The computational cost of the prediction is rela-
5.1. Predictive performance tively small compared to the computation cost of the training
ensemble. Thus, this metric should be addressed. In terms of mem-
Predictive performance metrics have always been the primary ory, a smaller ensemble model needs less memory to keep its com-
criterion for choosing the performance of classifiers. Also, predic- ponents. Furthermore, smaller ensembles perform faster
tive performance measures are considered objective and quantifi- prediction.
able, so they are often used to benchmark machine learning
algorithms practically. The first step to applying predictive perfor- 5.3. Other criteria
mance is to use a suitable dataset. The holdout technique is a typ-
ical approach for measuring predictive performance where the In addition to computational complexity and prediction accu-
given dataset is randomly divided into two subsets: training and racy, other considerations may be made when selecting the best
test sets. Other versions of the holdout method might be utilized. ensemble method. These criteria include Interpretability, Scalabil-
It is normal procedure to resample data, which means dividing it ity, usability, and robustness of the ensemble model. Interpretabil-
into training and test sets in different ways. Two common resam- ity (Carvalho et al., 2019) refers to the ability of a user to
pling methods include random subsampling, and n-fold cross- understand the ensemble outcomes. However, interpretability is
validation (Dai, 2013). typically a subjective metric. One of the many quantitative metrics
There are common measures for evaluating an ensemble model. and indicators that can help us evaluate this criterion is the com-
Accuracy is one of the popular and simplest metrics, which as pactness metric. Compactness in the ensemble can be evaluated
defined in Eq. 9: using the number of classifiers involved and the complexity of each
number of true predictions classifier.
Accuracy ¼ ð9Þ On the other hand, scalability refers to the capacity of the
total number of prediction
ensemble approach to construct a classification model given large
In some cases, accuracy is insufficient and can be deceptive in amounts of data. Independent ensemble methods are considered
evaluating an ensemble model with imbalanced class distributions. more scalable than dependent methods, as the classifier involved
In the latter scenario, other measures can be used as alternative in the ensemble approach can be trained in parallel. Usability is
measures, such as Recall, Precision, Specificity, and F-Measure another metric that assesses the user’s preference for comprehend-
(Kadam et al., 2019). ing how to adjust the ensemble models they employ. Broadly
Recall, also known as sensitivity, measures the ensemble speaking, a good ensemble method should contain a comprehen-
model’s capability to identify positive samples, which as defined sive set of control parameters that can be easily adjusted.
in Eq. 10:
6. Application domains
true positiv e
Recall ¼ ð10Þ
positiv e This section highlights applications of ensemble learning across
different domains, using either traditional or deep learning as
where true positive denotes the number of true positive observa- baseline classifiers. In general, we briefly summarize the baseline
tions and positive denotes the number of positive observations. classifiers applied, the ensemble techniques used, and the domain
Another well-known performance metric is precision. It quanti- used in their experiments.
fies how many instances classified as positive are actually positive.
Formally,the precision equation is defined as 11: 6.1. Applications of traditional ensemble learning
true positiv e
Precision ¼ ð11Þ This part discusses applications of traditional ensemble learning
true positiv e þ false positiv e
in various domains, including image classification, natural lan-
Likewise, specificity measures how well the model identifies guage processing (NLP), and others. Table 2 summarizes some
negative samples. The equation is defined as 12: works that presented ensemble learning methods in machine
766
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
Table 2
Applications of ensemble learning in machine learning approach.
learning in different fields. In the image classification domain, the voting based on NB using Breast Cancer dataset (Antoniou et al.,
researchers in Wang et al. (2013) applied voting based on SVM for 2000). The researchers in Mishra and Mishra (2015) applied
image retrieval using COREL images database (Liu et al., 2011). In voting-based NB using six medical image benchmark datasets
particular, in medical image classification, the researchers in (Leukemia, Breast cancer, Lung cancer, Hepatitis, Lymphoma, and
Cortes et al. (2014) suggested boosting based on deep decision tree Embryonal tumors). The researchers in Cho and Won (2003)
(DT) for image classification using several breast cancer datasets. applied voting based on SVM and KNN using three Leukemia can-
The researchers in Kuznetsov et al. (2014) used AdaBoost based cer datasets. In Bashir et al. (2015) applied voting and bagging
on DT for multi-class classification using 8 UCI datasets based on SVM and LR using five heart disease datasets. That same
(Fernández-Delgado et al., 2014). The researchers in Livieris et al. year, Bashir et al. (2015) applied voting based on SVM and DT using
(2019) applied voting and bagging based on kNN and DT to classify breast cancer diagnosis datasets. The researchers in Kang et al.
lung abnormalities from chest X-rays using three benchmark data- (2015) proposed two ensemble methods (bagging and boosting)
sets (Kermany et al., 2018). The researchers in Anwar et al. (2014) based on SVMs for the treatment of patients’ diabetes using dataset
proposed bagging based on many classifiers (KNN, DT, RF, and LR) (Li and Maguire, 2010).
using seven datasets from various diseases (such as Cancer, Dia- In addition, in the NLP domain for the English language, the
betes, Heart disease, Sonar, etc.). The researchers in authors in Wang et al. (2014) used two popular ensemble methods
Bharathidason and Venkataeswaran (2014) applied voting and (Bagging, Boosting) based on five base learners (NB, ME, DT, KNN,
bagging based on RF using heart disease dataset (Makhtar et al., SVM) by ten public sentiment analysis datasets. The authors in Xia
2012). The researchers in Shipp and Kuncheva (2002) proposed et al. (2011) used stacking based on three algorithms, namely NB,
767
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
ME, and SVM, by five datasets. The authors in Li et al. (2010), Xia voting based on diverse classification methods such as SVM, ME,
et al. (2016) applied a voting method based on both LR and SVM and RF for named entity recognition using three Indian languages
using reviews extracted from Amazon.com.(Rushdi-Saleh et al., (Bengali, Hindi, and Telugu) by using Bengali news corpus (Ekbal
2011). The authors in Araque et al. (2017) applied voting methods and Bandyopadhyay, 2008). The authors in Abbasi et al. (2008b)
based on different machine classifiers (NB, ME, and SVM) by even proposed a boosting based on SVM using several middle eastern
public datasets from movie reviews. The authors in Alrehili and web forums.
Albalawi (2019) suggested three ensemble methods (voting, bag- Moreover, in the diverse fields, in Stamatatos and Widmer
ging, and boosting) based on NB and SVM using English customer (2002) used a voting method based on SVM for music performer
reviews datasets (Alrehili and Albalawi, 2019). The authors in recognition using several pianists playing datasets. In Chen et al.
Saleena et al (2018) applied voting based on different baseline clas- (2019) applied the bagging method based on Fisher’s linear dis-
sifiers (SVM, RF, NB, and LR) by several English tweets datasets. The criminant function (FLDA) for potential groundwater assessment
authors in Dedhia and Ramteke (2017) used Adaboost based on at the Ningtiaota area in Shaanxi, China. They used using a data-
three classifiers (NB, SVM, and ME) using several English tweets base with 66 groundwater spring locations. In Zareapoor and
datasets. The authors in Perikos and Hatzilygeroudis (2016) Shamsolmoali (2015) suggested Bagging based on three machine
applied bagging based on NB and ME using different English news algorithms: SVM, NB, and KNN for credit card fraud predicting.
portals datasets. The authors in Fersini et al. (2016) used voting They use 100,000 records of credit card transactions dataset
based on NB, DT, and SVM by English Movie Reviews datasets (Hormozi et al., 2013). In Shahzad and Lavesson (2013) proposed
(Chen et al., 2012). The authors in Onan et al. (2016) proposed voting based on NB, DT, and KNN for malware detection using
three ensemble methods (bagging, AdaBoost, and stacking) based three datasets of malicious threat (Shahzad et al., 2010). In
on five classifiers (BLR, NB, LDA, LR, and SVM) using nine public Anifowose et al. (2013) applied bagging RF to predict petroleum
English sentiment analysis datasets from different domains reservoir properties using six datasets from a giant carbonate
(Whitehead and Yaeger, 2009). The authors in Kanakaraj and reservoir in the Middle East and a drilling site in the Northern Mar-
Guddeti (2015) suggested bagging and boosting based on both ion platform of North America (Helmy et al., 2010). In Kulkarni
NB and SVM using English movie review (Pang and Lee, 2005). et al. (2018) suggested voting based on SVM, NB, and RF for a crop
The authors in Fersini et al. (2014) proposed voting and bagging recommendation system using the input soil dataset into the rec-
based on different baseline classifiers (ME, SVM, and NB) by several ommendable crop type, Kharif and Rabi. In Erdoğan and Namlı
English movie and product reviews datasets (Täckström and (2019), applied voting and stacking based on SVM for a living envi-
McDonald, 2011; Pang and Lee, 2005. The authors in Prusa et al. ronment prediction. In Cai et al. (2020), voting based on SVM and
(2015) applied KNN, SVM, and LR based on both bagging and LR was applied to predict surface chloride concentration. In Seker
boosting using English sentiment140 corpus (Go et al., 2009). The and Ocak (2019) proposed a bagging based on three classifiers
authors in Wilson et al. (2006) introduced boosting based on a (RF, LR, and Linear R) to predict road headers using several
DT classifier by English MPQA Corpus (Wiebe et al., 2005). The datasets.
authors in Tsutsumi et al. (2007) applied stacking based on two
classifiers (SVM and ME) using the English movie review dataset 6.2. Applications of ensemble deep learning
(Pang and Lee, 2005). The authors in Hassan et al. (2013) proposed
boosting based on SVM using three English product review forum Ensemble learning methods in deep learning applications out-
datasets (Abbasi et al., 2010; and Abbasi et al., 2008a. The authors perform traditional ensemble learning in many domains, including
in Fouad et al. (2018) compared the performance of a voting image classification, natural language processing (NLP), and others.
method based on three classifiers (SVM, NB, and LR) using several Table 3 summarizes some works that presented ensemble learning
English tweets datasets. The authors in Rodriguez-Penagos et al. methods in deep learning in different fields. In the image classifica-
(2013) introduced voting based on SVM by English SemEval 2013 tion domain, in Wang et al. (2020) applied stacking method based
dataset (Dzikovska et al., 2013). The authors in Clark and on multiple CNNs using CIFAR-10 dataset (Pandit and Kumar,
Wicentwoski (2013) suggested voting based on NB using the Eng- 2020). Also, in Zhang et al. (2019) applied of stacking method
lish SemEval-2013 dataset (Nakov et al., 2016). The authors in Da based on multiple CNNs used for Image Deblurring. They used
Silva et al. (2014) applied voting-based four baseline classifiers GoPro dataset (Marques et al., 2021) and the Video Deblurring
(SVM, RF, and LR) using several English tweets datasets. But, in dataset (Wu et al., 2020). In Waltner et al. (2019) proposed boost-
multiclass sentiment classification, (Sharma et al., 2018) proposed ing method based on CNN used for image retrieval by the biggest
a bagging based on SVM using several English movie review data- available retrieval datasets. In Chen et al. (2019) and Chen et al.
sets. In contrast, in the Arabic language, the authors in Saeed et al. (2018) proposed the deep boosting framework by integrating the
(2022) applied both voting and stacking for spam detection based CNN into the boosting algorithm. They used two benchmark data-
on five baseline classifiers (SVM, NB, LR, DT, KNN) using two data- sets (Set12 and BSD68) (Thakur et al., 2019). In Can Malli et al.
sets from Opinion Spam Corpus (Li et al., 2011). Besides, in the dif- (2016) suggested voting based on CNNs for apparent age estima-
ferent dialects, the authors in Su et al. (2012) applied both voting tion ‘‘face detection” using IMDB-WIKI dataset (Russakovsky
and stacking based on two algorithms (ME and SVM) using two et al., 2015). In Opitz et al. (2017) applied Boosting CNNs using sev-
datasets for three domains of Chinese reviews (book, hotel, and eral image retrieval datasets(Liu et al., 2016). In Mosca and
notebook). The authors in Li et al. (2012) suggested stacking based Magoulas (2016) applied boosting CNN by using two image data-
on SVM and KNN using several Chinese food review datasets. The sets; namely, MNIST (LeCun, 1998), and CIFAR-10 (Pandit and
authors in Lu and Tsou (2010) applied stacking based on three clas- Kumar, 2020). In Walach and Wolf (2016) proposed boosting CNNs
sifiers NB, ME, and SVM, using the Chinese dataset (Seki et al., for object counting in images using different image datasets,
2008). The authors in Pasupulety et al. (2019) introduced stacking namely mall crowd counting (Chen et al., 2012). UCF 50 crowd
based on two baseline classifiers (SVM and RF) for predicting stock counting (Idrees et al., 2013), UCSD (Chan et al., 2008). In
prices of companies using India’s National Stock Exchange (NSE) Moghimi et al. (2016) applied boosting CNNs using several image
datasets (Kumar and Misra, 2018). The authors in Oussous et al. datasets, namely (Cars (Krause et al., 2013) and Aircrafts
(2018) proposed voting and stacking based on three baseline clas- (Gosselin et al., 2014)). In Yang et al. (2015) proposed boosting
sifiers (MNB, SVM, and ME) using the Moroccan tweets dataset CNNs for face detection using imageNet dataset (Krizhevsky
(Tratz et al., 2013). The authors in Ekbal and Saha (2011) suggested et al., 2012). In Li et al. (2015) suggested stacking based on simpli-
768
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
Table 3
Applications of ensemble learning in deep learning approach.
fied neural network module (SNNM) using four face image datasets base of 910,544 images(Locketz et al., 2016). The authors of Guo
(Jiang et al., 2013). In Zhang et al. (2020) applied a boosting et al. (2020) proposed a voting method for automated cervical pre-
method on CIFAR-10 dataset (Pandit and Kumar, 2020) containing cancer screening using 30,000 images from several datasets. The
60 000 colored images to train the CNN. In particular, in medical voting method combined the assessment of three deep learning
image classification, the authors of Ali et al. (2020) applied a smart architectures, RetinaNet, Deep SVDD, and CNN. The average accu-
healthcare system for heart disease prediction using ensemble racy and F-score of 91.6% and 0.89%, respectively. The authors of
deep learning and feature fusion approaches. The proposed system Khamparia et al. (2020) applied a voting method based on CNNs
achieved an accuracy of 98.5%. The authors of Alshazly et al. (2019) for disease prediction related to neuromuscular disorders using
suggested voting based on CNNs for visual recognition tasks (ear two neuromuscular disorder datasets (Bakay et al., 2006).
recognition) using several ear datasets. In addition, in the NLP domain, in Mohammed and Kora (2021)
The authors of Ortiz et al. (2016) applied voting based on deep proposed a novel ensemble for multilingual text classification
belief networks using a large dataset from the Alzheimer’s disease using six benchmark datasets. Also, compare the performance of
Neuroimaging Initiative (ADNI) (Hinrichs et al., 2009). The authors the proposed and other ensemble methods. The results prove that
of Codella et al. (2017) proposed voting based on residual networks the proposed method outperforms the state-of-art ensemble meth-
(DRN) and CNNs for melanoma recognition in dermoscopy images. ods. In Deng et al. (2012) suggested a stacking method based on
The voting method achieved an accuracy of 76% by using the der- deep convex network (DCN) to spoken language understanding
moscopic images dataset (containing 1279 images) (Mendonca (SLU) problems. The stacking method achieved an accuracy of
et al., 2015). The authors of Tasci et al. (2021) applied voting based 91.88% by using the ATIS dataset (consists of 5871 sentences)
on CNNs for tuberculosis detection by two TB CXR image datasets (Wen et al., 2005). The authors in Xu et al. (2016) proposed a soft
(Sharma et al., 2017). The voting method achieved an accuracy of voting ensemble based on CNN and LSTM using SemEval 2013
97.5% and 97.69% accuracy rates on datasets, respectively. The dataset (Dzikovska et al., 2013). In Chen et al. (2017) presented
authors of Cha et al. (2019) suggested voting based on nine CNNs voting based on the CNN_RNN model using a large documents
to classify eardrum and external auditory canal features. The vot- dataset (Lewis et al., 2004). In Akhtyamova et al. (2017) suggested
ing achieved an average accuracy of 93.67% by using a large data- a voting method based on CNNs for predicting drug safety using
769
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
English reviews from health forums (Karimi et al., 2015). In Araque they need a lot of knowledge and experience to tune the optimal
et al. (2017) applied both voting and stacking based on several hyperparameters aiming at reaching a global minimum error.
deep learning models, namely CNN, LSTM, and GRU, using seven However, finding the optimal hyperparameters requires an
English movie review datasets. In Al-Omari et al. (2019) applied exhausting technique in the search space, which in turn becomes
voting based on Bi_LSTM for English fake news detection using a tedious and time-consuming task. Thus, several research efforts
NLP4IF 2019 (Barrón-Cedeno et al., 2019). In Nguyen and Le have applied deep ensemble learning in many fields, and most of
Nguyen (2019) applied voting based on CNN and LSTM using five these efforts are articulated around simple ensemble methods. This
English datasets from movie reviews (Koh et al., 2010). In Livieris paper provided a comprehensive review of the various strategies
et al. (2020) proposed CNNs based on bagging and stacking using for ensemble learning, especially in the case of deep learning.
several English review datasets. In Haralabopoulos et al. (2020) The paper also illustrated the recent trends in ensemble learning
applied both voting and stacking based on several deep learning using quantitative analysis of several research papers. Moreover,
models, namely LSTM, GRU, CNN, RCNN, and DNN, using two Eng- the paper offered various factors that influence ensemble methods’
lish tweets datasets (SemEval (Bethard et al., 2016), Toxic Com- success, including sampling the training data, training the baseline
ment (van Aken et al., 2018)). In Mohammadi and Shaverizade models, and the fusion techniques of the baseline models. Also, the
(2021) applied stacking based on four deep learning models, papers discussed the pros and cons of each ensemble method.
namely CNN, LSTM, GRU, and BiLSTM using English review dataset Additionally, the paper extensively introduced and presented sev-
(SemEval) (Bethard et al., 2016). In Deriu et al. (2016) proposed eral research efforts that used ensemble learning in a wide range of
stacking ensemble based on CNN for English tweets classification domains and categorized these efforts into either traditional
by using SemEval-2016 dataset (Bethard et al., 2016). In contrast, machine or deep learning models as baseline classifiers. It is worth
in Heikal et al. (2018) applied voting based on the combination noting that an ensemble of deep learning models using simple
of CNN and LSTM models using Arabic dataset (ASTD) (Nabil averaging methods is not a smart choice and is very sensitive to
et al., 2015). In Alharbi et al. (2021) applied a voting method based biased baseline models. On the other hand, Injecting diversity in
on LSTM and GRU using five datasets from Arabic tweets. ensemble deep learning can become robust to the biased baseline
Moreover, in the diverse fields, in Zhang et al. (2020) proposed a models. The diversity can be achieved by training different base-
system that jointly learns the grasping and the stacking policies line deep learning architectures over several data samples. The
through the grasping for stacking network (GSNet) for enables a diversity, however, is limited by the computation cost and the
robotic arm to correctly pick boxes from a table and put it on a availability of suitable data to be sampled.
platform. In Wang et al. (2019) proposed an Adaboost method
based on DNN for security level classification. The dataset is the
assessment results of 100 Android terminals (including smart- Declaration of Competing Interest
phones, smart bracelets, tablet PC) and from schools, hospitals, fac-
tories, and other environments. In Liu et al. (2014) applied boosted The authors declare that they have no known competing finan-
deep belief network for facial expression recognition/shape cial interests or personal relationships that could have appeared
changes based on the CK + database (contains 327 expression to influence the work reported in this paper.
images) (Seyyedsalehi and Seyyedsalehi, 2014). The authors of
Deng and Platt (2014) applied the stacking method based on both
RNN and CNN for speech recognition using TIMIT dataset (Garofolo References
et al., 1993). The authors of Liu et al. (2017) applied stacking based
Abbasi, A., Chen, H., Salem, A., 2008a. Sentiment analysis in multiple languages:
on back propagation neural networks (BPNN) for flood forecasting. Feature selection for opinion classification in web forums. ACM Trans. Informat.
Han et al. (2016) applied boosting CNNs for recognizing facial Syst. (TOIS) 26 (3), 1–34.
action units. In Tur et al. (2012) applied a stacking method based Abbasi, A., Chen, H., Thoms, S., Fu, T., 2008b. Affect analysis of web forums and blogs
using correlation ensembles. IEEE Trans. Knowledge Data Eng. 20 (9), 1168–
on deep convex networks (DCNs) to semantic utterance classifica- 1180.
tion by the dataset of utterances from the users of a spoken dialog Abbasi, A., France, S., Zhang, Z., Chen, H., 2010. Selecting attributes for sentiment
system. In Palangi et al. (2014) applied stacking RNN for speech classification using feature relation networks. IEEE Trans. Knowl. Data Eng. 23
(3), 447–462.
recognition systems based on TIMIT dataset (Garofolo et al., 1993).
Abellán, J., Mantas, C.J., 2014. Improving experimental studies about ensembles of
classifiers for bankruptcy prediction and credit scoring. Expert Syst. Appl. 41
(8), 3825–3830.
7. Conclusion Aburomman, A.A., Reaz, M.B.I., 2016. A novel svm-knn-pso ensemble method for
intrusion detection system. Appl. Soft Comput. 38, 360–372.
Ain, Q.T., Ali, M., Riaz, A., Noureen, A., Kamran, M., Hayat, B., Rehman, A., 2017.
In machine learning, reducing the bias and the variance of mod- Sentiment analysis using deep learning techniques: a review. Int. J. Adv.
els is one of the key factors determining the success of the learning Comput. Sci. Appl. 8 (6), 424.
Akhtyamova, L., Ignatov, A., Cardiff, J., 2017. A large-scale cnn ensemble for
process. In the literature, it has been proven that merging the out- medication safety analysis In: International Conference on Applications of
put of different classification algorithms might decrease the gener- Natural Language to Information Systems. Springer, pp. 247–253.
alization error without increasing the variance of the model. The Alharbi, A., Kalkatawi, M., Taileb, M., 2021. Arabic sentiment analysis using deep
learning and ensemble methods. Arabian J. Sci. Eng. 46 (9), 8913–8923.
previous is the key essence of the so-called ensemble learning. Ali, F., El-Sappagh, S., Islam, S.R., Kwak, D., Ali, A., Imran, M., Kwak, K.-S., 2020. A
Numerous research efforts have preferred ensemble learning over smart healthcare monitoring system for heart disease prediction based on
single-model learning in various domains. The main advantage of ensemble deep learning and feature fusion. Informat. Fusion 63, 208–222.
Al-Omari, H., Abdullah, M., AlTiti, O., Shaikh, S., 2019. Justdeep at nlp4if 2019 task 1:
ensemble learning is combining several individual models to Propaganda detection using ensemble deep learning models. In Proceedings of
improve prediction performance and obtain a stronger model that the Second Workshop on Natural Language Processing for Internet Freedom:
outperforms them. In the literature, there are several ensemble Censorship, Disinformation, and Propaganda, pp. 113–118.
Alrehili, A., Albalawi, K., 2019. Sentiment analysis of customer reviews using
techniques to boost classification algorithms. The main difference
ensemble method, pp. 1–6.
between any two ensemble methods is training the baseline mod- Alshazly, H., Linse, C., Barth, E., Martinetz, T., 2019. Ensembles of deep learning
els and how to combine them. Several research efforts introduced models and transfer learning for ear recognition. Sensors 19 (19), 4139.
ensemble learning into deep learning models to remedy the prob- Anifowose, F., Labadin, J., Abdulraheem, A., 2013. Ensemble model of artificial
neural networks with randomized number of hidden neurons. In: 2013 8th
lems appearing during the learning process of deep learning mod- International Conference on Information Technology in Asia (CITA). IEEE, pp. 1–
els. Usually, the main challenge of deep learning models is that 5.
770
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
Antoniou, A.C., Gayther, S.A., Stratton, J.F., Ponder, B.A., Easton, D.F., 2000. Risk Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh
models for familial ovarian and breast cancer. Genetic Epidemiol.: Off. Publ. Int. International Workshop on Semantic Evaluation (SemEval 2013), pp. 425–
Genetic Epidemiol. Soc. 18 (2), 173–190. 429.
Anwar, H., Qamar, U., Muzaffar Qureshi, A.W., 2014. Global optimization ensemble Codella, N.C., Nguyen, Q.-B., Pankanti, S., Gutman, D.A., Helba, B., Halpern, A.C.,
model for classification methods. Sci. World J. 2014. Smith, J.R., 2017. Deep learning ensembles for melanoma recognition in
Araque, O., Corcuera-Platas, I., Sánchez-Rada, J.F., Iglesias, C.A., 2017. Enhancing dermoscopy images. IBM J. Res. Dev. 61 (4/5), pp. 5–1.
deep learning sentiment analysis with ensemble techniques in social Collobert, R., Weston, J., 2008. A unified architecture for natural language
applications. Expert Syst. Appl. 77, 236–246. processing: Deep neural networks with multitask learning. In: Proceedings of
Arel, I., Rose, D.C., Karnowski, T.P., 2010. Deep machine learning-a new frontier in the 25th International Conference on Machine Learning, pp. 160–167.
artificial intelligence research [research frontier]. IEEE Comput. Intell. Mag. 5 Cortes, C., Mohri, M., Syed, U., 2014. Deep boosting. In: International Conference on
(4), 13–18. Machine Learning. PMLR, pp. 1179–1187.
Asbai, N., Amrouche, A., 2017. Boosting scores fusion approach using front-end da Conceição, L.R., da Costa, C.E., Rocha, G.N.d., Pereira-Filho, E.R., Zamian, J.R., 2015.
diversity and adaboost algorithm, for speaker verification. Comput. Electr. Eng. Ethanolysis optimisation of jupati (raphia taedigera mart.) oil to biodiesel using
62, 648–662. response surface methodology. J. Brazil. Chem. Soc.26, 1321–1330.
Bakay, M., Wang, Z., Melcon, G., Schiltz, L., Xuan, J., Zhao, P., Sartorelli, V., Seo, J., da Conceição, L.R.V., Carneiro, L.M., Rivaldi, J.D., de Castro, H.F., 2016. Solid acid as
Pegoraro, E., Angelini, C., et al., 2006. Nuclear envelope dystrophies show a catalyst for biodiesel production via simultaneous esterification and
transcriptional fingerprint suggesting disruption of rb–myod pathways in transesterification of macaw palm oil. Ind. Crops Prod. 89, 416–424.
muscle regeneration. Brain 129 (4), 996–1013. Dai, Q., 2013. A competitive ensemble pruning approach based on cross-validation
Barrón-Cedeno, A., Da San Martino, G., Jaradat, I., Nakov, P., 2019. Proppy: A system technique. Knowl.-Based Syst. 37, 394–414.
to unmask propaganda in online news. In: Proceedings of the AAAI Conference Das, R., Sengur, A., 2010. Evaluation of ensemble methods for diagnosing of valvular
on Artificial Intelligence, vol. 33, no. 01, pp. 9847–9848. heart disease. Expert Syst. Appl. 37 (7), 5110–5115.
Bashir, S., Qamar, U., Khan, F.H., 2015. Bagmoov: A novel ensemble for heart disease Da Silva, N.F., Hruschka, E.R., Hruschka Jr, E.R., 2014. Tweet sentiment analysis with
prediction bootstrap aggregation with multi-objective optimized voting. classifier ensembles. Decis. Support Syst. 66, 170–179.
Austral. Phys. Eng. Sci. Med. 38 (2), 305–323. Dedhia, C., Ramteke, J., 2017. Ensemble model for twitter sentiment analysis. In
Bashir, S., Qamar, U., Khan, F.H., 2015. Heterogeneous classifiers fusion for dynamic 2017 International Conference on Inventive Systems and Control (ICISC). IEEE,
breast cancer diagnosis using weighted vote based ensemble. Quality Quantity pp. 1–5.
49 (5), 2061–2076. Delgado, R., 2022. A semi-hard voting combiner scheme to ensemble multi-class
Bebis, G., Georgiopoulos, M., 1994. Feed-forward neural networks. IEEE Potentials probabilistic classifiers. Appl. Intell. 52 (4), 3653–3677.
13 (4), 27–31. Deng, L., Platt, J., 2014. Ensemble deep learning for speech recognition. In: Proc.
Bethard, S., Savova, G., Chen, W.-T., Derczynski, L., Pustejovsky, J., Verhagen, M., Interspeech.
2016. Semeval-2016 task 12: Clinical tempeval. In: Proceedings of the 10th Deng, L., Tur, G., He, X., Hakkani-Tur, D., 2012. Use of kernel deep convex networks
International Workshop on Semantic Evaluation (SemEval-2016), pp. 1052– and end-to-end learning for spoken language understanding. 2012 IEEE Spoken
1062. Language Technology Workshop (SLT). IEEE, pp. 210–215.
Bharathidason, S., Venkataeswaran, C.J., 2014. Improving classification accuracy Deng, L., Yu, D., et al., 2014. Deep learning: methods and applications. Found. Trends
based on random forest model with uncorrelated high performing trees. Int. J. Signal Process. 7(3–4), 197–387.
Comput. Appl 101 (13), 26–30. Deriu, J., Gonzenbach, M., Uzdilli, F., Lucchi, A., Luca, V.D., Jaggi, M., 2016.
Breiman, L., 1996. Bagging predictors. Machine Learn. 24 (2), 123–140. Swisscheese at semeval-2016 task 4: Sentiment classification using an
Breiman, L., 2001. Random forests. Machine Learn. 45 (1), 5–32. ensemble of convolutional neural networks with distant supervision. In:
Brunetti, A., Buongiorno, D., Trotta, G.F., Bevilacqua, V., 2018. Computer vision and Proceedings of the 10th international workshop on semantic evaluation, no.
deep learning techniques for pedestrian detection and tracking: A survey. CONF, pp. 1124–1128.
Neurocomputing 300, 17–33. Divina, F., Gilson, A., Goméz-Vela, F., García Torres, M., Torres, J.F., 2018. Stacking
Bühlmann, P., Yu, B., 2002. Analyzing bagging. Annals Stat. 30 (4), 927–961. ensemble learning for short-term electricity consumption forecasting. Energies
Cai, R., Han, T., Liao, W., Huang, J., Li, D., Kumar, A., Ma, H., 2020. Prediction of 11 (4), 949.
surface chloride concentration of marine concrete using ensemble machine Dong, X., Yu, Z., Cao, W., Shi, Y., Ma, Q., 2020. A survey on ensemble learning. Front.
learning. Cem. Concr. Res. 136, 106164. Comput. Sci. 14 (2), 241–258.
Can Malli, R., Aygun, M., Kemal Ekenel, H., 2016. Apparent age estimation using Dzikovska, M.O., Nielsen, R.D., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L.,
ensemble of deep learning models. In: Proceedings of the IEEE Conference on Clark, P., Dagan, I., Dang, H.T., 2013. Semeval-2013 task 7: The joint student
Computer Vision and Pattern Recognition Workshops, pp. 9–16. response analysis and 8th recognizing textual entailment challenge. NORTH
Carvalho, D.V., Pereira, E.M., Cardoso, J.S., 2019. Machine learning interpretability: A TEXAS STATE UNIV DENTON, Tech. Rep.
survey on methods and metrics. Electronics 8 (8), 832. Ekbal, A., Bandyopadhyay, S., 2008. Web-based bengali news corpus for lexicon
Catal, C., Tufekci, S., Pirmit, E., Kocabag, G., 2015. On the use of ensemble of development and pos tagging. Polibits 37, 21–30.
classifiers for accelerometer-based activity recognition. Appl. Soft Comput. 37, Ekbal, A., Saha, S., 2011. A multiobjective simulated annealing approach for
1018–1022. classifier ensemble: Named entity recognition in indian languages as case
Cha, D., Pae, C., Seong, S.-B., Choi, J.Y., Park, H.-J., 2019. Automated diagnosis of ear studies. Expert Syst. Appl. 38(12), 14 760–14 772.
disease using ensemble deep learning with a big otoendoscopy image database. Elnagar, A., Al-Debsi, R., Einea, O., 2020. Arabic text classification using deep
EBioMedicine 45, 606–614. learning models. Informat. Process. Manage. 57 (1), 102121.
Chan, A.B., Liang, Z.-S.J., Vasconcelos, N., 2008. Privacy preserving crowd Erdoğan, Z., Namlı, E., 2019. ”A living environment prediction model using
monitoring: Counting people without people models or tracking. In: 2008 ensemble machine learning techniques based on quality of life index. J.
IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1–7. Ambient Intell. Humanized Comput., 1–17
Chatzimparmpas, A., Martins, R.M., Kucher, K., Kerren, A., 2020. Stackgenvis: Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., 2014. Do we need
Alignment of data, algorithms, and models for stacking ensemble learning using hundreds of classifiers to solve real world classification problems? J. Machine
performance metrics. IEEE Trans. Visual Comput. Graphics 27 (2), 1547–1557. Learn. Res. 15 (1), 3133–3181.
Chen, L., Wang, W., Nagarajan, M., Wang, S., Sheth, A., 2012. Extracting diverse Fersini, E., Messina, E., Pozzi, F.A., 2014. Sentiment analysis: Bayesian ensemble
sentiment expressions with target-dependent polarity from twitter. In: learning. Decision Support Syst. 68, 26–38.
Proceedings of the International AAAI Conference on Web and Social Media, Fersini, E., Messina, E., Pozzi, F.A., 2016. Expressive signals in social media languages
vol. 6, no. 1, pp. 50–57. to improve polarity detection. Informat. Process. Manage. 52 (1), 20–35.
Chen, K., Loy, C.C., Gong, S., Xiang, T., 2012. Feature mining for localised crowd Fouad, M.M., Gharib, T.F., Mashat, A.S., 2018. Efficient twitter sentiment analysis
counting. Bmvc 1 (2), 3. system with feature selection and classifier ensemble. In: International
Chen, G., Ye, D., Xing, Z., Chen, J., Cambria, E., 2017. Ensemble application of Conference on Advanced Machine Learning Technologies and Applications.
convolutional and recurrent neural networks for multi-label text Springer, pp. 516–527.
categorization. 2017 International Joint Conference on Neural Networks Freund, Y., Schapire, R.E., et al. 1996. Experiments with a new boosting algorithm.
(IJCNN). IEEE, pp. 2377–2383. 96, pp. 148–156.
Chen, C., Xiong, Z., Tian, X., Wu, F., 2018. Deep boosting for image denoising. In: Freund, Y., Iyer, R., Schapire, R.E., Singer, Y., 2003. An efficient boosting algorithm for
Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–18. combining preferences. J. Machine Learn. Res. 4 (Nov), 933–969.
Chen, W., Pradhan, B., Li, S., Shahabi, H., Rizeei, H.M., Hou, E., Wang, S., 2019. Novel Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine.
hybrid integration approach of bagging-based fisher’s linear discriminant Annals Stat. 1189–1232.
function for groundwater potential analysis. Nat. Resour. Res. 28 (4), 1239– Friedman, J., Hastie, T., Tibshirani, R., 2000. Additive logistic regression: a statistical
1258. view of boosting (with discussion and a rejoinder by the authors). Annals Stat.
Chen, C., Xiong, Z., Tian, X., Zha, Z.-J., Wu, F., 2019. Real-world image denoising 28 (2), 337–407.
with deep boosting. IEEE Trans. Pattern Anal. Machine Intell. 42 (12), 3071– Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., 1993. Darpa timit
3087. acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1–1.1.
Cho, S.-B., Won, H.-H., 2003. Machine learning in dna microarray analysis for cancer NASA STI/Recon Technical Report N 93, 27403.
classification. In: Proceedings of the First Asia-Pacific Bioinformatics Conference Ge, R., Feng, G., Jing, X., Zhang, R., Wang, P., Wu, Q., 2020. Enacp: An ensemble
on Bioinformatics 2003-Volume 19, pp. 189–198. learning model for identification of anticancer peptides. Front. Genet. 11, 760.
Clark, S., Wicentwoski, R., 2013. Swatcs: Combining simple classifiers with Go, A., Bhayani, R., Huang, L., 2009. Twitter sentiment classification using distant
estimated accuracy. In: Second Joint Conference on Lexical and supervision. CS224N project report, Stanford, vol. 1, no. 12, p. 2009.
771
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
Gosselin, P.-H., Murray, N., Jégou, H., Perronnin, F., 2014. Revisiting the fisher vector Koh, N.S., Hu, N., Clemons, E.K., 2010. Do online reviews reflect a product’s true
for fine-grained classification. Pattern Recognit. Lett. 49, 92–98. perceived quality? an investigation of online movie reviews across cultures.
Guo, P., Xue, Z., Mtema, Z., Yeates, K., Ginsburg, O., Demarco, M., Long, L.R., Electron. Commer. Res. Appl. 9 (5), 374–385.
Schiffman, M., Antani, S., 2020. Ensemble deep learning for cervix image Krause, J., Stark, M., Deng, J., Fei-Fei, L., 2013. 3d object representations for fine-
selection toward improving reliability in automated cervical precancer grained categorization. In: Proceedings of the IEEE International Conference on
screening. Diagnostics 10 (7), 451. Computer Vision Workshops, pp. 554–561.
Haghighi, F., Omranpour, H., 2021. Stacking ensemble model of deep learning and Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M., 2017. Ensemble
its application to persian/arabic handwritten digits recognition. Knowl.-Based learning for data stream analysis: A survey. Informat. Fusion 37, 132–156.
Syst. 220, 106940. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep
Han, S., Meng, Z., Khan, A.-S., Tong, Y., 2016. Incremental boosting convolutional convolutional neural networks. Adv. Neural Informat. Process. Syst. 25, 1097–
neural network for facial action unit recognition. Adv. Neural Informat. Process. 1105.
Syst. 29, 109–117. Kulkarni, N.H., Srinivasan, G., Sagar, B., Cauvery, N., 2018. Improving crop
Haque, A., Milstein, A., Fei-Fei, L., 2020. Illuminating the dark spaces of healthcare productivity through a crop recommendation system using ensembling
with ambient intelligence. Nature 585 (7824), 193–202. technique. In: 2018 3rd International Conference on Computational Systems
Haralabopoulos, G., Anagnostopoulos, I., McAuley, D., 2020. Ensemble deep learning and Information Technology for Sustainable Solutions (CSITSS). IEEE, pp. 114–
for multilabel binary classification of user-generated content. Algorithms 13 (4), 119.
83. Kumar, G., Misra, A.K., 2018. Commonality in liquidity: Evidence from india’s
Hashino, T., Bradley, A., Schwartz, S., 2007. Evaluation of bias-correction methods national stock exchange. J. Asian Econ. 59, 1–15.
for ensemble streamflow volume forecasts. Hydrol. Earth Syst. Sci. 11 (2), 939– Kumar, A., Kim, J., Lyndon, D., Fulham, M., Feng, D., 2016. An ensemble of fine-tuned
950. convolutional neural networks for medical image classification. IEEE J. Biomed.
Hassan, A., Abbasi, A., Zeng, D., 2013. Twitter sentiment analysis: A bootstrap Health Informat. 21 (1), 31–40.
ensemble framework. 2013 International Conference on Social Computing. IEEE, Kumar, V., Aydav, P.S.S., Minz, S., 2021. Multi-view ensemble learning using multi-
pp. 357–364. objective particle swarm optimization for high dimensional data classification.
Haumahu, J., Permana, S., Yaddarabullah, Y., 2021. Fake news classification for J. King Saud Univ.-Comput. Informat. Sci.
indonesian news using extreme gradient boosting (xgboost). IOP Conference Kuruvayil, S., Palaniswamy, S., 2021. Emotion recognition from facial images with
Series: Materials Science and Engineering, vol. 1098, no. 5. IOP Publishing, p. simultaneous occlusion, pose and illumination variations using meta-learning.
052081. J. King Saud Univ.-Comput. Informat. Sci.
Heikal, M., Torki, M., El-Makky, N., 2018. Sentiment analysis of arabic tweets using Kuruvayil, S., Palaniswamy, S., 2022. Emotion recognition from facial images with
deep learning. Proc. Comput. Sci. 142, 114–122. simultaneous occlusion, pose and illumination variations using meta-learning.
Helmy, T., Fatai, A., Faisal, K., 2010. Hybrid computational models for the J. King Saud Univ.-Comput. Informat. Sci. 34 (9), 7271–7282.
characterization of oil and gas reservoirs. Expert Syst. Appl. 37 (7), 5353–5363. Kuznetsov, V., Mohri, M., Syed, U., 2014. Multi-class deep boosting.
Hinrichs, C., Singh, V., Mukherjee, L., Xu, G., Chung, M.K., Johnson, S.C., Initiative, A. Lakshminarayanan, B., Pritzel, A., Blundell, C., 2017. Simple and scalable predictive
D.N., et al., 2009. Spatially augmented lpboosting for ad classification with uncertainty estimation using deep ensembles. Adv. Neural Informat. Process.
evaluations on the adni dataset. Neuroimage 48 (1), 138–149. Syst. 30.
Hopkinson, B.M., King, A.C., Owen, D.P., Johnson-Roberson, M., Long, M.H., Latif-Shabgahi, G.-R., 2004. A novel algorithm for weighted average voting used in
Bhandarkar, S.M., 2020. Automated classification of three-dimensional fault tolerant computing systems. Microprocess. Microsyst. 28 (7), 357–361.
reconstructions of coral reefs using convolutional neural networks. PloS One LeCun, Y., 1998. The mnist database of handwritten digits, http://yann.lecun.com/
15 (3), e0230671. exdb/mnist/.
Hormozi, E., Akbari, M.K., Hormozi, H., Javan, M.S., 2013. Accuracy evaluation of a Lewis, D.D., Yang, Y., Russell-Rose, T., Li, F., 2004. Rcv1: A new benchmark collection
credit card fraud detection system on hadoop mapreduce. The 5th Conference for text categorization research. J. Machine Learn. Res. 5 (Apr), 361–397.
on Information and Knowledge Technology. IEEE, pp. 35–39. Li, Y., Maguire, L., 2010. Selecting critical patterns based on local geometrical and
Hosni, M., Abnane, I., Idri, A., de Gea, J.M.C., Alemán, J.L.F., 2019. Reviewing statistical information. IEEE Trans. Pattern Anal. Machine Intell. 33 (6), 1189–
ensemble classification methods in breast cancer. Comput. Methods Programs 1201.
Biomed. 177, 89–112. Li, S, Lee, S.Y., Chen, Y., Huang, C.-R., Zhou, G., 2010. Sentiment classification and
Hospedales, T., Antoniou, A., Micaelli, P., Storkey, A., 2021. Meta-learning in polarity shifting. In: Proceedings of the 23rd International Conference on
neural networks: A survey. IEEE Trans. Pattern Anal. Machine Intell. 44 (9), Computational Linguistics (Coling 2010), pp. 635–643.
5149–5169. Li, F.H., Huang, M., Yang, Y., Zhu, X., 2011. Learning to identify review spam. In:
Huang, S., Wang, B., Qiu, J., Yao, J., Wang, G., Yu, G., 2016. Parallel ensemble of online Twenty-second International Joint Conference on Artificial Intelligence.
sequential extreme learning machine based on mapreduce. Neurocomputing Li, W., Wang, W., Chen, Y., 2012. Heterogeneous ensemble learning for chinese
174, 352–367. sentiment classification. J. Informat. Comput. Sci. 9 (15), 4551–4558.
Idrees, H., Saleemi, I., Seibert, C., Shah, M., 2013. Multi-source multi-scale counting Li, J., Chang, H., Yang, J., 2015. Sparse deep stacking network for image classification.
in extremely dense crowd images. In: Proceedings of the IEEE Conference on In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1.
Computer Vision and Pattern Recognition, pp. 2547–2554. Liu, G.-H., Li, Z.-Y., Zhang, L., Xu, Y., 2011. Image retrieval based on micro-structure
Jaouedi, N., Boujnah, N., Bouhlel, M.S., 2020. A new hybrid deep learning model for descriptor. Pattern Recogn. 44 (9), 2123–2133.
human action recognition. J. King Saud Univ.-Comput. Informat. Sci. 32 (4), Liu, P., Han, S., Meng, Z., Tong, Y., 2014. Facial expression recognition via a boosted
447–453. deep belief network. In: Proceedings of the IEEE conference on Computer Vision
Jiang, Z., Lin, Z., Davis, L.S., 2013. Label consistent k-svd: Learning a discriminative and Pattern Recognition, pp. 1805–1812.
dictionary for recognition. IEEE Trans. Pattern Anal. Machine Intell. 35 (11), Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X., Deepfashion: Powering robust
2651–2664. clothes recognition and retrieval with rich annotations. In: Proceedings of
Kadam, V.J., Jadhav, S.M., Vijayakumar, K., 2019. Breast cancer diagnosis using the IEEE Conference on Computer Vision and Pattern Recognition, pp.
feature ensemble learning based on stacked sparse autoencoders and softmax 1096–1104.
regression. J. Medical Syst. 43 (8), 1–11. Liu, F., Xu, F., Yang, S., 2017. A flood forecasting model based on deep learning
Kamilaris, A., Prenafeta-Boldú, F.X., 2018. Deep learning in agriculture: A survey. algorithm via integrating stacked autoencoders with bp neural network. 2017
Comput. Electron. Agric. 147, 70–90. IEEE third International conference on multimedia big data (BigMM). Ieee, pp.
Kanakaraj, M., Guddeti, R.M.R., 2015. Performance analysis of ensemble methods on 58–61.
twitter sentiment analysis using nlp techniques. In: Proceedings of the 2015 Livieris, I.E., Kanavos, A., Tampakas, V., Pintelas, P., 2019. A weighted voting
IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015). ensemble self-labeled algorithm for the detection of lung abnormalities from x-
IEEE, pp. 169–170. rays. Algorithms 12 (3), 64.
Kang, S., Kang, P., Ko, T., Cho, S., Rhee, S.-J., Yu, K.-S., 2015. An efficient and effective Livieris, I.E., Iliadis, L., Pintelas, P., 2020. On ensemble techniques of weight-
ensemble of support vector machines for anti-diabetic drug failure prediction. constrained neural networks. Evolv. Syst., 1–13
Expert Syst. Appl. 42 (9), 4265–4273. Locketz, G.D., Li, P.M., Fischbein, N.J., Holdsworth, S.J., Blevins, N.H., 2016. Fusion of
Karimi, S., Metke-Jimenez, A., Kemp, M., Wang, C., 2015. Cadec: A corpus of adverse computed tomography and propeller diffusion-weighted magnetic resonance
drug event annotations. J. Biomed. Informat. 55, 73–81. imaging for the detection and localization of middle ear cholesteatoma. JAMA
Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L., Otolaryngol.-Head Neck Surg. 142 (10), 947–953.
McKeown, A., Yang, G., Wu, X., Yan, F., et al., 2018. Identifying medical Lu, B., Tsou, B.K., 2010. Combining a large sentiment lexicon and machine learning
diagnoses and treatable diseases by image-based deep learning. Cell 172 (5), for subjectivity classification. 2010 International Conference on Machine
1122–1131. Learning and Cybernetics, vol. 6. IEEE, pp. 3311–3316.
Khamparia, A., Singh, A., Anand, D., Gupta, D., Khanna, A., Arun Kumar, N., Tan, J., Lu, X., Van Roy, B., 2017. Ensemble sampling. Adv. Neural Informat. Process. Syst.
2020. A novel deep learning-based multi-model ensemble method for the 30.
prediction of neuromuscular disorders. Neural Comput. Appl. 32 (15), 11-083– Ma, Z., Wang, P., Gao, Z., Wang, R., Khalighi, K., 2018. Ensemble of machine learning
11-095. algorithms using the stacked generalization approach to estimate the warfarin
Khan, W., Ghazanfar, M.A., Azam, M.A., Karami, A., Alyoubi, K.H., Alfakeeh, A.S., dose. PloS One 13 (10), e0205872.
2020. ”Stock market prediction using machine learning classifiers and social Makhtar, M., Yang, L., Neagu, D., Ridley, M., 2012. Optimisation of classifier
media, news. J. Ambient Intell. Humanized Comput., 1–24 ensemble for predictive toxicology applications. In: 2012 UKSim 14th
Kim, H.-C., Pang, S., Je, H.-M., Kim, D., Bang, S.Y., 2003. Constructing support vector International Conference on Computer Modelling and Simulation. IEEE, pp.
machine ensemble. Pattern Recognit. 36 (12), 2757–2767. 236–241.
772
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
Marques, J., Alves, R.M.F., Oliveira, H.C., Mendonca, M., Souza, J.R., 2021. An Saeed, R.M., Rady, S., Gharib, T.F., 2022. An ensemble approach for spam detection in
evaluation of machine learning methods for speed-bump detection on a gopro arabic opinion texts. J. King Saud Univ.-Comput. Informat. Sci. 34 (1), 1407–
dataset. Anais da Academia Brasileira de Ciencias 93 (1), e20190734. 1416.
Mendonca, T., Celebi, M., Mendonca, T., Marques, J., 2015. Ph2: A public database for Sagi, O., Rokach, L., 2018. Ensemble learning: A survey. Wiley Interdiscip. Rev.: Data
the analysis of dermoscopic images. Dermoscopy image analysis. Min. Knowledge Discov. 8 (4), e1249.
Mishra, S., Mishra, D., 2015. Adaptive multi-classifier fusion approach for gene Saleena et al. N. 2018. An ensemble classification system for twitter sentiment
expression dataset based on probabilistic theory. J. Korean Stat. Soc. 44 (2), analysis. Proc. Comput. Sci. 132, 937–946.
247–260. Saleh, H., Mostafa, S., Alharbi, A., El-Sappagh, S., Alkhalifah, T., 2022. Heterogeneous
Moghimi, M., Belongie, S.J., Saberian, M.J., Yang, J., Vasconcelos, N., Li, L.-J., 2016. ensemble deep learning model for enhanced arabic sentiment analysis. Sensors
Boosted convolutional neural networks. In: BMVC, vol. 5, p. 6. 22 (10), 3707.
Mohammadi, A., Shaverizade, A., 2021. Ensemble deep learning for aspect-based Scopus, 2023. scopus preview, https://scopus.com/.
sentiment analysis. Int. J. Nonlinear Anal. Appl. 12, 29–38. Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A., 2017.
Mohammed, A., Kora, R., 2019. Deep learning approaches for arabic sentiment Ensemble feature selection: homogeneous and heterogeneous approaches.
analysis. Social Network Anal. Min. 9 (1), 1–12. Knowl.-Based Syst. 118, 124–139.
Mohammed, A., Kora, R., 2021. An effective ensemble deep learning framework for Seker, S.E., Ocak, I., 2019. Performance prediction of roadheaders using ensemble
text classification. J. King Saud Univ.-Comput. Informat. Sci. 2021. machine learning techniques. Neural Comput. Appl. 31 (4), 1103–1116.
Monteiro, J.P., Ramos, D., Carneiro, D., Duarte, F., Fernandes, J.M., Novais, P., 2021. 7Seki, Y., Evans, D.K., Ku, L.-W., L.S. 0001, Chen, H.-H., Kando, N., 2008. Overview
Meta-learning and the new challenges of machine learning. Int. J. Intell. Syst. 36 of multilingual opinion analysis task at ntcir-7. In: NTCIR. Citeseer, pp. 185–
(11), 6240–6272. 203.
Montgomery, J.M., Hollenbach, F.M., Ward, M.D., 2012. Improving predictions using Seyyedsalehi, S.Z., Seyyedsalehi, S.A., 2014. Simultaneous learning of nonlinear
ensemble bayesian model averaging. Polit. Anal. 20 (3), 271–291. manifolds based on the bottleneck neural network. Neural Proces. Lett. 40 (2),
Mosca, A., Magoulas, G.D., 2016. Deep incremental boosting. in: GCAI, pp. 293–302. 191–209.
Nabil, M., Aly, M., Atiya, A., 2015. Astd: Arabic sentiment tweets dataset. In: Shahzad, R.K., Lavesson, N., 2013. Comparative analysis of voting schemes for
Proceedings of the 2015 Conference on Empirical Methods in Natural Language ensemble-based malware detection. J. Wireless Mobile Netw., Ubiquitous
Processing, pp. 2515–2519. Comput. Dependable Appl. 4 (1), 98–117.
Nakov, P., Rosenthal, S., Kiritchenko, S., Mohammad, S.M., Kozareva, Z., Ritter, A., Shahzad, R.K., Haider, S.I., Lavesson, N., 2010. Detection of spyware by mining
Stoyanov, V., Zhu, X., 2016. Developing a successful semeval task in sentiment executable files. In: 2010 International Conference on Availability, Reliability
analysis of twitter and other social media texts. Language Resourc. Eval. 50 (1), and Security. IEEE, pp. 295–302.
35–65. Sharma, A., Raju, D., Ranjan, S., 2017. Detection of pneumonia clouds in chest x-ray
Nguyen, H.T., Le Nguyen, M., 2019. An ensemble method with sentiment features using image processing approach. In: 2017 Nirma University International
and clustering support. Neurocomputing 370, 155–165. Conference on Engineering (NUiCONE). IEEE, pp. 1–4.
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T., 2015. Audio-visual Sharma, S., Srivastava, S., Kumar, A., Dangi, A., 2018. Multi-class sentiment analysis
speech recognition using deep learning. Appl. Intell. 42 (4), 722–737. comparison using support vector machine (svm) and bagging technique-an
Nti, I.K., Adekoya, A.F., Weyori, B.A., 2020. A comprehensive evaluation of ensemble ensemble method. In: 2018 International Conference on Smart Computing and
learning for stock-market prediction. J. Big Data 7 (1), 1–40. Electronic Enterprise (ICSCEE). IEEE, pp. 1–6.
Onan, A., Korukoğlu, S., Bulut, H., 2016. A multiobjective weighted voting ensemble Shin, Y., 2019. Application of stochastic gradient boosting approach to early
classifier based on differential evolution algorithm for text sentiment prediction of safety accidents at construction site. Adv. Civil Eng. 2019.
classification. Expert Syst. Appl. 62, 1–16. Shipp, C.A., Kuncheva, L.I., 2002. Relationships between combination methods and
Opitz, M., Waltner, G., Possegger, H., Bischof, H., 2017. Bier-boosting independent measures of diversity in combining classifiers. Informat. Fus. 3 (2), 135–148.
embeddings robustly. In: Proceedings of the IEEE International Conference on Smyth, P., Wolpert, D., 1997. Stacked density estimation. Adv. Neural Informat.
Computer Vision, pp. 5189–5198. Process. Syst. 10.
Ortiz, A., Munilla, J., Gorriz, J.M., Ramirez, J., 2016. Ensembles of deep learning Soares, C., Brazdil, P.B., Kuba, P., 2004. A meta-learning method to select the kernel
architectures for the early diagnosis of the alzheimer’s disease. Int. J. Neural width in support vector regression. Machine Learn. 54 (3), 195–209.
Syst. 26 (07), 1650025. Stamatatos, E., Widmer, G., 2002. Music performer recognition using an ensemble of
Oussous, A., Lahcen, A.A., Belfkih, S., 2018. Improving sentiment analysis of simple classifiers. ECAI, 335–339.
moroccan tweets using ensemble learning. In: International Conference on Su, Y., Zhang, Y., Ji, D., Wang, Y., Wu, H., 2012. Ensemble learning for
Big Data, Cloud and Applications. Springer, pp. 91–104. sentiment classification. In: Workshop on Chinese Lexical Semantics.
Palangi, H., Deng, L., Ward, R.K., 2014. Recurrent deep-stacking networks for Springer, pp. 84–93.
sequence classification. In: 2014 IEEE China Summit & International Conference Sultana, N., Sharma, N., Sharma, K.P., Verma, S., 2020. A sequential ensemble model
on Signal and Information Processing (ChinaSIP). IEEE, pp. 510–514. for communicable disease forecasting. Curr. Bioinform. 15 (4), 309–317.
Pandit, S., Kumar, S., 2020. Improvement in convolutional neural network for cifar- Sun, B., Chen, S., Wang, J., Chen, H., 2016. A robust multi-class adaboost algorithm
10 dataset image classification. Int. J. Comput. Appl. 176, 25–29. for mislabeled noisy data. Knowl.-Based Syst. 102, 87–102.
Pang, B., Lee, L., 2005. Seeing stars: Exploiting class relationships for sentiment Täckström, O., McDonald, R., 2011. Semi-supervised latent variable models for
categorization with respect to rating scales. In: ACL. sentence-level sentiment analysis. In: The 49th Annual Meeting of the
Pasupulety, U., Anees, A.A., Anmol, S., Mohan, B.R., 2019. Predicting stock prices Association for Computational Linguistics: Human Language Technologies.
using ensemble learning and sentiment analysis. In: 2019 IEEE Second Tang, J., Su, Q., Su, B., Fong, S., Cao, W., Gong, X., 2020. Parallel ensemble learning of
International Conference on Artificial Intelligence and Knowledge Engineering convolutional neural networks and local binary patterns for face recognition.
(AIKE). IEEE, pp. 215–222. Comput. Methods Programs Biomed. 197, 105622.
Perikos, I., Hatzilygeroudis, I., 2016. Recognizing emotions in text using ensemble of Tasci, E., Uluturk, C., Ugur, A., 2021. A voting-based ensemble deep learning method
classifiers. Eng. Appl. Artif. Intell. 51, 191–201. focusing on image augmentation and preprocessing variations for tuberculosis
Polikar, R., 2012. Ensemble learning. In: Ensemble Machine Learning. Springer, pp. detection. Neural Comput. Appl., 1–15
1–34. Thakur, R.S., Yadav, R.N., Gupta, L., 2019. State-of-art analysis of image denoising
Popel, M., Tomkova, M., Tomek, J., Kaiser, Ł., Uszkoreit, J., Bojar, O., Žabokrtskỳ, Z., methods using convolutional neural networks. IET Image Proc. 13 (13), 2367–
2020. Transforming machine translation: a deep learning system reaches news 2380.
translation quality comparable to human professionals. Nat. Commun. 11 (1), Tratz, S., Briesch, D., Laoudi, J., Voss, C., Tweet conversation annotation tool with a
1–15. focus on an arabic dialect, moroccan darija. In: Proceedings of the 7th Linguistic
Prusa, J., Khoshgoftaar, T.M., Dittman, D.J., 2015. Using ensemble learners to Annotation Workshop and Interoperability with Discourse, pp. 135–139.
improve classifier performance on tweet sentiment data. 2015 IEEE Tsai, C.-F., Lin, Y.-C., Yen, D.C., Chen, Y.-M., 2011. Predicting stock returns by
International Conference on Information Reuse and Integration. IEEE, pp. classifier ensembles. Appl. Soft Comput. 11 (2), 2452–2459.
252–257. Tsutsumi, K., Shimada, K., Endo, T., 2007. Movie review classification based on a
Qiu, X., Zhang, L., Ren, Y., Suganthan, P.N., Amaratunga, G., 2014. Ensemble deep multiple classifier. In: Proceedings of the 21st Pacific Asia Conference on
learning for regression and time series forecasting. 2014 IEEE Symposium on Language, Information and Computation, pp. 481–488.
Computational Intelligence in Ensemble Learning (CIEL). IEEE, pp. 1–6. Tur, G., Deng, L., Hakkani-Tür, D., He, X., 2012. Towards deeper understanding: Deep
Rodriguez-Penagos, C., Atserias, J., Codina-Filba, J., García-Narbona, D., Grivolla, J., convex networks for semantic utterance classification. 2012 IEEE International
Lambert, P., Saurí, R., 2013. Fbm: Combining lexicon-based ml and heuristics for Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5045–
social media polarities. In: Second Joint Conference on Lexical and 5048.
Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh Valle, C., Saravia, F., Allende, H., Monge, R., Fernández, C., 2010. Parallel approach for
International Workshop on Semantic Evaluation (SemEval 2013), pp. 483–489. ensemble learning with locally coupled neural networks. Neural Process. Lett.
Rokach, L., 2019. Ensemble learning: Pattern classification using ensemble methods. 32 (3), 277–291.
World Sci. 85. van Aken, B., Risch, J., Krestel, R., Löser, A., 2018. Challenges for toxic comment
Rushdi-Saleh, M., Martín-Valdivia, M.T., Ureña-López, L.A., Perea-Ortega, J.M., 2011. classification: An in-depth error analysis. In: ALW.
Oca: Opinion corpus for arabic. J. Am. Soc. Informat. Sci. Technol. 62(10), 2045– Walach, E., Wolf, L., 2016. Learning to count with cnn boosting. In: European
2054. Conference on Computer Vision. Springer, pp. 660–676.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, Waltner, G., Opitz, M., Possegger, H., Bischof, H., 2019. Hibster: Hierarchical boosted
A., Khosla, A., Bernstein, M., et al., 2015. Imagenet large scale visual recognition deep metric learning for image retrieval. In: 2019 IEEE Winter Conference on
challenge. Int. J. Comput. Vision 115 (3), 211–252. Applications of Computer Vision (WACV). IEEE, pp. 599–608.
773
A. Mohammed and R. Kora Journal of King Saud University – Computer and Information Sciences 35 (2023) 757–774
Wang, X.-Y., Zhang, B.-B., Yang, H.-Y., 2013. Active svm-based relevance feedback Xia, R., Xu, F., Yu, J., Qi, Y., Cambria, E., 2016. Polarity shift detection, elimination and
using multiple classifiers ensemble and features reweighting. Eng. Appl. Artif. ensemble: A three-stage model for document-level sentiment analysis.
Intell. 26 (1), 368–381. Informat. Process. Manage. 52 (1), 36–45.
Wang, G., Sun, J., Ma, J., Xu, K., Gu, J., 2014. Sentiment classification: The Xiong, Y., Ye, M., Wu, C., 2021. Cancer classification with a cost-sensitive naive
contribution of ensemble learning. Decision Support Syst. 57, 77–93. bayes stacking ensemble. Comput. Mathe. Methods Med. 2021.
Wang, F., Jiang, D., Wen, H., Song, H., 2019. Adaboost-based security level Xu, S., Liang, H., Baldwin, T., 2016. Unimelb at semeval-2016 tasks 4a and 4b: An
classification of mobile intelligent terminals. J. Supercomput. 75 (11), 7460– ensemble of neural networks and a word2vec based model for sentiment
7478. classification. In: Proceedings of the 10th international Workshop on Semantic
Wang, B., Xue, B., Zhang, M., 2020. Particle swarm optimisation for evolving deep Evaluation (SemEval-2016), pp. 183–189.
neural networks for image classification by evolving and stacking Yang, B., Yan, J., Lei, Z., Li, S.Z., 2015. Convolutional channel features. In: Proceedings
transferable blocks. 2020 IEEE Congress on Evolutionary Computation (CEC). of the IEEE International Conference on Computer Vision, pp. 82–90.
IEEE, pp. 1–8. Yu, Y., Si, X., Hu, C., Zhang, J., 2019. A review of recurrent neural networks: Lstm
Wen, Y.-H., Lee, T.-T., CHO, H.-J., 2005. Missing data treatment and data fusion cells and network architectures. Neural Comput. 31 (7), 1235–1270.
toward travel time estimation for atis. J. Eastern Asia Soc. Transport. Stud. 6, Zareapoor, M., Shamsolmoali, P., et al., 2015. Application of credit card fraud
2546–2560. detection: Based on bagging ensemble classifier. Procedia Comput. Sci. 48
Whitehead, M., Yaeger, L., 2009. Building a general purpose cross-domain sentiment (2015), 679–685.
mining model. 2009 WRI World Congress on Computer Science and Information Zhang, W., Zou, H., Luo, L., Liu, Q., Wu, W., Xiao, W., 2016. Predicting potential side
Engineering, vol. 4. IEEE, pp. 472–476. effects of drugs by recommender methods and ensemble learning.
Wiebe, J., Wilson, T., Cardie, C., 2005. Annotating expressions of opinions and Neurocomputing 173, 979–987.
emotions in language. Language Resourc. Eval. 39 (2), 165–210. Zhang, H., Dai, Y., Li, H, Koniusz, P., 2019. Deep stacked hierarchical multi-patch
Wilson, T., Wiebe, J., Hwa, R., 2006. Recognizing strong and weak opinion clauses. network for image deblurring. In: Proceedings of the IEEE/CVF Conference on
Comput. Intell. 22 (2), 73–99. Computer Vision and Pattern Recognition, pp. 5978–5986.
Wu, J., Yu, X., Liu, D., Chandraker, M., Wang, Z., 2020. David: Dual-attentional video Zhang, W., Jiang, J., Shao, Y., Cui, B., 2020. Snapshot boosting: a fast ensemble
deblurring. In: 2020 IEEE Winter Conference on Applications of Computer framework for deep neural networks. Science China Informat. Sci. 63 (1), 1–12.
Vision (WACV), pp. 2365–2374. Zhang, J., Zhang, W., Song, R., Ma, L., Li, Y., 2020. Grasp for stacking via deep
Xia, R., Zong, C., Li, S., 2011. Ensemble of feature sets and classification algorithms reinforcement learning. In: 2020 IEEE International Conference on Robotics and
for sentiment classification. Informat. Sci. 181 (6), 1138–1152. Automation (ICRA). IEEE, pp. 2543–2549.
774