Droidfusion: A Novel Multilevel Classifier Fusion Approach For Android Malware Detection
Abstract—Android malware has continued to grow in volume malware samples with nearly 2.5 million new samples discov-
and complexity posing significant threats to the security of mobile ered every year [2]. Android malware can be embedded in
devices and the services they enable. This has prompted increas- a variety of applications such as banking apps, gaming apps,
ing interest in employing machine learning to improve Android
malware detection. In this paper, we present a novel classi- lifestyle apps, educational apps, etc. These malware-infected
fier fusion approach based on a multilevel architecture that apps can then compromise security and privacy by allowing
enables effective combination of machine learning algorithms for unauthorized access to privacy-sensitive information, rooting
improved accuracy. The framework (called DroidFusion), gener- devices, turning devices into remotely controlled bots, etc.
ates a model by training base classifiers at a lower level and then Zero-day Android malware have the ability to evade
applies a set of ranking-based algorithms on their predictive accu-
racies at the higher level in order to derive a final classifier. The traditional signature-based defences. Hence, there is an
induced multilevel DroidFusion model can then be utilized as an urgent need to develop more effective detection methods.
improved accuracy predictor for Android malware detection. We Recently, machine learning-based methods are increasingly
present experimental results on four separate datasets to demon- being applied to Android malware detection. However, clas-
strate the effectiveness of our proposed approach. Furthermore, sifier fusion approaches have not been extensively explored
we demonstrate that the DroidFusion method can also effec-
tively enable the fusion of ensemble learning algorithms for as they have been in other domains like network intrusion
improved accuracy. Finally, we show that the prediction accuracy detection.
of DroidFusion, despite only utilizing a computational approach In this paper, we present and investigate a novel classi-
in the higher level, can outperform stacked generalization, a well- fier fusion approach that utilizes a multilevel architecture to
known classifier fusion method that employs a meta-classifier increase the predictive power of machine learning algorithms.
approach in its higher level.
The framework, called DroidFusion, is designed to induce a
Index Terms—Android malware detection, classifier fusion, classification model for Android malware detection by train-
ensemble learning, machine learning, mobile security, stacked ing a number of base classifiers at the lower level. A set of
ranking-based algorithms are then utilized to derive combi-
nation schemes at the higher level, one of which is selected
to build a final model. The framework is capable of lever-
I. I NTRODUCTION aging not only traditional singular learning algorithms like
N RECENT years, Android has become the leading mobile decision trees or naive Bayes, but also ensemble learning algo-
I operating system with a substantially higher percentage
of the global market share. Over 1 billion Android devices
rithms like random forest, random subspace, boosting, etc. for
improved classification accuracy.
have been sold with an estimated 65 billion app downloads In order to demonstrate the effectiveness of the DroidFusion
from Google Play alone [1]. The growth in the popularity approach, we performed extensive experiments on four
of Android and the proliferation of third party app markets datasets derived from extracting features from two publicly
has also made it a popular target for malware. Last year, available and widely used malware samples collection (i.e.,
McAfee reported that there were more than 12 million Android Android Malgenome project [3] and DREBIN [4]) and a
collection of samples provided by Intel Security (formerly,
The results of experiments with singular classifiers and SVM, decision tree, k-NN, and naive Bayes with information
ensemble classifiers are presented. priors and hierarchical mixture of naive Bayes.
4) Furthermore, we present results of a performance com- Wang et al. [52] applied logistic regression, linear SVM,
parison of DroidFusion with stacked generalization (or decision tree, and random forest with static analysis for the
stacking), a well-known classifier fusion method that is detection of malicious apps. They utilized app-specific static
also based on a multilevel architecture. features and platform-specific static features for training the
5) Datasets that we created from the feature extraction pro- machine learning algorithms. The authors reported a maximum
cess with DREBIN and Malgenome project malware true positive rate (TPR) of 96% and false positive rate (FPR)
samples are released in the supplementary material. of 0.06% with the logistic regression classifier based on
The rest of this paper is structured as follows. Section II dis- experiments conducted on 18 363 malware apps and 217 619
cusses related work while Section III presents the DroidFusion benign apps.
framework. The investigation methodology is presented in Other research papers that have investigated static fea-
Section IV, while Section V presents results with analyses tures with machine learning for Android malware detection
and discussion. Finally, the conclusion is given in Section VI. include [21]–[23], [45], [47], [48], and [54].
II. R ELATED W ORK B. Dynamic and Hybrid Analysis With Traditional Classifiers
In this section, we review related work on machine learning- Some of the detection methods utilized dynamic fea-
based Android malware detection. Static and/or dynamic tures with machine learning, for example AntiMalDroid [24].
analysis is used to extract model training features, and both AntiMalDroid is a dynamic analysis behavior-based mal-
methods have pros and cons. Static analysis is prone to obfus- ware detection framework that uses logged behavior sequence
cation [5], but is generally faster and less resource intensive as features with SVM. DroidDolphin [25] also employed
than dynamic analysis. Dynamic analysis is resistant to obfus- SVM with dynamically obtained features. Afonso et al. [26]
cation but can be hampered by anti-virtualization [6]–[9] and utilized dynamic API calls and system call traces and
code coverage limitations [10], [34]. investigated SVM, J48, IBk (an instance-based classifier),
BayesNet K2, BayesNet TAN, random forest, and naive
A. Static Analysis With Traditional Classifiers Bayes. Alzaylaee et al. [27] investigated SVM, naive Bayes,
PART, random forest, J48, multilayer perceptron (MLP), and
Recent Android malware detection work that employ
simple logistic by comparing their performances on real
machine learning with static features include the fol-
phones versus emulators using dynamically obtained features.
lowing. DroidMat [11] proposed applying k-means and
Ni et al. [46] proposed a real-time malicious behavior detec-
k-nearest neighbor (k-NN) algorithms based on static fea-
tion system that records API calls, permission uses, and other
tures from permissions, intents, and application program
real-time features such as user operations. In their paper, they
interface (API) calls, to classify apps as benign or malware.
used SVM and naive Bayes algorithms for detection with these
Arp et al. [4] proposed SVM based on permissions, API
run-time features.
calls, network access, etc. for lightweight on-device detec-
Mahindru and Singh [53] extracted 123 dynamic per-
tion. Yerima et al. [12], [14] proposed an eigenpsace analysis
missions from 11 000 Android applications which were
approach, as well as random forest ensemble learning models.
subsequently applied to several individual machine learning
The machine learning-based detection proposed in the papers
classifiers including naive Bayes, decision tree, random for-
were based on API calls, intents, permissions, and embed-
est, simple logistic, and k-star. In their experiments, simple
ded commands. Varsha et al. [15] investigated SVM, random
logistic was found to perform marginally better than the oth-
forest, and rotation forests on three datasets; their detection
ers but the malware classification accuracy of random forest,
method employed static features extracted from the manifest
decision tree (J48), and simple logistic were comparable.
and application executable files.
Other works such as MARVIN [28], adopt a hybrid static
Sharma and Dash [16] utilized API calls and permis-
and dynamic feature-based approach with machine learning
sions to build naive Bayes and k-NN-based detection systems.
(SVM and L2 regularized linear classifier). MARVIN assesses
In [17], API classes were used with random forest, J48, and
the risk associated with unknown Android apps in the form of
SVM classifiers. Wang et al. [18] evaluated the usefulness of
a malice score ranging from 0 to 10. Similarly, Su et al. [49]
risky permissions for malware detection using SVM, decision
adopted a hybrid static and dynamic feature approach by per-
trees, and random forest. DAPASA [19] focused on detecting
forming experiments on 1200 (900 clean and 300 malware)
malware piggybacked onto benign apps by utilizing sensi-
samples. Several machine learning algorithms were investi-
tive subgraphs to construct five features depicting invocation
gated including Bayes net, naive Bayes, k-NN, J48, and SVM.
patterns. The features are fed into machine learning algo-
The best overall accuracy of 91.1% was attained with SVM.
rithms, i.e., random forest, decision tree, k-NN, and PART,
with random forest yielding the best detection performance.
Cen et al. [20] proposed a detection method based on API C. Android Malware Detection With Classifier Fusion
calls from decompiled code and permissions. Their proposed Previous works in intrusion detection systems such
method applies a probabilistic discriminative model based on as [29]–[32] investigated classifier fusion for improving
regularized logistic regression (RLR). RLR is compared to detection accuracy. This method is also being applied
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
to the detection of Android malware. For example, OVERVIEW OF S OME OF THE PAPERS T HAT A PPLY C LASSIFIER F USION
Milosevic et al. [50] investigated classifier fusion approach FOR A NDROID M ALWARE D ETECTION . NB = NAIVE BAYES ; SL =
with static analysis based on Android permissions and source S IMPLE L OGISTIC ; LR = L INEAR R EGRESSION ; DT = D ECISION T REE ;
code-based analysis. They used SVM, C.45, decision trees, P ROD P = P RODUCT OF P ROBABILITIES ; AND M AX P =
random tree, random forests, JRip, and linear regression classi- M AXIMUM P ROBABILITY
fiers. The authors experimented with ensembles that contained
odd combinations of three and five classifiers using the major-
ity voting fusion method. The best fusion model achieved an
accuracy rate of 95.6% using the source-code-based features.
However, the number of samples used in the experiments were
limited (387 samples for the permissions-based experiments
and 368 for source code-based analysis).
Yerima et al. [13] compared several classifier fusion meth-
ods, i.e., majority vote, product of probabilities, maximum
probability, and average of probabilities using J48, naive
Bayes, PART, RIDOR, and simple logistic classifiers. The
classifiers were trained with static features extracted from
6863 app samples, and in the experiments presented, the fused
models performed better than the single classifiers.
Wang et al. [51] extracted 11 types of static features
and employed multiple classifiers in a majority vote fusion
approach. The classifiers include SVM, k-NN, naive Bayes, produce different randomly induced models that are subse-
classification and regression tree (CART), and random for- quently combined). At the lower level, the (DroidFusion)
est. Their experiments on 116 028 app samples showed more base classifiers are trained on a training set using a strati-
robustness with the majority voting ensemble than with the fied N-fold cross-validation technique to estimate their relative
individual base classifiers. predictive accuracies. The outcomes are utilized by four differ-
Idrees et al. [55] utilized permissions and intents as fea- ent ranking-based algorithms (in the higher layer) that define
tures to train machine learning models and applied classifier certain criteria for the selection and subsequent combination
fusion for improved performance. Their experiments were per- of a subset (or all) of the applicable base classifiers. The out-
formed on 1745 app samples starting with a performance comes of the ranking algorithms are combined in pairs in
comparison between MLP, decision table, decision tree, ran- order to find the strongest pair, which is subsequently used
dom forest, naive Bayes, and sequential minimal optimization to build the final DroidFusion model (after testing against an
classifiers. The decision table, MLP, and decision tree classi- unweighted parallel combination of the base classifiers).
fiers were then combined using three schemes: 1) average of
probabilities; 2) product of probabilities; and 3) majority vot- A. DroidFusion Model Construction
ing. Coronado-De-Alba et al. [33] proposed and investigated The model building, i.e., training process is distinct from the
a classifier fusion method based on random forest and ran- prediction or testing phase, as the former utilizes a training-
dom committee ensemble classifiers. Their approach embeds validation set to build a multilevel ensemble classifier which is
random forest within random committee to produce a meta- then evaluated on a separate test set in the latter phase. Fig. 1
ensemble model. The meta-model outperformed the individual illustrates the two-level architecture of DroidFusion. It shows
classifiers in experiments performed with 1531 malware and the training paths (solid arrows) and the testing/prediction path
1531 benign samples. Table I summarizes papers that have (dashed arrows). First, at the lower level each base classi-
investigated classifier fusion for Android malware detection. fier undergoes an N-fold cross-validation-based estimate of
In contrast to all of the existing Android malware detection class performance accuracies. Let the N-fold cross validated
works, this paper proposes a novel classifier fusion approach predictive accuracies for K base classifiers be expressed by
that utilizes four ranking-based algorithms within a multilevel Pbase , a K-tuple of the class accuracies of the K base classifiers
framework (DroidFusion). We evaluated DroidFusion exten-
sively and compared its performance to stacking and other Pbase = {[P1m , P1b ], [P2m , P2b ], . . . , [PKm , PKb ]}. (1)
classifier fusion methods. Next, we present DroidFusion.
The elements of Pbase are applied to the ranking-based algo-
rithms average accuracy-based (AAB) ranking scheme, class
III. D ROID F USION : G ENERAL P URPOSE F RAMEWORK differential-based (CDB) ranking scheme, ranked aggregate
FOR C LASSIFIER F USION of per class performance-based (RAPC) scheme, and ranked
The DroidFusion framework consists of a multilevel archi- aggregate of average accuracy and class differential-based
tecture for classifier fusion. It is designed as a general (RACD) scheme described later in Section III-B. Let X be
purpose classifier fusion system, so that it can be applied the total number of instances with M malware and B benign
to both traditional singular classifiers and ensemble classi- instances, where the M instances possess a label L = 1 denot-
fiers (which themselves employ a base classifier usually to ing malware and the B instances from X possess a label L = 0
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
denoting benign. All X instances are also represented by fea- reclassification is accomplished using V̇(x), x ∈ X based on
ture vectors with f binary representations, where f is the the criteria defined by the schemes in S using Pbase . Each
number of features extracted from the given app. The fea- scheme in S derives a set of Z weights that will be applied
tures in the vectors take on 0 or 1 representing the absence with V̇(x), x ∈ X for every instance during the reclassification
or presence the given feature. Additionally, after the N-fold process.
cross-validation process (as shown in Fig. 1), a set of K-tuple Let ωi , i ∈ {1, . . . , Z}, Z ≤ K be the set of weights derived
class predictions are derived for every instance x, given by for a particular scheme in S. Then, to reclassify an instance
x according to the scheme’s criterion, its class prediction will
V(x) = {v1 , v2 , . . . , vk }, ∀k ∈ {1, . . . , K}. (2)
be given by
Note that v1 , v2 , . . . , vk could be crisp predictions or prob- Z
ωi vi
ability estimates from the base classifiers. Adding the original 1 : if i=1 ≥ 0.5
CSj (x) = Z
i=1 ωi (5)
(known) class label, l, we obtain 0 : otherwise ∀j ∈ {1, 2, 3, 4}.
V̇(x) = {v1 , v2 , . . . , vk , l}, ∀k ∈ {1, . . . , K}, l ∈ {0, 1}. (3) Hence, the benign class accuracy performance for the given
Pbase and V̇(x), ∀x ∈ X will be utilized in the level-2 scheme is calculated from
computation during the DroidFusion model construction. Let X
(CSj (x) + 1)|CSj (x) = 0, l(x) = 0
us denote the set of four ranking-based schemes by S = PSj = x=1
{S1, S2, S3, S4}. The pairwise combinations of the elements
of S will result in six possibilities where B is the number of benign instances, while the malware
accuracy performance is calculated from
φ = {S1S2, S1S3, S1S4, S2S3, S2S4, S3S4}. (4) X
CSj (x)|CSj (x) = 1, l(x) = 1
Our goal is to select the best pair of ranking-based schemes PSj = x=1
. (7)
from S, and if its performance exceeds that of an unweighted
combination of the original base classifiers, it would be Thus the average performance accuracy is simply
selected to construct the final DroidFusion model. In the B · Pben
Sj + (X − B) · PSj
event that the unweighted combination performance is greater, ṖSj = . (8)
DroidFusion will be configured to apply a majority vote (or X
average of probabilities) of the base classifiers in the final con- Likewise, to determine the performance of each pairwise com-
structed model. In order to estimate the accuracy performance bination in φ: let ωi , i ∈ {1, . . . , Z}, Z ≤ K be the first set
of each scheme in S or each pairwise combination in set φ, of weights derived for the first scheme in the pair, and let
a reclassification of the X instances (in the training-validation μi , i ∈ {1, . . . , Z}, Z ≤ K be those derived for the second
set) is performed for each scheme or pair of schemes. The scheme in the pair. Then, to reclassify the X instances in the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
training-validation set according to the combination pair, the B. Proposed Ranking-Based Algorithms
class prediction of each instance x will be given by The design of our proposed algorithms is influenced by the
⎧ Z Z observation that most typical classifiers perform differently for
⎪ i=1 ωi vi +i=1 μi vi
⎪ 1 : if ≥ 0.5 both classes. That is, class accuracy performance for benign
⎨ Z Z
i=1 ωi + i=1 μi
CSjSn (x) = 0 : otherwise (9) and malware are very rarely equal in magnitude. The proposed
⎪ ∀j ∈ {1, 2, 3, 4}, ∀n ∈ {1, 2, 3, 4} ranking-based algorithms include the following.
j = n, SjSn ≡ SnSj. 1) An AAB ranking scheme.
2) A CDB ranking scheme.
Therefore, computing benign class accuracy and malware class 3) An RAPC-based scheme.
accuracy will utilize 4) An RACD-based scheme.
X 1) Average Accuracy-Based Ranking Scheme: With the
x=1 CSjSn (x) + 1 |CSjSn (x) = 0, l(x) = 0
SjSn = (10) AAB method, the ranking is designed to be directly propor-
B tional to the average prediction accuracies across the classes.
and In this case, base classifiers with larger overall accuracy
X performance will rank higher. AAB does not take into account
x=1 CSjSn (x)|CSjSn (x) = 1, l(x) = 1
Pmal = (11) how well a base classifier performs for a particular class. Let
X−B AAB be the first scheme S1, from set S. The algorithm is
respectively. The average performance accuracy for the pair- summarized as follows.
wise schemes will then be given by Let Pbase be the set of performance accuracies Pk,c ∈ Pbase
of K base classifiers. If m denotes malware and b, benign then
B · Pben
SjSn + (X − B) · PSjSn
the average accuracy of the kth base classifier is given by
ṖSjSn = . (12)
ak = 0.5 × Pk,c |k ∈ {1, . . . , K}, 0 < Pk,c ≤ 1. (18)
∀j ∈ {1, 2, 3, 4}, ∀n ∈ {1, 2, 3, 4}, j = n, SjSn ≡ SnSj. c=m,b
Equivalently, the unweighted majority vote class predictions Let A ← ak , ∀k ∈ {1, . . . , K} be a set of the average predictive
for instance x is given by accuracies, to which a ranking function Rankdesc (.) is applied
K Ā ← Rankdesc (A). (19)
k=1 vi
Cmv (x) = 1 : if K ≥ 0.5 (13) Thus, Ā contains an ordered ranking of the level-1 base classi-
0 : otherwise ∀k ∈ {1, . . . , K}.
fiers average predictive accuracies in descending order. Next,
Hence, the benign class accuracy performance for the the top Z rankings are utilized in weight assignments as
unweighted scheme will be given by follows:
X ω1 = Z, ω2 = Z − 1, . . . , ωZ = 1, Z ≤ K. (20)
x=1 (Cmv (x) + 1)|Cmv (x) = 0, l(x) = 0
mv = . (14)
B Thus, the AAB class prediction C(x) for instance x in the
Likewise, the malware class accuracy performance for the training-validation set is given by (5) or given by (9) when
unweighted scheme is given by used in the pairwise combination with another scheme.
2) Class Differential-Based Ranking Scheme: With the
Cmv (x)|Cmv (x) = 1, l(x) = 1 CDB method, the ranking is directly proportional to the aver-
Pmv = x=1
. (15) age predictive accuracy and inversely proportional to the abso-
lute value of the performance difference between the classes.
Finally, the average accuracy performance for the unweighted Assuming a binary classification problem, this approach will
scheme is given by be less likely to favor the decision from a base classifier that
B · Pben exhibits much higher accuracy in one class over the other but
mv + (X − B) · Pmv
Ṗmv = . (16) will assign larger weights to good classifiers that perform rel-
atively well in both classes. The CDB procedure is described
After all the reclassifications are completed, and the aver- as follows.
age accuracies computed, the applicable scheme that will be Suppose the CDB method is taken as scheme S2, let the
utilized to construct the DroidFusion model is selected thus average accuracy of each base classifier be given by ak in (18)
With D̄ containing the ordered rankings of dk values, the top Then, for each base classifier, aggregate the values and apply
Z rankings are also utilized to assigned weights according the ranking function Rankdesc (.)
to (20). Thus, the S2 = CDB class prediction for an instance
