IEEE - INDIACom 2018 Paper
IEEE - INDIACom 2018 Paper
IEEE - INDIACom 2018 Paper
Abstract— The quality of the software can be improved by prediction models, we also require fault data from previous
determining the faulty portions of the software in the initial versions of the software product for training. Thus, we will
phases of software development. There are various machine develop software fault prediction models using previous defect
learning techniques in literature that can be used to create fault data and OO metrics to identify components of a software
prediction models using object oriented metrics. These models
which are prone to faults in the upcoming releases of the
will allow the developers and software practitioners to predict
faulty classes and concentrate the constraint resources in testing software.
these weaker portions of the software. In this work we analyze A key factor while developing software fault
and assess the predictive capability of six machine learning prediction models is the use of an efficient modeling
techniques. The results are validated using seven open source technique. These modeling techniques are classification
software. algorithms which learn from the provided historical fault data
of the software and identify faulty classes in the new versions
Keywords— Faultt Prediction; Machine Learning Techniques; on the basis of their learning. Traditionally, statistical
Object-Oriented Metrics; Software Quality
techniques such as logistic regression (LR) were used for
modeling software prediction models. However, various fault
I. INTRODUCTION prediction studies in literature have advocated the use Machine
The prime aim of software industry is to develop effective Learning (ML) techniques for this task [2,5]. ML techniques
quality software products which fulfills its requirements and are capable in extraction of worthy information from complex
has satisfied customers. However, in order to do so existing or difficult problem scenarios in less time [2]. Therefore, this
faults in the software should be removed as early as possible. study analyzes the applicability of these algorithms in
A fault which is detected in the early phases of software life software fault prediction domain. Furthermore, the results of
cycle can be easily corrected as compared to a fault which was ML algorithms are compared with those of tradition statistical
found in the later phases of software development lifecycle technique, LR. It may also be observed that the results
[1]. It has been ascertained that the cost to correct faults obtained by ML algorithms vary on different datasets. Thus, it
increases exponentially, if it is detected in the later phases [1]. is important to validate them using datasets from different
Thus, various researchers have rigorously developed and domains to confirm their effectiveness. Therefore, this study
evaluated several software fault prediction models, which are analyzes the capability of six ML algorithms viz. Adaboost
capable of early detection of faults. Remedial actions can be (AB), Bagging (BG), Decision Tree (J48), LogitBoost (LB),
effectively taken by software developers to modify the Naïve Bayes (NB) and Random Forest (RF) for developing
software product by removal of these faults. Moreover, software fault prediction models. Furthermore, the study
software managers will also be able to effectively plan compares the results of these ML algorithms with LR for
resource usage by assigning more resources to fault-prone developing defect prediction models. The comparison is
components of a software. These steps would ensure better performed statistically using Friedman test. This study
quality software products at optimum costs. explores the following research questions (RQ’s):
While developing software fault prediction models, RQ1: What is the effectiveness of ML algorithms (AB, BG,
researchers have evaluated a wide range of software metrics. J48, LB, NB and RF) for developing prediction models which
These metrics include process metrics, procedural metrics or determine faulty classes in a software?
Object-Oriented (OO) metrics [2]. Recent reviews on software
RQ2: What is the comparative performance of ML algorithms
fault prediction studies have ascertained OO metrics to be
(AB, BG, J48, LB, NB and RF) with the statistical technique
widely used in this domain [2-3]. Therefore, this study uses
LR for developing prediction models which determine faulty
Chidamber and Kemrer (CK) metrics suite [4], a popularly
classes in a software?
used OO metrics suite for developing defect prediction models
The above-mentioned research questions are answered by
[2-3]. The CK metrics suite contains metrics which represent
analyzing the performance of software fault prediction models
various structural properties of an OO software such as its
on seven open-source datasets. The performance of the fault
cohesion, size, reusability etc. In order to develop fault
prediction models is assessed using four measures viz. As pointed by Lessmann et al. [19], there were few
precision, recall, f-measure and the Area Under the Receiver studies in literature which statistically assess the effectiveness
Operating Characteristic Curve (AUC). Though, precision and of the developed defect prediction models. However, some of
recall are traditional performance measures, the use of AUC is the recent key studies which use ML algorithms in this domain
supported by various studies [2,6] as it is a robust performance have conducted rigorous statistical analysis to determine their
measure. The results point out the RF method as the best for effectiveness. Malhotra [20] evaluated the efficiency of 18 ML
developing software fault prediction models. algorithms for developing defect prediction using Friedman
Section II of this study gives a brief overview of related and post-hoc Nemenyi test on several android datasets by
literature. The research background and the various ML developing both within project and inter-release validation
algorithms are discussed in Section III and IV respectively. models. A study by Arar and Ayan [21] also assessed the
Section V describes the study’s results followed by threats to applicability of 3 ML and 3 search-based algorithms on four
validity in Section VI. Future work and conclusions are NASA datasets using Friedman test. Similarly, a study by
mentioned in Section VII. Harman et al. [22] evaluated the capability of a hybridized
technique i.e. Genetic Algorithms with Support Vector
II. RELATED LITERATURE Machine on Hadoop datasets using Wilcoxon test. De
Extensive research has been conducted in the domain of Carvalho et al. [6] also evaluated the use of Multi-objective
software defect prediction. A number of review studies particle Swarm Optimization along with several ML
extensively evaluated the conducted studies in this domain on algorithms and assessed its capability using Wilcoxon test in
various parameters. A review study by Radgenovic et al. [7] the domain of software defect prediction. Zhou et al. [23],
assessed 106 studies from 1991 to 2011 to evaluate the Arisholm et al. [24], Okutan et al. [25] and Canfora et al. [26]
relevance of various software metrics for developing defect also used statistical analysis to determine the effectiveness of
prediction models. According to their survey, Object-oriented ML algorithms in this domain.
metrics have been widely used in literature studies as III. RESEARCH BACKGROUND
compared to process metrics or traditional metrics extracted
from source code. A review by Catal and Diri [5] of 74 fault The various software metrics used in the study along with
prediction studies ascertained that the use of publicly available the dependent variable fault-proneness is discussed in this
datasets has increased significantly over the years. Moreover, section. The details of various datasets used in the study along
the review confirmed ML techniques as popular choices for with the performance measures is also provided in this section.
developing defect prediction models. This finding was A. OO Metrics and Fault-proneness
supported by a recent review conducted by Malhotra [2]. She Fault-proneness is defined as the likelihood that a specific
also confirmed the need for more number of studies which class would contain faulty code in the upcoming releases of
assessed the comparative performance of ML techniques with
the software product. It is binary in nature with values as
statistical techniques in the domain of software defect
“fault-prone” or “not fault-prone”.
prediction. Certain recent reviews have also investigated
The metrics belonging to the CK metrics suite are used as
studies which use search-based techniques for developing
predictors in this study. Another common measure of size
software defect prediction models [8-9]. A review by Afzal
known as Lines of Code (LOC), which counts the number of
and Torkar [8] assessed genetic programming for developing
lines of source code in a class, is also used as an independent
defect prediction models and another one by Malhotra et al.
variable. The CK metrics consists of two measures of
[9] analyzed the use of several search-based techniques for
reusability, viz. Number of Children (NOC), which represents
their effectiveness in this domain.
the number of direct subclasses and Depth of Inheritance Tree
It may be noted that a wide category of techniques has (DIT) which represents the position of the class in the tree
been evaluated for software defect prediction which includes hierarchy. The Coupling Between Object (CBO) metric counts
statistical, ML and the recently explored search-based the number of coupled classes for a specific class. Also,
algorithms. Though, statistical techniques have been found another coupling measure, Response For a Class (RFC)
effective in this domain [10-12], the use of ML algorithms has represents the number of methods which can respond to a
yielded improved results [2]. Dejaeger et al. [13] assessed the class’s message. The cohesive nature of classes is addressed
use of several Bayesian network classifiers along with by the shared variables between two classes, which is
statistical and other common ML algorithms for developing represented by the Lack of Cohesion among Methods
defect prediction models. Several other studies too such as the (LCOM) of a class metric. Weighted Methods per Class
one by Gyimothy et al. [14], Pai and Dugan [15], Vandecruys (WMC) estimates the number of methods in a specific class.
et al. [16] and Chen et al. [17] evaluated the use of both
B. Datasets Description
statistical and ML algorithms. A recent study by
Tantithamthavorn et al. [18] used 12 model validation In order to develop software fault prediction models, we
techniques using a statistical and two ML algorithms on 18 use seven open source data sets in this study. The seven data
datasets. sets are developed using Java. The description of these data
sets in terms of number of total classes and number of faulty
classes is mentioned in Table I. Some of the data sets are
extracted from the PROMISE [27], while some of them are weight threshold of 100, a shrinkage of 1.0, 10 iterations, and
collected using a data extraction tool developed by students of a seed value of 1.
Delhi Technological University named Defect Collection and J48 is a decision tree algorithm which uses normalized
Reporting System (DCRS) tool [28]. The DCRS tool is information gain as the criteria for splitting [30]. For each of
capable of extracting data from data sets which use GIT as a
the predictor variable, the information gain is computed and
version control repository. The tool computes OO metrics with
the attribute with the highest information gain is designated as
the aid of CKJM tool [29].
the root node. This process is performed recursively. The
TABLE I. DATA SET DETAILS
default parameter settings for WEKA were a confidence factor
Dataset Name Number of Number of
of 0.25, three folds, a seed value of 1 and a pruned tree.
data points classes with
faults NB is a Bayesian method which develops a classifier based
Bsf 75 55 on probability. Each of the predictor variables is assumed to
Click 403 85 be independent.
Zuzel 30 13 Bagging (BG) is another type of ensemble learning method
Xerxes 441 71 which creates bootstrap sample from the original data by
Ivy 614 600 repeatedly sampling the dataset. The sampling is done in
Log4j 351 23 accordance with uniform probability distribution. It may be
Wspomaganiepi 19 12 noted that the size of the bootstrap sample is exactly similar to
C. Perfromance Measures that of the original data. The parameters used by the WEKA
tool for the BG technique include 100% as the bag size, REP
The study uses four performance measures for evaluating
tree as the base classifier, a seed value of 1 and ten iterations.
the developed software fault prediction models. The
definition of these performance measures is as follows: RF is also an ensemble learner which consists of several
• Recall (Rec.): It is the ratio of correctly predicted decision trees. The individual trees of the forest are
faulty classes amongst the actually present faulty constructed with various subsets of the training data.
classes. It is also known as sensitivity. However, these subsets are constructed randomly and with
• Precision (Prec.): It is the ratio of correctly replacement. The outcome of the forest is determined as the
identified faulty classes amongst the predicted mode of the outputs generated by the individual constituent
faulty classes trees. A forest of 100 trees, a seed value of 1 and a maximum
• F-measure (FM): It is computed as the harmonic depth of 0 is used as parameter settings by the WEKA tool for
mean of precision and recall. RF.
• AUC: The AUC is a stable performance measure
as it simultaneously optimizes both recall along V. RESULTS AND ANALYSIS
with the percentage of correctly predicted non- This section describes the results of the study along with
faulty classes (specificity). It achieves an answers to each of the investigated RQ.
optimum cut-off which balances both recall and
specificity. A. RQ1: What is the effectiveness of ML algorithms (AB, BG,
J48, LB, NB and RF) for developing prediction models
which determine faulty classes in a software?
IV. MACHINE LEARNING ALGORITHMS
In this section, we discuss briefly, the various ML techniques The fault prediction models in the study are developed
used in the study. We used WEKA as a simulation tool for the using ten-fold cross validation [31]. This strategy divides the
investigated ML algorithms. The default parameter settings of given data sets into ten disjoint subsets. A model is developed
the WEKA tool were used for each ML technique. by providing nine of these subsets as the training data. The
Boosting is the process of aggregating various classifiers tenth remaining subset is used for validating the developed
model. This process is repeated until the prediction values on
which are weak in nature in order to construct a strong
each of the ten subsets are obtained, i.e. till all the subsets are
classifier. Thus, boosting is an ensemble of various classifiers
used at least once for the purpose of validation. As discussed
where each classifier is trained using a slightly different in section III.C, we evaluate the developed software fault
training data. The AB algorithm uses boosting with prediction models using recall, precision, FM and AUC. Table
distribution of weights on the training data to yield effective II states the precision and recall values obtained by the
results [30]. WEKA uses a seed value of 1, a weight threshold developed software fault prediction models on all the datasets.
of 100 and decision stump as the base classifier for AB. The The precision values of all the models ranged from 0.736-
LB algorithm uses AB as an additive model along with cost 0.975 indicated effective fault prediction models.
function provided by LR [30]. The default parameter settings
for the LB technique in the WEKA tool include a likelihood
threshold of -1.79, decision stump as the base classifier, a
TABLE II. PRECISION AND RECALL VALUES
Similarly, the recall values in majority of the cases ranged According to the figure, the average AUC values obtained by
from 0.759-0.977. Only in two cases, the recall values were only the J48 technique was less than the LR technique. In all
below 0.60. Thus, the investigated ML algorithms along with other cases the average AUC values obtained by the ML
LR techniques developed effective fault prediction models techniques were greater than the LR technique. The average
with acceptable precision and recall values. Table III states the FM values obtained by the ML techniques were also
FM and AUC values obtained by the developed fault competent with the LR technique.
prediction models. The FM values ranged from 0.615-0.966 The range of AUC values for all the ML techniques on all
for all the developed defect prediction models. Similarly, the the investigated data sets were 0.669-0.854 for AB, 0.603-
AUC values in majority of the cases ranged from 0.6-0.9. 0.829 for BG, 0.413-0.758 for J48, 0.619-0.868 for LB, 0.604-
Only in three specific cases the AUC values were less than 0.852 for NB and 0.666-0.849 for RF. The range of AUC
0.6. In fact, in most of the cases, the AUC values were greater values for models developed using the RF techniques was
than 0.7. This indicates effectiveness of the ML and the 0.583-0.801, which was lesser than most of the ML
statistical technique LR for developing software fault techniques. This indicates the superiority of ML techniques.
prediction models. The average values of f-measure and AUC
obtained by all the techniques over all the datasets are depicted
B. RQ2: What is the comparative performance of ML
in Figure 1.
algorithms (AB, BG, J48, LB, NB and RF) with the
statistical technique LR for developing prediction models
which determine faulty classes in a software?
In order to compare the performance of six ML algorithms
with the statistical technique LR, we use Freidman statistical
test on AUC values. This is because literature studies advocate
the use of AUC as a stable performance measure. The test was
conducted at a cut-off of 0.05. We evaluated the following
hypothesis: