SonarQube Rules

Are SonarQube Rules Inducing Bugs?
Valentina Lenarduzzi Francesco Lomio Heikki Huttunen Davide Taibi

Lahti-Lappeenranta University Tampere University Tampere University Tampere University
Lahti-Lappeenranta, Finland Tampere, Finland Tampere, Finland Tampere, Finland
valentina.lenarduzzi@lut.fi francesco.lomio@tuni.fi heikki.huttunen@tuni.fi davide.taibi@tuni.fi
Abstract—Background. The popularity of tools for analyzing box set of rules (named ”the Sonar way”4 ) is very subjective
Technical Debt, and particularly the popularity of SonarQube, and their developers did not manage to agree on a common
is increasing rapidly. SonarQube proposes a set of coding rules, set of rules that should be enforced. Therefore, the companies
arXiv:1907.00376v2 [cs.SE] 19 Dec 2019
which represent something wrong in the code that will soon be

reflected in a fault or will increase maintenance effort. However, asked us to understand if it is possible to use machine
our local companies were not confident in the usefulness of the learning to reduce the subjectivity of the customization of
rules proposed by SonarQube and contracted us to investigate the SonarQube model, considering only rules that are actually
the fault-proneness of these rules. fault-prone in their specific context.
Objective. In this work we aim at understanding which Sonar-
Qube rules are actually fault-prone and to understand which
SonarQube is not the most used static analysis tool on the
machine learning models can be adopted to accurately identify market. Other tools such as Checkstyle, PMD and FindBugs
fault-prone rules. are more used, especially in Open Source Projects [1] and
Method. We designed and conducted an empirical study on 21 in research [2]. However, the adoption of another tool in
well-known mature open-source projects. We applied the SZZ the DevOps pipeline requires extra effort for the companies,
algorithm to label the fault-inducing commits. We analyzed the
fault-proneness by comparing the classification power of seven
including the training and the maintenance of the tool itself. If
machine learning models. the SonarQube rules actually resulted fault-prone, our compa-
Result. Among the 202 rules defined for Java by SonarQube, nies would not need to invest extra effort to adopt and maintain
only 25 can be considered to have relatively low fault-proneness. other tools.
Moreover, violations considered as ”bugs” by SonarQube were At the best of our knowledge, not studies have investigated
generally not fault-prone and, consequently, the fault-prediction
power of the model proposed by SonarQube is extremely low. the fault-proneness of SonarQube rules, and therefore, we
Conclusion. The rules applied by SonarQube for calculating accepted the challenge and we designed and conducted this
technical debt should be thoroughly investigated and their study. At best, only a limited number of studies have con-
harmfulness needs to be further confirmed. Therefore, companies sidered SonarQube rule violations [3], [4], but they did not
should carefully consider which rules they really need to apply, investigate the impact of the SonarQube violations considered
especially if their goal is to reduce fault-proneness.
as ”bugs” on faults.
The goal of this work is twofold:
I. I NTRODUCTION
• Analyze the fault-proneness of SonarQube rule violations,
The popularity of tools for analyzing technical debt, such and in particular, understand if rules classified as ”bugs”
as SonarQube, is increasing rapidly. In particular, SonarQube are more fault-prone than security and maintainability
has been adopted by more than 85K organizations 1 including rules.
nearly 15K public open-source projects 2 . SonarQube analyzes • Analyze the accuracy of the quality model provided by
code compliance against a set of rules. If the code violates a SonarQube in order to understand the fault-prediction
rule, SonarQube adds the time needed to refactor the violated accuracy of the rules classified as ”bugs”.
rule as part of the technical debt. SonarQube also identifies a
set of rules as ”bugs”, claiming that they ”represent something SonarQube and issue tracking systems adopt similar terms
wrong in the code and will soon be reflected in a fault”; for different concepts. Therefore, in order to clarify the
moreover, they also claim that zero false positives are expected terminology adopted in this work, we define SQ-Violation
from ”bugs” 3 . as a violated SonarQube rule that generated a SonarQube
Four local companies have been using SonarQube for more ”issue” and fault as an incorrect step, process, data definition,
than five years to detect possible issue in their code, reported or any unexpected behavior in a computer program inserted
that their developers do not believe that the rules classified as by a developer, and reported by Jira issue-tracker. We also
bugs can actually result in faults. Moreover, they also reported use the term ”fault-fixing” commit for commits where the
that the manual customization of the SonarQube out-of-the- developers have clearly reported bug fixing activity and ”fault-
inducing” commits for those commits that are responsible for
1 https://www.sonarqube.org the introduction of a fault.
2 https://sonarcloud.io/explore/projects
3 SonarQube Rules: https://tinyurl.com/v7r8rqo 4 SonarQube Quality Profiles: https://tinyurl.com/wkejmgr
The remainder of this paper is structured as follows. In work, we will refer to the different sq-violations with their id
Section II-A we introduce SonarQube and the SQ-Violations (squid). The complete list of violations can be found in the
adopted in this work. In Section II we present the background file ”SonarQube-rules.xsls” in the online raw data.
of this work, introducing the SonarQube violations and the
different machine learning algorithms applied in this work. B. Machine Learning Techniques
In Section III, we describe the case study design. Section IV
presents respectively the obtained results. Section V identifies In this Section, we describe the machine learning techniques
threats to validity while Section VI describes related works. adopted in this work to predict the fault-proneness of sq-
Finally, conclusions are drawn in Section VII. violations. Due to the nature of the task, all the models used
for this work were used for classification. We compared eight
II. BACKGROUND machine learning models. Among these, we used a generalized
A. SonarQube linear model: Logistic Regression [7]; one tree based classi-
fier: Decision Tree [8]; and 6 ensemble classifiers: Bagging
SonarQube is one of the most common Open Source static [9], Random Forest [10], Extremely Randomized Trees [11],
code analysis tools adopted both in academia [5],[2] and in AdaBoost [12], Gradient Boosting [13], and XGBoost [14]
industry [1]. SonarQube is provided as a service from the which is an optimized implementation of Gradient Boosting.
sonarcloud.io platform or it can be downloaded and executed All the models, except the XGBoost, were implemented using
on a private server. the library Scikit-Learn6 , applying the default parameters for
SonarQube calculates several metrics such as the number building the models. For the ensamble classifiers we alwasys
of lines of code and the code complexity, and verifies the used 100 estimators. The XGBoost classifier was implemented
code’s compliance against a specific set of ”coding rules” using the XGBoost library7 also trained with 100 estimators.
defined for most common development languages. In case the
1) Logistic Regression [7]: Contrary to the linear regres-
analyzed source code violates a coding rule or if a metric
sion, which is used to predict a numerical value, Logistic
is outside a predefined threshold, SonarQube generates an
Regression is used for predicting the category of a sample.
”issue”. SonarQube includes Reliability, Maintainability and
Particularly, a binary Logistic Regression model is used to
Security rules.
estimate the probability of a binary result (0 or 1) given a set
Reliability rules, also named ”bugs” create issues (code
of independent variables. Once the probabilities are known,
violations) that ”represents something wrong in the code”
these can be used to classify the inputs in one of the two
and that will soon be reflected in a bug. ”Code smells” are
classes, based on their probability to belong to either of the
considered ”maintainability-related issues” in the code that de-
two.
creases code readability and code modifiability. It is important
Like all linear classifiers, Logistic Regression projects the
to note that the term ”code smells” adopted in SonarQube
P -dimensional input x into a scalar by a dot product of the
does not refer to the commonly known code smells defined
learned weight vector w and the input sample: w · x + w0 ,
by Fowler et al. [6] but to a different set of rules. Fowler et
where w0 ∈ R the constant intercept. To have a result which
al. [6] consider code smells as ”surface indication that usually
can be interpreted as a class membership probability—a num-
corresponds to a deeper problem in the system” but they can be
ber between 0 and 1—Logistic Regression passes the projected
indicators of different problems (e.g., bugs, maintenance effort,
scalar through the logistic function (sigmoid). This function,
and code readability) while rules classified by SonarQube
for any given input x, returns an output value between 0 and
as ”Code Smells” are only referred to maintenance issues.
1. The logistic function is defined as
Moreover, only four of the 22 smells proposed my Fowler
et al. are included in the rules classified as ”Code Smells” by 1
SonarQube (Duplicated Code, Long Method, Large Class, and σ(x) = .
1 + e−x
Long Parameter List).
SonarQube also classifies the rules into five severity levels5 : Where the class probability of a sample x ∈ RP is modeled
Blocker, Critical, Major, Minor, and Info. as
In this work, we focus on the sq-violations, which are 1
P r(c = 1 | x) = .
reliability rules classified as ”bugs” by SonarQube, as we are 1 + e−(w·x+w0 )
interested in understanding whether they are related to faults.
SonarQube includes more than 200 rules for Java (Version Logistic Regression is trained through maximum likelihood:
6.4). In the replication package (Section III-D) we report the model’s parameters are estimated in a way to maximize
all the violations present in our dataset. In the remainder the likelihood of observing the inputs with respect to the
of this paper, column ”squid” represents the original rule-id parameters w and w0 . We chose to use this model as baseline
(SonarQube ID) defined by SonarQube. We did not rename it, as it requires limited computational resources and it is easy to
to ease the replicability of this work. In the remainder of this implement and fast to train.
5 SonarQube 6 https://scikit-learn.org
Issues and Rules Severity:’
https://docs.sonarqube.org/display/SONAR/Issues Last Access:May 2018 7 https://xgboost.readthedocs.io
2) Decision Tree Classifier [8]: Utilizes a decision tree decision trees which are constructed choosing a subset of
to return an output given a series of input variables. Its tree the samples of the original dataset. The difference with the
structure is characterized by a root node and multiple internal Random Forest classifier is in the way in which the split point
nodes, which are represented by the input variable, and leaf, is decided: while in the Random Forest algorithm the splitting
corresponding to the output. The nodes are linked between one point is decided base on a random subset of the variables,
another through branches, representing a test. The output is the Bagging algorithm is allowed to look at the full set of
given by the decision path taken. A decision tree is structured variable to find the point minimizing the error. This translates
as a if-then-else diagram: in this structure, given the value of in structural similarities between the trees which do not resolve
the variable in the root node, it can lead to subsequent nodes the overfitting problem related to the single decision tree. This
through branches following the result of a test. This process model was included as a mean of comparison with newer and
is iterated for all the input variables (one for each node) until better performing models.
it reaches the output, represented by the leaves of the tree. 5) Extremely Randomized Trees [11]: (ExtraTrees) [11],
In order to create the best structure, assigning each input provides a further randomization degree to the Random Forest.
variable to a different node, a series of metrics can be For the Random Forest model, the individual trees are created
used. Amongst these we can find the GINI impurity and the by randomly choosing subsets of the dataset features. In
information gain: the ExtraTrees model the way each node in the individual
• Gini impurity measures how many times randomly cho- decision trees are split is also randomized. Instead of using
sen inputs would be wrongly classified if assigned to a the metrics seen before to find the optimal split for each
randomly chosen class; node (Gini impurity and Information gain), the cut-off choice
• Information gain measures how important is the infor- for each node is completely randomized, and the resulting
mation obtained at each node related to its outcome: the splitting rule is decided based on the best random split. Due
more important is the information obtained in one node, to its characteristics, especially related to the way the splits
the purer will be the split. are made at the node level, the ExtraTrees model is less
In our models we used the Gini impurity measure to computationally expensive than the Random Forest model,
generate the tree as it is more computationally efficient. The while retaining a higher generalization capability compared
reasons behind the choice of decision tree models and Logistic to the single decision trees.
Regression, are their simplicity and easy implementation. 6) AdaBoost [12]: is another ensemble algorithm based on
Moreover, the data does not need to be normalized, and boosting [15] where the individual decision trees are grown
the structure of the tree can be easily visualized. However, sequentially. Moreover, a weight is assigned to each sample of
this model is prone to overfitting, and therefore it cannot the training set. Initially, all the samples are assigned the same
generalize the data. Furthermore, it does not perform well with weight. The model trains the first tree in order to minimize the
imbalanced data, as it generates a biased structure. classification error, and after the training is over, it increases
3) Random Forest [10]: is an ensemble technique that helps
the weights to those samples in the training set which were
to overcome overfitting issues of the decision tree. The term
misclassified. Moreover, it grows another tree and the whole
ensemble indicates that these models use a set of simpler
model is trained again with the new weights. This whole
models to solve the assigned task. In this case, Random Forest
process continues until a predefined number of trees has been
uses an ensemble of decision trees.
generated or the accuracy of the model cannot be improved
An arbitrary number of decision trees is generated consid-
anymore. Due to the many decision trees, as for the other
ering a randomly chosen subset of the samples of the original
ensemble algorithms, AdaBoost is less prone to overfitting
dataset [9]. This subset is created with replacement, hence
and can, therefore, generalize better the data. Moreover, it
a sample can appear multiple times. Moreover, in order to
automatically selects the most important features for the task
reduce the correlation between the individual decision trees a
it is trying to solve. However, it can be more susceptible to
random subset of the features of the original dataset. In this
the presence of noise and outliers in the data.
case, the subset is created without replacement. Each tree is
therefore trained on its subset of the data, and it is able to 7) Gradient Boosting [13]: also uses an ensemble of indi-
give a prediction on new unseen data. The Random Forest vidual decision trees which are generated sequentially, like for
classifier uses the results of all these trees and averages them the AdaBoost. The Gradient Boosting trains at first only one
to assign a label to the input. By randomly generating multiple decision tree and, after each iteration, grows a new tree in order
decision trees, and averaging their results, the Random Forest to minimize the loss function. Similarly to the AdaBoost, the
classifier is able to better generalize the data. Moreover, using process stops when the predefined number of trees has been
the random subspace method, the individual trees are not created or when the loss function no longer improves.
correlated between one another. This is particularly important 8) XGBoost [14]: can be viewed as a better performing im-
when dealing with a dataset with many features, as the prob- plementation of the Gradient Boosting algorithm, as it allows
ability of them being correlated between each other increases. for faster computation and parallelization. For this reason it
4) Bagging [9]: Exactly like the Random Forest model, can yield better performance compared to the latter, and can
the Bagging classifier is applied to an arbitrary number of be more easily scaled for the use with high dimensional data.
III. C ASE S TUDY D ESIGN TABLE I
T HE SELECTED PROJECTS
We designed our empirical study as a case study based on
the guidelines defined by Runeson and H´’ost [16]. In this Project Analyzed Last Faults SonarQube
Section, we describe the empirical study including the goal and Name commits commit Violations
LOC
the research questions, the study context, the data collection Ambari 9727 396775 3005 42348
and the data analysis. Bcel 1255 75155 41 8420
Beanutils 1155 72137 64 5156
A. Goal and Research Questions Cli 861 12045 59 37336
Codec 1644 34716 57 2002
As reported in Section 1, our goals are to analyze the fault- Collections 2847 119208 103 11120
proneness of SonarQube rule violations (SQ-Violations) and Configuration 2822 124892 153 5598
the accuracy of the quality model provided by SonarQube. Dbcp 1564 32649 100 3600
Based on the aforementioned goals, we derived the following Dbutils 620 15114 21 642
Deamon 886 3302 4 393
three research questions (RQs). Digester 2132 43177 23 4945
RQ1 Which are the most fault-prone SQ-Violations? FileUpload 898 10577 30 767
Io 1978 56010 110 4097
In this RQ, we aim to understand whether the intro- Jelly 1914 63840 45 5057
duction of a set of SQ-Violations is correlated with Jexl 1499 36652 58 34802
the introduction of faults in the same commit and Jxpath 596 40360 43 4951
Net 2078 60049 160 41340
to prioritize the SQ-Violations based on their fault- Ognl 608 35085 15 4945
proneness. Sshd 1175 139502 222 8282
Our hypothesis is that a set of SQ-Violations should Validator 1325 33127 63 2048
be responsible for the introduction of bugs. Vfs 1939 59948 129 3604
Sum 39.518 1,464,320 4,505 231,453
RQ2 Are SQ-Violations classified as ”bugs” by Sonar-
Qube more fault-prone than other rules?
Our hypothesis is that reliability rules (”bugs”) Residual Analysis
Commit Labeling Overall Validation
Labeled Commits
should be more fault-prone that maintainability rules SZZ ΔIND + ΔFIX
> 95%
V1 V2 V4 V6
(”code smells”) and security rules. V1 V3 V4 V5 V6
La
RQ3 What is the fault prediction accuracy of the SQ-Violations
be
led
Importance
el
Co
od
SonarQube quality model based on violations
tM
m
Logistic Reg.
s
its
Be
Decision Trees
classified as ”bugs”? SQ-Violations Random Forest
SonarQube claims that whenever a violation is clas- Gradient Boost.

Extr. Rnd. Trees
sified as a ”bug”, a fault will develop in the software. AdaBoost
Therefore, we aim at analyzing the fault prediction XGBoost
accuracy of the rules that are classified as ”bugs” by

measuring their precision and recall.
Fig. 1. The Data Analysis Process
B. Study Context
In agreement with the four companies, we considered open
source projects available in the Technical Debt Dataset [17]. identified in the selected commits, and the total number of
The reason for considering open source projects instead of SQ-Violations.
their private projects is that not all the companies would
have allowed us to perform an historical analysis of all their C. Data Analysis
commits. Moreover, with closed source projects the whole Before answering our RQs, we first executed the eight
process cannot be replicated and verified transparently. machine learning (ML) models, we compared their accuracy,
For this purpose, the four companies selected together 21 and finally performed the residual analysis.
out of 31 projects available, based on the ones that were more The next subsections describe the analysis process in details
similar to their internal projects considering similar project as depicted in Figure 1.
age, size, usage of patterns used and other criteria that we 1) Machine Learning Execution: In this step we aim at
cannot report for reason of NDA. comparing fault-proneness prediction power of SQ-Violations
The dataset includes the analysis of each commit of the by applying the eight machine learning models described in
projects from their first commit until the end of 2015 with Section II-B.
SonarQube, information on all the Jira issues, and a classifi- Therefore we aim at predicting the fault-proneness of a
cation of the fault-inducing commits performed with the SZZ commit (labeled with the SZZ algorithm) by means of the SQ-
algorithm [18]. Violations introduced in the same commit. We used the SQ-
In Table I, we report the list of projects we considered Violations introduced in each commits as independent vari-
together with the number of analyzed commits, the project ables (predictors) to determine if a commit is fault-inducing
size (LOC) of the last analyzed commits, the number of faults (dependent variable).
After training the eight models described in Section II-B, sample misclassified as positive, and false negative rate (FNR),
we performed a second analysis retraining the models using a measuring the percentage of positive samples misclassified
drop-column mechanism [19]. This mechanism is a simplified as negative. The measure of true positive rate is left out
variant of the exhaustive search [20], which iteratively tests as equivalent to the recall. The way these measures were
every subset of features for their classification performance. calculated can be found in Table II.
The full exhaustive search is very time-consuming requiring
2P train-evaluation steps for a P -dimensional feature space. TABLE II
Instead, we look only at dropping individual features one at a ACCURACY M ETRICS F ORMULAE
time, instead of all possible groups of features. Accuracy Measure Formula
More specifically, a model is trained P times, where P is Precision TP
F P +T P
the number of features, iteratively removing one feature at a Recall TP
F N +T P
time, from the first to the last of the dataset. The difference in √ T P ∗T N −F P ∗F N
MCC
(F P +T P )(F N +T P )(F P +T N )(F N +T N )
cross-validated test accuracy between the newly trained model f-measure precision∗recall
2 ∗ precision+recall
and the baseline model (the one trained with the full set of TNR TN
F P +T N e
features) defines the importance of that specific feature. The FPR FP
T N +F P
more the accuracy of the model drops, the more important for FNR FN
F N +T P
the classification is the specific feature. TP: True Positive; TN: True Negative; FP: False Positive; FN: False
The feature importance of the SQ-Violation has been cal- Negative
culated for all the machine learning models described, but we Finally, to graphically compare the true positive and the
considered only the importance calculated by the most accu- false positive rates, we calculated the Receiver Operating Char-
rate model (cross-validated with all P features, as described in acteristics (ROC), and the related Area Under the Receiver
the next section), as the feature importances of a poor classifier Operating Characteristic Curve (AUC): the probability that a
are likely to be less reliable. classifier will rank a randomly chosen positive instance higher
2) Accuracy Comparison: Apart from ranking the SQ- than a randomly chosen negative one.
Violations by their importance, we first need to confirm the In our dataset, the proportion of the two types of commits is
validity of the prediction model. If the predictions obtained not even: a large majority (approx. 90 %) of the commits were
from the ML techniques are not accurate, the feature ranking non-fault-inducing, and a plain accuracy score would reach
would also become questionable. To assess the prediction high values simply by always predicting the majority class.
accuracy, we performed a 10-fold cross-validation, dividing the On the other hand, the ROC curve (as well as the precision
data in 10 parts, i.e., we trained the models ten times always and recall scores) are informative even in seriously unbalanced
using 1/10 of the data as a testing fold. For each fold, we situations.
evaluated the classifiers by calculating a number of accuracy
3) SQ-Violations Residual Analysis: The results from the
metrics (see below). The data related to each project have been
previous ML techniques show a set of SQ-Violations related
split in 10 sequential parts, thus respecting the temporal order,
with fault-inducing commits. However, the relations obtained
and the proportion of data for each project. The models have
in the previous analysis do not imply causation between faults
been trained iteratively on group of data preceding the test set.
and SQ-Violations.
The temporal order was also respected for the groups included
In this step, we analyze which violations were introduced
in the training set: as an example, in fold 1 we used group 1
in the fault-inducing commits and then removed in the fault-
for training and group 2 for testing, in fold 2 groups 1 and 2
fixing commits. We performed this comparison at the file level.
were used for training and group 3 for testing, and so on for
Moreover, we did not consider cases where the same violation
the remaining folds.
was introduced in the fault-inducing commit, removed, re-
As accuracy metrics, we first calculated precision and recall.
introduced in commits not related to the same fault, and finally
However, as suggested by [21], these two measures present
removed again during the fault-fixing commit.
some biases as they are mainly focused on positive examples
In order to understand which SQ-Violations were introduced
and predictions and they do not capture any information about
in the fault-inducing commits (IND) and then removed in the
the rates and kind of errors made.
fault-fixing commit (FIX), we analyzed the residuals of each
The contingency matrix (also named confusion matrix), and
SQ-Violation by calculating:
the related f-measure help to overcome this issue. Moreover, as
recommended by [21], the Matthews Correlation Coefficient Residual = ∆IN D + ∆F IX
(MCC) should be also considered to understand possible dis-
agreement between actual values and predictions as it involves where ∆IN D and ∆F IX are calculated as:
all the four quadrants of the contingency matrix. ∆IN D = #SQ-Violations introduced in the fault-inducing
From the contingency matrix, we retrieved the measure commit
of true negative rate (TNR), which measures the percentage
∆F IX =#SQ-Violations removed in the fault-fixing commit
of negative sample correctly categorized as negative, false
positive rate (FPR) which measures the percentage of negative Figure 2 schematizes the residual analysis.
ΔIND ΔFIX 6) RQ3: What is the fault prediction accuracy of the Sonar-
Introduced in the Introduced in the fault-inducing
fault-inducing AND removed in the fault fixing Qube quality model based on violations classified as ”bugs”:
Since SonarQube considers every SQ-Violation tagged as a
”bug” as ”something wrong in the code that will soon be
reflected in a bug”, we also analyzed the accuracy of the model
Inducing -1 Inducing Fix provided by SonarQube.
Commits In order to answer our RQ3, we calculated the percentage of
a2d7c9e49 a2d7c9e57 E8bfdb92
(21 Jun 2016) (22 Jun 2016) (11 Jul 2016) SQ-Violations classified as ”bugs” that resulted in being highly
fault-prone according to the previous analysis. Moreover, we
also analyzed the accuracy of the model calculating all the
Fig. 2. Residuals Analysis
accuracy measures reported in Section III-C2.
We calculated the residuals for each commit/fix pair, ver- D. Replicability
ifying the introduction of the SQ-Violation Vi in the fault-
In order to allow the replication of our study, we published
inducing commit (IND) and the removal of the violation in the
the raw data in the replication package 8 .
fault-fixing commit (FIX). If ∆IN D was lower than zero, no
SQ-Violations were introduced in the fault-inducing commit. IV. R ESULTS
Therefore, we tagged such a commit as not related to faults.
In this work, we considered more than 37 billion effective
For each violation, the analysis of the residuals led us to
lines of code and retrieved a total of 1,464,320 violations from
two groups of commits:
39,518 commits scanned with SonarQube. Table 1 reports the
• Residual > 0: The SQ-Violations introduced in the fault-
list of projects together with the number of analyzed commits
inducing commits were not removed during the fault- and the size (in Lines of Code) of the latest analyzed commit.
fixing. We retrieved a total of 4,505 faults reported in the issue
• Residual ≤ 0: All the SQ-Violations introduced in the
trackers.
fault-inducing commits were removed during the fault-
All the 202 rules available in SonarQube for Java were
fixing. If Residual < 0, other SQ-Violations of the same
found in the analyzed projects. For reasons of space limi-
type already present in the code before the bug-inducing
tations, we will refer to the SQ-Violations only with their
commit were also removed.
SonarQube id number (SQUID). The complete list of rules,
For each SQ-Violations, we calculated descriptive statistics together with their description is reported in the online
so as to understand the distribution of residuals. replication package (file SonarQube-rules.xlsx). Note that in
Then, we calculated the residual sum of squares (RSS) as: column ”Type” MA means Major, Mi means Minor, CR means
X
RSS = (Residual)2 Critical, and BL means Blocker.
We calculated the percentage of residuals equal to zero as: A. RQ1: Which are the most fault-prone SQ-Violations?
#zero residuals In order to answer this RQ, we first analyzed the importance
∗ 100%
#residuals of the SQ-Violations by means of the most accurate ML
Based on the residual analysis, we can consider violations technique and then we performed the residual analysis.
where the percentage of zero residuals was higher than 95% 1) SQ-Violations Importance Analysis: As shown in Fig-
as a valid result. ure 3, XGBoost resulted in the most accurate model among
4) RQ1: Which are the most fault-prone SQ-Violations?: In the eight machine learning techniques applied to the dataset.
order to analyze RQ1, we combined the results obtained from The 10-fold cross-validation reported an average AUC of 0.83.
the best ML technique and from the residual analysis. There- Table III (column RQ1) reports average reliability measures
fore, if a violation has a high correlation with faults but the for the eight models.
percentage of the residual is very low, we can discard it from Despite the different measures have different strengths and
our model, since it will be valuable only in a limited number of weaknesses (see Section III-C2), all the measures are consis-
cases. As we cannot claim a cause-effect relationship without tently showing that XGBoost is the most accurate technique.
a controlled experiment, the results of the residual analysis The ROC curves of all models are depicted in Table III
are a step towards the identification of this relationship and while the reliability results of all the 10-folds models are
the reduction of spurious correlations. available in the online replication package.
5) RQ2: Are SQ-Violations classified as bugs by Sonar- Therefore, we selected XGBoost as classification model for
Qube more fault-prone than other rules?: The comparison of the next steps, and utilized the feature importance calculated
rules classified as ”bugs” with other rules has been performed applying the drop-column method to this classifier. The XG-
considering the results of the best ML techniques and the resid- Boost classifier was retrained removing one feature at a time
ual analysis, comparing the number of violations classified as sequentially.
”bug” that resulted to be fault-prone from RQ1. We expect
bugs to be in the most faults-prone rules. 8 Replication Package: https://figshare.com/s/fe5d04e39cb74d6f20dd
TABLE III
M ODEL R ELIABILITY
RQ1 (Average between 10-fold validation models) RQ2 RQ3

Logistic Decision Bagging Random Extra AdaBoost Gradient XGBoost SQ ”bugs”
Measure
Regr. Tree Forest Trees Boosting
Precision 0.417 0.311 0.404 0.532 0.427 0.481 0.516 0.608 0.086
Recall 0.076 0.245 0.220 0.156 0.113 0.232 0.192 0.182 0.028
MCC 0.162 0.253 0.279 0.266 0.203 0.319 0.300 0.318 0.032
f-measure 0.123 0.266 0.277 0.228 0.172 0.301 0.275 0.275 0.042
TNR 0.996 0.983 0.990 0.995 0.995 0.993 0.995 0.997 0.991
FPR 0.004 0.002 0.010 0.004 0.005 0.007 0.005 0.003 0.009
FNR 0.924 0.755 0.779 0.844 0.887 0.768 0.808 0.818 0.972
AUC 0.670 0.501 0.779 0.802 0.775 0.791 0.825 0.832 0.509
gether with the percentage residuals = 0 (number of SQ-

1.0
Violations introduced during fault-inducing commits and
0.8
1000
0.6
100
XGBoost (AUC = 83.21 %)
0.4 GradientBoost (AUC = 82.49 %)
RandomForest (AUC = 80.18 %)
AdaBoost (AUC = 79.14 %) 10
0.2 Bagging (AUC = 77.85 %)
ExtraTrees (AUC = 77.45 %)
LogisticRegression (AUC = 67.04 %)
DecisionTrees (AUC = 50.14 %) 1
0.0 1192
112
134
1166
1541
1128
1213
125
1132
1130
1117
1481
122
1124
100
1226
1171
116
115
1168
107
1109
0.0 0.2 0.4 0.6 0.8 1.0
All Violations
Fig. 3. ROC Curve (Average between 10-fold validation models) Introduced in Fault-inducing commits
Introduced in Fault-Inducing commits and Removed in Fault-fixing commits
23 SQ-Violations have been ranked with an importance
higher than zero by the XGBoost. In Table V, we report the Fig. 4. Comparison of Violations introduced in fault-inducing commits and
SQ-Violations with an importance higher or equal than 0.01 % removed in fault-fixing commits
(coloumn ”Intr. & Rem. (%)” reports the number of violations removed during fault-fixing commits).
introduced in the fault-inducing commits AND removed in Column ”Res >95%”, shows a checkmark (X) when the
the fault-fixing commits). The remaining SQ-Violations are percentage of residuals=0 was higher than 95%.
reported in the raw data for reasons of space. coloumn ”Intr. Figure 4 compares the number of violations introduced in
& Rem. (%)” means fault-inducing commits, and the number of violations removed
The combination of the 23 violations guarantees a good in the fault-fixing commits.
classification power, as reported by the AUC of 0.83. However,
the drop column algorithm demonstrates that SQ-Violations B. Manual Validation of the Results
have a very low individual importance. The most important In order to understand the possible causes and to validate the
SQ-Violation has an importance of 0.62%. This means that results, we manually analyzed 10 randomly selected instances
the removal of this variable from the model would decrease for the first 20 SQ-Violations ranked as more important by the
the accuracy (AUC) only by 0.62%. Other three violations XGBoost algorithm.
have a similar importance (higher than 0.5%) while others are The first immediate result is that, in 167 of the 200 manually
slightly lower. inspected violations, the bug induced in the fault-inducing
2) Model Accuracy Validation: The analysis of residuals commit was not fixed by the same developer that induced it.
shows that several SQ-Violations are introduced in fault- We also noticed that violations related to duplicated code
inducing commits in more than 50% of cases. 32 SQ- and empty statements (eg. ”method should not be empty”)
Violations out of 202 had been introduced in the fault-inducing always generated a fault (in the randomly selected cases).
commits and then removed in the fault-fixing commit in more When committing an empty method (often containing only
than 95% of the faults. The application of the XGBoost, also a ”TODO” note), developers often forgot to implement it and
confirmed an importance higher than zero in 26 of these SQ- then used it without realizing that the method did not return
Violations. This confirms that developers, even if not using the expected value. An extensive application of unit testing
SonarQube, pay attention to these 32 rules, especially in case could definitely reduce this issue. However, we are aware that
of refactoring or bug-fixing. is is a very common practice in several projects. Moreover,
Table V reports the descriptive statistics of residuals, to- SQ-Violations such as 1481 (unused private variable should
TABLE IV Construct Validity. As for construct validity, the results
S ONAR Q UBE C ONTINGENCY M ATRIX (P REDICTION MODEL BASED ON might be biased regarding the mapping between faults and
SQ-V IOLATIONS CONSIDERED AS ”B UG ” BY S ONAR Q UBE )
commits. We relied on the ASF practice of tagging commits
Predicted Actual with the issue ID. However, in some cases, developers could
IND NOT IND have tagged a commit differently. Moreover, the results could
IND 32 342
NOT IND 1,124 38,020
also be biased due to detection errors of SonarQube. We are
aware that static analysis tools suffer from false positives. In
this work we aimed at understanding the fault proneness of
be removed) and 1144 (unused private methods should be the rules adopted by the tools without modifying them, so as
removed) unexpectedly resulted to be an issue. In several to reflect the real impact that developers would have while
cases, we discovered methods not used, but expected to be using the tools. In future works, we are planning to replicate
used in other methods, resulted in a fault. As example, if a this work manually validating a statistically significant sample
method A calls another method B to compose a result message, of violations, to assess the impact of false positives on the
not calling the method B results in the loss of the information achieved findings. As for the analysis timeframe, we analyzed
provided by B. commits until the end of 2015, considering all the faults raised
until the end of March 2018. We expect that the vast majority
C. RQ2: Are SQ-Violations classified as ”bugs” by Sonar- of the faults should have been fixed. However, it could be
Qube more fault-prone than other rules? possible that some of these faults were still not identified and
Out of the 57 violations classified as ”bugs” by SonarQube, fixed.
only three (squid 1143, 1147, 1764) were considered fault- Internal Validity. Threats can be related to the causation
prone with a very low importance from the XGBoost and with between SQ-Violations and fault-fixing activities. As for the
residuals higher than 95%. However, rules classified as ”code identification of the fault-inducing commits, we relied on the
smells” were frequently violated in fault-inducing commits. SZZ algorithm [18]. We are aware that in some cases, the SZZ
However, considering all the SQ-Violations, out of 40 the SQ- algorithm might not have identified fault-inducing commits
Violations that we identified as fault-prone, 37 are classified correctly because of the limitations of the line-based diff
as ”code smells” and one as security ”vulnerability”. provided by git, and also because in some cases bugs can be
When comparing severity with fault proneness of the SQ- fixed modifying code in other location than in the lines that
Violations, only three SQ-Violations (squid 1147, 2068, 2178) induced them. Moreover, we are aware that the imbalanced
were associated with the highest severity level (blocker). data could have influenced the results (approximately 90% of
However, the fault-proneness of this rule is extremely low the commits were non-fault-inducing). However, the applica-
(importance <= 0.14%). Looking at the remaining violations, tion of solid machine learning techniques, commonly applied
we can see that the severity level is not related to the with imbalanced data could help to reduce this threat.
importance reported by the XGBoost algorithm since the rules External Validity. We selected 21 projects from the ASF,
of different level of severity are distributed homogeneously which incubates only certain systems that follow specific and
across all importance levels. strict quality rules. Our case study was not based only on one
application domain. This was avoided since we aimed to find
D. RQ3: Fault prediction accuracy of the SonarQube model general mathematical models for the prediction of the number
”Bug” violations were introduced in 374 commits out of of bugs in a system. Choosing only one or a very small number
39,518 analyzed commits. Therefore, we analyzed which of of application domains could have been an indication of the
these commits were actually fault-inducing commits. Based non-generality of our study, as only prediction models from
on SonarQube’s statement, all these commits should have the selected application domain would have been chosen. The
generated a fault. selected projects stem from a very large set of application
All the accuracy measures (Table III, column ”RQ2”) con- domains, ranging from external libraries, frameworks, and web
firm the very low prediction power of ”bug” violations. The utilities to large computational infrastructures. The dataset
vast majority of ”bug” violations never become a fault. Results only included Java projects. We are aware that different
are also confirmed by the extremely low AUC (50.95%) and programming languages, and projects different maturity levels
by the contingency matrix (Table IV). The results of the could provide different results.
SonarQube model also confirm the results obtained in RQ2. Reliability Validity. We do not exclude the possibility that
Violations classified as ”bugs” should be classified differently other statistical or machine learning approaches such as Deep
since they are hardly ever injected in fault-inducing commits. Learning, or others might have yielded similar or even better
accuracy than our modeling approach.
V. T HREATS TO VALIDITY
VI. R ELATED W ORK
In this Section, we discuss the threats to validity, including
internal, external, construct validity, and reliability. We also In this Section, we introduced the related works analyzing
explain the different adopted tactics [22]. literature on SQ-Violations and faults predictions.
TABLE V
S UMMARY OF THE MOST IMPORTANT S ONAR Q UBE V IOLATIONS R ELATED TO FAULTS (XGB OOST I MPORTANCE > 0.2%)
SonarQube SZZ Residuals XG Res.

Boost >95%
SQUID Severity Type # Occ. Intr. & Intr. in Mean Max Min Stdev RSS Imp.
Rem.(%) fault-ind
S1192 CRITICAL CS 1815 50,87 95,10 245,60 -861 2139 344,42 1726 0,66 X
S1444 MINOR CS 96 2,69 97,92 4,59 -7 73 10,34 94 0,62 X
Useless Import Check MAJOR CS 1026 28,76 97,27 33,37 -170 351 61,58 998 0,41 X
S00105 MINOR CS 263 7,37 97,72 1,96 -13 32 10,22 257 0,41 X
S1481 MINOR CS 568 15,92 95,25 10,41 -6 83 14,60 541 0,39 X
S1181 MAJOR CS 200 5,61 97,00 8,87 0 88 13,43 194 0,31 X
S00112 MAJOR CS 1644 46,08 94,77 188,26 -279 1529 270,34 1558 0,29
S1132 MINOR CS 704 19,73 93,75 121,75 -170 694 134,91 660 0,24
Hidden Field MAJOR CS 584 16,37 92,98 26,96 -12 143 29,42 543 0,23
S134 CRITICAL CS 1272 35,65 94,65 70,66 -66 567 88,07 1204 0,20
Falessi et al. [3] studied the distribution of 16 metrics and and Jira Extracting Tool from Apache Hive and determined
106 SQ-Violations in an industrial project. They applied a significant independent variables for defect- and change-prone
What-if approach with the goal of investigating what could classes, respectively. Then they used a Bayesian approach to
happen if a specific SQ-Violation would not have been intro- build a prediction model to determine the ”technical debt
duced in the code and if the number of faulty classes decrease proneness” of each class. Their model requires the identifi-
in case the violation is not introduced. They compared four ML cation of ”technical debt items”, which requires manual input.
techniques applying the same techniques on a modified version These items are ultimately ranked and given a risk probability
of the code where they manually removed SQ-Violations. by the predictive framework.
Results showed that 20% of faults were avoidable if the code Saarimki investigated the diffuseness of SQ-violations in
smells would have been removed. the same dataset we adopted [26] and the accuracy of the
Tollin et al. [4] investigated if SQ-Violations introduced SonarQube remediation time [27].
would led to an increase in the number of changes (code Regarding other code quality rules detection, 7 different
churns) in the next commits. The study was applied on two machine learning approaches (Random Forest, Naive Bayes,
different industrial projects, written in C# and JavaScript. They Logistic regression, IBl, IBk, VFI, and J48) [28] were suc-
reported that classes affected by more SQ-Violations have a cessfully applied on 6 code smells (Lazy Class, Feature Envy,
higher change proneness. However they did not prioritize or Middle Man Message Chains, Long Method, Long Param-
classified the most change prone SQ-Violations. eter Lists, and Switch Statement) and 27 software metrics
Digkas et al. [23] studied weekly snapshots of 57 Java (including Basic, Class Employment, Complexity, Diagrams,
projects of the ASF investigating the amount of technical debt Inheritance, and MOOD) as independent variables.
paid back over the course of the projects and what kind of Code smells detection was also investigated from the point
issues were fixed. They considered SQ-Violations with severity of view of how the severity of code smells can be classified
marked as Blocker, Critical, and Major. The results showed through machined learning models [29] such as J48, JRip,
that only a small subset of all issue types was responsible Random Forest, Naive Bayes, SMO, and LibSVM with best
for the largest percentage of technical debt repayment. Their agreement to detection 3 code smells (God Class, Large Class,
results thus confirm our initial assumption that there is no and Long Parameter List).
need to fix all issues. Rather, by targeting particular violations, VII. D ISCUSSION AND C ONCLUSION
the development team can achieve higher benefits. However, SonarQube classifies 57 rules as ”bugs”, claiming that they
their work does not consider how the issues actually related will sooner or later they generate faults. Four local companies
to faults. contacted us to investigate the fault prediction power of the
Falessi and Reichel [24] developed an open-source tool to SonarQube rules, possibly using machine learning, so as to
analyze the technical debt interest occurring due to violations understand if they can rely on the SonarQube default rule-set
of quality rules. Interest is measured by means of various or if they can use machine learning to customize the model
metrics related to fault-proneness. They use SonarQube rules more accurately.
and uses linear regression to estimate the defect-proneness of We conducted this work analyzing a set of 21 well-known
classes. The aim of MIND is to answer developers’ questions open source project selected by the companies, analyzing
like: is it worth to re-factor this piece of code? Differently the presence of all 202 SonarQube detected violations in
than in our work, the actual type of issue causing the defect the complete project history. The study considered 39,518
was not considered. commits, including more than 38 billion lines of code, 1.4
Codabux and Williams [25] propose a predictive model to million violations, and 4,505 faults mapped to the commits.
prioritize technical debt. They extracted class-level metrics for To understand which sq-violations have the highest fault-
defect- and change-prone classes using Scitool Understanding proneness, we first applied eight machine learning approaches
to identify the sq-violations that are common in commits main result for the companies is that they will need to invest
labeled as fault-inducing. As for the application of the different in the adoption of other tools to reduce the fault proneness
machine learning approaches, we can see an important differ- and therefore, we will need to replicate this work considering
ence in their accuracy, with a difference of more than 53% other tools such as FindBugs, PMD but also commercial tools
from the worst model (Decision Trees AUC=47.3%±3%) and such as Coverity Scan, Cast Software and others.
the best model (XGBoost AUC=83.32%±10%). This confirms Based on the overall results, we can summarize the follow-
also what we reported in Section II-B: ensemble models, ing lessons learned:
like the XGBoost, can generalize better the data compared to Lesson 1: SonarQube violations are not good predictors
Decision Trees, hence it results to be more scalable. The use of of fault-proneness if considered individually, but can be good
many weak classifiers, yields an overall better accuracy, as it predictors if considered together. Machine learning techniques,
can be seen by the fact that the boosting algorithms (AdaBoost, such as XGBoost can be used to effectively train a customized
GradientBoost, and XGBoost) are the best performers for this model for each company.
classification task, followed shortly by the Random Forest Lesson 2: SonarQube violations classified as ”bugs” do not
classifier and the ExtraTrees. seem to be the cause of faults.
As next step, we checked the percentage of commits where a
Lesson 3: SonarQube violation severity is not related to
specific violation was introduced in the fault-inducing commit
the fault-proneness and therefore, developers should carefully
and then removed in the fault-fixing commit, accepting only
consider the severity as decision factor for refactoring a
those violations where the percentage of cases where the
violation.
same violations were added in the fault-inducing commit and
Lesson 4: Technical debt should be calculated differently,
removed in the fault-fixing commit was higher than 95%.
and the non-fault prone rules should not be accounted as
Our results show that 26 violations can be considered fault-
”fault-prone” (or ”buggy”) components of the technical debt
prone from the XGBoost model. However, the analysis of
while several ”code smells” rules should be carefully consid-
the residuals showed that 32 sq-violations were commonly
ered as potentially fault-prone.
introduced in a fault-inducing commit and then removed in the
fault-fixing commit but only two of them are considered fault- The lessons learned confirm our initial hypothesis about the
prone from the machine learning algorithms. It is important to fault-proneness of the SonarQube violations. However, we are
notice that all the sq-violations that are removed in more than not claiming that SonarQube violations are not harmful in
95% of cases during fault-fixing commits are also selected by general. We are aware that some violations could be more
the XGBoost, also confirming the importance of them. prone to changes [3], decrease code readability, or increase
When we looked at which of the sq-violations were con- the maintenance effort.
sidered as fault-prone in the previous step, only four of them Our recommendation to companies using SonarQube is to
are also classified as (”bugs”) by SonarQube. The remaining customize the rule-set, taking into account which violations to
fault-prone sq-violations are mainly classified as ”code smells” consider, since the refactoring of several sq-violations might
(SonarQube claims that ”code smells” increase maintenance not lead to a reduction in the number of faults. Furthermore,
effort but do not create faults). The analysis of the accuracy since the rules in SonarQube constantly evolve, companies
of the fault prediction power of the SonarQube model based should continuously re-consider the adopted rules.
on ”bugs” showed an extremely low fitness, with an AUC of Research on technical debt should focus more on validating
50.94%, confirming that violations classified as ”bugs” almost which rules are actually harmful from different points of view
never resulted in a fault. and which will account for a higher technical debt if not
An important outcome is related to the application of the refactored immediately.
machine learning techniques. Not all the techniques performed Future works include the replication of this work consid-
equally and XGBoost was the most more accurate and fastest ering the severity levels of SonarQube rules and their impor-
technique in all the projects. Therefore, the application XG- tance. We are working on the definition of a more accurate
Boost to historical data is a good alternative to the manual model for predicting TD [30] Moreover, we are planning
tuning of the model, where developers should select which to investigate whether classes that SonarQube identify as
rules they believe are important based on their experience. problematic are more fault-prone than those not affected by
The result confirmed the impression of the developers of any problem. Since this work did not confirmed the fault
our companies. Their developers still consider it very useful to proneness of SonarQube rules, the companies are interested in
help to develop clean code that adhere to company standards, finding other static analysis tool for this purpose. Therefore,
and that help new developers to write code that can be easily we are planning to replicate this study using other tools such
understood by other developers. Before the execution of this as FindBugs, Checkstyle, PMD and others. Moreover, we will
study the companies were trying to avoid to violate the rules focus on the definition of recommender systems integrated
classifies as bugs, hoping to reduce fault proneness. However, in the IDEs [31][32], to alert developers about the presence
after the execution of this study, the companies individually of potential problematic classes based on their (evolution of)
customized the set of rules considering only coding standards change- and fault-proneness and rank them based on the
aspects and rules classified as ”security vulnerabilities”. The potential benefits provided by their removal.
R EFERENCES [19] Parr Terence, Turgutlu Kerem, Csiszar Christopher, and Howard
Jeremy. Beware default random forest importances. http://explained.
[1] Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian ai/rf-importance/index.html. Accessed: 2018-07-20.
Proksch, Harald C. Gall, and Andy Zaidman. How Developers Engage
with Static Analysis Tools in Different Contexts. In Empirical Software [20] Hyunjin Yoon, Kiyoung Yang, and Cyrus Shahabi. Feature subset selec-
Engineering, 2019. tion and feature ranking for multivariate time series. IEEE transactions
[2] Valentina Lenarduzzi, Alberto Sillitti, and Davide Taibi. A survey on on knowledge and data engineering, 17(9):1186–1198, 2005.
code analysis tools for software maintenance prediction. In 6th Inter- [21] D. M. W. Powers. Evaluation: From precision, recall and f-measure
national Conference in Software Engineering for Defence Applications, to roc., informedness, markedness & correlation. Journal of Machine
pages 165–175. Springer International Publishing, 2020. Learning Technologies, 2(1):37–63, 2011.
[3] D. Falessi, B. Russo, and K. Mullen. What if i had no smells? 2017 [22] R.K. Yin. Case Study Research: Design and Methods, 4th Edition
ACM/IEEE International Symposium on Empirical Software Engineering (Applied Social Research Methods, Vol. 5). SAGE Publications, Inc,
and Measurement (ESEM), pages 78–84, Nov 2017. 4th edition, 2009.
[4] F. Arcelli Fontana I. Tollin, M. Zanoni, and R. Roveda. Change
prediction through coding rules violations. EASE’17, pages 61–64, New [23] G. Digkas, M. Lungu, P. Avgeriou, A. Chatzigeorgiou, and A. Ampat-
York, NY, USA, 2017. ACM. zoglou. How do developers fix issues and pay back technical debt in
[5] Valentina Lenarduzzi, Alberto Sillitti, and Davide Taibi. Analyzing the apache ecosystem? volume 00, pages 153–163, March 2018.
forty years of software maintenance models. In 39th International [24] D. Falessi and A. Reichel. Towards an open-source tool for measuring
Conference on Software Engineering Companion, ICSE-C ’17, pages and visualizing the interest of technical debt. pages 1–8, 2015.
146–148, Piscataway, NJ, USA, 2017. IEEE Press. [25] B.J. Williams Z. Codabux. Technical debt prioritization using predictive
[6] M. Fowler and K. Beck. Refactoring: Improving the design of existing analytics. ICSE ’16, pages 704–706, New York, NY, USA, 2016. ACM.
code. Addison-Wesley Longman Publishing Co., Inc., 1999.
[7] D. R. Cox. The regression analysis of binary sequences. Journal of [26] Nyyti Saarimäki, Valentina Lenarduzzi, and Davide Taibi. On the
the Royal Statistical Society. Series B (Methodological), 20(2):215–242, diffuseness of code technical debt in open source projects of the apache
1958. ecosystem. International Conference on Technical Debt (TechDebt
[8] Leo Breiman, Jerome Friedman, Charles J. Stone, and R.A. Olshen. 2019), 2019.
Classification and regression trees Regression trees. 1984. [27] N. Saarimaki, M.T. Baldassarre, V. Lenarduzzi, and S. Romano. On the
[9] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, accuracy of sonarqube technical debt remediation time. SEAA Euromicro
8 1996. 2019, 2019.
[10] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [28] N. Maneerat and P. Muenchaisri. Bad-smell prediction from software
[11] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely random- design model using machine learning techniques. pages 331–336, May
ized trees. Machine Learning, 63(1):3–42, 4 2006. 2011.
[12] Yoav Freund and Robert E Schapire. A Decision-Theoretic Generaliza- [29] Francesca Arcelli Fontana and Marco Zanoni. Code smell severity
tion of On-Line Learning and an Application to Boosting. Journal of classification using machine learning techniques. Know.-Based Syst.,
Computer and System Sciences, 55(1):119–139, 8 1997. 128(C):43–58, July 2017.
[13] Jerome H. Friedman. Greedy Function Approximation: A Gradient
[30] Valentina Lenarduzzi, Antonio Martini, Davide Taibi, and Damian An-
Boosting Machine.
drew Tamburri. Towards surgically-precise technical debt estimation:
[14] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting
Early results and research roadmap. In Proceedings of the 3rd ACM
System. pages 785–794, New York, New York, USA, 2016. ACM Press.
SIGSOFT International Workshop on Machine Learning Techniques for
[15] Robert E. Schapire. The Strength of Weak Learnability. Machine
Software Quality Evaluation, MaLTeSQuE 2019, pages 37–42, 2019.
Learning, 5(2):197–227, 1990.
[16] P. Runeson and M. Höst. Guidelines for conducting and reporting [31] Andrea Janes, Valentina Lenarduzzi, and Alexandru Cristian Stan. A
case study research in software engineering. Empirical Softw. Engg., continuous software quality monitoring approach for small and medium
14(2):131–164, 2009. enterprises. In Proceedings of the 8th ACM/SPEC on International
[17] Valentina Lenarduzzi, Nyyti Saarimäki, and Davide Taibi. The technical Conference on Performance Engineering Companion, pages 97–100,
debt dataset. In 15th conference on PREdictive Models and data analycs 2017.
In Software Engineering, PROMISE ’19, 2019. [32] Valentina Lenarduzzi, Christian Stan, Davide Taibi, Davide Tosi, and
[18] and A. Zeller J. Śliwerski, T. Zimmermann. When do changes induce Gustavs Venters. A dynamical quality model to continuously monitor
fixes? MSR ’05, pages 1–5, New York, NY, USA, 2005. ACM. software maintenance. 2017.

SonarQube Rules

Uploaded by

Copyright:

Available Formats

SonarQube Rules

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SonarQube Rules

Uploaded by

Copyright:

Available Formats

Are SonarQube Rules Inducing Bugs?

Valentina Lenarduzzi Francesco Lomio Heikki Huttunen Davide Taibi

which represent something wrong in the code that will soon be

SonarQube claims that whenever a violation is clas- Gradient Boost.

Therefore, we aim at analyzing the fault prediction XGBoost

accuracy of the rules that are classified as ”bugs” by

RQ1 (Average between 10-fold validation models) RQ2 RQ3

gether with the percentage residuals = 0 (number of SQ-

SonarQube SZZ Residuals XG Res.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.