SonarQube Rules
SonarQube Rules
SonarQube Rules
Abstract—Background. The popularity of tools for analyzing box set of rules (named ”the Sonar way”4 ) is very subjective
Technical Debt, and particularly the popularity of SonarQube, and their developers did not manage to agree on a common
is increasing rapidly. SonarQube proposes a set of coding rules, set of rules that should be enforced. Therefore, the companies
arXiv:1907.00376v2 [cs.SE] 19 Dec 2019
5 SonarQube 6 https://scikit-learn.org
Issues and Rules Severity:’
https://docs.sonarqube.org/display/SONAR/Issues Last Access:May 2018 7 https://xgboost.readthedocs.io
2) Decision Tree Classifier [8]: Utilizes a decision tree decision trees which are constructed choosing a subset of
to return an output given a series of input variables. Its tree the samples of the original dataset. The difference with the
structure is characterized by a root node and multiple internal Random Forest classifier is in the way in which the split point
nodes, which are represented by the input variable, and leaf, is decided: while in the Random Forest algorithm the splitting
corresponding to the output. The nodes are linked between one point is decided base on a random subset of the variables,
another through branches, representing a test. The output is the Bagging algorithm is allowed to look at the full set of
given by the decision path taken. A decision tree is structured variable to find the point minimizing the error. This translates
as a if-then-else diagram: in this structure, given the value of in structural similarities between the trees which do not resolve
the variable in the root node, it can lead to subsequent nodes the overfitting problem related to the single decision tree. This
through branches following the result of a test. This process model was included as a mean of comparison with newer and
is iterated for all the input variables (one for each node) until better performing models.
it reaches the output, represented by the leaves of the tree. 5) Extremely Randomized Trees [11]: (ExtraTrees) [11],
In order to create the best structure, assigning each input provides a further randomization degree to the Random Forest.
variable to a different node, a series of metrics can be For the Random Forest model, the individual trees are created
used. Amongst these we can find the GINI impurity and the by randomly choosing subsets of the dataset features. In
information gain: the ExtraTrees model the way each node in the individual
• Gini impurity measures how many times randomly cho- decision trees are split is also randomized. Instead of using
sen inputs would be wrongly classified if assigned to a the metrics seen before to find the optimal split for each
randomly chosen class; node (Gini impurity and Information gain), the cut-off choice
• Information gain measures how important is the infor- for each node is completely randomized, and the resulting
mation obtained at each node related to its outcome: the splitting rule is decided based on the best random split. Due
more important is the information obtained in one node, to its characteristics, especially related to the way the splits
the purer will be the split. are made at the node level, the ExtraTrees model is less
In our models we used the Gini impurity measure to computationally expensive than the Random Forest model,
generate the tree as it is more computationally efficient. The while retaining a higher generalization capability compared
reasons behind the choice of decision tree models and Logistic to the single decision trees.
Regression, are their simplicity and easy implementation. 6) AdaBoost [12]: is another ensemble algorithm based on
Moreover, the data does not need to be normalized, and boosting [15] where the individual decision trees are grown
the structure of the tree can be easily visualized. However, sequentially. Moreover, a weight is assigned to each sample of
this model is prone to overfitting, and therefore it cannot the training set. Initially, all the samples are assigned the same
generalize the data. Furthermore, it does not perform well with weight. The model trains the first tree in order to minimize the
imbalanced data, as it generates a biased structure. classification error, and after the training is over, it increases
3) Random Forest [10]: is an ensemble technique that helps
the weights to those samples in the training set which were
to overcome overfitting issues of the decision tree. The term
misclassified. Moreover, it grows another tree and the whole
ensemble indicates that these models use a set of simpler
model is trained again with the new weights. This whole
models to solve the assigned task. In this case, Random Forest
process continues until a predefined number of trees has been
uses an ensemble of decision trees.
generated or the accuracy of the model cannot be improved
An arbitrary number of decision trees is generated consid-
anymore. Due to the many decision trees, as for the other
ering a randomly chosen subset of the samples of the original
ensemble algorithms, AdaBoost is less prone to overfitting
dataset [9]. This subset is created with replacement, hence
and can, therefore, generalize better the data. Moreover, it
a sample can appear multiple times. Moreover, in order to
automatically selects the most important features for the task
reduce the correlation between the individual decision trees a
it is trying to solve. However, it can be more susceptible to
random subset of the features of the original dataset. In this
the presence of noise and outliers in the data.
case, the subset is created without replacement. Each tree is
therefore trained on its subset of the data, and it is able to 7) Gradient Boosting [13]: also uses an ensemble of indi-
give a prediction on new unseen data. The Random Forest vidual decision trees which are generated sequentially, like for
classifier uses the results of all these trees and averages them the AdaBoost. The Gradient Boosting trains at first only one
to assign a label to the input. By randomly generating multiple decision tree and, after each iteration, grows a new tree in order
decision trees, and averaging their results, the Random Forest to minimize the loss function. Similarly to the AdaBoost, the
classifier is able to better generalize the data. Moreover, using process stops when the predefined number of trees has been
the random subspace method, the individual trees are not created or when the loss function no longer improves.
correlated between one another. This is particularly important 8) XGBoost [14]: can be viewed as a better performing im-
when dealing with a dataset with many features, as the prob- plementation of the Gradient Boosting algorithm, as it allows
ability of them being correlated between each other increases. for faster computation and parallelization. For this reason it
4) Bagging [9]: Exactly like the Random Forest model, can yield better performance compared to the latter, and can
the Bagging classifier is applied to an arbitrary number of be more easily scaled for the use with high dimensional data.
III. C ASE S TUDY D ESIGN TABLE I
T HE SELECTED PROJECTS
We designed our empirical study as a case study based on
the guidelines defined by Runeson and H´’ost [16]. In this Project Analyzed Last Faults SonarQube
Section, we describe the empirical study including the goal and Name commits commit Violations
LOC
the research questions, the study context, the data collection Ambari 9727 396775 3005 42348
and the data analysis. Bcel 1255 75155 41 8420
Beanutils 1155 72137 64 5156
A. Goal and Research Questions Cli 861 12045 59 37336
Codec 1644 34716 57 2002
As reported in Section 1, our goals are to analyze the fault- Collections 2847 119208 103 11120
proneness of SonarQube rule violations (SQ-Violations) and Configuration 2822 124892 153 5598
the accuracy of the quality model provided by SonarQube. Dbcp 1564 32649 100 3600
Based on the aforementioned goals, we derived the following Dbutils 620 15114 21 642
Deamon 886 3302 4 393
three research questions (RQs). Digester 2132 43177 23 4945
RQ1 Which are the most fault-prone SQ-Violations? FileUpload 898 10577 30 767
Io 1978 56010 110 4097
In this RQ, we aim to understand whether the intro- Jelly 1914 63840 45 5057
duction of a set of SQ-Violations is correlated with Jexl 1499 36652 58 34802
the introduction of faults in the same commit and Jxpath 596 40360 43 4951
Net 2078 60049 160 41340
to prioritize the SQ-Violations based on their fault- Ognl 608 35085 15 4945
proneness. Sshd 1175 139502 222 8282
Our hypothesis is that a set of SQ-Violations should Validator 1325 33127 63 2048
be responsible for the introduction of bugs. Vfs 1939 59948 129 3604
Sum 39.518 1,464,320 4,505 231,453
RQ2 Are SQ-Violations classified as ”bugs” by Sonar-
Qube more fault-prone than other rules?
Our hypothesis is that reliability rules (”bugs”) Residual Analysis
Commit Labeling Overall Validation
Labeled Commits
should be more fault-prone that maintainability rules SZZ ΔIND + ΔFIX
> 95%
V1 V2 V4 V6
(”code smells”) and security rules. V1 V3 V4 V5 V6
La
RQ3 What is the fault prediction accuracy of the SQ-Violations
be
led
Importance
el
Co
od
SonarQube quality model based on violations
tM
m
Logistic Reg.
s
its
Be
Decision Trees
classified as ”bugs”? SQ-Violations Random Forest
0.8
1000
0.6
100
XGBoost (AUC = 83.21 %)
0.4 GradientBoost (AUC = 82.49 %)
RandomForest (AUC = 80.18 %)
AdaBoost (AUC = 79.14 %) 10
0.2 Bagging (AUC = 77.85 %)
ExtraTrees (AUC = 77.45 %)
LogisticRegression (AUC = 67.04 %)
DecisionTrees (AUC = 50.14 %) 1
0.0 1192
112
134
1166
1541
1128
1213
125
1132
1130
1117
1481
122
1124
100
1226
1171
116
115
1168
107
1109
0.0 0.2 0.4 0.6 0.8 1.0
All Violations
Fig. 3. ROC Curve (Average between 10-fold validation models) Introduced in Fault-inducing commits
Introduced in Fault-Inducing commits and Removed in Fault-fixing commits
23 SQ-Violations have been ranked with an importance
higher than zero by the XGBoost. In Table V, we report the Fig. 4. Comparison of Violations introduced in fault-inducing commits and
SQ-Violations with an importance higher or equal than 0.01 % removed in fault-fixing commits
(coloumn ”Intr. & Rem. (%)” reports the number of violations removed during fault-fixing commits).
introduced in the fault-inducing commits AND removed in Column ”Res >95%”, shows a checkmark (X) when the
the fault-fixing commits). The remaining SQ-Violations are percentage of residuals=0 was higher than 95%.
reported in the raw data for reasons of space. coloumn ”Intr. Figure 4 compares the number of violations introduced in
& Rem. (%)” means fault-inducing commits, and the number of violations removed
The combination of the 23 violations guarantees a good in the fault-fixing commits.
classification power, as reported by the AUC of 0.83. However,
the drop column algorithm demonstrates that SQ-Violations B. Manual Validation of the Results
have a very low individual importance. The most important In order to understand the possible causes and to validate the
SQ-Violation has an importance of 0.62%. This means that results, we manually analyzed 10 randomly selected instances
the removal of this variable from the model would decrease for the first 20 SQ-Violations ranked as more important by the
the accuracy (AUC) only by 0.62%. Other three violations XGBoost algorithm.
have a similar importance (higher than 0.5%) while others are The first immediate result is that, in 167 of the 200 manually
slightly lower. inspected violations, the bug induced in the fault-inducing
2) Model Accuracy Validation: The analysis of residuals commit was not fixed by the same developer that induced it.
shows that several SQ-Violations are introduced in fault- We also noticed that violations related to duplicated code
inducing commits in more than 50% of cases. 32 SQ- and empty statements (eg. ”method should not be empty”)
Violations out of 202 had been introduced in the fault-inducing always generated a fault (in the randomly selected cases).
commits and then removed in the fault-fixing commit in more When committing an empty method (often containing only
than 95% of the faults. The application of the XGBoost, also a ”TODO” note), developers often forgot to implement it and
confirmed an importance higher than zero in 26 of these SQ- then used it without realizing that the method did not return
Violations. This confirms that developers, even if not using the expected value. An extensive application of unit testing
SonarQube, pay attention to these 32 rules, especially in case could definitely reduce this issue. However, we are aware that
of refactoring or bug-fixing. is is a very common practice in several projects. Moreover,
Table V reports the descriptive statistics of residuals, to- SQ-Violations such as 1481 (unused private variable should
TABLE IV Construct Validity. As for construct validity, the results
S ONAR Q UBE C ONTINGENCY M ATRIX (P REDICTION MODEL BASED ON might be biased regarding the mapping between faults and
SQ-V IOLATIONS CONSIDERED AS ”B UG ” BY S ONAR Q UBE )
commits. We relied on the ASF practice of tagging commits
Predicted Actual with the issue ID. However, in some cases, developers could
IND NOT IND have tagged a commit differently. Moreover, the results could
IND 32 342
NOT IND 1,124 38,020
also be biased due to detection errors of SonarQube. We are
aware that static analysis tools suffer from false positives. In
this work we aimed at understanding the fault proneness of
be removed) and 1144 (unused private methods should be the rules adopted by the tools without modifying them, so as
removed) unexpectedly resulted to be an issue. In several to reflect the real impact that developers would have while
cases, we discovered methods not used, but expected to be using the tools. In future works, we are planning to replicate
used in other methods, resulted in a fault. As example, if a this work manually validating a statistically significant sample
method A calls another method B to compose a result message, of violations, to assess the impact of false positives on the
not calling the method B results in the loss of the information achieved findings. As for the analysis timeframe, we analyzed
provided by B. commits until the end of 2015, considering all the faults raised
until the end of March 2018. We expect that the vast majority
C. RQ2: Are SQ-Violations classified as ”bugs” by Sonar- of the faults should have been fixed. However, it could be
Qube more fault-prone than other rules? possible that some of these faults were still not identified and
Out of the 57 violations classified as ”bugs” by SonarQube, fixed.
only three (squid 1143, 1147, 1764) were considered fault- Internal Validity. Threats can be related to the causation
prone with a very low importance from the XGBoost and with between SQ-Violations and fault-fixing activities. As for the
residuals higher than 95%. However, rules classified as ”code identification of the fault-inducing commits, we relied on the
smells” were frequently violated in fault-inducing commits. SZZ algorithm [18]. We are aware that in some cases, the SZZ
However, considering all the SQ-Violations, out of 40 the SQ- algorithm might not have identified fault-inducing commits
Violations that we identified as fault-prone, 37 are classified correctly because of the limitations of the line-based diff
as ”code smells” and one as security ”vulnerability”. provided by git, and also because in some cases bugs can be
When comparing severity with fault proneness of the SQ- fixed modifying code in other location than in the lines that
Violations, only three SQ-Violations (squid 1147, 2068, 2178) induced them. Moreover, we are aware that the imbalanced
were associated with the highest severity level (blocker). data could have influenced the results (approximately 90% of
However, the fault-proneness of this rule is extremely low the commits were non-fault-inducing). However, the applica-
(importance <= 0.14%). Looking at the remaining violations, tion of solid machine learning techniques, commonly applied
we can see that the severity level is not related to the with imbalanced data could help to reduce this threat.
importance reported by the XGBoost algorithm since the rules External Validity. We selected 21 projects from the ASF,
of different level of severity are distributed homogeneously which incubates only certain systems that follow specific and
across all importance levels. strict quality rules. Our case study was not based only on one
application domain. This was avoided since we aimed to find
D. RQ3: Fault prediction accuracy of the SonarQube model general mathematical models for the prediction of the number
”Bug” violations were introduced in 374 commits out of of bugs in a system. Choosing only one or a very small number
39,518 analyzed commits. Therefore, we analyzed which of of application domains could have been an indication of the
these commits were actually fault-inducing commits. Based non-generality of our study, as only prediction models from
on SonarQube’s statement, all these commits should have the selected application domain would have been chosen. The
generated a fault. selected projects stem from a very large set of application
All the accuracy measures (Table III, column ”RQ2”) con- domains, ranging from external libraries, frameworks, and web
firm the very low prediction power of ”bug” violations. The utilities to large computational infrastructures. The dataset
vast majority of ”bug” violations never become a fault. Results only included Java projects. We are aware that different
are also confirmed by the extremely low AUC (50.95%) and programming languages, and projects different maturity levels
by the contingency matrix (Table IV). The results of the could provide different results.
SonarQube model also confirm the results obtained in RQ2. Reliability Validity. We do not exclude the possibility that
Violations classified as ”bugs” should be classified differently other statistical or machine learning approaches such as Deep
since they are hardly ever injected in fault-inducing commits. Learning, or others might have yielded similar or even better
accuracy than our modeling approach.
V. T HREATS TO VALIDITY
VI. R ELATED W ORK
In this Section, we discuss the threats to validity, including
internal, external, construct validity, and reliability. We also In this Section, we introduced the related works analyzing
explain the different adopted tactics [22]. literature on SQ-Violations and faults predictions.
TABLE V
S UMMARY OF THE MOST IMPORTANT S ONAR Q UBE V IOLATIONS R ELATED TO FAULTS (XGB OOST I MPORTANCE > 0.2%)
Falessi et al. [3] studied the distribution of 16 metrics and and Jira Extracting Tool from Apache Hive and determined
106 SQ-Violations in an industrial project. They applied a significant independent variables for defect- and change-prone
What-if approach with the goal of investigating what could classes, respectively. Then they used a Bayesian approach to
happen if a specific SQ-Violation would not have been intro- build a prediction model to determine the ”technical debt
duced in the code and if the number of faulty classes decrease proneness” of each class. Their model requires the identifi-
in case the violation is not introduced. They compared four ML cation of ”technical debt items”, which requires manual input.
techniques applying the same techniques on a modified version These items are ultimately ranked and given a risk probability
of the code where they manually removed SQ-Violations. by the predictive framework.
Results showed that 20% of faults were avoidable if the code Saarimki investigated the diffuseness of SQ-violations in
smells would have been removed. the same dataset we adopted [26] and the accuracy of the
Tollin et al. [4] investigated if SQ-Violations introduced SonarQube remediation time [27].
would led to an increase in the number of changes (code Regarding other code quality rules detection, 7 different
churns) in the next commits. The study was applied on two machine learning approaches (Random Forest, Naive Bayes,
different industrial projects, written in C# and JavaScript. They Logistic regression, IBl, IBk, VFI, and J48) [28] were suc-
reported that classes affected by more SQ-Violations have a cessfully applied on 6 code smells (Lazy Class, Feature Envy,
higher change proneness. However they did not prioritize or Middle Man Message Chains, Long Method, Long Param-
classified the most change prone SQ-Violations. eter Lists, and Switch Statement) and 27 software metrics
Digkas et al. [23] studied weekly snapshots of 57 Java (including Basic, Class Employment, Complexity, Diagrams,
projects of the ASF investigating the amount of technical debt Inheritance, and MOOD) as independent variables.
paid back over the course of the projects and what kind of Code smells detection was also investigated from the point
issues were fixed. They considered SQ-Violations with severity of view of how the severity of code smells can be classified
marked as Blocker, Critical, and Major. The results showed through machined learning models [29] such as J48, JRip,
that only a small subset of all issue types was responsible Random Forest, Naive Bayes, SMO, and LibSVM with best
for the largest percentage of technical debt repayment. Their agreement to detection 3 code smells (God Class, Large Class,
results thus confirm our initial assumption that there is no and Long Parameter List).
need to fix all issues. Rather, by targeting particular violations, VII. D ISCUSSION AND C ONCLUSION
the development team can achieve higher benefits. However, SonarQube classifies 57 rules as ”bugs”, claiming that they
their work does not consider how the issues actually related will sooner or later they generate faults. Four local companies
to faults. contacted us to investigate the fault prediction power of the
Falessi and Reichel [24] developed an open-source tool to SonarQube rules, possibly using machine learning, so as to
analyze the technical debt interest occurring due to violations understand if they can rely on the SonarQube default rule-set
of quality rules. Interest is measured by means of various or if they can use machine learning to customize the model
metrics related to fault-proneness. They use SonarQube rules more accurately.
and uses linear regression to estimate the defect-proneness of We conducted this work analyzing a set of 21 well-known
classes. The aim of MIND is to answer developers’ questions open source project selected by the companies, analyzing
like: is it worth to re-factor this piece of code? Differently the presence of all 202 SonarQube detected violations in
than in our work, the actual type of issue causing the defect the complete project history. The study considered 39,518
was not considered. commits, including more than 38 billion lines of code, 1.4
Codabux and Williams [25] propose a predictive model to million violations, and 4,505 faults mapped to the commits.
prioritize technical debt. They extracted class-level metrics for To understand which sq-violations have the highest fault-
defect- and change-prone classes using Scitool Understanding proneness, we first applied eight machine learning approaches
to identify the sq-violations that are common in commits main result for the companies is that they will need to invest
labeled as fault-inducing. As for the application of the different in the adoption of other tools to reduce the fault proneness
machine learning approaches, we can see an important differ- and therefore, we will need to replicate this work considering
ence in their accuracy, with a difference of more than 53% other tools such as FindBugs, PMD but also commercial tools
from the worst model (Decision Trees AUC=47.3%±3%) and such as Coverity Scan, Cast Software and others.
the best model (XGBoost AUC=83.32%±10%). This confirms Based on the overall results, we can summarize the follow-
also what we reported in Section II-B: ensemble models, ing lessons learned:
like the XGBoost, can generalize better the data compared to Lesson 1: SonarQube violations are not good predictors
Decision Trees, hence it results to be more scalable. The use of of fault-proneness if considered individually, but can be good
many weak classifiers, yields an overall better accuracy, as it predictors if considered together. Machine learning techniques,
can be seen by the fact that the boosting algorithms (AdaBoost, such as XGBoost can be used to effectively train a customized
GradientBoost, and XGBoost) are the best performers for this model for each company.
classification task, followed shortly by the Random Forest Lesson 2: SonarQube violations classified as ”bugs” do not
classifier and the ExtraTrees. seem to be the cause of faults.
As next step, we checked the percentage of commits where a
Lesson 3: SonarQube violation severity is not related to
specific violation was introduced in the fault-inducing commit
the fault-proneness and therefore, developers should carefully
and then removed in the fault-fixing commit, accepting only
consider the severity as decision factor for refactoring a
those violations where the percentage of cases where the
violation.
same violations were added in the fault-inducing commit and
Lesson 4: Technical debt should be calculated differently,
removed in the fault-fixing commit was higher than 95%.
and the non-fault prone rules should not be accounted as
Our results show that 26 violations can be considered fault-
”fault-prone” (or ”buggy”) components of the technical debt
prone from the XGBoost model. However, the analysis of
while several ”code smells” rules should be carefully consid-
the residuals showed that 32 sq-violations were commonly
ered as potentially fault-prone.
introduced in a fault-inducing commit and then removed in the
fault-fixing commit but only two of them are considered fault- The lessons learned confirm our initial hypothesis about the
prone from the machine learning algorithms. It is important to fault-proneness of the SonarQube violations. However, we are
notice that all the sq-violations that are removed in more than not claiming that SonarQube violations are not harmful in
95% of cases during fault-fixing commits are also selected by general. We are aware that some violations could be more
the XGBoost, also confirming the importance of them. prone to changes [3], decrease code readability, or increase
When we looked at which of the sq-violations were con- the maintenance effort.
sidered as fault-prone in the previous step, only four of them Our recommendation to companies using SonarQube is to
are also classified as (”bugs”) by SonarQube. The remaining customize the rule-set, taking into account which violations to
fault-prone sq-violations are mainly classified as ”code smells” consider, since the refactoring of several sq-violations might
(SonarQube claims that ”code smells” increase maintenance not lead to a reduction in the number of faults. Furthermore,
effort but do not create faults). The analysis of the accuracy since the rules in SonarQube constantly evolve, companies
of the fault prediction power of the SonarQube model based should continuously re-consider the adopted rules.
on ”bugs” showed an extremely low fitness, with an AUC of Research on technical debt should focus more on validating
50.94%, confirming that violations classified as ”bugs” almost which rules are actually harmful from different points of view
never resulted in a fault. and which will account for a higher technical debt if not
An important outcome is related to the application of the refactored immediately.
machine learning techniques. Not all the techniques performed Future works include the replication of this work consid-
equally and XGBoost was the most more accurate and fastest ering the severity levels of SonarQube rules and their impor-
technique in all the projects. Therefore, the application XG- tance. We are working on the definition of a more accurate
Boost to historical data is a good alternative to the manual model for predicting TD [30] Moreover, we are planning
tuning of the model, where developers should select which to investigate whether classes that SonarQube identify as
rules they believe are important based on their experience. problematic are more fault-prone than those not affected by
The result confirmed the impression of the developers of any problem. Since this work did not confirmed the fault
our companies. Their developers still consider it very useful to proneness of SonarQube rules, the companies are interested in
help to develop clean code that adhere to company standards, finding other static analysis tool for this purpose. Therefore,
and that help new developers to write code that can be easily we are planning to replicate this study using other tools such
understood by other developers. Before the execution of this as FindBugs, Checkstyle, PMD and others. Moreover, we will
study the companies were trying to avoid to violate the rules focus on the definition of recommender systems integrated
classifies as bugs, hoping to reduce fault proneness. However, in the IDEs [31][32], to alert developers about the presence
after the execution of this study, the companies individually of potential problematic classes based on their (evolution of)
customized the set of rules considering only coding standards change- and fault-proneness and rank them based on the
aspects and rules classified as ”security vulnerabilities”. The potential benefits provided by their removal.
R EFERENCES [19] Parr Terence, Turgutlu Kerem, Csiszar Christopher, and Howard
Jeremy. Beware default random forest importances. http://explained.
[1] Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian ai/rf-importance/index.html. Accessed: 2018-07-20.
Proksch, Harald C. Gall, and Andy Zaidman. How Developers Engage
with Static Analysis Tools in Different Contexts. In Empirical Software [20] Hyunjin Yoon, Kiyoung Yang, and Cyrus Shahabi. Feature subset selec-
Engineering, 2019. tion and feature ranking for multivariate time series. IEEE transactions
[2] Valentina Lenarduzzi, Alberto Sillitti, and Davide Taibi. A survey on on knowledge and data engineering, 17(9):1186–1198, 2005.
code analysis tools for software maintenance prediction. In 6th Inter- [21] D. M. W. Powers. Evaluation: From precision, recall and f-measure
national Conference in Software Engineering for Defence Applications, to roc., informedness, markedness & correlation. Journal of Machine
pages 165–175. Springer International Publishing, 2020. Learning Technologies, 2(1):37–63, 2011.
[3] D. Falessi, B. Russo, and K. Mullen. What if i had no smells? 2017 [22] R.K. Yin. Case Study Research: Design and Methods, 4th Edition
ACM/IEEE International Symposium on Empirical Software Engineering (Applied Social Research Methods, Vol. 5). SAGE Publications, Inc,
and Measurement (ESEM), pages 78–84, Nov 2017. 4th edition, 2009.
[4] F. Arcelli Fontana I. Tollin, M. Zanoni, and R. Roveda. Change
prediction through coding rules violations. EASE’17, pages 61–64, New [23] G. Digkas, M. Lungu, P. Avgeriou, A. Chatzigeorgiou, and A. Ampat-
York, NY, USA, 2017. ACM. zoglou. How do developers fix issues and pay back technical debt in
[5] Valentina Lenarduzzi, Alberto Sillitti, and Davide Taibi. Analyzing the apache ecosystem? volume 00, pages 153–163, March 2018.
forty years of software maintenance models. In 39th International [24] D. Falessi and A. Reichel. Towards an open-source tool for measuring
Conference on Software Engineering Companion, ICSE-C ’17, pages and visualizing the interest of technical debt. pages 1–8, 2015.
146–148, Piscataway, NJ, USA, 2017. IEEE Press. [25] B.J. Williams Z. Codabux. Technical debt prioritization using predictive
[6] M. Fowler and K. Beck. Refactoring: Improving the design of existing analytics. ICSE ’16, pages 704–706, New York, NY, USA, 2016. ACM.
code. Addison-Wesley Longman Publishing Co., Inc., 1999.
[7] D. R. Cox. The regression analysis of binary sequences. Journal of [26] Nyyti Saarimäki, Valentina Lenarduzzi, and Davide Taibi. On the
the Royal Statistical Society. Series B (Methodological), 20(2):215–242, diffuseness of code technical debt in open source projects of the apache
1958. ecosystem. International Conference on Technical Debt (TechDebt
[8] Leo Breiman, Jerome Friedman, Charles J. Stone, and R.A. Olshen. 2019), 2019.
Classification and regression trees Regression trees. 1984. [27] N. Saarimaki, M.T. Baldassarre, V. Lenarduzzi, and S. Romano. On the
[9] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, accuracy of sonarqube technical debt remediation time. SEAA Euromicro
8 1996. 2019, 2019.
[10] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [28] N. Maneerat and P. Muenchaisri. Bad-smell prediction from software
[11] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely random- design model using machine learning techniques. pages 331–336, May
ized trees. Machine Learning, 63(1):3–42, 4 2006. 2011.
[12] Yoav Freund and Robert E Schapire. A Decision-Theoretic Generaliza- [29] Francesca Arcelli Fontana and Marco Zanoni. Code smell severity
tion of On-Line Learning and an Application to Boosting. Journal of classification using machine learning techniques. Know.-Based Syst.,
Computer and System Sciences, 55(1):119–139, 8 1997. 128(C):43–58, July 2017.
[13] Jerome H. Friedman. Greedy Function Approximation: A Gradient
[30] Valentina Lenarduzzi, Antonio Martini, Davide Taibi, and Damian An-
Boosting Machine.
drew Tamburri. Towards surgically-precise technical debt estimation:
[14] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting
Early results and research roadmap. In Proceedings of the 3rd ACM
System. pages 785–794, New York, New York, USA, 2016. ACM Press.
SIGSOFT International Workshop on Machine Learning Techniques for
[15] Robert E. Schapire. The Strength of Weak Learnability. Machine
Software Quality Evaluation, MaLTeSQuE 2019, pages 37–42, 2019.
Learning, 5(2):197–227, 1990.
[16] P. Runeson and M. Höst. Guidelines for conducting and reporting [31] Andrea Janes, Valentina Lenarduzzi, and Alexandru Cristian Stan. A
case study research in software engineering. Empirical Softw. Engg., continuous software quality monitoring approach for small and medium
14(2):131–164, 2009. enterprises. In Proceedings of the 8th ACM/SPEC on International
[17] Valentina Lenarduzzi, Nyyti Saarimäki, and Davide Taibi. The technical Conference on Performance Engineering Companion, pages 97–100,
debt dataset. In 15th conference on PREdictive Models and data analycs 2017.
In Software Engineering, PROMISE ’19, 2019. [32] Valentina Lenarduzzi, Christian Stan, Davide Taibi, Davide Tosi, and
[18] and A. Zeller J. Śliwerski, T. Zimmermann. When do changes induce Gustavs Venters. A dynamical quality model to continuously monitor
fixes? MSR ’05, pages 1–5, New York, NY, USA, 2005. ACM. software maintenance. 2017.