When and Why Your Code Starts To Smell Bad (And Whether The Smells Go Away)
When and Why Your Code Starts To Smell Bad (And Whether The Smells Go Away)
When and Why Your Code Starts To Smell Bad (And Whether The Smells Go Away)
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 11, NOVEMBER 2017
Abstract—Technical debt is a metaphor introduced by Cunningham to indicate “not quite right code which we postpone making it right”. One
noticeable symptom of technical debt is represented by code smells, defined as symptoms of poor design and implementation choices.
Previous studies showed the negative impact of code smells on the comprehensibility and maintainability of code. While the repercussions of
smells on code quality have been empirically assessed, there is still only anecdotal evidence on when and why bad smells are introduced, what
is their survivability, and how they are removed by developers. To empirically corroborate such anecdotal evidence, we conducted a large
empirical study over the change history of 200 open source projects. This study required the development of a strategy to identify smell-
introducing commits, the mining of over half a million of commits, and the manual analysis and classification of over 10K of them. Our findings
mostly contradict common wisdom, showing that most of the smell instances are introduced when an artifact is created and not as a result of
its evolution. At the same time, 80 percent of smells survive in the system. Also, among the 20 percent of removed instances, only 9 percent
are removed as a direct consequence of refactoring operations.
Ç
1 INTRODUCTION Manuscript received 23 Mar. 2016; revised 1 Dec. 2016; accepted 22 Dec.
2016. Date of publication 15 Jan. 2017; date of current version 20 Nov. 2017.
HE technical
debt metaphor introduced by Cunningham Recommended for acceptance by A. Zeller.
For information on obtaining reprints of this article, please send e-mail to:
T [22] explains well the trade-offs between delivering
the most appropriate but still immature product, in the
shortest time possible [14], [22], [42], [47], [70]. Bad code smells
reprints@ieee.org, and reference the Digital Object Identifier below. Digital
Object Identifier no. 10.1109/TSE.2017.2653105
empirically proven, there is still noticeable lack of empirical
(shortly “code smells” or “smells”), i.e., symptoms of poor design evidence related to how, when, and why they occur in software
and implementation choices [27], represent one important factor projects, as well as whether, after how long, and how they are
contributing to technical debt, and possibly affecting the removed [14]. This represents an obstacle for an effective and
maintainability of a software system [42]. In the past and, most efficient management of technical debt. Also, understanding the
notably, in recent years, several studies investigated the relevance typical life-cycle of code smells and the actions undertaken by
that code smells have for developers [60], [90], the extent to developers to remove them is of paramount importance in the
which code smells tend to remain in a software system for long conception of recommender tools for developers’ support. In
periods of time [4], [17], [48], [64], as well as the side effects of other words, only a proper understanding of the phenomenon
code smells, such as an increase in change- and fault-proneness would allow the creation of recommenders able to highlight the
[37], [38] or decrease of software understandability [1] and presence of code smells and suggesting refactorings only when
maintainability [72], [88], [89]. While the repercussions of code appropriate, hence avoiding information overload for developers
smells on software quality have been [53].
Common wisdom suggests that urgent maintenance activities
and pressure to deliver features while prioritizing timeto-market
M.Williamsburg,Tufano andVAD. Poshyvanyk23185. are with the College over code quality are often the causes of such smells. Generally
of William and Mary, speaking, software evolution has always been considered as one
of the reasons behind “software aging” [61] or “increasing
E-mail: mtufano@email.wm.edu, denys@cs.wm.edu.
complexity” [44], [55], [87]. Also, one of the common beliefs is
F.SAPalomba84084, Italy.and A.E-mail:De Lucia{fpalomba,are that developers remove code smells from the system by
withadelucia}@unisa.it.the University of Salerno, Fisciano, performing refactoring operations. However, to the best of our
G.6900Bavota, Switzerland.is with E-mail:the knowledge, there is no comprehensive empirical investigation
into when and why code smells are introduced in software
Universitgabriele.bavota@usi.ch.a della Svizzera italiana (USI), Lugano E- projects, how long they survive, and how they are removed.
mail:R. Olivetorocco.oliveto@unimol.it.is with the University of Molise, In this paper we fill the void in terms of our understanding of
code smells, reporting the results of a large-scale empirical study
Pesche (IS) 86090, Italy. conducted on the change history of 200 open source projects
belonging to three software ecosystems, namely Android, Apache
M.E-mail:Di Pentadipenta@unisannio.it.is with the University of Sannio, and Eclipse. The study aims at investigating (i) when smells are
Benevento, BN 82100, Italy. introduced in software projects, (ii) why they are introduced (i.e.,
0098-5589 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html_ for
more information.
1064 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 11, NOVEMBER 2017
under what circumstances smell introductions occur and who are being performed and the need to consider refactoring
the developers responsible for introducing smells), (iii) how long activities whenever possible.
they survive in the system, and (iv) how they are removed. To The symptom simply highlights—as also pointed out in a
address these research questions, we developed a metric-based previous study [60], [90]—the intrinsic complexity, size
methodology for analyzing the evolution of code entities in (or any other smell-related characteristics) of a code
change histories of software projects to determine when code entity, and there is little or nothing one can do about that.
smells start manifesting themselves and whether this happens Often some situations that seem to fall in the two cases
suddenly (i.e., because of a pressure to quickly introduce a above should be considered in this category instead.
change), or gradually (i.e., because of medium-to-long range Smells manifest themselves gradually. In such cases, smell
design decisions). We mined over half a million of commits and detectors can identify smells only when they actually
we manually analyzed over 10K of them to understand how code manifest themselves (e.g., some metrics go above a given
smells are introduced and removed from software systems. We threshold) and suggest refactoring actions. Instead, in
are unaware of any published technical debt, in general, and code such circumstances, tools monitoring system evolution
smells study, in particular, of comparable size. The obtained and identifying metric trends, combined with history-
results allowed us to report quantitative and qualitative evidence based smell detectors [58] , should be used.
on when and why smells are introduced and removed from In addition, our findings, which are related to the very limited
software projects as well as implications of these results, often refactoring actions undertaken by developers to remove code
contradicting common wisdom. In particular, our main findings smells, call for further studies aimed at understanding the reasons
show that (i) most of the code smells are introduced when the behind this result. Indeed, it is crucial for the research community
(smelly) code artifact is created in the first place, and not as the to study and understand whether:
result of maintenance and evolution activities performed on such
an artifact, (ii) 80 percent of code smells, once introduced, are not developers perceive (or don’t) the code smells as harmful, and
removed by developers, and (iii) the 20 percent of removed code thus they simply do not care about removing them from
smells are very rarely (in 9 percent of cases) removed as a direct the system; and/or
consequence of refactoring activities. developers consider the cost of refactoring code smells too
high when considering possible side effects ( e.g., bug
The paper makes the following notable contributions:
introduction [9]) and expected benefits; and/or
1) A methodology for identifying smell-introducing the available tools for the identification/refactoring of code
changes, namely a technique able to analyze change smells are not sufficient/effective/usable from the
history information in order to detect the commit, which developers’ perspective.
introduced a code smell; Paper Structure. Section 2 describes the study design, while
2) A large-scale empirical study involving three popular Sections 3 and 4 report the study results and discuss the threats to
software ecosystems aimed at reporting quantitative and validity, respectively. Following the related work (Section 5),
qualitative evidence on when and why smells are Section 6 concludes the paper outlining lessons learned and
introduced in software projects, what is their promising directions for future work.
survivability, and how code smells are removed from the
source code, as well as implications of these results, often
2 STUDY DESIGN
contradicting common wisdom.
3) A publicly available comprehensive dataset [80] that The goal of the study is to analyze the change history of software
enables others to conduct further similar or different projects with the purpose of investigating when code smells are
empirical studies on code smells (as well as completely introduced and fixed by developers and the circumstances and
reproducing our results). Implications of the Study. From reasons behind smell appearances.
a purely empirical point of view, the study aims at More specifically, the study aims at addressing the following
confirming and/or contradicting the common wisdom four research questions ( RQs ):
about software evolution and manifestation of code
smells. From a more practical point of view, the results of RQ1: When are Code Smells Introduced? This research
this study can help distinguish among different situations question aims at investigating to what extent the common
that can arise in software projects, and in particular in wisdom suggesting that “code smells are introduced as a
cases where: consequence of continuous maintenance and evolution
activities performed on a code artifact” [27] applies.
Smells are introduced when a (sub) system has been Specifically, we study “when” code smells are introduced
conceived. Certainly, in such cases smell detectors can in software systems, to understand whether smells are
help identify potential problems, although this situation introduced as soon as a code entity is created, whether
can trigger even more serious alarms related to potentially smells are suddenly introduced in the context of specific
poor design choices made in the system since its inception maintenance activities, or whether, instead, smells appear
(i.e., technical debt that smell detectors will not be able to “gradually” during software evolution. To this aim, we
identify from a system’s snapshot only), that may require investigated the
careful redesign in order to avoid worse problems in
future.
Smells occur suddenly in correspondence to a given change,
pointing out cases for which recommender systems may
warn developers of emergency maintenance activities
TABLE 1
Characteristics of Ecosystems Under Analysis
TUFANO ET AL.: WHEN AND WHY YOUR CODE STARTS TO SMELL BAD (AND WHETHER THE SMELLS GO AWAY) 1065 Mean Story Min-Max
Ecosystem #Proj. #Classes KLOC #Commits #Issues
Length Story Length
Apache 100 4-5,052 1-1,031 207,997 3,486 6 1-15
Android 70 5-4,980 3-1,140 107,555 1,193 3 1-6
Eclipse 30 142-16,700 26-2,610 264,119 124 10 1-13
Overall 200 - - 579,671 4,803 6 1-15
presence of possible trends in the history of code artifacts that
characterize the introduction of specific types of smells.
RQ2: Why are Code Smells Introduced? The second research
question aims at empirically investigating under which
circumstances developers are more prone to introduce code
smells. We focus on factors that are indicated as possible
causes for code smell introduction in the existing literature
[27]: the commit goal (e.g., is the developer implementing a
new feature or fixing a bug?), the project status (e.g., is the
change performed in proximity to a major release deadline?),
and the developer status (e.g., a newcomer or a senior project
member?).
RQ3: What is the Survivability of Code Smells? In this
research question we aim to investigate how long a smell
remains in the code. In other words, we want to study the
survivability of code smells, that is the probability that a
code smell instance survives over time. To this aim, we
employ a statistical method called survival analysis [66].
In this research question, we also investigate differences
of survivability among different types of code smells.
RQ4: How do Developers Remove Code Smells? The fourth
and last research question aims at empirically
investigating whether and how developers remove code
smells. In particular, we want to understand whether code
smells are removed using the expected and suggested
refactoring operations for each specific type of code smell
(as suggested by Fowler [27]), whether they are removed
using “unexpected refactorings”, or whether such a
removal is a side effect of other changes. To achieve this
goal, we manually analyzed 979 commits removing code
smells by following an open coding process inspired by
grounded theory [21].
identifying smell-introducing commits. Our tool mines the entire for each source code artifact, a set of quality metrics (see Table
change history of ri, checks out each commit in chronological 2). As done for DECOR, quality metrics are computed for all
order, and runs an implementation of the DECOR smell detector code artifacts only during the first commit, and updated at each
based on the original rules defined by Moha et al. [50]. DECOR subsequent commit for added and modified files. The purpose of
identifies smells using detection rules based on the values of this analysis is to understand whether the trend followed by such
internal quality metrics.4 The choice of using DECOR is driven by metrics differ between files affected by a specific type of smell
the fact that (i) it is a state-ofthe-art smell detector having a high and files not affected by such a smell. For example, we expect
accuracy in detecting smells [50]; and (ii) it applies simple that classes becoming Blobs will exhibit a higher growth rate than
detection rules that allow it to be very efficient. Note that we ran classes that are not going to become Blobs.
DECOR on all source code files contained in ri only for the first
commit of ri. For the subsequent commits DECOR has been 4. An example of detection rule exploited to identify Blob classes can be
executed only on code files added or modified in each specific found at http://tinyurl.com/paf9gp6
commit to save computational time. As an output, our tool In order to analyze the evolution of the quality metrics, we
produces, for each source code file fj 2 ri the list of commits in need to identify the function that best approximates the data
which fj has been involved, specifying if fj has been added, distribution, i.e., the values of the considered metrics computed in
deleted, or modified and if fj was affected in that specific commit, a sequence of commits. We found that the best model is the linear
by one of the five considered smells. function (more details are available in our technical report [80]).
Starting from the data generated by the HistoryMiner we Note that we only consider linear regression models using a single
compute, for each type of smell (smellk) and for each source code metric at a time (i.e., we did not consider more than one metric in
the same regression model) since our interest is to observe how a
file (fj), the number of commits performed on fj since the first
single metric in isolation describes the smell-introducing process.
commit involving fj and adding the file to the repository, up to the
We consider the building of more complex regression models
commit in which DECOR detects that fj is affected by smellk. based on more than one metric as part of our future work.
Clearly, such numbers are only computed for files identified as
Having identified the model to be used, we compute, for each
affected by the specific smellk.
file fj 2 ri, the regression line of its quality metric values. If file fj is
When analyzing the number of commits needed for a smell to
affected by a specific smellk, we compute the regression line
affect a code component, we can have two possible scenarios. In
considering the quality metric values computed for each commit
the first scenario, smell instances are introduced during the
involving fj from the first commit (i.e., where the file was added to
creation of source code artifacts, i.e., in the first commit involving
the versioning system) to the commit where the instance of smellk
a source code file. In the second scenario, smell instances are
introduced after several commits and, thus, as a result of multiple was detected in fj. Instead, if fj is not affected by any smell, we
maintenance activities. For the latter scenario, besides running the consider only the first nth commits involving the file fj, where n is
DECOR smell detector for the project snapshot related to each the average number of commits required by smellk to affect code
TUFANO ET AL.: WHEN AND WHY YOUR CODE STARTS TO SMELL BAD (AND WHETHER THE SMELLS GO AWAY) 1067
instances. Then, for each metric reported in Table 2 , we compare
the distributions of regression line slopes for smell-free and
smelly files. The comparison is performed using a two-tailed
Mann-Whitney U test [20]. The results are intended as
statistically significant at a ¼ 0:05. We also estimate the
magnitude of the observed differences using the Cliff’s Delta (or
d), a non-parametric effect size measure [31] for ordinal data. We
follow the guidelines provided by Grissom and Kim [31] to
interpret the effect size values: small for d < 0:33 (positive as well
as negative values), medium for 0:33 d < 0:474 and large for d
0:474. Fig. 1. Example of identifying smell-introducing commits.
Overall, the data extraction for RQ1 (i.e., the smells detection
and metrics computation at each commit for the 200 systems) Fig. 1 reports an example aimed at illustrating the
took eight weeks on a Linux server having seven quad-core 2.67 smellintroducing commits identification for a file fj. Suppose that
GHz CPU (28 cores) and 24 Gb of RAM. fj has been involved in eight commits (from c1 to c8) , and that in c8
a Blob instance has been identified by DECOR in fj. Also,
2.2.2 Why Are Code Smells Introduced? suppose that the results of our RQ 1 showed that the LOC metric is
One challenge arising when answering RQ2 is represented by the the only one “characterizing” the Blob introduction, i.e., the slope
identification of the specific commit (or also possibly a set of of the LOC regression line for Blobs is significantly different
commits) where the smell has been introduced (from now on from the one of the regression line built for classes which are not
referred to as a smell-introducing commit). Such information is affected by the Blob smell. The black line in Fig. 1 represents the
crucial to explain under which circumstances these commits were LOC regression line computed among all the involved commits,
performed. A trivial solution would have been to use the results of having a slope of 1.3. The gray lines represent the regression lines
our RQ1 and consider the commit cs in which DECOR detects for between pairs of commits (ci1;ci), where ci is not classified as a
the first time a smell instance smellk in a source code file fj as a smell-introducing commit (their slope is lower than 1.3). Finally,
commitintroducing smell in fj. However, while this solution the red-dashed lines represent the regression lines between pairs
would work for smell instances that are introduced in the first of commits (ci1;ci), where ci is classified as a smell-introducing
commit involving fj (there is no doubt on the commit that commit (their slope is higher than 1.3). Thus, the smell-
introduced the smell), it would not work for smell instances that introducing commits in the example depicted in Fig. 1 are: c3, c5,
are the consequence of several changes, performed in n different and c7. Overall, we obtained 9,164 smell-introducing commits in
commits involving fj. In such a situation, on one hand, we cannot 200 systems, that we used to answer RQ2.
simply assume that the first commit in which DECOR identifies
the smell is the one introducing that smell, because the smell
appearance might be the result of several small changes
performed across the n commits. On the other hand, we cannot
assume that all n commits performed on fj are those (gradually)
introducing the smell, since just some of them might have pushed
fj toward a smelly direction. Thus, to identify the smell-
introducing commits for a file fj affected by an instance of a smell
(smellk), we use the following heuristic:
5. https://cran.r-project.org/package=survival
It is important to highlight that, while the survival analysis is
designed to deal with censored data, we perform a cleaning of our
dataset aimed at reducing possible biases caused by censored
intervals before running the analysis. In particular, code smell
instances introduced too close to the end of the observed change
history can potentially influence our results, since in these cases
the period of time needed for their removal is too short for being
analyzed. Thus, we excluded from our survival analysis all
censored intervals for which the last-smell-introducing commit 3 ANALYSIS OF THE RESULTS
was “too close” to the last commit we analyzed in the system’s
This section reports our analysis of the results achieved in our
change history (i.e., for which the developers did not have
study and aims at answering the four research questions
“enough time” to fix them). To determine a threshold suitable to
formulated in Section 2.
remove only the subset of smell instances actually too close to the
end of the analyzed change history, we study the distribution of
the number of days needed to fix the code smell instance (i.e., the 3.1 When Are Code Smells Introduced?
length of the closed smelly interval) in our dataset and, then, we Fig. 2 shows the distribution of the number of commits required
choose an appropriate threshold (see Section 3.3). by each type of smell to manifest itself. The results are grouped
by ecosystems; also, we report the Overall results (all ecosystems
2.2.4 How Do Developers Remove Code Smells? together).
In order to understand how code smells disappear from the As we can observe in Fig. 2, in almost all the cases the median
system, we manually analyzed a randomly selected set of 979 number of commits needed by a smell to affect code components
smell-removing commits. Such a set represents a 95 percent is zero, except for Blob on Android ( median =3) and Complex
statistically significant stratified sample with a 5 percent Class on Eclipse (median=1). In other words, most of the smell
confidence interval of the 1,426 smell-removing commits in our instances (at least half of them) are introduced when a code entity
TUFANO ET AL.: WHEN AND WHY YOUR CODE STARTS TO SMELL BAD (AND WHETHER THE SMELLS GO AWAY) 1071
is added to the versioning system. This is quite surprising finding, The analysis of the results reveals that for all the smells, but
considering the common wisdom that smells are generally the Functional Decomposition, the files affected by smells show a
result of continuous maintenance activities performed on a code higher slope than clean files. This suggests that the files that will
component [27]. be affected by a smell exhibit a steeper growth in terms of metric
However, the box plots also indicate (i) the presence of several values than files that are not becoming smelly. In other words,
outliers; and that (ii) for some smells, in particular Blob and when a smell is going to appear, its operational indicators (metric
Complex Class, the distribution is quite skewed. This means that value increases) occur very fast (not gradually). For example,
besides smell instances introduced in the first commit, there are considering the Apache ecosystem, we can see a clear difference
also several smell instances that are introduced as a result of between the growth of LOC in Blob and clean classes. Indeed, the
several changes performed on the file during its evolution. In latter have a mean growth in terms of LOC characterized by a
order to better understand such phenomenon, we analyzed how slope of 0.40, while the slope for Blobs is, on average, 91.82. To
the values of some quality metrics change during the evolution of make clear the interpretation of such data, let us suppose we plot
such files. both regression lines on the Cartesian plane. The regression line
Table 4 presents the descriptive statistics (mean and median) for Blobs will have an inclination of 89.38 degree, indicating an
of the slope of the regression line computed, for each metric, for abrupt growth of LOC, while the inclination of the regression line
both smelly and clean files. Also, Table 4 reports the results of the for clean classes will be 21.8 degree, indicating less steep
Mann-Whitney test and Cliff’s d effect size (Large, Medium, or increase of LOC. The same happens when considering the LCOM
Small) obtained when analyzing the difference between the slope cohesion metric (the higher the LCOM, the lower the class
of regression lines for clean and smelly files. Column cmp of cohesion). For the overall dataset, the slope for classes that will
Table 4 shows a " (#) if for the metric m there is a statistically become Blobs is 849.90 as compared to the 0.25 of clean classes.
significant difference in the m’s slope between the two groups of Thus, while the cohesion of classes generally decreases over time,
files ( i.e., clean and smelly), with the smelly ones exhibiting a classes destined to become Blobs exhibit cohesion metric loss
higher (lower) slope; a ”” is shown when the difference is not orders of magnitude faster than clean classes. In general, the
statistically significant. results in Table 4 show strong differences in the metrics’ slope
TABLE 4
1072 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 11, NOVEMBER 2017
between clean and smelly files, indicating that it could be types of smells the percentage of smell-introducing commits
possible to create recommenders warning developers when the tagged as enhancement ranges between 60 and 66 percent. Note
changes performed on a specific code component show a that by enhancement we mean changes applied by developers on
dangerous trend potentially leading to the introduction of a bad existing features aimed at improving them. For example, a
smell. Functional Decomposition was introduced in the class
The Functional Decomposition (FD) smell deserves a separate CreateProjectFromArchetypeMojo of Apache Maven when the
TABLE 5
RQ 2: Commit-Goal Tags to Smell-Introducing Commits
complex and critical tasks [91]. Thus, it is likely that their we also selected a different threshold for each ecosystem when
commits are more prone to introducing design problems. excluding code smell instances introduced too close to the end of
Summary for RQ2. Smells are generally introduced by the observed change history, needed to avoid cases in which the
developers when enhancing existing features or implementing period of time needed for removing the smell is too short for
new ones. As expected, smells are generally introduced in the last being analyzed ( see Section 2.2.3). Analyzing the distribution,
month before issuing a deadline, while there is a considerable we decided to choose the median as threshold, since it is a central
number of instances introduced in the first year from the project value not affected by outliers, as opposed to the mean. Also, the
startup. Finally, developers who introduce smells are generally median values of the distributions are small enough to consider
the owners of the file and they are more prone to introducing discarded smells in the censored interval close to the end of the
smells when they have higher workloads. observed change history (if compared for example to the mean
time to remove a smell). Therefore, we used as threshold values
3.3 What Is the Survivability of Code Smells? 40, 101 and 135 days respectively for Android, Apache and
Eclipse projects. Note that the censored intervals that we did not
We start by analyzing the data for smells that have been removed exclude were opportunely managed by the survival model.
from the system, i.e., those for which there is a closed interval
Fig. 4 shows the number of modifications (i.e., commits
delimited by a last-smell-introducing commit and smell-
modifying the smelly file) performed by the developer between
removing-commit. Fig. 3 shows the box plot of the distribution of
the introduction and the removal of the code smell instance.
the number of days needed to fix a code smell instance for the
These results clearly show that most of the code smell instances
different ecosystems. The box plots, depicted in log-scale, show
are removed after a few commits, generally no more than five
that while few code smell instances are fixed after a long period
commits for Android and Apache, and ten for Eclipse. By
of time (i.e., even over 500 days) most of the instances are fixed
combining what has been observed in terms of the number of
in a relatively short time.
days and the number of commits a smell remains in the system
Table 8 shows the descriptive statistics of the distribution of before being removed, we can conclude that if code smells are
the number of days when aggregating all code smell types removed, this usually happens after few commits from their
considered in our study. We can notice considerable differences introduction, and in a relatively short time.
in the statistics for the three analyzed ecosystems. In particular,
the median value of such distributions are 40, 101 and 135 days
for Android, Apache and Eclipse projects, respectively. While it
is difficult to speculate on the reasons why code smells are fixed
quicker in the Android ecosystem than in the Apache and Eclipse
ones, it is worth noting that on one hand Android apps generally
have a much smaller size with respect to systems in the Apache
and Eclipse ecosystems (i.e., the average size, in terms of KLOC,
is 415 for Android, while it is 1,417 for Apache and
TABLE 8
Descriptive Statistics of the Number of Days Needed a Smell
Remained in the System Before Being Removed
7. As also done for the survival model, for the sake of consistency the data
reported in Table 9 exclude code smell instances introduced too close to the
end of the analyzed change history
see, the vast majority of code smells (81.4 percent, on average)
are not removed, and this result is consistent across the three
ecosystem (83 percent in Android, 87 percent in Apache, and 74
percent in Eclipse). The most refactored smell is the Blob with,
on average, 27 percent of refactored instances. This might be due
to the fact that such a smell is more visible than others due to the
large size of the classes affected by it.
Further insights about the survivability of the smells across the
three ecosystems are provided in the survival models ( i.e., Figs. 5
and 6). The survival of Complex Class (blue line) and Spaghetti
Code (brown line) is much higher in systems belonging to the
Apache ecosystem with respect to systems belonging to the
Android and Eclipse ecosystems. Indeed,
TABLE 9
Percentage of Code Smells Removed and Not in
the Observed Change History
these two smell types are the ones exhibiting the highest
survivability in Apache and the lowest survivability in Android
and Eclipse. Similarly, we can notice that the survival curves for
CDSBP (green) and FD (yellow) exhibit quite different shapes
between Eclipse (higher survivability) and the other two
ecosystems (lower survivability). Despite these differences, the
outcome that can be drawn from the observation of the survival
models is one and valid across all the ecosystems and for all smell
types: the survivability of code smells is very high, with over 50
percent of smell instances still “alive” after 1,000 days and 1,000
commits from their introduction.
Finally, we analyzed differences in the survivability of code
smell instances affecting “born-smelly-artifacts” (i.e., code files
containing the smell instance since their creation) and “not-born-
smelly-artifacts” (i.e., code files in which the code smell has been
introduced as a consequence of maintenance and evolution
activities). Here there could be two possible scenarios: on the one
hand developers might be less prone to refactor and fix born-
smelly-artifacts than not-born-smelly-artifacts, since the code
smell is somehow part of the original design of the code
component. On the other hand, it could also be the case that the
initial design is smelly because it is simpler to realize and release,
while code smell removal is planned as a future activity. Both
these conjectures have not been confirmed by the performed data
analysis. As an example, we report the results achieved for the
TUFANO ET AL.: WHEN AND WHY YOUR CODE STARTS TO SMELL BAD (AND WHETHER THE SMELLS GO AWAY) 1077
Fig. 7 shows the survivability of born-smelly and notborn-
smelly artifacts for the CDSBP instances. In this case, on two of
the three analyzed ecosystems the survivability of born-smelly
artifacts is actually higher, thus confirming in part the first
scenario drawn above. However, when looking at the results for
Complex Class instances (Fig. 8), such a trend is not present in
Android and Apache and it is exactly the opposite in Eclipse (i.e.,
not-born-smelly-artifacts survive longer than the born-smelly
ones). Such trends have also been observed for the other analyzed
smells and, in some cases, contradictory trends were observed for
the same smell in the three ecosystems (see [80]). Thus, it is not
really possible to draw any conclusions on this point.
Fig. 7. Survival probability of CDSBP instances affecting born and not born smelly artifacts.
Fig. 8. Survival probability of complex class instances affecting born and not born smelly artifacts.
1078 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 11, NOVEMBER 2017
TABLE 10 The first surprising result to highlight is that only 9 percent
How Developers Remove Code Smells (71) of smell instances are removed as a result of a refactoring
operation. Of these, 27 are Encapsulate Field refactorings
Category # Commits % Percentage % Excluding Unclear performed to remove a CDSBP instance. Also, five additional
Code Removal 329 34 40 CDSBP instances are removed by performing Extract Class
Code Replacement 267 27 33 refactoring. Thus, in these five cases the smell is not even actually
Unclear 158 16 - fixed, but just moved from one class to another. Four Extract
Code Insertion 121 12 15 Class refactorings have been instead performed to remove four
Refactoring 71 7 9 Blob instances. The Substitute Algorithm refactoring has been
Major Restructuring 33 3 4 applied to remove Complex Classes (ten times) and Spaghetti
Summary for RQ3. Most of the studied code smell instances code (four times). Other types of refactorings we observed (e.g.,
(80 percent) are not removed during the observed system’s move method, move field) were only represented by one or two
evolution. When this happens, the removal is generally performed instances. Note that this result (i.e., few code smells are removed
after few commits from the introduction (10) and in a limited time via refactoring operations) is in line with what was observed by
period (100 days). Overall, we can observe a very high Bazrfashan and Koschke [11] when studying how code clones
survivability of code smells, with over 50 percent of smell had been removed by developers: They found that most of the
instances still “alive” after 1,000 days and 1,000 commits from clones were removed accidentally as a side effect of other
their introduction. changes rather than as the result of targeted code transformations.
One interesting example of code smell removed using an
appropriate refactoring operation relates to the class
3.4 How Do Developers Remove Code Smells?
org.openejb.alt.config.ConfigurationFactory of the Apache
Table 10 shows the results of the open coding procedure, aimed at Tomee project. The main responsibility of this class is to manage
identifying how developers fix code smells (or, more generally, the data and configuration information for assembling an
how code smells are removed from the system). We defined the application server. Until the commit 0877b14, the class also
following categories: contained a set of methods to create new jars and descriptors for
such jars (through the EjbJar and EjbJarInfo classes). In the
Code Removal. The code affected by the smell is deleted or
commented. As a consequence, the code smell instance is commit mentioned above, the class affected by the Blob code
smell has been refactored using Extract Class refactoring. In
no longer present in the system. Also, it is not replaced by
other code in the smellremoving-commit. particular, the developer extracted two new classes from the
original class, namely OpenejbJar and EjbJarInfoBuilder
Code Replacement. The code affected by the smell is
containing the extra functionalities previously contained in
substantially rewritten. As a consequence, the code smell
instance is no longer present in the system. Note that the ConfigurationFactory.
code rewriting does not include any specific refactoring The majority of code smell instances (40 percent) are simply
operation. removed due to the deletion of the affected code components. In
Code Insertion. A code smell instance disappears after new particular: Blob, Complex Class, and Spaghetti Code instances
code is added in the smelly artifact. While at a first glance are mostly fixed by removing/commenting large code fragments
it might seem unlikely that the insertion of new code can (e.g., no longer needed in the system). In case of Class Data
remove a code smell, the addition of a new method in a Should Be Private, the code smell frequently disappears after the
class could, for example, increase its cohesion, thus deletion of public fields. As an example of code smell removed
removing a Blob class instance. via the deletion of code fragments, the class
Refactoring. The code smell is explicitly removed by org.apache.subversion.javahl.ISVNClient of the
applying one or multiple refactoring operations. ApacheSubversion project was a Complex Class until the
Major Restructuring. A code smell instance is removed after a snapshot 673b5ee. Then, the developers completely deleted
significant restructuring of the system’s architecture that several methods, as explained in the commit message: “JavaHL:
totally changes several code artifacts, making it difficult Remove a completely superfluous API”. This resulted in the
to track the actual operation that removed the smell. Note consequent removal of the Complex Class smell.
that this category might implicitly include the ones listed In 33 percent of the cases, smell instances are fixed by
above (e.g., during the major restructuring some code has rewriting the source code in the smelly artifact. This frequently
been replaced, some new code has been written, and some occurs in Complex Class and Spaghetti Code instances, in which
refactoring operations have been performed). However, it the rewriting of method bodies can substantially simplify the code
differs from the others since in this case we are not able to and/or make it more inline with object-oriented principles. Code
identify the exact code change leading to the smell Insertion represents 15 percent of the fixes. This happens
removal. We only know that it is a consequence of a particularly in Functional Decomposition instances, where the
major system’s restructuring. smelly artifacts acquire more responsibilities and are better
Unclear. The GitHub URL used to see the commit diff (i.e., to shaped in an object-oriented flavor. Interestingly, also three Blob
inspect the changes implemented by the smell-removing- instances were removed by writing new code increasing their
commit) was no longer available at the time of the manual cohesion. An example of Functional Decomposition removed by
inspection. adding code is represented by the ExecutorFragment class,
For each of the defined categories, Table 10 shows (i) the belonging to the org.eclipse.ocl. library.executor package of the
absolute number of smell-removing-commits classified in that EclipseOCL project. The original goal of this class was to provide
category; (ii) their percentage over the total of 979 instances and the description of the properties for the execution of the plug-in
(iii) their percentage computed excluding the Unclear instances. that allows users to parse and evaluate Object Constraint
Language (OCL) constraints. In the commit b9c93f8 the
TUFANO ET AL.: WHEN AND WHY YOUR CODE STARTS TO SMELL BAD (AND WHETHER THE SMELLS GO AWAY) 1079
developers added to the class methods to access and modify such The overlap between the quality metrics used when building
properties, as well as the init method, which provides APIs the linear regression models (RQ1) and the metrics used by
allowing external users to define their own properties. DECOR for detecting code smells may bias the findings related to
Finally, in 4 percent of the cases the smell instance was when code smells are introduced. In our empirical investigation
removed as a consequence of a major restructuring of the whole we are not interested in predicting the presence of code smells
system. over time, but we want to observe whether the trends of quality
Summary for RQ4. The main, surprising result of this research metrics are different for classes that will become smelly with
question is the very low percentage (9 percent) of smell instances respect to those that will not become smelly. For this reason, the
that are removed as a direct consequence of refactoring use of indicators that are used by the detector to identify smells
operations. Most of the code smell instances (40 percent) are should not influence our observations. However, in most of the
removed as a simple consequence of the deletion of the smelly cases we avoided the overlap between the metrics used by
artifact. Interestingly, the addition of new code can also DECOR and the ones used in the context of RQ 1. Table 11
contribute to removing code smells (15 percent of cases). reports, for each smell, (i) the set of metrics used by the detector,
(ii) the set of metrics evaluated in the context of RQ 1, and (iii) the
overlap between them. We can note that the overlap between the
4 THREATS TO VALIDITY two sets of metrics is often minimal or even empty (e.g., in the
case of Spaghetti Code). Also, it is worth noting that the detector
The main threats related to the relationship between theory and
uses specific thresholds for detecting smells, while in our case we
observation (construct validity) are due to imprecisions/ errors in
simply look for the changes of metrics’ value over time.
the measurements we performed. Above all, we relied on DECOR
As explained in Section 2, the heuristic for excluding projects
rules to detect smells. Notice that our reimplementation uses the
with incomplete history from the Project startup analysis may
exact rules defined by Moha et al. [50], and has been already used
have failed to discard some projects. Also, we excluded the first
in our previous work [58]. Nevertheless, we are aware that our
commit from a project’s history involving Java files from the
results can be affected by (i) the thresholds used for detecting
analysis of smell-introducing commits, because such commits are
code smell instances, and (ii) the presence of false positives and
likely to be imports from old versioning systems, and, therefore,
false negatives.
we only focused our attention ( in terms of the first commit) on
A considerable increment/decrement of the thresholds used in
the addition of new files during the observed history period.
the detection rules might determine changes in the set of detected
Concerning the tags used to characterize smell-introducing
code smells (and thus, in our results). In our study we used the
changes, the commit classification was performed by two
thresholds suggested in the paper by Moha et al. [50]. As for the
different authors and results were compared and discussed in
presence of false positives and false negatives, Moha et al.
cases of inconsistencies. Also, a second check was performed for
reported for DECOR a precision above 60 percent and a recall of
those commits linked to issues (only 471 out of 9,164 commits),
100 percent on Xerces 2.7.0. As for the precision, other than
to avoid problems due to incorrect issue classification [3], [32].
relying on Moha et al.’s assessment, we have manually validated
a subset of the 4,627 detected smell instances. This manual
validation has been performed by two authors independently, and
cases of disagreement were discussed. In total, 1,107 smells were
validated, including 241 Blob instances, 317 Class Data Should
Be Private, 166 Complex Class, 65 Spaghetti Code, and 318
Functional Decomposition. Such a (stratified) sample is deemed
to be statistically significant for a 95 percent confidence level and
10 percent confidence interval [69]. The results of the manual
validation indicated a mean precision of 73 percent, and
specifically 79 percent for Blob, 62 percent for Class Data Should
Be Private, 74 percent for Complex Class, 82 percent for
Spaghetti Code, and 70 percent for Functional Decomposition. In
addition, we replicated all the analysis performed to answer our
research questions by just considering the smell-introducing
commits (2,555) involving smell instances that have been
manually validated as true positives. The results achieved in this
analysis (available in our replication package [80]) are perfectly
consistent with those obtained in our paper on the complete
dataset, thus confirming all our findings. Finally, we are aware
that our study can also suffer from the presence of false negatives.
However, (i) the sample of investigated smell instances is pretty
large (4,627 instances), and (ii) the DECOR’s claimed recall is
very high.
Another threat related to the use of DECOR is the possible
presence of “conceptual” false positive instances [26], i.e.,
instances detected by the tool as true positives but irrelevant for
developers. However, most of the code smells studied in this
paper (i.e., Blob, Complex Class and Spaghetti Code) have been
shown to be perceived as harmful by developers [60]. This limits
the possible impact of this threat.
1080 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 11, NOVEMBER 2017
The analysis of developer-related tags was performed using been fixed more than once for the same type of code smell during
the GIT author information instead of relying on committers (not the considered change history. Thus, such a phenomenon should
all authors have commit privileges in open source projects, hence only marginally impact our data.
observing committers would give an imprecise and partial view Concerning RQ4, we relied on an open coding procedure
of the reality). However, there is no guarantee that the reported performed on a statistically significant sample of smellremoving
authorship is always accurate and complete. We are aware that
TABLE 11
Metrics Used by the Detector Compared to the Metrics Evaluated in RQ 1
Code Smell Metrics used by DECOR Metrics used in RQ1 Overlap
Blob #Methods*, #Attributes* LOC, LCOM*, WMC, RFC, 3 metrics out of 5 used by DECOR. Note that in this case
CBO DECOR also uses textual aspects
LCOM*, MethodName, #Methods*, #Attributes* of the source code that we do not take into account in the context of
ClassName RQ1.
CDSBP # Public Attributes LOC, LCOM, WMC, RFC, –
CBO
#Methods, #Attributes
Complex Class WMC LOC, LCOM, WMC* 1 metric in overlap between the two sets. Note that in the paper we
did not only observe
RFC, CBO the growth of the WMC metric, but we found that other several
metrics tend to increase
#Methods, #Attributes over time for the classes that will become smelly (e.g., LCOM and
NOA).
Functional # Private Attributes, LOC, LCOM, WMC, RFC, 1 metric in overlap. Also in this case, we found decreasing trends for
#Attributes* CBO all the metrics
Decomposition Class name #Methods, #Attributes* used in RQ1, and not only for the one used by DECOR.
Spaghetti Code Method LOC, #Parameters LOC, LCOM, WMC, RFC, –
CBO
DIT #Methods, #Attributes
the Workload tag measures the developers’ activity within a commits in order to understand how code smells are removed
single project, while in principle one could be busy on other from software systems. This procedure involved three of the
projects or different other activities. One possibility to mitigate authors and included open discussion aimed at double checking
such a threat could have been to measure the workload of a the classifications individually performed. Still, we cannot
developer within the entire ecosystem. However, in our opinion, exclude imprecision and some degree of subjectiveness (mitigated
this would have introduced some bias, i.e., assigning a high by the discussion) in the assignment of the smell-removing
workload to developers working on several projects of the same commits to the different fixing/removal categories.
ecosystem and a low workload to those that, while not working As for the threats that could have influenced the results
on other projects of the same ecosystem, could have been busy on (internal validity), we performed the study by comparing classes
projects outside the ecosystem. It is also important to point out affected (and not) by a specific type of smell. However, there can
that, in terms of the relationship between Workload tag and smell also be cases of classes affected by different types of smells at the
introduction, we obtained consistent results across three same time. Our investigation revealed that such classes represent
ecosystems, which at least mitigates the presence of a possible a minority (3 percent for Android, 5 percent for Apache, and 9
threat. Also, estimating the Workload by just counting commits is percent for Eclipse), and, therefore, the coexistence of different
an approximation. However, we do not use the commit size types of smells in the same class is not particularly interesting to
because there might be a small commit requiring a substantial investigate, given also the complexity it would have added to the
effort as well. study design and to its presentation. Another threat could be
The proxies that we used for the survivability of code smells represented by the fact that a commit identified as a
(i.e., the number of days and the number of commits from their smellremoving-commit (i.e., a commit which fixes a code smell)
introduction to their removal) should provide two different views could potentially introduce another type of smell in the same
on the survivability phenomenon. However, the level of activity class. To assess the extent to which this could represent a threat to
of a project (e.g., the number of commits per week) may our study, we analyzed in how many cases this happened in our
substantially change during its lifetime, thus, influencing the two entire dataset. We found that in only four cases a fix of a code
measured variables. smell led to the introduction of a different code smell type in the
When studying the survival and the time to fix code smell same software artifact.
instances, we relied on DECOR to assess when a code smell In RQ2 we studied tags related to different aspects of a
instance has been fixed. Since we rely on a metricbased approach, software project’s lifetime—characterizing commits, developers,
code smell instances whose metrics’ values alternate between and the project’s status itself—we are aware that there could be
slightly below and slightly above the detection threshold used by many other factors that could have influenced the introduction of
DECOR appear as a series of different code smell instances smells. In any case, it is worth noting that it is beyond the scope
having a short lifetime, thus introducing imprecisions in our data. of this work to make any claims related to causation of the
To assess the extent of such imprecisions, we computed the relationship between the
distribution of a number of fixes for each code file and each type TABLE 12
of smell in our dataset. We found that only between 0.7 and 2.7 Number of Censored Intervals Discarded Using Different
percent (depending on the software ecosystem) of the files has Thresholds
TUFANO ET AL.: WHEN AND WHY YOUR CODE STARTS TO SMELL BAD (AND WHETHER THE SMELLS GO AWAY) 1081
# Censored Intervals Android Apache Eclipse
Total 708 5,780 2,709
Discarded using 1st Q. 1 (0.1) 43 (0.7) 7 (0.3) Blob class in commit ci). Then, suppose that f is completely
Discarded using Median 3 (0.4) 203 (3.5) 51 (1.9) rewritten in ciþ1 and that DECOR still identifies f as a Blob class.
Discarded using 3rd Q. 26 (3.7) 602 (10.0) 274 (10.0) While it is clear that the Blob instance detected in commit ci is
Percentages are reported between brackets. different with respect to the one detected in commit ciþ1 (since f
has been completely rewritten), we are not able to discriminate
introduction of smells and product or process factors the two instances since we simply observe that DECOR was
characterizing a software project. detecting a Blob in f at commit ci and it is still detecting a Blob in
The survival analysis in the context of RQ 3 has been performed f at commit ciþ1. This means that ( i ) we will consider for the Blob
by excluding smell instances for which the developers had not instance detected at commit ci a lifetime longer than it should be,
“enough time” to fix them, and in particular censored intervals and (ii) we will not be able to study a new Blob instance. Also,
having the last-smell-introducing commit too close to the last when computing the survivability of the code smells we
commit analyzed in the project’s history. Table 12 shows the considered the smell introduced only after the last-smell-
absolute number of censored intervals discarded using different introducing-commit ( i.e., we ignored the other commits
thresholds. In our analysis, we used the median of the smelly contributing to the introduction of the smell). Basically, our RQ 3
interval (in terms of the number of days) for closed intervals as a results are conservative in the sense that they consider the
threshold. As we can observe in Table 12, this threshold allows minimum survival time of each studied code smell instance.
the removal of a relatively small number of code smells from the The main threats related to the relationship between the
analysis. Indeed, we discarded three instances (0.4 percent of the treatment and the outcome (conclusion validity) are represented
total number of censored intervals) in Android, 203 instances (3.5 by the analysis method exploited in our study. In RQ 1, we used
percent) in Apache and 51 instances (1.9 percent) in Eclipse. This non-parametric tests (Mann-Whitney) and effect size measures
is also confirmed by the analysis of the distribution of the number (Cliff’s Delta), as well as regression analysis. Results of RQ 2 and
of days composing the censored intervals, shown in Table 13, RQ4 are, instead, reported in terms of descriptive statistics and
which highlights how the number of days composing censored analyzed from purely observational point of view. As for RQ 3, we
intervals is quite large. It is worth noting that if we had selected used the Kaplan-Meier estimator [33], which estimates the
the first quartile as threshold, we would have removed too few underlying survival model without making any initial assumption
code smells from the analysis (i.e., one instance in Android, 43 in upon the underlying distribution.
Apache, and seven in Eclipse). On the other hand, a more Finally, regarding the generalization of our findings (external
conservative approach would have been to exclude censored data validity) this is, to the best of our knowledge, the largest study—
where the time interval between the last-smellintroducing commit in terms of the number of projects (200) — concerning the
and the last analyzed commit is greater thanthe third quartile of analysis of code smells and of their evolution. However, we are
the smell removing time distribution. In this case, we would have aware that we limited our attention to only five types of smells.
removed a higher number of instances with respect to the median As explained in Section 2, this choice is justified by the need for
(i.e., 26 instances in Android, 602 in Apache, and 51 in Eclipse). limiting the computational time since we wanted to analyze a
Moreover, as we show in our online appendix [80], this choice large number of projects. Also, we tried to diversify the types of
would have not impacted our findings (i.e., the achieved results smells by including smells representing violations of OO
are consistent with what we observed by using the median). principles and “size-related” smells. Last, but not least, we made
Finally, we also analyzed the proportion of closed and censored sure to include smells—such as Complex Class, Blob, and
intervals considering (i) the original change history (no instance Spaghetti Code—that previous studies indicated to be perceived
removed), (ii) the first quartile as threshold, (iii) the median value by developers as severe problems [60]. Our choice of the subject
as threshold, and (iv) the third quartile as threshold. As shown in systems is not random, but guided by specific requirements of our
our online appendix [80], we found that the proportion of closed underlying infrastructure. Specifically, the selected systems are
and censored intervals after excluding censored intervals using written in Java, since the code smell detector used in our study is
the median value, remains almost identical to the initial able to work with software systems written in this programming
proportion (i.e., original change history). Indeed, in most of the language. Clearly, results cannot be generalized to other
cases the differences is less than 1 percent, while in only few programming languages. Nevertheless, further studies aiming at
cases it reaches 2 percent. replicating our work on other smells, with projects developed for
Still in the context of RQ3, we considered a code smell as other ecosystems and in other programming languages, are
removed from the system in a commit ci when DECOR detects it desirable.
in ci1 but does not detect it in ci. This might lead to some 5 RELATED WORK
imprecisions why computing the lifetime of the smells. Indeed,
suppose that a file f was affected by the Blob smell until commit This section reports the literature related to (i) empirical studies
ci (i.e., DECOR still identifies f as a conducted to analyze the evolution and (ii) the impact of code
TABLE 13 smells on maintainability; (iii) methods and tools able to detect
Descriptive Statistics of the Number of Days of them in the source code. Finally we also reported the empirical
Censored Intervals studies conducted in the field of refactoring.
Ecosystem Min 1st Qu. Median Mean 3rd Qu. Max. 5.1 Evolution of Smells
Android 3 513 945 1,026 1,386 2,911 A first study that takes into account the way the code smells
Apache 0 909 1,570 1,706 2,434 5,697 evolve during the evolution of a system has been conducted by
Eclipse 0 1,321 2,799 2,629 4,005 5,151 Chatzigeorgiou and Manakos [17]. The reported results show that
1082 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 11, NOVEMBER 2017
(i) the number of instances of code smells increases during time; this work, our analyses revealed that other code smells have
and (ii) developers are reluctant to perform refactoring operations generally a long life, and that, when fixed, their removal is
in order to remove them. On the same line are the results reported usually performed after few commits.
by Peters and Zaidman [62], who show that developers are often Thummalapenta et al. [76] introduced the notion of “late
aware of the presence of code smells in the source code, but they propagation” related to changes that have been propagated across
do not invest time in performing refactoring activities aimed at cloned code instances at different times. An important difference
removing them. A partial reason for this behavior is given by between research conducted in the area of clone evolution and
Arcoverde et al. [4], who studied the longevity of code smells code smell evolution is that, differently from other code smells,
showing that they often survive for a long time in the source code. clone evolution can be seen of the coevolution of multiple, similar
The authors point to the will of avoiding changes to API as one of (i.e., cloned) code elements, and such evolution can either be
the main reasons behind this result [4]. The analyses conducted in consistent or inconsistent (e.g., due to missing change
the context of RQ3 confirm previous findings on code smell propagation) [76]. Such a behavior does not affect the code smells
longevity, showing that code smells tend to remain in a system for studied in this paper.
a long time. Moreover, the results of RQ 4 confirm that refactoring Finally, related to the variables investigated in this study, and
is not the primary way in which code smells are removed. specifically related to the authorship of smell-related changes, is
The evolution of code smells is also studied by Olbrich et al. the notion of code ownership. Rahman and Devanbu [63] studied
[56], who analyzed the evolution of two types of code smells, the impact of ownership and developers’ experience on software
namely God Class and Shotgun Surgery, showing that there are quality. The authors focused on software bugs analyzing whether
periods in which the number of smells increases and periods in “troubled” code fragments (i.e., code involved in a fix) are the
which this number decreases. They also show that the result of contributions from multiple developers. Moreover, they
increase/decrease of the number of instances does not depend on studied if and what type of developers’ experience matter in this
the size of the system. Vaucher et al. [83] conducted a study on context. The results show that code implicated in bugs is more
the evolution of the God Class smell, aimed at understanding strongly associated with contributions coming from a single
whether they affect software systems for long periods of time or, developer. In addition, specialized experience on the target file is
instead, are refactored while the system evolves. Their goal is to shown to be more important than developer’s general experience.
define a method able to discriminate between God Class instances
that have been introduced by design and God Class instances that 5.2 Impact of Smells on Maintenance Properties Several
were introduced unintentionally. Our study complements the empirical studies have investigated the impact of code smells on
work by Vaucher et al. [83], because we look into the maintenance activities. Abbes et al. [1] studied the impact of two
circumstances behind the introduction of smells, other than types of code smells, namely Blob and Spaghetti Code, on
analyzing when they are introduced. program comprehension. Their results show that the presence of a
In a closely related field, Bavota et al. [7] analyzed the code smell in a class does not have an important impact on
distribution of unit test smells in 18 software systems providing developers’ ability to comprehend the code. Instead, a
evidence that they are widely spread, but also that most of the combination of more code smells affecting the same code
them have a strong negative impact on code comprehensibility. components strongly decreases developers’ ability to deal with
On the same line, Tufano et al. [79] reported a large-scale comprehension tasks. The interaction between different smell
empirical study, which showed that test smells are usually instances affecting the same code components has also been
introduced by developers when the corresponding test code is studied by Yamashita et al. [88], who confirmed that developers
committed in the repository for the first time and they tend to experience more difficulties in working on classes affected by
remain in a system for a long time. The study conducted in this more than one code smell. The same authors also analyzed the
paper is complementary to the one by Tufano et al., since it is impact of code smells on maintainability characteristics [89].
focused on the analysis of the design flaws arising in the They identified which maintainability factors are reflected by
production code. code smells and which ones are not, basing their results on (i)
Some related research has been conducted to analyze one very expert-based maintainability assessments, and (ii) observations
specific type of code smell, i.e., code clones. Gode€ [30] and interviews with professional developers. Sjoberg et al. [72]
investigated to what extent code clones are removed through investigated the impact of twelve code smells on the
deliberate operations, finding significant divergences between the maintainability of software systems. In particular, the authors
code clones detected by existing tools and the ones removed by conducted a study with six industrial developers involved in three
developers. Bazrafshan and Koschke [11] extended the work by maintenance tasks on four Java systems. The amount of time
Gode,€ analyzing whether developers remove code clones using spent by each developer in performing the required tasks has been
deliberate or accidental modifications, finding that the former measured through an Eclipse plug-in, while a regression analysis
category is the most frequent. To this aim, the authors classified has been used to measure the maintenance effort on source code
the changes which removed clones in Replacement, Movement, files having specific properties, including the number of smells
and Deletion, thus leading to a categorization similar to the one affecting them. The achieved results show that smells do not
presented in our RQ4. However, such a categorization is focused always constitute a problem, and that often class size impacts
on code clones, since it considers specific types of changes aimed maintainability more than the presence of smells.
at modeling code clones evolution ( e.g., whether the duplicated Lozano et al. [48] proposed the use of change history
code is placed into a common superclass), while we defined a information to better understand the relationship between code
more generic taxonomy of changes applied by developers for smells and design principle violations, in order to assess the
removing a variety of code smells. severity of design flaws. The authors found that the types of
Kim et al. [40] studied the lifetime of code clones, finding that maintenance activities performed over the evolution of the system
many clones are fixed shortly, while long-lived code clones are should be taken into account to focus refactoring efforts. In our
not easy to refactor because they evolve independently. Unlike study, we point out how particular types of maintenance activities
TUFANO ET AL.: WHEN AND WHY YOUR CODE STARTS TO SMELL BAD (AND WHETHER THE SMELLS GO AWAY) 1083
(i.e., enhancement of existing features or implementation of new explicitly converted from one class type into another, possibly
ones) are generally more associated with code smell introduction. performing illegal casting which results in a runtime error. Simon
Deligiannis et al. [24] performed a controlled experiment showing et al. [71] provided a metric-based visualization tool able to
that the presence of God Class smell negatively affects the discover design defects representing refactoring opportunities.
maintainability of source code. Also, the authors highlight an For example, a Blob is detected if different sets of cohesive
influence played by these smells in the way developers apply the attributes and methods are present inside a class. In other words, a
inheritance mechanism. Blob is identified when there is the possibility to apply Extract
Khomh et al. [38] demonstrated that the presence of code Class refactoring. Marinescu [49] proposed a metric-based
smells increases the code change proneness. Also, they showed mechanism to capture deviations from good design principles and
that the code components affected by code smells are more fault- heuristics, called “detection strategies”. Such strategies are based
prone with respect to components not affected by any smell [38]. on the identification of symptoms characterizing a particular
Gatrell and Counsell [29] conducted an empirical study aimed at smell and metrics for measuring such symptoms. Then, thresholds
quantifying the effect of refactoring on change- and fault- on these metrics are defined in order to define the rules. Lanza
proneness of classes. In particular, the authors monitored a and Marinescu [43] showed how to exploit quality metrics to
commercial C# system for twelve months identifying the identify “disharmony patterns” in code by defining a set of
refactorings applied during the first four months. They examined thresholds based on the measurement of the exploited metrics in
the same classes for the second four months in order to determine real software systems. Their detection strategies are formulated in
whether the refactoring results in a decrease of change- and fault- four steps. In the first step, the symptoms characterizing a smell
proneness. They also compared such classes with the classes of are defined. In the second step, a proper set of metrics measuring
the system that, during the same time period, have not been these symptoms is identified. Having this information, the next
refactored. The results revealed that classes subject to refactoring step is to define thresholds to classify the class as affected (or not)
have a lower change- and fault-proneness, both considering the by the defined symptoms. Finally, AND/OR operators are used to
time period in which the same classes were not refactored and correlate the symptoms, leading to the final rules for detecting the
classes in which no refactoring operations were applied. Li et al. smells.
[45] empirically evaluated the correlation between the presence of Munro [52] presented a metric-based detection technique able
code smells and the probability that the class contains errors. to identify instances of two smells, i.e., Lazy Class and
They studied the post-release evolution process showing that Temporary Field, in the source code. A set of thresholds is
many code smells are positively correlated with class errors. applied to some structural metrics able to capture those smells. In
Olbrich et al. [56] conducted a study on the God Class and Brain the case of Lazy Class, the metrics used for the identification are
Class code smells, reporting that these code smells were changed Number of Methods (NOM), LOC, Weighted Methods per Class
less frequently and had a fewer number of defects with respect to (WMC), and Coupling Between Objects (CBO). Moha et al. [50]
the other classes. D’Ambros et al. [23] also studied the correlation introduced DECOR, a technique for specifying and detecting
between the Feature Envy and Shotgun Surgery smells and the code and design smells. DECOR uses a Domain-Specific
defects in a system, reporting no consistent correlation between Language (DSL) for specifying smells using high-level
them. Recently, Palomba et al. [60] investigated how the abstractions. Four design smells are identified by DECOR,
developers perceive code smells, showing that smells namely Blob, Swiss Army Knife, Functional Decomposition, and
characterized by long and complex code are those perceived more Spaghetti Code. As explained in Section 2, in our study we rely
by developers as design problems. on DECOR for the identification of code smells over the change
history of the systems in our dataset because of its good
5.3 Detection of Smells performances both in terms of accuracy and execution time.
Several techniques have been proposed in the literature to detect Tsantalis and Chatzigeorgiou [78] presented JDeodorant, a tool
code smell instances affecting code components, and all of these able to detect instances of Feature Envy smells with the aim of
take their cue from the suggestions provided by four well-known suggesting move method refactoring opportunities. For each
books: [15], [28], [65], [85]. The first one, by Webster [85] method of the system, JDeodorant forms a set of candidate target
defines common pitfalls in Object Oriented Development, going classes where a method should be moved. This set is obtained by
from the project management down to the implementation. Riel examining the entities (i.e., attributes and methods) that a method
[65] describes more than 60 guidelines to rate the integrity of a accesses from the other classes. In its current version JDeodorant 8
software design. The third one, by Fowler [28], describes 22 code is also able to detect other three code smells (i.e., State Checking,
smells describing for each of them the refactoring actions to take. Long Method, and God Classes), as well as opportunities for
Finally, Brown et al. [15] define 40 code antipatterns of different refactoring code clones. Ligu et al. [46] introduced the
nature ( i.e., architectural, managerial, and in source code), identification of Refused Bequest code smell using a combination
together with heuristics to detect them. of static source code analysis and dynamic unit test execution.
From these starting points, in the last decade several Their approach aims at discovering classes that really want to
approaches have been proposed to detect design flaws in source support the interface of the superclass [28]. In order to understand
code. Travassos et al. [77] define the “reading techniques”, a what are the methods really invoked on subclass instances, they
mechanism suggesting manual inspection rules to identify defects intentionally override these methods introducing an error in the
in source code. van Emden and Moonen [82] presented jCOSMO, new implementation (e.g., a division by zero). If there are classes
a code smell browser that visualizes the detected smells in the in the system invoking the method, then a failure will occur.
source code. In particular, they focus their attention on two Java Otherwise, the method is never invoked and an instance of
programming smells, known as instanceof and typecast. The first Refused Bequest is found.
occurs when there are too many instanceof operators in the same Code smell detection can be also formulated as an optimization
block of code that make the source code difficult to read and problem, as pointed out by Kessentini et al. [35] as they presented
understand. The typecast smell appears instead when an object is a technique to detect design defects by following the assumption
that what significantly diverges from good design practices is
1084 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 11, NOVEMBER 2017
likely to represent a design problem. The advantage of their aggregating activities of 13,000 developers for almost one year,
approach is that it does not look for specific code smells (as most and information extracted from versioning systems. Some of the
approaches) but for design problems in general. Also, in the several interesting findings they found were (i) almost 41 percent
reported evaluation, the approach was able to achieve a 95 of development activities contain at least one refactoring session,
percent precision in identifying design defects [35]. Kessentini et ( ii ) programmers rarely (almost 10 percent of the time)
al. [36] also presented a cooperative parallel search-based configure refactoring tools, (iii) commit messages do not help in
approach for identifying code smells instances with an accuracy predicting refactoring, since rarely developers explicitly report
higher than 85 percent. Boussaa et al. [13] proposed the use of their refactoring activities in them, (iv) developers often perform
competitive coevolutionary search to code-smell detection floss refactoring, namely they interleave refactoring with other
problem. In their approach two populations evolve programming activities, and (v) most of the refactoring operations
simultaneously: the first generates detection rules with the aim of (close to 90 percent) are manually performed by developers
detecting the highest possible proportion of code smells, whereas without the help of any tool.
the second population generates smells that are currently not Kim et al. [41] presented a survey performed with 328
detected by the rules of the other population. Sahin et al. [67] Microsoft engineers (of which 83 percent developers) to
perception towards the benefits, risks, and challenges of refac-
8. http://www.jdeodorant.com/ toring [41]. The main findings of the study reported that:
proposed an approach able to generate code smell detection rules investigate (i) when and how they refactor source code, (ii) if
using a bi-level optimization problem, in which the first level of automated refactoring tools are used by them and (iii) developers’
optimization task creates a set of detection rules that maximizes While developers recognize refactoring as a way to improve the
the coverage of code smell examples and artificial code smells quality of a software system, in almost 50 percent of the cases
generated by the second level. The lower level is instead they do not define refactoring as a behavior-preserving operation;
responsible to maximize the number of code smells artificially The most important symptom that pushes developers to perform
generated. The empirical evaluation shows that this approach refactoring is low readability of source code;
achieves an average of more than 85 percent in terms of precision 51 percent of developers manually
and recall. perform refactoring;
The approaches described above classify classes strictly as The main benefits that the developers observed from the
being clean or anti-patterns, while an accurate analysis for the refactoring were improved readability (43 percent) and
borderline classes is missing [39]. In order to bridge this gap, improved maintainability (30 percent);
Khomh et al. [39] proposed an approach based on Bayesian belief The main risk that developers fear when performing
networks providing a likelihood that a code component is affected refactoring operations is bug introduction (77 percent).
by a smell, instead of a boolean value as done by the previous Kim et al. [41] also reported the results of a quantitative
techniques. This is also one of the main characteristics of the analysis performed on the Windows 7 change history showing
approach based on the quality metrics and B-splines proposed by that code components refactored over time experienced a higher
Oliveto et al. [57] for identifying instances of Blobs in source reduction in the number of inter-module dependencies and post-
code. release defects than other modules. Similar results have been
Besides structural information, historical data can be exploited obtained by Kataoka et al. [34], which analyzed the history of an
for detecting code smells. Ratiu et al. [64] proposed to use the industrial software system comparing the classes subject to the
historical information of the suspected flawed structure to application of refactorings with the classes never refactored,
increase the accuracy of the automatic problem detection. finding a decreasing of coupling metrics.
Palomba et al. [59] provided evidence that historical data can be Finally, a number of works have studied the relationship
successfully exploited to identify not only smells that are between refactoring and software quality. Bavota et al. [9]
intrinsically characterized by their evolution across the program conducted a study aimed at investigating to what extent
history-such as Divergent Change, Parallel Inheritance, and refactoring activities induce faults. They show that refactorings
Shotgun Surgery-but also smells such as Blob and Feature Envy involving hierarchies (e.g., pull down method) induce faults very
[59]. frequently. Conversely, other kinds of refactorings are likely to be
harmless in practice. The study on why code smells are
5.4 Empirical Studies on Refactoring Wang et al. [84] conducted introduced (RQ2) reveals an additional sideeffect of refactoring,
a survey with ten industrial developers in order to understand i.e., sometimes developers introduce code smells during
which are the major factors that motivate their refactoring refactoring operations.
activities. The authors report twelve different factors pushing Bavota et al. also conducted a study aimed at understanding
developers to adopt refactoring practices and classified them in the relationships between code quality and refactoring [10]. In
intrinsic motivators and external motivators. In particular, particular, they studied the evolution of 63 releases of three open
Intrinsic motivators are those for which developers do not obtain source systems in order to investigate the characteristics of code
external rewards (for example, an intrinsic motivator is the components increasing/decreasing their chances of being object
Responsibility with Code Authorship, namely developers want to of refactoring operations. Results indicate that often refactoring is
ensure high quality for their code). Regarding the external not performed on classes having a low metric profile, while
motivators, an example is the Recognitions from Others, i.e., high almost 40 percent of the times refactorings have been performed
technical ability can help the software developers gain on classes affected by smells. However, just 7 percent of them
recognitions. actually removed the smell. The latter finding is perfectly in line
Murphy-Hill et al. [54] analyzed eight different datasets trying with the results achieved in the context of RQ 4, where we found
to understand how developers perform refactorings. Examples of that only 9 percent of code smell instances are removed as direct
the exploited datasets are usage data from 41 developers using the consequence of refactoring operations.
Eclipse environment, data from the Eclipse Usage Collector
TUFANO ET AL.: WHEN AND WHY YOUR CODE STARTS TO SMELL BAD (AND WHETHER THE SMELLS GO AWAY) 1085
Stroggylos and Spinellis [73] studied the impact of refactoring source code before their actual application (e.g., see the recent
operations on the values of eight object-oriented quality metrics. work by Chaparro et al. [16]).
Their results show the possible negative effects that refactoring Lesson 4. Newcomers are not necessarily responsible for
can have on some quality metrics (e.g., increased value of the introducing bad smalls, while developers with high workloads
LCOM metric). On the same line, Stroulia and Kapoor [74], and release pressure are more prone to introducing smell
analyzed the evolution of one system observing a decrease of instances. This result highlights that code inspection practices
LOC and NOM (Number of Methods) metrics on the classes in should be strengthened when developers are working under these
which a refactoring has been applied. Szoke et al. [75] performed stressful conditions.
a study on five software systems to investigate the relationship Lesson 5. Code Smells have a high survivability and are rarely
between refactoring and code quality. They show that small removed as a direct consequence of refactoring activities.
refactoring operations performed in isolation rarely impact We found that 80 percent of the analyzed code smell instances
software quality. On the other side, a high number of refactoring survive in the system and only a very low percentage of them (9
operations performed in block helps in substantially improving percent) is removed through the application of specific
code quality. Alshayeb [2] investigated the impact of refactoring refactorings. While we cannot conjecture on the reasons behind
operations on five quality attributes, namely adaptability, such a finding (e.g., the absence of proper refactoring tools, the
maintainability, understandability, reusability, and testability. developers’ perception of code smells, etc.), our results highlight
Their findings highlight that benefits brought by refactoring the need for further studies aimed at understanding why code
operations on some code classes are often counterbalanced by a smells are not refactored by developers. Only in this way it will
decrease of quality in some other classes. Our study partially be possible to understand where the research community should
confirms the findings reported by Alshayeb [2], since we show invest its efforts (e.g., in the creation of a new generation of
how in some cases refactoring can introduce design flaws. Moser refactoring tools).
et al. [51] conducted a case study in an industrial environment These lessons learned represent the main input for our future
aimed at investigating the impact of refactoring on the research agenda on the topic, mainly focusing on designing and
productivity of an agile team and on the quality of the source code developing a new generation of code qualitycheckers, such as
they produce. The achieved results show that refactoring not only those described in Lesson 2, as well as investigating the reasons
increases software quality but also helps to increase developers’ behind developers’ lack of motivation to perform refactoring
productivity. activities and which factors (e.g., intensity of the code smell)
promote/discourage developers to fix a smell instance (Lesson 5).
6 CONCLUSION AND LESSONS LEARNED Also, we intend to perform a deeper investigation of factors that
can potentially explain the introduction of code smells, other than
This paper presented a large-scale empirical study conducted over the ones already analyzed in this paper.
the commit history of 200 open source projects and aimed at
understanding when and why bad code smells are introduced,
what is their survivability, and under which circumstances they ACKNOWLEDGMENTS
are removed. These results provide several valuable findings for We would like to sincerely thank the anonymous reviewers for
the research community: their careful reading of our manuscript and extremely useful
Lesson 1. Most of the times code artifacts are affected by bad comments that were very helpful in significantly improving the
smells since their creation. This result contradicts the common paper. Michele Tufano and Denys Poshyvanyk from W&M were
wisdom that bad smells are generally introduced due to several partially supported via NSF CCF1525902 and CCF-1253837
modifications made on a code artifact. Also, this finding grants. Fabio Palomba is partially funded by the University of
highlights that the introduction of most smells can simply be Molise.This paper is an extension of “When and Why Your Code
avoided by performing quality checks at commit time. In other Starts to Smell Bad” that appeared in the Proceedings of the 37th
words, instead of running smell detectors time-to-time on the IEEE/ACM International Conference on Software Engineering
entire system, these tools could be used during commit activities (ICSE 2015), Florence, Italy, pp. 403-414, 2015 [81].
(in particular circumstances, such as before issuing a release) to
avoid or, at least, limit the introduction of bad code smells.
Lesson 2. Code artifacts becoming smelly as consequence of REFERENCES
maintenance and evolution activities are characterized by peculiar [1] M. Abbes, F. Khomh, Y.-G. Gueheneuc, and G. Antoniol, “An
metrics’ trends, different from those of clean artifacts. This is in empirical study of the impact of two antipatterns, Blob and Spaghetti
agreement with previous findings on the historical evolution of Code, on program comprehension,” in Proc. 15th Eur. Conf. Softw.
Maintenance Reengineering, 2011, pp. 181–190.
code smells [48], [58], [64]. Also, such results encourage the [2] M. Alshayeb, “Empirical investigation of refactoring effect on software
development of recommenders able to alert software developers quality,” Inf. Softw. Technol., vol. 51, no. 9, pp. 1319–1326 , 2009.
when changes applied to code artifacts result in worrisome metric [3] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y.-G. Gueheneuc,
trends, generally characterizing artifacts that will be affected by a “Is it a bug or an enhancement?: A text-based approach to classify
change requests,” in Proc. Conf. Centre Adv. Studies Collaborative
smell. Res., 2008, Art. no. 23.
Lesson 3. While implementing new features and enhancing [4] R. Arcoverde, A. Garcia, and E. Figueiredo, “Understanding the
existing ones, the main activities during which developers tend to longevity of code smells: Preliminary results of an explanatory survey,”
introduce smells, we found almost 400 cases in which refactoring in Proc. Int. Workshop Refactoring Tools, 2011, pp. 33–36.
[5] A. Bachmann, C. Bird, F. Rahman, P. T. Devanbu, and A. Bernstein,
operations introduced smells. This result is quite surprising, given “The missing links: Bugs and bug-fix commits,” in Proc. 18th ACM
that one of the goals behind refactoring is the removal of bad SIGSOFT Int. Symp. Found. Softw. Eng., 2010 , pp. 97–106.
smells [27]. This finding highlights the need for techniques and [6] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.
tools aimed at assessing the impact of refactoring operations on Reading, MA, USA: Addison-Wesley, 1999.
[7] G. Bavota, A. Qusef, R. Oliveto, A. D. Lucia, and D. Binkley, “An
empirical analysis of the distribution of unit test smells and their impact
1086 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 11, NOVEMBER 2017
on software maintenance,” in Proc. 28th IEEE Int. Conf. Softw. [32] K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: How
Maintenance, 2012, pp. 56–65. misclassification impacts bug prediction,” in Proc. 35th Int. Conf.
[8] G. Bavota, G. Canfora, M. Di Penta, R. Oliveto, and S. Panichella, “The Softw. Eng., 2013, pp. 392–401.
evolution of project inter-dependencies in a software ecosystem: The [33] E. Kaplan and P. Meier, “Nonparametric estimation from incomplete
case of Apache,” in Proc. IEEE Int. Conf. Softw. Maintenance, 2013, observations,” J. Amer. Statist. Assoc., vol. 53, no. 282, pp. 457–481,
pp. 280–289. 1958.
[9] G. Bavota, B. D. Carluccio, A. De Lucia, M. Di Penta, R. Oliveto, and [34] Y. Kataoka, T. Imai, H. Andou, and T. Fukaya, “A quantitative
O. Strollo, “When does a refactoring induce bugs? an empirical study,” evaluation of maintainability enhancement by refactoring,” in Proc. Int.
in Proc. IEEE Int. Work. Conf. Source Code Anal. Manipulation, 2012, Conf. Softw. Maintenance, 2002, pp. 576–585.
pp. 104–113. [35] M. Kessentini, S. Vaucher, and H. Sahraoui, “Deviance from perfection
[10] G. Bavota, A. D. Lucia, M. D. Penta, R. Oliveto, and F. Palomba, “An is a better criterion than closeness to evil when identifying risky code,”
experimental investigation on the innate relationship between quality in Proc. IEEE/ACM Int. Conf. Automated Softw. Eng., 2010, pp. 113–
and refactoring,” J. Syst. Softw., vol. 107 , pp. 1–14, 2015. 122.
[11] S. Bazrafshan and R. Koschke, “An empirical study of clone removals,” [36] W. Kessentini, M. Kessentini, H. Sahraoui, S. Bechikh, and A. Ouni, “A
in Proc. 29th IEEE Int. Conf. Softw. Maintenance, Sep. 2013, pp. 50– cooperative parallel search-based software engineering approach for
59. code-smells detection,” IEEE Trans. Softw. Eng., vol. 40, no. 9, pp.
[12] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. T. Devanbu, “Don’t 841–861, Sep. 2014.
touch my code!: Examining the effects of ownership on software [37] F. Khomh, M. Di Penta, and Y.-G. Gueheneuc, “An exploratory study
quality,” in Proc. 19th ACM SIGSOFT Symp. Found. Softw. Eng. 13rd of the impact of code smells on software changeproneness,” in Proc.
Eur. Softw. Eng. Conf., 2011, pp. 4–14. 16th Workshop Conf. Reverse Eng., 2009, pp. 75–84.
[13] M. Boussaa, W. Kessentini, M. Kessentini, S. Bechikh, and S. Ben [38] F. Khomh, M. DiPenta, Y.-G. Gueheneuc, and G. Antoniol, “An
Chikha, “Competitive coevolutionary code-smells detection,” in Search exploratory study of the impact of antipatterns on class change- and
Based Software Engineering. Berlin, Germany: Springer, 2013 , pp. 50– fault-proneness,” Empirical Softw. Eng., vol. 17, no. 3, pp. 243–275,
65. 2012.
[14] N. Brown, et al., “Managing technical debt in software-reliant systems,” [39] F. Khomh, S. Vaucher, Y.-G. Gueheneuc, and H. Sahraoui, “A Bayesian
in Proc. Workshop Future Softw. Eng. Res. 18th ACM SIGSOFT Int. approach for the detection of code and design smells,” in Proc. 9th Int.
Symp. Found. Softw. Eng., 2010, pp. 47–52. Conf. Qual. Softw., 2009, pp. 305–314.
[15] W. J. Brown, R. C. Malveau, W. H. Brown, H. W. McCormick III, and [40] M. Kim, V. Sazawal, D. Notkin, and G. Murphy, “An empirical study of
T. J. Mowbray, Anti Patterns: Refactoring Software, Architectures, and code clone genealogies,” SIGSOFT Softw. Eng. Notes, vol. 30, no. 5,
Projects in Crisis, 1st ed. Hoboken, NJ, USA: Wiley, Mar. 1998. pp. 187–196, Sep. 2005.
[16] O. Chaparro, G. Bavota, A. Marcus, and M. Di Penta, “On the impact of [41] M. Kim, T. Zimmermann, and N. Nagappan, “An empirical study of
refactoring operations on code quality metrics,” in Proc. 30th IEEE Int. refactoring challenges and benefits at Microsoft,” IEEE Trans. Softw.
Conf. Softw. Maintenance Evol., 2014, pp. 456–460. Eng., vol. 40, no. 7, pp. 633–649, Jul. 2014.
[17] A. Chatzigeorgiou and A. Manakos, “Investigating the evolution of bad [42] P. Kruchten, R. L. Nord, and I. Ozkaya, “Technical debt: From
smells in object-oriented code,” in Proc. Int. Conf. Qual. Inf. Commun. metaphor to theory and practice,” IEEE Softw., vol. 29, no. 6, pp. 18–
Technol., 2010, pp. 106–115. 21, Nov./Dec. 2012.
[18] S. R. Chidamber and C. F. Kemerer, “A metrics suite for object oriented [43] M. Lanza and R. Marinescu, Object-Oriented Metrics in Practice: Using
design,” IEEE Trans. Softw. Eng., vol. 20, no. 6, pp. 476–493 , Jun. Software Metrics to Characterize, Evaluate, and Improve the Design of
1994. Object-Oriented Systems. Berlin, Germany: Springer, 2006.
[19] M. Claes, T. Mens, R. Di Cosmo, and J. Vouillon, “A historical analysis [44] M. M. Lehman and L. A. Belady, Software Evolution—Processes of
of debian package incompatibilities,” in Proc. 12th Work. Conf. Mining Software Change. London, U.K.: Academic, 1985.
Softw. Repositories, 2015, pp. 212–223. [45] W. Li and R. Shatnawi, “An empirical study of the bad smells and class
[20] W. J. Conover, Practical Nonparametric Statistics, 3rd ed. Hoboken, NJ, error probability in the post-release object-oriented system evolution,” J.
USA: Wiley, 1998. Syst. Softw., vol. 80, pp. 1120–1128, 2007.
[21] J. Corbin and A. Strauss, “Grounded theory research: Procedures, [46] E. Ligu, A. Chatzigeorgiou, T. Chaikalis, and N. Ygeionomakis,
canons, and evaluative criteria,” Qualitative Sociology, vol. 13, no. 1, “Identification of refused bequest code smells,” in Proc. 29th IEEE Int.
pp. 3–21, 1990. Conf. Softw. Maintenance, 2013, pp. 392–395.
[22] W. Cunningham, “The WyCash portfolio management system,” OOPS [47] E. Lim, N. Taksande, and C. B. Seaman, “A balancing act: What
Messenger, vol. 4, no. 2, pp. 29–30, 1993. software practitioners have to say about technical debt,” IEEE Softw.,
[23] M. D’Ambros, A. Bacchelli, and M. Lanza, “On the impact of design vol. 29, no. 6, pp. 22–27, Nov./Dec. 2012.
flaws on software defects,” in Proc. 10th Int. Conf. Qual. Softw., Jul. [48] A. Lozano, M. Wermelinger, and B. Nuseibeh, “Assessing the impact of
2010, pp. 23–31. bad smells using historical information,” in Proc. 9th Int. Workshop
[24] I. Deligiannis, I. Stamelos, L. Angelis, M. Roumeliotis, and M. Principles Softw. Evol.: Conjunction 6th ESEC/FSE Joint Meeting,
Shepperd, “A controlled experiment investigation of an objectoriented 2007, pp. 31–34.
design heuristic for maintainability,” J. Syst. Softw., vol. 72, no. 2, pp. [49] R. Marinescu, “Detection strategies: Metrics-based rules for detecting
129–143, 2004. design flaws,” in Proc. 20th Int. Conf. Softw. Maintenance, 2004, pp.
[25] M. Fischer, M. Pinzger, and H. Gall, “Populating a release history 350–359.
database from version control and bug tracking systems,” in Proc. 19th [50] N. Moha, Y.-G. Gueheneuc, L. Duchien, and A.-F. L. Meur, “Decor: A
Int. Conf. Softw. Maintenance, 2003, Art. no. 23. method for the specification and detection of code and design smells,”
[26] F. A. Fontana, J. Dietrich, B. Walter, A. Yamashita, and M. Zanoni, IEEE Trans. Softw. Eng., vol. 36, no. 1, pp. 20–36, Jan./Feb. 2010.
“Antipattern and code smell false positives: Preliminary [51] R. Moser, P. Abrahamsson, W. Pedrycz, A. Sillitti, and G. Succi, “A
conceptualization and classification,” in Proc. IEEE 23rd Int. Conf. case study on the impact of refactoring on quality and productivity in an
Softw. Anal. Evol. Reengineering, 2016, vol. 1, pp. 609–613. agile team,” in Balancing Agility and Formalism in Software
[27] M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts, Refactoring: Engineering, B. Meyer, J. R. Nawrocki, and B. Walter, Eds. Berlin,
Improving the Design of Existing Code. Reading, MA, USA: Addison- Germany: Springer-Verlag, 2008, pp. 252–266.
Wesley, 1999. [52] M. J. Munro, “ Product metrics for automatic identification of “bad
[28] M. Fowler, Refactoring: Improving the Design of Existing Code. smell” design problems in Java source-code,” in Proc. 11th Int. Softw.
Boston, MA, USA: Addison-Wesley, 1999. Metrics Symp., Sep. 2005, pp. 15–15.
[29] M. Gatrell and S. Counsell, “The effect of refactoring on change and [53] G. C. Murphy, “Houston: We are in overload,” in Proc. 23rd IEEE Int.
fault-proneness in commercial c# software,” Sci. Comput. Program., Conf. Softw. Maintenance, 2007, Art. no. 1.
vol. 102, pp. 44–56, 2015. [54] E. Murphy-Hill, C. Parnin, and A. P. Black, “How we refactor, and how
[30] N. G ode,€ “Clone removal: Fact or fiction?” in Proc. 4th Int. Workshop we know it,” IEEE Trans. Softw. Eng., vol. 38, no. 1, pp. 5–18,
Softw. Clones, 2010, pp. 33–40. Jan./Feb. 2011.
[31] R. J. Grissom and J. J. Kim, Effect Sizes for Research: A Broad [55] I. Neamtiu, G. Xie, and J. Chen, “Towards a better understanding of
Practical Approach, 2nd ed. Mahwah, NJ, USA: Lawrence Earlbaum software evolution: An empirical study on open-source software,” J.
Associates, 2005. Softw.: Evol. Process, vol. 25, no. 3, pp. 193–218, 2013.
[56] S. Olbrich, D. S. Cruzes, V. Basili, and N. Zazworka, “The evolution
and impact of code smells: A case study of two open source systems,”
TUFANO ET AL.: WHEN AND WHY YOUR CODE STARTS TO SMELL BAD (AND WHETHER THE SMELLS GO AWAY) 1087
in Proc. 3rd Int. Symp. Empirical Softw. Eng. Meas., 2009 , pp. 390– [82] E. van Emden and L. Moonen, “Java quality assurance by detecting
400. code smells,” in Proc. 9th Work. Conf. Reverse Eng., Oct. 2002, pp. 97–
[57] R. Oliveto, F. Khomh, G. Antoniol, and Y.-G. Gueheneuc, “Numerical 106.
signatures of antipatterns: An approach based on B-splines,” in Proc. [83] S. Vaucher, F. Khomh, N. Moha, and Y. G. Gueheneuc, “Tracking
14th Conf. Softw. Maintenance Reengineering, Mar. 2010, pp. 248–251. design smells: Lessons from a study of god classes,” in Proc. 16th
[58] F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, A. De Lucia, and D. Work. Conf. Reverse Eng., 2009, pp. 145–158.
Poshyvanyk, “Detecting bad smells in source code using change history [84] Y. Wang, “What motivate software engineers to refactor source code?
information,” in Proc. IEEE/ACM 28th Int. Conf. Automated Softw. evidences from professional developers,” in Proc. IEEE Int. Conf.
Eng, Nov. 2013, pp. 268–278. Softw. Maintenance, 2009, pp. 413 –416.
[59] F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, D. Poshyvanyk, and A. [85] B. F. Webster, Pitfalls of Object Oriented Development, 1st ed. Buffalo,
De Lucia, “Mining version histories for detecting code smells,” IEEE NY, USA: M & T Books, Feb. 1995.
Trans. Softw. Eng., vol. 41, no. 5, pp. 462–489 , May 2015. [86] R. Wu, H. Zhang, S. Kim, and S.-C. Cheung, “ReLink: Recovering
[60] F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, and A. De Lucia, “Do links between bugs and changes,” in Proc. 19th ACM SIGSOFT Symp.
they really smell bad? A study on developers’ perception of bad code Found. Softw. Eng. 13rd Eur. Softw. Eng. Conf., 2011, pp. 15–25.
smells,” in Proc. 30th IEEE Int. Conf. Softw. Maintenance Evol., 2014, [87] G. Xie, J. Chen, and I. Neamtiu, “Towards a better understanding of
pp. 101–110. software evolution: An empirical study on open source software,” in
[61] D. L. Parnas, “Software aging,” in Proc. 16th Int. Conf. Softw. Eng., Proc. IEEE Int. Conf. Softw. Maintenance, 2009, pp. 51– 60.
1994, pp. 279–287. [88] A. Yamashita and L. Moonen, “Exploring the impact of intersmell
[62] R. Peters and A. Zaidman, “Evaluating the lifespan of code smells using relations on software maintainability: An empirical study,” in Proc. Int.
software repository mining,” in Proc. Eur. Conf. Softw. Maintenance Conf. Softw. Eng., 2013, pp. 682–691.
Reengineering, 2012, pp. 411–416. [89] A. F. Yamashita and L. Moonen, “Do code smells reflect important
[63] F. Rahman and P. T. Devanbu, “Ownership, experience and defects: A maintainability aspects?” in Proc. 28th IEEE Int. Conf. Softw.
fine-grained study of authorship,” in Proc. 33rd Int. Conf. Softw. Eng., Maintenance, 2012, pp. 306–315.
2011, pp. 491–500. [90] A. F. Yamashita and L. Moonen, “Do developers care about code
[64] D. Ratiu, S. Ducasse, T. G^ırba, and R. Marinescu, “Using history smells? an exploratory survey,” in Proc. 20th Work. Conf. Reverse
information to improve design flaws detection,” in Proc. 8th Eur. Conf. Eng., 2013, pp. 242–251.
Softw. Maintenance Reengineering, 2004, pp. 223–232. [91] A. Zeller, Why Programs Fail: A Guide to Systematic Debugging. San
[65] A. J. Riel, Object-Oriented Design Heuristics. Reading, MA, USA: Mateo, CA, USA: Morgan Kaufmann, 2009.
Addison-Wesley, 1996.
[66] J. RupertG. Miller, Survival Analysis, 2nd ed. Hoboken, NJ, USA: Michele Tufano received the master’s degree in
Wiley, 2011. computer science from the University of Salerno,
[67] D. Sahin, M. Kessentini, S. Bechikh, and K. Deb, “Code-smell detection Italy. He is currently working toward the PhD
as a bilevel problem,” ACM Trans. Softw. Eng. Methodology, vol. 24, degree in the College of William and Mary, Virgina,
no. 1, pp. 6:1–6:44, Oct. 2014. under the supervision of Prof. Denys Poshyvanyk.
[68] G. Scanniello, “Source code survival with the Kaplan Meier,” in Proc. His research interests include software engineering,
27th IEEEInt. Conf. Softw. Maintenance, 2011, pp. 524–527.
mining software repositories, software quality,
[69] D. Sheskin, Handbook of Parametric and Nonparametric Statistical
software maintenance and evolution, and empirical
Procedures, 2nd ed. London, U.K.: Chapman & Hall/CRC, 2000.
software engineering. He is a student member of
[70] F. Shull, D. Falessi, C. Seaman, M. Diep, and L. Layman, “Technical
the IEEE and the ACM.
debt: Showing the way for better transfer of empirical results” in
Perspectives on the Future of Software Engineering. Berlin, Germany:
Springer, 2013, pp. 179–190.
[71] F. Simon, F. Steinbr, and C. Lewerentz, “Metrics based refactoring,” in Fabio Palomba received the PhD degree in
Proc. 5th Eur. Conf. Softw. Maintenance Reengineering, 2001, pp. 30– computer science from the University of Salerno,
38. Italy, in 2017. He is a postdoctoral researcher with
[72] D. I. K. Sjøberg, A. F. Yamashita, B. C. D. Anda, A. Mockus, and T. Delft University of Technology, The Netherlands. His
Dyba , “Quantifying the effect of code smells on maintenance research interests include software maintenance
effort,” IEEE Trans. Softw. Eng., vol. 39, no. 8, pp. 1144–1156 , Aug. and evolution, empirical software engineering,
2013. change and defect prediction, green mining, and
[73] K. Stroggylos and D. Spinellis, “Refactoring–does it improve software mining software repositories. He serves and has
quality?” in Proc. 5th Int. Workshop Softw. Qual., 2007 , Art. no. 10. served as a program committee member of
[74] E. Stroulia and R. Kapoor, “ Metrics of refactoring-based development: international conferences such as MSR, ICPC,
An experience report,” in OOIS, X. Wang, R. Johnston, and S. Patel, ICSME, and others. He is member of the IEEE and
Eds. London, U.K.: Springer, 2001, pp. 113–122. the ACM.
[75] G. Szoke, G. Antal, C. Nagy, R. Ferenc, and T. Gyimothy, “Bulk fixing
Gabriele Bavota received the PhD degree in
coding issues and its effects on software quality: Is it worth
computer science from the University of Salerno,
refactoring?” in Proc. IEEE 14th Int. Work. Conf. Source Code Anal.
Manipulation, 2014, pp. 95–104. Italy, in 2013. He is an assistant professor with the
[76] S. Thummalapenta, L. Cerulo, L. Aversano, and M. Di Penta , “An Universita della Svizzera italiana ( USI), Switzerland.
empirical study on the maintenance of source code clones,” Empirical His research interests include software
Softw. Eng., vol. 15, no. 1, pp. 1–34, 2010. maintenance, empirical software engineering, and
[77] G. Travassos, F. Shull, M. Fredericks, and V. R. Basili, “ Detecting mining software repository. He is author of more
defects in object-oriented designs: Using reading techniques to increase than 70 papers appeared in international journals, conferences and
software quality,” in Proc. 14th Conf. Object-Oriented Program. Syst. workshops. He serves as a program co-chair for ICPC’16, SCAM’16, and
Languages Appl., 1999, pp. 47–56. SANER’17. He also serves and has served
[78] N. Tsantalis and A. Chatzigeorgiou, “Identification of move method as organizing and program committee member of international conferences
refactoring opportunities,” IEEE Trans. Softw. Eng., vol. 35, no. 3, pp. in the field of software engineering, such as ICSE, ICSME, MSR, ICPC, SANER,
347–367, May/Jun. 2009. SCAM, and others. He is a member of the IEEE.
[79] M. Tufano, et al., “An empirical investigation into the nature of test
smells,” in Proc. 31st IEEE/ACM Int. Conf. Automated Softw. Eng.,
2016, pp. 4–15.
Rocco Oliveto is associate professor with the University of Molise, Italy,
[80] M. Tufano, et al., “When and why your code starts to smell bad (and
where, he is also the chair of the computer science program and the director
whether the smells go away)-replication package,” 2014. [Online].
Available: http://www.cs.wm.edu/semeru/data/codesmells/ of the Software and Knowledge Engineering (STAKE) Lab. He co-authored
[81] M. Tufano, et al., “When and why your code starts to smell bad,” in more than 100 papers on topics related to software traceability, software
Proc. 37th Int. Conf. Softw. Eng., 2015, pp. 403–414. maintenance and evolution, search-based software engineering, and
empirical software engineering. His activities span various international
software engineering research communities. He has served as organizing and
1088 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 11, NOVEMBER 2017
program committee member of several (Android) development and testing, performance testing, energy
international conferences in the field of software consumption, and code reuse. His papers received several Best Paper Awards
engineering. He was program co-chair of ICPC at ICPC’06, ICPC’07, ICSM’10, SCAM’10, ICSM’13, and ACM SIGSOFT
2015 , TEFSE 2015 and 2009, SCAM 2014, WCRE Distinguished Paper Awards at ASE’13, FSE’15, ICSE’15, and ICPC’16. His
2013, and 2012. He was also keynote speaker at ICSM’06 paper received the Most Influential Paper Award in 2016. He
MUD 2012. He is a member of the IEEE. received the NSF CAREER award (2013) and was recently selected to receive
2016 Plumeri Award for Faculty Excellence at W&M. He served on the
steering committees of the IEEE International Conference on Software
Massimiliano Di Penta is an associate professor with Maintenance and Evolution (ICSME) and IEEE International Conference on
the University of Sannio, Italy. His research interests Program Comprehension (ICPC). He also serves on the editorial board of the
include software maintenance and evolution, Empirical Software Engineering Journal (Springer) and the Journal of
mining software repositories, empirical software Software: Evolution and Process (Wiley). He served as a PC co-chair for
engineering, search-based software engineering, ICSME’16, ICPC’13, WCRE’12, and WCRE’11. He is a member of the IEEE.
and service-centric software engineering. He is an
author of more than 230 papers appeared in
international journals, conferences, and workshops. " For more information on this or any other computing topic, please visit our
He serves and has served in the organizing and Digital Library at www.computer.org/publications/dlib.
program committees of more than
100 conferences such as ICSE, FSE, ASE, ICSM, ICPC,
GECCO, MSR WCRE, and others. He has been a general co-chair of various
events, including the 10th IEEE Working Conference on Source Code Analysis
and Manipulation (SCAM 2010), the second International Symposium on
Search-Based Software Engineering (SSBSE 2010), and the 15th Working
Conference on Reverse Engineering (WCRE 2008). Also, he has been a
program chair of events such as the 28th IEEE International Conference on
Software Maintenance (ICSM 2012), the 21st IEEE International Conference
on Program Comprehension (ICPC 2013), the ninth and 10th Working
Conference on Mining Software Repository (MSR 2013 and 2012), the 13th
and 14 th Working Conference on Reverse Engineering (WCRE 2006 and
2007) , the first International Symposium on Search-Based Software
Engineering (SSBSE 2009), and other workshops. He is currently a member of
the steering committee of ICSME, MSR, and PROMISE. Previously, he has
been a steering committee member of other conferences, including ICPC,
SCAM, SSBSE, and WCRE. He is in the editorial board of the IEEE Transactions
on Software Engineering, the Empirical Software Engineering Journal edited
by Springer, and of the Journal of Software: Evolution and Processes edited
by Wiley. He is a member of the IEEE.
Andrea De Lucia received the laurea degree in
computer science from the University of Salerno,
Italy, in 1991, the MSc degree in computer science
from the University of Durham, United Kingdom, in
1996, and the PhD degree in electronic engineering
and computer science from the University of Naples
“Federico II,” Italy, in 1996. He is a full professor of
software engineering in the Department of
Computer Science, University of Salerno, Italy, the
head of the Software Engineering Lab, and the
director of the International Sum-
mer School on Software Engineering. His research interests include software
maintenance and testing, reverse engineering and reengineering, source
code analysis, code smell detection and refactoring, defect prediction,
empirical software engineering, search-based software engineering,
collaborative development, workflow and document management, visual
languages, and elearning. He has published more than 200 papers on these
topics in international journals, books, and conference proceedings and has
edited books and journal special issues. He also serves on the editorial
boards of international journals and on the organizing and program
committees of several international conferences. He was also at-large
member of the executive committee of the IEEE Technical Council on
Software Engineering. He is a senior member of the IEEE and the IEEE
Computer Society.
Denys Poshyvanyk received the MS and MA degrees
in computer science from National University of
Kyiv-Mohyla Academy, Ukraine, and Wayne State
University, in 2003 and 2006 , respectively, and the
PhD degree in computer science from Wayne State
University, in 2008. He is an associate professor in
the Department of Computer Science, College of
William and Mary (W&M), Virginia, where, he leads
SEMERU Research Group. His research interests
include the area of software engineering, evolution,
main-
tenance, and program comprehension. His recent research projects span
topics such as large-scale repository mining, traceability, mobile app