Strathprints 002676
Strathprints 002676
Strathprints 002676
(1996)
An empirical study evaluating depth of inheritance on the maintainability
of object-oriented software. In: Empirical Studies of Programmers:
Sixth Workshop. Intellect, pp. 39-58. ISBN 9781567502626
http://eprints.cdlr.strath.ac.uk/2676/
ISERN-96-11
Abstract
This empirical research was undertaken as part of a multi-method programme of research to
investigate unsupported claims made of object-oriented technology. A series of subject-based
laboratory experiments, including an internal replication, tested the eect of inheritance
depth on the maintainability of object-oriented software. Subjects were timed performing
identical maintenance tasks on object-oriented software with a hierarchy of three levels of
inheritance depth and equivalent object-based software with no inheritance. This was then
replicated with more experienced subjects. In a second experiment of similar design, sub-
jects were timed performing identical maintenance tasks on object-oriented software with a
hierarchy of ve levels of inheritance depth and the equivalent object-based software.
The collected data showed that subjects maintaining object-oriented software with three
levels of inheritance depth performed the maintenance tasks signicantly quicker than those
maintaining equivalent object-based software with no inheritance. In contrast, subjects
maintaining the object-oriented software with ve levels of inheritance depth took longer,
on average, than the subjects maintaining the equivalent object-based software (although
statistical signicance was not obtained). Subjects' source code solutions and debrieng
questionnaires provided some evidence suggesting subjects began to experience diculties
with the deeper inheritance hierarchy
It is not at all obvious that object-oriented software is going to be more maintainable in
the long run. These ndings are suciently important that attempts to verify the results
should be made by independent researchers.
Daly is now with the Fraunhofer Institut (IESE), Kaiserslautern, Germany
1
1 Introduction
Object-oriented technology has become increasingly popular as a result of anecdotal evidence
and expert intuition despite various warnings about relying only on such evidence [Bur95],
[HHL90]. Evidence must be derived from a variety of empirical techniques, the data collected
being used to substantiate ndings, identify discrepancies, and act as a platform for further
investigation. Unfortunately not enough of this research is being performed | for object-
oriented technology this means little empirical evidence exists to support many of the claims
made of it. For example, Jones [Jon94] details a visible lack of empirical data to support the
assertions of substantial gains in software productivity and quality, reduction in defect potential
(the probable number of defects from all causes that will be encountered during development
and production) and improving defect removal eciency (the percentage of defects removed by
any operation, e.g., code inspection), and reuse of software components. Henry et al. [HHL90]
provide a list of references which they state have made claims having qualitative appeal, but
which have little supporting quantitative data.
In contrast, related research has reported that aggregation, dynamic binding, inheritance,
and polymorphism can introduce diculties for programmers attempting to understand, main-
tain, and test object-oriented software; see [CvM93], [Dvo94], [JKZ94], [KGH+ 94], [LMR92],
[WH92], [WMH93]. For example, Wilde and Huitt [WH92] argue that the mechanisms of inher-
itance, polymorphism, and dynamic binding are responsible for the creation of delocalised plans
| pieces of code that are conceptually related but are physically located in non-contiguous parts
of the program [SPL+ 88]. As a consequence, although it can be relatively easy to understand
most of the data structures and member functions individually, understanding their combined
functionality can be extremely dicult [KGH+ 94], [LMR92]. In addition, the use of inheritance
and polymorphism can create a large amount of dependencies that need to be considered within
an object-oriented program [KGH+ 94], [WH92]. The number of dependencies that must be
considered is far greater than in a conventional system and, as a consequence, a maintainer can
have great diculty identifying the impact of their changes [KGH+ 94].
So clearly the alleged advantages and disadvantages of the technology require substantial
empirical investigation. This realisation led to a multi-method programme of research [Dal96],
[DBM+ 95]. The programme of research began with an exploratory investigation where struc-
tured interviews were conducted, with both academics and industrialists, on their opinions of the
merits and failings of the object-oriented approach. The ndings of this primary investigation
were used to design and implement a questionnaire on key aspects of object-oriented systems
| the intention was to conrm (or otherwise) the ndings of the rst phase across a much
larger and wider practitioner group. Finally, a series of subject-based laboratory experiments
2
were conducted, including an internal replication, which tested one of the important and most
interesting outcomes of the questionnaire survey in a more controlled setting.
This paper details the design of these experiments and describes the procedures, subjects,
tasks, and materials. Statistical tests are applied to the time data collected and these are
interpreted in conjunction with an inductive analysis to explore possible explanations of the
data. Finally, threats to internal and external validity are discussed.
The collected data shows that subjects maintaining object-oriented software with three levels
of inheritance depth performed the maintenance tasks signicantly quicker than those maintain-
ing equivalent object-based software with no inheritance. In contrast, subjects maintaining the
object-oriented software with ve levels of inheritance depth took longer, on average, than the
subjects maintaining object-based software (although statistical signicance was not obtained).
Subjects' source code solutions, and debrieng questionnaires provide some evidence suggesting
subjects began to experience diculties with the deeper inheritance hierarchy.
3
should still seek to scale up the ndings to the maintenance of larger software systems with
professional programmers.
3 Experimental design
The experiments sought to determine if inheritance depth has an eect on the maintainability
of object-oriented software. Throughout this article the following denitions apply:
Inheritance depth: the level of a class in the hierarchy where the base class is level 1. Con-
sequently, any class is at level n if it has n , 1 superclasses. The level of the deepest leaf
class is quoted as the depth of the hierarchy.
Maintenance: modication of a software product after delivery to correct faults, to improve
performance or other attributes, or adapt the product to a changed environment [Sch87].
Maintainability: the ease with which a software system can be corrected when errors or de-
ciencies occur, and can be expanded or contracted to satisfy new requirements [Sch87].
Maintainability can be measured in a number of ways, e.g., by using complexity metrics
or even subjective evaluations by experts. For this study, in operational terms, a dierence
in maintainability is to be measured by dierences in the times it takes subjects to perform
maintenance tasks. If a software system is less or more easy to understand and modify than
another, then this dierence is expected to manifest itself as dierences in performance times.
In a controlled experiment on the impact of software structure on maintainability, Rombach
[Rom87] reports correlations between complexity measures and sta-hour eort. The times
reported here re
ect a snap-shot view of maintainability, and relative maintainability could
change as the maintenance process evolves.
Regardless of program versions in the rst experiment and its internal replication, the main-
tenance task described to subjects was the same. Regarding program versions, eort was made to
ensure the maintenance tasks were suciently similar to allow meaningful comparisons. (Table
3 suggests there were no task eects arising from the two programs.) In the second experiment,
regardless of program versions, the maintenance task described to subjects was the same.
The programs used are regarded as good representatives of the solution spaces, i.e., they are
not contrived and are assumed to resemble the solutions most programmers would adopt.
5
is 0:8 of a standard deviation). With pairing, the anticipation was that these power levels would
be improved upon.
3.1.1 Procedure
The rst experiment was performed through a taught postgraduate conversion course in in-
formation technology. All of the students (see Section 3.1.2) enrolled in an object-oriented
programming class using C++ which was intensively taught over a four week period with ap-
proximately nine hours of supervised practical time every week for the rst three weeks and ve
hours in the last week. Students were taught the concepts of object encapsulation, inheritance,
message passing, and polymorphism, a working knowledge of which was required to complete the
maintenance tasks. Practical exercises were based on these concepts, with students designing
and implementing their own classes and inheritance relations and integrating these with existing
code.
Students consented to their practical work being used for research purposes and the practical
tests/experiments, constituting 60% of the nal class mark, were conducted during the nal week
of the class. For each practical test, every student was given a sheet detailing the experimental
instructions, a packet containing the maintenance task, and a second packet containing a listing
of the source code. The experimental instructions were also explained verbally at the beginning.
(The only other information given was that dierent versions of the program existed, stated to
reduce students concern about their relative performance during an individual test.)
The procedure followed for each of the two practical tests was:
1. Subjects were allowed ve minutes to read the instructions and ask questions. When this
time had passed and all subjects indicated they were happy with the instructions, they
were instructed to open packet 1.
2. Packet 1 contained the maintenance task the subjects were to attempt. Subjects were
given a further ten minutes to read the task and ask questions. Again, when this time had
passed and all subjects had indicated they were happy with the maintenance task, they
were instructed to open packet 2.
3. Packet 2 contained the experimental code listing. Once packet 2 was opened, data record-
ing began and each subject had up to 1 hour 45 minutes to complete the maintenance task
and compile and execute the code until the program output matched the required output
provided. When subjects were of the opinion that they had completed the task a monitor
checked their work. If the output was correct, data recording was terminated; if not, the
subject was asked to continue with the modication.
6
Group Experiment 1a Experiment 1b
A Program 1 inheritance Equivalent
at version
version of program 2
B Equivalent
at version Program 2 inheritance
of program 1 version
Table 1: Group allocations to tasks in the rst experiment
After completing the maintenance task, subjects were asked to complete a debrieng ques-
tionnaire before leaving. The questionnaire elicited personal details, programming experience,
and impressions of the maintenance task just attempted, e.g., the overall task diculty, what
approach to the modication was taken, and what aspect caused the most diculty.
3.1.2 Subjects
Thirty one students enrolled in the object-oriented programming course, all of whom had com-
pleted a ten week class in imperative programming using Turbo Pascal. Each subject sat two
multiple choice tests (counting for the other 40% of the class mark) which assessed their object-
oriented programming knowledge gained from the class. The subjects were distributed into two
groups (16 subjects in group A and 15 subjects in group B) by matching pairs of subjects on
the results of these two multiple choice tests and then randomly assigning one to each group:
this pre-screening matching was performed to reduce subject variability across the groups.
The two groups were counter-balanced across the program versions with or without inheri-
tance as illustrated in Table 3.1.2. Allocation in this manner ensured that all subjects performed
a maintenance task to both a
at program version and an inheritance program version. Sub-
jects who did not complete the task could not be included in the statistical analysis because
the nature of the study prevented subjects from continuing after the allocated time period. The
eorts made by these subjects, however, have been taken into account.
There were two programs to be modied; each was designed in an object-oriented fashion and
then implemented in C++. Both programs were simple database systems which allowed records
to be created, displayed, modied, and deleted. The rst system stored information on two
types of university sta and students via the classes Lecturer, Secretary, and Student. Figure 1
displays the inheritance hierarchy for this database system. The classes Sta and Student inherit
from the Univ Community class, Lecturer and Secretary inherit from Sta, and Professor (to
be added) inherits from Lecturer. The classes Univ Community and Sta are abstract classes:
there are no instances of these classes, they merely have the abstract features common to the
7
Univ_Community
firstName
print lastName
department
Staff Student
print print
staffId regNumber
Lecturer Secretary
print
print
annualSalary
hourlyWage
Professor
print
researchGrant
setResearchGrant
Figure 1: Inheritance hierarchy of database system for university sta and students.
specialisation classes. Instances of Lecturer, Secretary and Professor can receive the message
staId, and the member function in the super-class Sta will be executed. Member functions in
any of the subclasses can manipulate the instance variables rstName, lastName, and department
by means of the appropriate member functions in the superclass. Finally, each class overloads the
member function print to implement its own version. The second system stored information on
three types of written work via classes Book, Conference, and Thesis. The inheritance hierarchy
for this system was similar to that of the university database, as were the number of elds per
class.
Two versions of each system were used, a
at and an inheritance program version. The
equivalent
at program versions were created by removing all the inheritance links between the
classes in the hierarchy and adding the data members and individual member functions to each
class which had previously inherited them. Any abstract classes were then deleted, leaving a
`
attened' but equivalent version of the inheritance hierarchy. The
at program versions were
each about 390 lines of code (simple line count, including approximately 25 comment lines |
used to identify C++ constructs not class relationships, e.g., `lecturer constructor', `assign initial
values'). The inheritance program versions were each about 360 lines of code (approximately 35
comment lines). The inheritance depth for each system was three.
To test the hypothesis about the maintainability of object-oriented software, maintenance
8
tasks were devised which introduced new requirements (in this case, increasing the amount of
information the database could store). The subjects' task was to add a single class to their
system. A Professor class had to be added to the university system, and a Phd Thesis class to
the library system. The Professor class was to consist of seven elds, some of which are shown
in Figure 1, and was intended to be specialised from class Lecturer. The Phd Thesis class was
also to consist of seven dierent elds and was intended to be specialised from class Thesis.
These two tasks were designed to be similar. In line with common programming practices each
class was expected to have: (i) its member variables declared as private, (ii) a constructor, (iii)
a destructor, and (iv) public member functions (although the required output could be obtained
without all of these practices being adhered to). Subjects had then to create an instance of
their new class with initial and default values, modify some of these values, and then display
the object. Regardless of the program version (inheritance or
at) the maintenance task was
the same.
3.1.4 Materials
The data was automatically collected by a highly controlled environment designed specically for
this study. Each subject was required to start a shell script which provided a workstation prompt
with their login name and the time. This script was kept running throughout the experiment
and it recorded the process the subject adopted towards the modication; this allowed the reader
of the typescript to decipher, for example, how long was spent on a particular problem.
Another shell script was introduced which, while compiling the subject's les to generate the
executable, automatically copied each le with a time stamp to a backup directory. This meant
9
the number of compilations could be calculated and also allowed examination of each subject's
solution as it was written and compiled from one stage to the next.
In summary, the data collected from conducting each experiment for any given subject was:
(i) the time to complete the task, (ii) automatic le backups, (iii) a script of the subject's
experimental procedure, (iv) the nal version of the subject's solution, and (v) answers to the
debrieng questionnaire.
A pilot study, using four academic sta, was conducted to: (i) nd introduced assumptions in
the experimental materials, (ii) nd mistakes in the experimental procedure, (iii) test that the
experimental instructions were clear, (iv) check that the tasks were of reasonable complexity,
but that they could be completed well within the allotted time, (v) ensure performance of
the automated data collection techniques, and (vi) attempt to identify any other unforeseen
circumstances.
No signicant issues were encountered during the pilot study, but subjects did require clari-
cation on several points in the instructions, e.g., two subjects mentioned that the description
of the required program output was not specic enough. The instructions were subsequently
amended to make them clearer.
10
Group Internal Replication Second Experiment
A Inheritance version Equivalent
at version
of 5 level hierarchy
B Equivalent
at version Inheritance version
with 5 levels
Table 2: Group allocations to tasks for the replication and second experiment
Section 3.1.2, but were blocked across their average Computer Science exam marks. The groups
were counter-balanced across program versions with or without inheritance as illustrated in Table
3.2. Allocation in this manner again ensured that all subjects performed a maintenance task to
both a
at program version and an inheritance program version, i.e., those that performed with
the
at program version in the replication were given the inheritance program version in the
second experiment and vice versa. The procedures, materials, and environment were the same
as they were for the rst experiment (see Section 3.1).
For the internal replication the null hypothesis was stated:
H 0,rep | The use of a hierarchy of 3 levels of inheritance depth does not aect the
maintainability of object-oriented programs,
to be rejected in favour of the alternative hypothesis
H 1,rep | The results of the internal replication will be in the same direction as the
rst experiment.
The direction specied in the hypothesis indicates the results of the replication were expected
to be similar to the results of the rst experiment.
11
Univ_Community
firstName
lastName
print department
print
Staff Student
staffId print
taxableSalary regNumber
Professor Senior_Technician
print print
setResearchGrant taxableSalary
Supervisor
print
taxableSalary
Director
print
office
taxableSalary
setOffice
addition, member functions were introduced so that wages and salaries could be calculated for
the university employees. Figure 2 displays the inheritance hierarchy for this system. Again,
two versions of the system were constructed: a
at program version and an inheritance program
version. The inheritance depth for this system was 5. The inheritance program version was
approximately 800 lines of code (approximately 90 comment lines), distributed in 12 classes
(again, each class was distributed in a header and implementation le) and a main le. The
at
program version, constructed in the same manner detailed in Section 3.1.3, had 3 fewer classes
(the abstract classes which were deleted), but was around 300 lines longer (approximately 80
comment lines).
The maintenance task for this more complex system was again devised to meet new require-
ments. The task involved adding a new class, Director, which was expected to be specialised
12
from class Supervisor (as detailed in Figure 2). Once more the task required member functions
to create, modify, display, and delete instances of the class. In addition, a member function had
to be written to calculate the taxable salary for Director. Each subject then had to create an
instance of their new class and send it messages to invoke actions to meet the required program
output. For the second experiment the null hypothesis was stated:
H 0,exp2 | The use of a hierarchy of 5 levels of inheritance depth does not aect the
maintainability of object-oriented programs,
to be rejected in favour of the alternative hypothesis
H 1,exp2 | The use of a hierarchy of 5 levels of inheritance depth does aect the
maintainability of object-oriented programs - subjects maintaining the inheritance
program version will take longer than those subjects maintaining the
at program
version.
For this hypothesis a direction was provided because the depth being empirically investigated
is within the range indicated most frequently by practitioners where diculties begin to occur
[DMB+ 95].
4 Experimental results
This section details subjects' mean completion times for the maintenance tasks and provides an
interpretation of the discovered timing trends.
Statistical tests were then applied. Formal skewness and kurtosis tests were performed and
found several of the data distributions to be non-normal at the 95% condence interval (con-
dence intervals are provided in [BCM94]). Consequently, to be conservative, non-parametric
statistical tests were applied (although for each non-parametric test a similar result was obtained
by an alternative parametric tests). A Wilcoxon signed ranks (related) test which takes account
of the dierence (positive or negative) between paired values, i.e., the performance dierence
between a subject's time to complete the inheritance program version and the
at program ver-
sion, was calculated. The statistical test, based upon the 20 subjects who completed both the
at and inheritance program versions, produced a signicant result with p = 0:05 (two-tailed,
W = 46:5; N = 19; z = ,1:95): 13 subjects performed better on the inheritance than the
at,
6 did the opposite, and 1 achieved the same time for both versions (and so was discounted in
the test calculation). Thus we reject the null hypothesis H0,exp1 in favour of the alternative
hypothesis H1,exp1 .
It was of some concern that 11 subjects failed to complete at least one of the tasks within
the allotted time. It is important to note the signicant statistical result was based on paired
observations and thus did not include any of these subjects. Of these 11 subjects, one failed to
complete both program versions, 7 subjects completed an inheritance but not a
at version, and
3 completed a
at but not an inheritance program version. On studying the questionnaires and
subjects' source code it was found that the most common reason for incompletion was that 6 of
the those working with a
at version attempted to develop a solution using inheritance. Most
students who successfully completed the task when working with the
at version appeared to
use an existing class as a template. Those who attempted to introduce inheritance into the
at
version had no such template within their existing code.
Combining those who did complete with those who failed to complete within time provides
a performance ratio of 20:9, i.e., approximately 2 out of 3 subjects performed better when
maintaining object-oriented software with inheritance.
14
X time Stime Min. Max. N Inc.
Replication Flat 46.1 20.0 22 97 14 0
Replication Inheritance 35.2 17.0 14 77 13 2
Table 4: Statistical summary of the replication times (minutes)
the 0.05 level to provide conrmatory power for the rst experiment result. We reject the null
hypothesis H0,rep in favour of the alternative hypothesis H1,rep .
In this experiment there were 2 incompletions. Questionnaire data suggests that these 2
subjects suered signicant problems with C++ inheritance syntax.
15
null hypothesis, H0,exp2, cannot be rejected. When the null hypothesis is not rejected, it is
important to consider power levels. The design had a good chance, 0:7, of detecting a large
eect but only 0:4 of a probability of detecting a medium-sized eect. These power estimates
are further weakened by the need that arose to use non-parametric statistics. So a small to
medium-sized eect may well exist: the second experiment simply did not have the necessary
statistical power to draw conclusions one way or the other regarding the existence of a small to
medium-sized eect. On the other hand, the direction of the mean times has reversed for this
experiment. This is an important nding and worth exploring.
There was one incompletion in this experiment. The collected source code shows that this
subject attempted to reconstruct a complete inheritance hierarchy from his
at code, so this
incompletion does not bias the results.
16
120
90
Time (minutes) 60
30
0
Inheritance Flat Inheritance Flat Inheritance Flat
Figure 3: Boxplots of completion times for the rst experiment (left), the internal replication
(centre), and the second experiment using a deeper inheritance hierarchy (right)
the second experiment, however, the
at program version had 9 classes to consider, a doubling
from the rst experiment, and the inheritance program version had 12 classes to consider, again
a doubling from the rst experiment. So subjects were initially faced with a search problem
which can have been solved in either a satisfying or optimising way. This search problem was
exacerbated by the fact that subjects were not provided with a conceptual model of the domain
nor supplied with a strategy on which to base a selection of superclass or copy template.
So a possible interpretation is that experiment 1 and its internal replication simply revealed
the modiability advantage of inheritance which was then cancelled out in the second exper-
iment by the more demanding search problem generated through a greater number of classes
interconnected through the inheritance mechanisms. So it need not be depth per se that would
cause a performance deterioration: a shallow but broad inheritance hierarchy could just as easily
result in a demanding search problem.
Subjects' questionnaire responses provide evidence relating to this interpretation. Table 6
indicates the number of comments or indications received under several categories for the
at
and inheritance program groups. These categories are:
1. Problems choosing superclass or class to use as copy template
2. Problems tracing in the inheritance hierarchy
3. Problems with virtual functions
4. Problems with lack of provided conceptual model
17
60
50
Time (mins.)
40
30
20
Flat
Inheritance
0
First Experiment Replication Second Experiment
Figure 4: Average completion times for the rst experiment, the internal replication, and the
second experiment using a deeper inheritance hierarchy
18
as a problem with tracing and a problem with tracing could be categorised as a problem with
choosing superclass.
(1) (2) (3) (4) (5) (6) (7) (8) (9)
at 1 0 1 1 6 0 1 4 1
inheritance 4 5 4 6 7 3 1 1 7
Table 6: Frequency of comments and other indicators from questionnaires
Assuming the responses are representative, Table 6 suggests that most subjects adopted an
optimising selection strategy but that it took the inheritance program group typically at least
another 5 minutes to make the selection. Moreover, 3 subjects from the inheritance program
group settled on the wrong class despite claiming to have used an optimising strategy and 5
subjects from the inheritance program group essentially invented conceptual models of the do-
main that meant they chose not to use the superclass with most in common. In contrast to
the rst experiment and its internal replication where there was considerable (but not complete)
agreement on choice of superclass, in the second experiment, inheritance program group subjects
can be categorised into three groups: subjects inheriting from class Lecturer (2 subjects), sub-
jects inheriting from class Sta (9 subjects), and subjects inheriting from the `most in common'
class Supervisor (4 subjects). So there is a reasonably clear indication of how more demanding
the search problem was for the inheritance program group. We believe our data also supports
Dvorak's ideas on conceptual entropy [Dvo94] : all systems that are frequently changed charac-
teristically tend towards disorder, a term recognised as entropy. In object-oriented systems
\conceptual entropy is manifested by increasing conceptual inconsistency as we travel
down the hierarchy. That is, the deeper the level of the hierarchy, the greater the
probability that a subclass will not consistently extend and/or specialise the concept
of its superclass." [Dvo94].
Dvorak identied this concept through an experiment where subjects were to construct a class
hierarchy from class specications: the deeper the hierarchy got the less agreement there was
between subjects about a class's placement in the hierarchy. Essentially, a similar eect has
been found here.
But this search problem alone is not enough to explain away all the worsening in performances
for the inheritance program group.
Table 6 indicates that most subjects in the inheritance program group reported that they
suered from problems due to tracing or virtual functions. (The data for categories (2) and
(3) are for separate subjects.) These problems probably go some way toward explaining the
19
rest of the worsening in performances for the inheritance program group. As noted earlier,
one diculty that aects program understanding, and hence maintenance, is the presence of
delocalised plans, where pieces of code that are conceptually related are physically located in
non-contiguous parts of the program. According to Wilde et al., the mechanism of inheritance
creates further opportunities for delocalisation [WMH93]. One such related diculty is that
understanding a single line of code may require tracing a line of method invocations through
an inheritance hierarchy. In a shallow hierarchy this may not represent a large overhead, but
as the hierarchy becomes deeper the overhead is likely to increase. In the case of a maintainer
who wants to view the actual implementation of a method, tracing the line of invocations to its
source must be conducted. Such tracing may have aected some subjects' maintenance times.
Note that no subject from the rst experiment or its internal replication commented on
problems choosing superclass/class to use as a copy template or problems tracing through the
hierarchy.
The subject who took by far the longest time on the inheritance program group made several
tries at dierent points in the hierarchy. If this subject's datum is excluded the average time for
the inheritance program group would almost match the average time for the
at program group.
We have no reason to exclude this datum: the subject's behaviour is a particularly poignant
example of conceptual entropy.
To summarise:
1. Program size is discounted as a predictor of performance.
2. The inheritance program group working with a deeper inheritance hierarchy had a more
demanding search problem when choosing a superclass but that this alone does not account
for the relative deterioration in performances.
3. Conceptual entropy will arise when programmers are forced to create their own conceptual
models of the domain or if they are given a free choice between satisfying and optimising
strategies when specialising in an inheritance hierarchy.
4. Problems with tracing and virtual functions go some way toward explaining the deterio-
rations in performances.
5. The more demanding search problem and the problems with tracing and virtual functions,
together, probably explain the general relative deterioration in performances.
20
5 Threats to validity
5.1 Threats to internal validity
A major concern within any empirical study is that an unobserved independent variable is exert-
ing control over the dependent variable(s), a possibility which must be minimised. Three such
threats have been identied: (i) selection eects, (ii) maturation eects, and (iii) instrumentation
eects.
1. Selection eects are due to natural variations in subject performance (see, e.g., [Bro80]).
An example of this is presented in [DBM+94] where the majority of `high ability' subjects
were randomly assigned to one of two groups, something which obviously biased the results
of the study. Such bias was catered for in this study by creating subject groups of equal
ability (as detailed in Sections 3.1.2 and 3.2).
2. Maturation or learning eects are caused by subjects learning as an experiment proceeds.
The threat here was that subjects would learn from the rst run and that their performance
on the second run would be biased. The data was analysed for this and no signicant eect
was found.
3. Instrumentation eects may result from dierences in the experimental materials employed.
In this study such eects were likely to arise from dierences in the presented software
systems and maintenance tasks. Although an explicit attempt was made to ensure as
much similarity as possible, such variation can be dicult to avoid. The collected timing
data for the rst experiment are very similar across the two runs; the internal replication
repeated these results. This increases condence that any such eect was minimised. In-
strumentation eects also appear minimal between the replication and second experiment
| the increase of mean time for the inheritance group would have been similar for the
at
group otherwise.
So there is no evidence suggesting that these threats to internal validity have impacted on the
results of the study.
6 Conclusions
This empirical study should be of interest to those designing and maintaining object-oriented
software. The results suggest that when it is obvious which class should be used to specialise
from and when little tracing up through hierarchies is demanded, then inheritance provides
gains in modiability, i.e., object-oriented software is more maintainable than equivalent object-
based software. On the contrary, when conceptual entropy exists (when either the conceptual
model of the domain has not been not provided or other strategies for specialising have not
been specied, e.g., `most in common') and when tracing up through hierarchies is required for
22
sound comprehension, then modiability gains are cancelled out, i.e., object-oriented software is
no more maintainable than object-based software. One of our subjects provided a particularly
poignant example of conceptual entropy by attempting specialisations at several points in the
hierarchy.
An interpretation based solely on our experimental hypotheses would be misleading. Deterio-
rating performances were not simply down to depth and increased tracing diculties: conceptual
entropy also played a part and one could imagine shallow and broad hierarchies suering from
conceptual entropy as much as narrow and deep. So the inductive analysis was a vital component
of our research.
While threats to the external validity have been identied, it is argued that because the
results have been conrmed across a multi-method programme of research, these threats are
reduced. Subsequent experimentation, however, should make use of larger software systems
using professional programmers as subjects. Such experimentation might also consider other
categories of maintenance and other aspects of the overall maintenance process. It is not at all
obvious that object-oriented software is going to be more maintainable in the long run.
Acknowledgements
The authors wish to acknowledge the eorts of those who participated in the experiments.
Thanks are extended to Pete Hendry and Dave Lloyd for their technical assistance.
References
[BCM94] A. Brooks, D. Clarke, and P. McGale. Investigating stellar variability by normality
tests. Vistas in Astronomy, 38:377{399, 1994.
[Bro80] R. Brooks. Studying programmer behavior experimentally: The problems of proper
methodology. Communications of the ACM, 23(4):207{213, April 1980.
[Bur95] A. Burgess. Finding an experimental basis for software engineering. IEEE Software,
28(3):92{93, 1995.
[CCKT83] J. Chambers, W. Cleveland, B. Kleiner, and P. Tukey. Graphical methods for data
analysis. Wadsworth International Group, rst edition, 1983.
[Cha88] A. Chapanis. Some generalisations about generalisation. Human Factors, 30(3):253{
267, 1988.
23
[CK94] S. Chidamber and C. Kemerer. A metrics suite for object-oriented design. IEEE
Transactions on Software Engineering, 20(6):476{493, June 1994.
[Cur86] B. Curtis. By the way, did anyone study any real programmers? In E. Soloway
and S. Iyengar, editors, Empirical Studies of Programmers: First Workshop, pages
256{262. Ablex Publishing Corporation, 1986.
[CvM93] R. Crocker and A. von Mayrhauser. Maintenance support needs for object-oriented
software. In Proceedings of the International Computer Software and Applications
Conference, pages 63{69, November 1993.
[Dal96] J. Daly. Replication and a Multi-Method Approach to Empirical Software Engineer-
ing Research. PhD thesis, Department of Computer Science, University of Strath-
clyde, Glasgow, 1996.
[DBM+ 94] J. Daly, A. Brooks, J. Miller, M. Roper, and M. Wood. Verication of results
in software maintenance through external replication. In Proceedings of the IEEE
International Conference on Software Maintenance, pages 50{57, September 1994.
[DBM+ 95] J. Daly, A. Brooks, J. Miller, M. Roper, and M. Wood. A multi-method approach
to performing empirical research. Software Engineering Technical Council (TCSE)
Newsletter, 14(1):SPN10{12, Fall 1995.
[DMB+ 95] J. Daly, J. Miller, A. Brooks, M. Roper, and M. Wood. Issues on the object-oriented
paradigm: A questionnaire survey. Research report EFoCS-8-95, Department of
Computer Science, University of Strathclyde, Glasgow, 1995.
[Dvo94] J. Dvorak. Conceptual entropy and its eect on class hierarchies. IEEE Computer,
27(6):59{63, June 1994.
[DWB+ 95] J. Daly, M. Wood, A. Brooks, J. Miller, and M. Roper. Structured interviews on the
object-oriented paradigm. Research report EFoCS-7-95, Department of Computer
Science, University of Strathclyde, Glasgow, 1995.
[Fos91] J. Foster. Program lifetime: A vital statistic for maintenance. In Proceedings of the
IEEE Conference on Software Maintenance, pages 98{103, 1991.
[HHL90] S. Henry, M. Humphrey, and J. Lewis. Evaluation of the maintainability of object-
oriented software. In IEEE Conference on Computer and Communication Systems,
pages 404{409, September 1990.
24
[JKZ94] P. Juttner, S. Kolb, and P. Zimmerer. Integrating and testing of object-oriented
software. In Proceedings of the European Conference on Software Testing, Analysis,
and Review, pages 13/1{13/14. Siemens AG, 1994.
[Jon94] C. Jones. Gaps in the object-oriented paradigm. IEEE Computer, 27(6):90{91, June
1994.
[KGH+ 94] D. Kung, J. Gao, P. Hsia, F. Wen, Y. Toyoshima, and C. Chen. Change impact
identication in object-oriented software maintenance. In Proceedings of the IEEE
International Conference on Software Maintenance, pages 202{211, September 1994.
[LHKS92] J. Lewis, S. Henry, D. Kafura, and R. Schulman. On the relationship between the
object-oriented paradigm and software reuse: An empirical investigation. Journal
of Object-Oriented Programming, 5(4):35{41, 1992.
[Lip90] Mark W. Lipsey. Design Sensitivity, Statistical Power for Experimental Research.
SAGE Publications, 1990.
[LMR92] M. Lejter, S. Meyers, and S. Reiss. Support for maintaining object-oriented pro-
grams. IEEE Transactions on Software Engineering, SE-18(12):1045{1052, Decem-
ber 1992.
[PB94] C. Ponder and B. Bush. Polymorphism considered harmful. ACM SIGSOFT, Soft-
ware Engineering Notes, 19(2):35{37, April 1994.
[PVB95] A. Porter, L. Votta, and V. Basili. Comparing detection methods for software
requirements inspections: A replicated experiment. IEEE Transactions on Software
Engineering, 21(6):563{575, June 1995.
[Rom87] H. D. Rombach. A controlled experiment on the impact of software structure on
maintainability. IEEE Transactions on Software Engineering, 13(3):344{354, March
1987.
[Sch87] N. Schneidewind. The state of software maintenance. IEEE Transactions on Soft-
ware Engineering, SE-13(3):303{310, 1987.
[Sch95] S. Schneberger. Software maintenance in distributed computer environments: Sys-
tem complexity versus component simplicity. In Proceedings of IEEE International
Conference on Software Maintenance, pages 304{313, 1995.
[Ski92] M. Skinner. The C++ primer: a gentle introduction to C++. Silicon Press and
Prentice Hall, rst edition, 1992.
25
[SPL+ 88] E. Soloway, J. Pinto, S. Letovsky, D. Littman, and R. Lampert. Designing doc-
umentation to compensate for delocalized plans. Communications of the ACM,
31(11):1259{1267, 1988.
[Til91] D. Tiller. Experimental design and analysis. In N. Fenton, editor, Software Metrics
| A Rigorous Approach, pages 63{78. Chapman and Hall, 1991.
[WH92] N. Wilde and R. Huitt. Maintenance support for object-oriented programs. IEEE
Transactions on Software Engineering, SE-18(12):1038{1044, December 1992.
[WMH93] N. Wilde, P. Matthews, and R. Huitt. Maintaining object-oriented software. IEEE
Software, 10(1):75{80, 1993.
26