Cleanroom Software Engineering
Cleanroom Software Engineering
9-1987
M. Dyer
R. C. Linger
Recommended Citation
Mills, Harlan D.; Dyer, M.; and Linger, R. C., "Cleanroom Software Engineering" (1987). The Harlan D. Mills
Collection.
https://trace.tennessee.edu/utk_harlan/18
This Article is brought to you for free and open access by the Science Alliance at TRACE: Tennessee Research and
Creative Exchange. It has been accepted for inclusion in The Harlan D. Mills Collection by an authorized
administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact
trace@utk.edu.
<4ag :
QU A L I T Y A S S U R A N C E
Cleanroom Software
Engineering
Harlan D. Mills, Information Systems Institute
Michael Dyer and Richard C. Linger, IBM Federal Systems Division
20 IEEE Software
statistical quality can be measured. and performance. So a certification of having no gotos without acquiring the fun-
For even the simplest of products, there software quality is a business measure, damental discipline of mathematical
is no absolute best statistical measure of part of the overall consideration in verification in engineering software - of
quality. For example, a statistical average producing and receiving software. even discovering that such a discipline
can be computed many ways - an arith- Once a basis for measuring statistical exists.
metic average, a weighted average, a geo- quality of delivered software is available, In contrast, learning the rigor of mathe-
metric average, and a reciprocal average creating a management process for statisti- matical verification leads to behavioral
can each be better than the others in vari- cal quality control is relatively straightfor- modification in both individuals and
ous circumstances. ward. In principle, the goal is to find ways teams of programmers, whether programs
It finally comes down to a judgment of to repeatedly rehearse the final measure- are verified formally or not. Mathemati-
business and management - in every case. ment during software development and to cal verification requires precise specifica-
In most cases, the judgment is practically modify the development process, where tions and formal arguments about the
automatic from experience and precedent, necessary, to achieve a desired level of correctness with respect to those specifi-
but it is a judgment. In the case of soft- statistical quality. cations.
ware, that judgment has no precedent The Cleanroom process has been Two main behavioral effects are read-
because the concept of producing software degn ocarry otis pnciple. It c ily observable. First, communication
under statistical quality control is just at forthe are in incre- among programmers (and managers)
its inception. 'ments that nermit realistic measuremenTs becomes much more precise, especially
A new basis for the certification of soft- .r of statistical quality during d;evelopout program specifications. Second, a
ware quality, given in Currit, Dyer, and with provision for improving the masure remium is placed on the simplest pro-
Mills,' is based on a new software- grams possible to achieve specified func-
engineering process.4 This basis requires a tion and performance.
software specification and a probability If a program looks hard to verify, it is
distribution on scenarios of the software's Statistical quality the program that should be revised, not the
use; it then defines a testing procedure and measurements verification. The result is high productiv-
a prescribed computation from test data ultimately come down ity in producing software that requires lit-
results to provide a certified statistical to management and tle or no debugging.
quality of delivered software.
This new basis represents scientific and business judgmentsCleanroo softwar engineering
mthematical verification uses
to replace pro-
engineering judgment of a fair and gram debugging before releaseato-staTi-
reasonable way to measure statistical qual- cal testing. This mathematical verification
ity of software. As for simpler products, quality by additional testing, by process isdone by people, based on standard
there is no absolute best and no logical changes (such as increased insnections and software-engineering practices such as
arguments for it beyond business and conguration control), or by lh those taught at the IBM Software Engi-
management judgment. But it can provide mnietnods. ' - neering Institute. We find that human
a basis for software statistical quality as a verification is surprisingly synergistic with
contractual item where no such reasonable statistical testing - that mathematical fal-
item existed before. Mathematical libility is very different from debugging
The certification of software quality is verification fallibility and that errors of mathematical
given in terms of its measured reliability Software engineering without mathe- fallibility are much easier to discover in
over a probability distribution of usage matical verification is no more than a statistical testing than are errors of debug-
scenarios in statistical testing. Certifica- buzzword. When Dijkstra introduced the ging fallibility.
tion is an ordinary process in business - idea of structured programming at an early Perhaps one day automatic verification
even in the certification of the net worth of software-engineering conference,5 his of software will be practical. But there is
a bank. As in software certification, there principal motivation was to reduce the no need to wait for the engineering value
is a fact-finding process, followed by a length of mathematical verifications of and discipline of mathematical verification
prescribed computation. programs by using a few basic control until that day.
In the case of a bank, the fact-finding structures and eliminating gotos. Experimental data from projects where
produces assets and liabilities, and the Many popularizers of structured pro- both Cleanroom verification and more
computation subtracts the sum of the lia- gramming have cut out the rigorous part traditional debugging techniques were
bilities from the sum of the assets. For the about mathematical verification in favor used offers evidence that the Cleanroom-
bank, there are other measures of impor- of the easy part about no gotos. But by cut- verified software exhibited higher quality.
tance besides net worth - such as good- ting out the rigorous part, they have also For the verified software, fewer errors
will, growth, and security of assets - just cut out much of the real benefit of struc- were injected, and these errors were less
as there are other measures for software tured programming. As a result, a lot of severe and required less time to find and
than reliability - such as maintainability people have become three-day wonders in fix. The verified product also experienced
September 1987 21
better field quality, all of which was due to
the added care and attention paid during
Cleanroom software cal usage. A structured specification is a
formal specification (a relation or set of
design. engineering ordered pairs) for a decomposition into a
While it may sound revolutionary at nested set of subspecifications for succes-
Findings from an early Cleanroom proj- first glance, the Cleanroom software engi-
ect (where verified software accounted for sive product releases. A structured speci-
neering process is an evolutionary step in fication defines not only the final software
approximately half the product's func- software development. It is evolutionary
tion) indicate that verified software but also a release plan for its incremental
in eliminating debugging because, over the implementation and statistical testing.
accounted for only one fourth the error past 20 years, more and more program
count. Moreover, the verified software A stepwise refinement or decomposition
design has been developed in design lan- of requirements creates successive levels of
was responsible for less than 10 percent of guages that must be verified rather than
the severe failures. These findings substan- software design. At each level of decom-
executed. So the relative effort for position, mathematics-based correctness
tiate that verified software contains fewer advanced teams in debugging, compared
defects and that those defects that are pres- arguments ensure the accuracy of the
to verifying, is now quite small, even in evolving design and the continued integrity
ent are simpler and have less effect on non-Cleanroom development.
product execution. of the product requirements. The work
It is evolutionary in statistical testing strategy is to create specifications and the
The method of human mathematical because with higher quality programs at design to those specifications, as well as to
verification used in Cleanroom develop- the outset, representative-user testing is check the correctness of that design before
ment, called functional verification, is correspondingly a greater and greater frac- proceeding to the next decomposition.
quite different than the method of axio- tion of the total testing effort. And, as The Cleanroom design methods use a
matic verification usually taught in univer- limited set of design primitives to capture
sities. It is based on functional semantics software logic (sequence, selection, and
and on the reduction of software verifica- iteration). They use module and procedure
tion to ordinary mathematical reasoning In verfied software, primitives to package software designs
about sets and functions as directly as developers essentially into products. Decomposition of software
possible. never resorted to data requirements is handled by a compan-
debugging. ion set of data-structuring primitives (sets,
The motivation for functional verifica-
tion and for the earliest possible reduction stacks, and queues) that ensure product
designs with strongly typed data opera-
of verification reasoning to sets and func- tions. Specially defined design languages
tions is the problem of scaling up. A set or already noted, we have found a surprising document designs and provide a straight-
function can be described in three lines of synergism between human verification and forward translation to standard program-
ordinary mathematics notation or in 300 statistical testing: People are fallible with ming forms.
lines of English text. There is more human human verification, but the errors they In the Cleanroom model, structural test-
fallibility in 300 lines of English than in leave behind for system testing are much ing that requires knowledge of the design
three lines of mathematical notation, but easier to find and fix than those left behind is replaced by formal verification, but
the verification paradigm is the same. from debugging. functional testing is retained. In fact, this
By introducing verification in terms of Results from an early Cleanroom proj- testing can be performed with the two
sets and functions, you establish a basis for ect where verification and debugging were goals of demonstrating that the product
reasoning that scales up. Large programs used to develop different parts of the soft- requirements are correctly implemented in
have many variables, but only one func- ware indicate that corrections to the veri- the software and of providing a basis for
tion. Mills and Linger6 gave an additional fied software were accomplished in about product-reliability prediction. The latter is
basis for verifying large programs by one fifth the average time of corrections to a unique Cleanroom capability that results
designing with sets, stacks, and queues the debugged software. In the verified from its statistical testing method, which
rather than with arrays and pointers. software case, the developers essentially supports statistical inference from the test
never resorted to debugging (less than 0.1 to operating environments.
While initially harder to teach than axio- percent of the cases) to isolate and repair The Cleanroom life cycle of incremen-
matic verification, functional verification reported defects. tal product releases supports software test-
scales up to reasoning for million-line sys- The feasibility of combining human ing throughout the product development
tems in top-level design as well as for verification with statistical testing makes rather than only when it is completed. This
hundred-line programs at the bottom it possible to define a new software- allows the continuous assessment of prod-
level. The evidence that such reasoning is engineering process under statistical qual- uct quality from an execution perspective
effective is in the small amount of back- ity control.' For that purpose, we define and permits any necessary adjustments in
tracking required in very large systems a new development life cycle of successive the process to improve observed product
designed top-down with functional verifi- incremental releases to achieve a struc- quality.
cation.7 tured specification of function and statisti- As each release becomes available,
22 IEEE Software
statistical testing provides statistical esti- tionship between the number of errors in ing), and
mates of its reliability. Software process software and the time between failures in * sporadic failures.
analysis and feedback can be used to meet its execution. Half as many errors would Even terminating or permanent failures
reliability goals (for example, by increased mean half the failure rate and twice the may be followed by a restart of the soft-
verification inspections and by more inter- mean time between failures. In this case, ware, so you can imagine a long history of
mediate specification formality) for sub- efforts to reduce errors would automati- execution and, in this history, the failures
sequent releases. As errors are found and cally increase reliability. marked at each instant of time. Clearly,
fixed during system testing, the growth in It turns out that every major IBM soft- this history will depend on the software's
reliability of the maturing system can also ware product - without exception - has initial conditions (and data) and on the
be estimated so a certified reliability esti- an extremely high error-failure rate vari- subsequent inputs (as commands and
mate of the system-tested software can be ation. In stable released products, these data) to it. Such a history can be very arbi-
provided at final release. failure rates run from 18 months between trary, but suppose for argument's sake
Cho8 has also proposed the develop- failures to more than 5000 years. More that representative histories (scenarios of
ment of software under statistical quality than half the errors have failure rates of use) are conceivable.
control, using as a measure the ratio of more than 1500 years between failures. The behavior of software is determinis-
correct outputs to total outputs. He Fixing these errors will reduce the number tic in that repeating an initial condition and
regards software as a factory for produc- of errors by more than half, but the history of use will reproduce the same out-
ing output, rather than for producing a decrease in the product failure rate will be puts (with the same failures). But, in fact,
product itself. The ratio of correct outputs imperceptible. More precisely, you could if software is used in more than one history
to total outputs is directly related to the by more than one user, the histories of use
mean time between failures, where time is will usually be different. For that reason,
normalized to output production. Such a we consider as part of a structured speci-
normalization is one possibility in the fication a probability distribution of usage
Cleanroom process, but other normaliza- Users do not see errors histories, typically defined as a stochastic
tions may be more meaningful in most sys- in software, they see process.
tem applications. failures in execution. This probability distribution of usage
A principal difference between the histories will, in turn, induce a probabil-
Cleanroom and Cho's ideas is the use of a ity distribution of failure histories in which
certification model to account for the statistics about times between failures,
growth in reliability during development. remove more than 60 percent of the errors failure-free intervals, and the like can be
Another major difference is an insistence but only decrease the failure rate by less defined and estimated. So, even though
on human mathematical verification with than 3 percent. software behavior is deterministic, its
no program debugging before These surprising refutations of conven- reliability can be defined relative to its
representative-user testing at the system tional wisdom in software reliability are statistical usage. Such a probability distri-
level. As Mills discussed,9 human mathe- due to data painstakingly developed over bution of usage histories provides a
matical verification is possible and prac- many years by Adams. 10 statistical basis for software quality
tical at high production rates. The time To be more precise about software control.
spent on verification can be less than the errors and failures, assume that a specifi-
time spent on debugging. cation and its software exist. Then, when
the software is executed, its behavior can Certifying sttistical
Statistical basis be compared with its specification and any quality
Software people customarily talk about discrepancies (failures). Such failures may For software already released, it is sim-
errors in the software, typically measured be catastrophic and prevent further execu- ple to estimate its reliability in mean times
in errors per thousand lines of code. Cur- tion (for example, by abnormal termina- to failure: Merely take the average of its
rent postdelivery levels in ordinary soft- tion). Other failures may be so serious that times between failure in statistical testing.
ware are one to 10 errors per thousand every response from then on is incorrect However, for software under Cleanroom
lines. Good methodology produces post- (for example, if a database is com- development, the problem is more compli-
delivery levels under one error per thou- promised). Less serious failures represent cated, for two reasons:
sand lines. But such numbers are irrelevant the case in which the software continues to 1. In each Cleanroom increment,
and misleading when you consider soft- execute with at least partially correct results of system testing may indicate soft-
ware reliability. Users do not see errors in behavior beyond the failure. ware changes to correct failures found.
the software, they see failures in execution, These examples illustrate that failures 2. With each Cleanroom increment
so the measurement of times between represent different levels of severity, release, untested new software will be
failures is more relevant. beginning with three major levels: added to software already under test.
If each error had the same or similar * terminating failures, In fact, each change or set of changes to
failure rate, there would be a direct rela- * permanent faihures (but not terminat- correct failures in a release creates a new
September 1987 23
software product very much like its gives producer and receiver (seller and pur- aggregate testing experience across incre-
predecessor but with a different reliability chaser) a common, objective way to cer- ments. A simple aggregation could com-
(intended to be better, but possibly worse). tify the reliability of the delivered plement separately treated increments with
However, each of these corrected software software. The certification is a scientific, management judgment.
products, by itself, will be subject to a statistical inference obtained by a A more sophisticated treatment of sep-
strictly limited amount of testing before it prescribed computation on test data war- arate releases would be to model the fail-
is superseded by its successor, and statisti- ranted to be correct by the developer. ure contribution of each newly released
cal estimates of reliability will be cor- In principle, the estimators for software part of the software and to develop strati-
respondingly limited in confidence. reliability are no more than a sophisticated fied estimators release by release. Earlier
Therefore, to aggregate the testing expe- way to average the times between failure, releases can be expected to mature while
rience for an increment release, we define taking into account the change activity later releases come under test. This matu-
a model of reliability change with called for during statistical testing. As test ration rate in reliability improvement can
parameters Mand R (as discussed in Cur- data materializes, the reliability can be esti- be used to estimate the amount of test time
rit, Dyer, and Mills') for the mean time to mated, even change by change. And with required to reach prescribed reliability
failure after c software changes, of the successful corrections, the reliability esti- levels.
form MTTF = MRC where Mis the initial mates will improve with further testing,
mean time to failure of the release and providing objective, quantitative evidence Mean time to failure and the rate of
where R is the observed effectiveness ratio of the achievement of reliability goals. change in mean time to failure can be use-
for improving mean time to failure with This objective evidence is itself a basis ful decision tools for project management.
software changes. for management control of the software For software under test, which has both an
Although various technical rationales development to meet reliability goals. For estimated mean time to failure and a
are given for this model by Currit, Dyer, example, process analysis may reveal both known rate of change in mean time to fail-
and Mills,' it should be considered a con- unexpected sources of errors (such as poor ure, decisions on releasability can be based
tractual basis for the eventual certification understanding of the underlying hard- on an evaluation of life-cycle costs rather
of the finally released software by the ware) and appropriate corrections in the than on just marketing effect.
developer to the user. Moreover, because process itself for later increments. Inter- When the software is delivered, the
there is no way to know that the model mediate rehearsals of the final certification average cost for each failure must include
parameters M and R are absolutely cor- provide a basis for management feedback both the direct costs of repair and the
rect, we define statistical estimators for to meet final goals. indirect costs to the users (which may be
them in terms of the test data. The choice The treatment of separate increment much larger). These postdelivery costs can
of these estimators is based on statistical releases should also be part of the contrac- be estimated from the number of expected
analysis, but the choice should also be a tual basis between the developer and user. failures and compared with the costs for
contractual basis for certification. Perhaps the simplest treatment is to treat additional predelivery testing. Judgments
The net result of these two contractual separate increments independently. How- could then be made about the profitabil-
bases - a reliability change model and ever, more statistical confidence in the ity of continuing tests to minimize lifetime
statistical estimators for its parameters - final certification will result from costs. -6-
24 IEEE Software
SOFTWARE ENGINEERS
A
b
|AA MOTOROLA INC.
FEATURES:
* Built in Screen Compiler (converts Screen Images to Object Modules
without Source Code)
* Mainframe style Transaction Processor
* 3270 Keyboard Features
* End User control over Screen Colors
* Realtime Full Screen Editor
MANY MORE UNIQUE FEATURES!
REQUIREMENTS:
* IBM PC, XT, AT, 3270 PC, or 100% compatibles
Richard C. Linger is a senior programming * Lattice "C" 3.x or Microsoft "C" 4.0
manager of software engineering studies in the * 256 K
IBM Federal Systems Division.
Linger received a BS in electrical engineering
from Duke University. He is a member of the
ACM and Computer Society of the IEEE. Price $295.00 Tel. C91 4) 245-8392
Mills can be reached at Information Science
Institute, 2770 Indian River Blvd., Vero Beach, Ad For more information write to:
FL 32960. Dyer and Linger can be contacted at
IBM Federal Systems Division, 6600 Rockledge W WOLF PAK RE!SEARCH, INC.
Dr., Bethesda, MD 20817. WOLF P I so90 Upland Rd.
RESEARCH. INC. Yorktown Hts., NY 10598
In Australia: CSC Computer Systems, (03) 749-6046
September 1987
Reader Servke Nmumber 3