Taming Compiler Fuzzers
Taming Compiler Fuzzers
1
Our contributions are as follows. First, we frame the fuzzer
GCC 4.3.0 Crashes
GCC 4.3.0 Wrong Code Bugs taming problem, which has not yet been addressed by the research
Mozilla SpiderMonkey 1.6 Bugs
1000 community, as far as we are aware. Second, we make and exploit
the observation that automatic triaging of test cases is strongly
synergistic with automated test-case reduction. Third, based on the
insight that bugs are highly diverse, we exploit diverse sources of
# of Failures (logscale)
100
information about bug-triggering test cases, including features of
the test case itself, features from execution of the compiler on the
test case, and features from the compiler’s output. Fourth, we show
that diverse test cases can be ranked highly by first placing test
10
cases in a metric space and then using the furthest point first (FPF)
technique from machine learning [10]. The more obvious approach
to fuzzer taming is to use a clustering algorithm [27] to place tests
into homogeneous groups, and then choose a representative test from
each cluster. We show that FPF is both faster and more effective
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 than clustering for all of our case studies. Finally, we show that
Bug our techniques can effectively solve the fuzzer taming problem
for 2,603 test cases triggering 28 bugs in a JavaScript engine and
Figure 1. A fuzzer tends to hit some bugs thousands of times more
3,799 test cases triggering 46 bugs in a C compiler. Using our
frequently than others
methods over this test suite, a developer who inspects the JavaScript
engine’s test cases in ranked order will more quickly find cases that
trigger the 28 bugs found during the fuzzing run. In comparison to
compiler code coverage
under test
a developer who examines cases in a random order, the developer
who inspects in ranked order will be 4.6× faster. For wrong-code
test bugs and crash bugs in the C compiler, the improvements are 2.6×
fuzzer cases output reducer tamer user
ranked
reduced
reduced and 32×, respectively. Even more importantly, users can find many
bug-triggering
test cases bug-triggering more distinct bugs than would be found with a random ordering by
oracle test cases
examining only a few tens of test cases.
bug-triggering feedback
test cases Taming a fuzzer differs from previous efforts in duplicate bug de-
tection [37, 38, 42] because user-supplied metadata is not available:
Figure 2. Workflow for a fuzzer tamer we must rely solely on information from failure-inducing test cases.
Compared to previous work on dealing with software containing
multiple bugs [15, 22, 28], our work differs in the methods used
the original author of the jsfunfuzz tool, reports using a variety (ranking bugs as opposed to clustering), in the kinds of inputs to the
of heuristics to avoid looking at test cases that trigger known bugs, machine learning algorithms (diverse, as opposed to just predicates
such as turning off features in the test-case generator and using tools or coverage information), and in its overall goal of taming a fuzzer.
like grep to filter out test cases triggering bugs that have predictable
symptoms [33]. During testing of mission file systems at NASA [12],
Groce et al. used hand-tuned rules to avoid repeatedly finding the 2. Approach
same bugs during random testing—e.g., “ignore any test case with a This section describes our approach to taming compiler fuzzers and
reset within five operations of a rename.” gives an overview of the tools implementing it.
We claim that much more sophisticated automation is feasible
and useful. In this paper we describe and evaluate a system that 2.1 Definitions
does this for two fuzzers: one for JavaScript engines, the other for C
compilers. We characterize the fuzzer taming problem: A fault or bug in a compiler is a flaw in its implementation. When
the execution of a compiler is influenced by a fault—e.g., by wrong
Given a potentially large collection of test cases, each of or missing code—the result may be an error that leads to a failure
which triggers a bug, rank them in such a way that test cases detected by a test oracle. In this paper, we are primarily concerned
triggering distinct bugs are early in the list. with two kinds of failures: (1) compilation or interpretation that
Sub-problem: If there are test cases that trigger bugs previ- fails to follow the semantics of the input source code, and (2) com-
ously flagged as undesirable, place them late in the list. piler crashes. The goal of a compiler fuzzer is to discover source
programs—test cases—that lead to these failures. The goal of a
Ideally, for a collection of test cases that triggers N distinct bugs fuzzer tamer is to rank failure-inducing test cases such that any
(none of which have been flagged as undesirable), each of the first prefix of the ranked list triggers as many different faults as possible.
N test cases in the list would trigger a different bug. In practice, Faults are not directly observable, but a fuzzer tamer can estimate
perfection is unattainable because the problem is hard and also which test cases are related by a common fault by making an as-
because there is some subjectivity in what constitutes “distinct bugs.” sumption: the more “similar” two test cases, or two executions of
Thus, our goal is simply to improve as much as possible upon the the compiler on those test cases, the more likely they are to stem
default situation where test cases are presented to users in effectively from the same fault [23].
random order. A distance function maps any pair of test cases to a real number
Figure 2 shows the workflow for a fuzzer tamer. The “oracle” in that serves as a measure of similarity. This is useful because our
the figure detects buggy executions, for example by watching for goal is to present fuzzer users with a collection of highly dissimilar
crashes and by running the compiler’s output against a reference test cases. Because there are many ways in which two test cases
compiler’s output. Our rank-ordering approach was suggested by the can be similar to each other—e.g., they can be textually similar,
prevalence of ranking approaches for presenting alarms produced cause similar failure output, or lead to similar executions of the
by static analyses to users [18, 19]. compiler—our work is based on several distance functions.
2
2.2 Ranking Test Cases case, but which does not occur in every test case, out of a batch of
Our approach to solving the fuzzer taming problem is based on the test cases that is being processed by the fuzzer tamer. The elements
following idea. in the vector are based on the number of appearances of each token
in the test case. We construct lexical feature vectors for both C and
Hypothesis 1: If we (1) define a distance function between JavaScript.
test cases that appropriately captures their static and dynamic Given two n-element vectors v1 and v2 , the Euclidean distance
characteristics and then (2) sort the list of test cases in between them is:
furthest point first (FPF) order, then the resulting list will r
constitute a usefully approximate solution to the fuzzer ∑ (v1 [i] − v2 [i])2
taming problem. i=1..n
If this hypothesis holds, the fuzzer taming problem is reduced to
For C code, our intuition was that lexical analysis in some sense
defining an appropriate distance function. The FPF ordering is one
produced shallower results than it did for JavaScript. To compensate,
where each point in the list is the one that maximizes the distance
we wrote a Clang-based detector for 45 additional features that we
to the nearest of all previously listed elements; it can be computed
guessed might be associated with compiler bugs. These features
using a greedy algorithm [10]. We use FPF to ensure that diverse
include:
test cases appear early in the list. Conversely, collections of highly
similar test cases will be found towards the end of the list. • common types, statement classes, and operator kinds;
Our approach to ignoring known bugs is based on the premise
that fuzzer users will have labeled some test cases as exemplifying • features specific to aggregate data types such as structs with
these bugs; this corresponds to the “feedback” edge in Figure 2. bitfields and packed structs;
Hypothesis 2: We can lower the rank of test cases corre- • obvious divide-by-zero operations; and
sponding to bugs that are known to be uninteresting by “seed- • some kinds of infinite loops that can be detected statically.
ing” the FPF computation with the set of test cases that are
labeled as uninteresting. In addition to constructing vectors from test cases, we also
Thus, the most highly ranked test case will be the one maximiz- constructed feature vectors from compiler executions. For example,
ing its minimum distance from any labeled test case. the function coverage of a compiler is a list of the functions that it
executes while compiling a test case. The overall feature vector for
2.3 Distance Functions for Test Cases function coverage contains an element for every function executed
while compiling at least one test case, but that is not executed while
The fundamental problem in defining a distance function that will
compiling all test cases. As with token-based vectors, the vector
produce good fuzzer taming results is that we do not know what the
elements are based on how many times each function executed. We
trigger for a generic compiler bug looks like. For example, one C
created vectors of:
compiler bug might be triggered by a struct with a certain sequence
of bitfields; another bug might be triggered by a large number of • functions covered;
local variables, which causes the register allocator to spill. Our
solution to this fundamental ambiguity has been to define a variety • lines covered;
of distance functions, each of which we believe will usefully capture • tokens in the compiler’s output as it crashes (if any); and
some kinds of bug triggers. This section describes these distance
• tokens in output from Valgrind (if any).
functions.
Levenshtein Distance Also known as edit distance, the Leven- In the latter two cases, we use the same tokenization as with test
shtein distance [20] between two strings is the smallest number of cases (treating output from the execution as a text document), except
character additions, deletions, and replacements that suffices to turn that in the case of Valgrind we abstract some non-null memory
one string into the other. For every pair of test cases we compute addresses to a generic ADDRESS token. The overall hypothesis is
the Levenshtein distance between the following, all of which can be that most bugs will exhibit some kind of dynamic signature that will
treated as plain text strings: reveal itself in one or more kinds of feature vector.
• the test cases themselves;
• the output of the compiler as it crashes (if any); and Normalization Information retrieval tasks can often benefit from
normalization, which serves to decrease the importance of terms
• the output of Valgrind [25] on a failing execution (if any).
that occur very commonly, and hence convey little information.
Computing Levenshtein distance requires time proportional to Before computing distances over feature vectors, we normalized
the product of the string lengths, but the constant factor is small (a the value of each vector element using tf-idf [34]; this is a common
few tens of instructions), so it is reasonably efficient in practice. practice in text clustering and classification. Given a count of a
feature (token) in a test case or its execution (the “document”),
Euclidean Distance Many aspects of failure-inducing test cases, the tf-idf is the product of the term-frequency (tf) and the inverse-
and of executions of compilers on these test cases, lend themselves document-frequency (idf) for the token. Term-frequency is the ratio
to summarization in the form of feature vectors. For example, of the count of the token in the document to the total number of
consider this reduced JavaScript test case, which triggers a bug tokens in the document. (For coverage we use number of times the
in SpiderMonkey 1.6: entity is executed.) Inverse-document-frequency is the logarithm
__proto__=__parent__ of the ratio of the total number of documents and the total number
new Error(this) of documents containing the token: this results in a uniformly zero
value for tokens appearing in all documents, which are therefore not
Lexing this code gives eight tokens, and a feature vector based included in the vector. We normalize Levenshtein distances by the
on these tokens contains eight nonzero elements. The overall vector length of the larger of the two strings, which helps handle varying
contains one element for every token that occurs in at least one test sizes for test cases or outputs.
3
3. A Foundation for Experiments manifest at run time (wrong-code bugs). This distinction makes less
To evaluate our work, we need a large collection of reduced versions sense for a just-in-time compiler such as SpiderMonkey; we did not
of randomly generated test cases that trigger compiler bugs. More- attempt to make it, but rather lumped all bugs into a single category.
over, we require access to ground truth: the actual bug triggered by Test cases produced by jsfunfuzz were also large, over 100 KB
each test case. This section describes our approach to meeting these on average. We reduced test cases using a custom reducer similar
prerequisites. in spirit to C-Reduce, tuned for JavaScript. Reduction resulted in
854 duplicate test cases that we removed, leaving 1,749 test cases
3.1 Compilers Tested for input to the fuzzer taming tools. The typical failure-inducing
test case for SpiderMonkey was reduced in size by more than three
We chose to test GCC 4.3.0 and SpiderMonkey 1.6, both running on
orders of magnitude, to an average size of 68 bytes.
Linux on x86-64. SpiderMonkey, best known as the JavaScript en-
gine embedded in Firefox, is a descendant of the original JavaScript 3.4 Establishing Ground Truth
implementation; it contains an interpreter and several JIT compil-
ers. Our selection of these particular versions was based on several Perhaps the most onerous part of our work involved determining
considerations. First, the version that we fuzzed had to be buggy ground truth: the actual bug triggered by each test case. Doing this
enough that we could generate useful statistics. Second, it was im- the hard way—examining the execution of the compiler for each of
portant that most of the bugs revealed by our fuzzer had been fixed thousands of failure-inducing test cases—is obviously infeasible.
by developers. This would not be the case for very recent compiler Instead, our goal was to create, for each of the 74 total bugs that
versions. Also, it turned out not to be the case for GCC 4.0.0, which our fuzzing efforts revealed, a patched compiler fixing only that
we initially started using and had to abandon, since maintenance of bug. At that point, ground-truth determination can be automated: for
its release branch—the 4.0.x series—terminated in 2007 with too each failure-inducing test case, run it through every patched version
many unfixed bugs. of the compiler and see which one changes its behavior. We only
partially accomplished our goal. For a collection of arbitrary bugs
3.2 Test Cases for C in a large application that is being actively developed, it turns out to
We used the default configuration of Csmith [44] version 2.1.0, be very hard to find a patch fixing each bug, and only that bug.
which over a period of a few days generated 2,501 test cases that For each bug, we started by performing an automated forward
crash GCC and 1,298 that trigger wrong-code bugs. The default search to find the patch that fixed the bug. In some cases this patch
configuration of Csmith uses swarm testing [13], which varies test (1) was small; (2) clearly fixed the bug triggered by the test case,
features to improve fault detection and code coverage. Each program as opposed to masking it by suppressing execution of the buggy
emitted by Csmith was compiled at -O0, -O1, -O2, -Os, and -O3. code; and (3) could be back-ported to the version of the compiler
To detect crash bugs, we inspected the return code of the main that we tested. In other cases, some or all of these conditions failed
compiler process; any nonzero value was considered to indicate a to hold. For example, some compiler patches were extraordinarily
crash. To detect wrong-code bugs, we employed differential testing: complex, changing tens of thousands of lines of code. Moreover,
we waited for the compiler’s output to produce a result different from these patches were written for compiler versions that had evolved
the result of executing the output of a reference compiler. Since no considerably since the GCC 4.3.0 and SpiderMonkey 1.6 versions
perfect reference compiler exists, we approximated one by running that are the basis for our experiments.
GCC 4.6.0 and Clang 3.1 at their lowest optimization levels and Although we spent significant effort trying to create a minimal
ensuring that both compilers produced executables that, when run, patch fixing each compiler bug triggered by our fuzzing effort,
had the same output. (We observed no mismatches during our tests.) this was not always feasible. Our backup strategy for assessing
Csmith’s outputs tend to be large, often exceeding 100 KB. We ground truth was first to approximately classify each test case
reduced each failure-inducing test case using C-Reduce [29], a tool based on the revision of the compiler that fixed the bug that it
that uses a generalized version of Delta debugging to heuristically triggered, and second to manually inspect each test case in order to
reduce C programs. After reduction, some previously different tests determine a final classification for which bug it triggered, based on
became textually equivalent; this happens because C-Reduce tries our understanding of the set of compiler bugs.
quite hard to reduce identifiers, constants, data types, and other
constructs to canonical values. For crash bugs, reduction produced 3.5 Bug Slippage
1,797 duplicates, leaving only 704 different test cases. Reduction When the original and reduced versions of a test case trigger different
was less effective at canonicalizing wrong-code test cases, with only bugs, we say that bug slippage has occurred. Slippage is not hard to
23 duplicate tests removed, leaving 1,275 tests to examine. In both avoid for bugs that have an unambiguous symptom (e.g., “assertion
cases, the typical test case was reduced in size by two to three orders violation at line 512”) but it can be difficult to avoid for silent bugs
of magnitude, to an average size of 128 bytes for crash bugs and such as those that cause a compiler to emit incorrect code. Although
243 bytes for wrong-code bugs. slippage is normally difficult to recognize or quantify, these tasks
are easy when ground truth is available, as it is here.
3.3 Test Cases for JavaScript Of our 2,501 unreduced test cases that caused GCC 4.3.0 to
We started with the last public release of jsfunfuzz [31], a tool crash, almost all triggered the same (single) bug that was triggered
that, over its lifetime, has led to the discovery of more than 1,700 by the test case’s reduced version. Thirteen of the unreduced test
faults in SpiderMonkey. We modified jsfunfuzz to support swarm cases triggered two different bugs, and in all of these cases the
testing and then ran it for several days, accumulating 2,603 failing reduced version triggered one of the two. Finally, we saw a single
test cases. Differential testing of JavaScript compilers is problematic instance of actual slippage where the original test case triggered one
due to their diverging implementations of many of the most bug- bug in GCC leading to a segmentation fault and the reduced version
prone features of JavaScript. However, jsfunfuzz comes with a triggered a different bug, also leading to a segmentation fault. For
set of built-in test oracles, including semantic checks (e.g., ensuring the 1,298 test cases triggering wrong-code bugs in GCC, slippage
that compiling then decompiling code is an identity function) and during reduction occurred fifteen times.
watchdog timers to ensure that infinite loops can only result from For JavaScript, bug slippage was a more serious problem: 23%
faults. For an ahead-of-time compiler like GCC, it is natural to divide of reduced JavaScript test cases triggered a different bug than the
bugs into those that manifest at compile time (crashes) and those that original test case. This problem was not mitigated (as we had
4
originally hoped) by re-reducing test cases using the slower “debug” 4.3 Selecting a Distance Function
version of SpiderMonkey. In Section 2 we described a number of ways to compute distances be-
In short, bug slippage was a problem for SpiderMonkey 1.6 but tween test cases. Since we did not know which of these would work,
not for GCC 4.3.0. Although the dynamics of test-case reduction we tried all of them individually and together, with Figures 3–8
are complex, we have a hypothesis about why this might have been showing our best results. Since we did not consider enough case
the case. Test-case reduction is a heuristic search that explores one studies to be able to reach a strong conclusion such as “fuzzer tam-
particular path through the space of all possible programs. This path ing should always use Levenshtein distance on test case text and
stays in the subset of programs that trigger a bug and also follows compiler output,” this section analyzes the detailed results from our
a gradient leading towards smaller test cases. Sometimes, the tra- different distance functions, in hopes of reaching some tentative
jectory will pass through the space of programs triggering some conclusions about which functions are and are not useful.
completely different bug, causing the reduction to be “hijacked” by
the second bug. We would expect this to happen more often for a SpiderMonkey and GCC Crash Bugs For these faults, the best
compiler that is buggier. Our observation is that GCC 4.3.0 is basi- distance function to use as the basis for FPF, based on our case
cally a solid and mature implementation whereas SpiderMonkey 1.6 studies, is the normalized Levenshtein distance between test cases
is not—it contains many bugs in fairly basic language features. plus normalized Levenshtein distance between failure outputs. Our
tentative recommendation for bugs that (1) reduce very well and
4. Results and Discussion (2) have compiler-failure outputs is: use normalized Levenshtein
distance over test-case text plus compiler-output text, and do not
For 1,979 reduced test cases triggering 46 bugs in GCC 4.3.0 and bother with Valgrind output or coverage information.
1,749 reduced test cases triggering 28 bugs in SpiderMonkey 1.6, Given that using Levenshtein distance on the test-case text plus
our goal is to rank them for presentation to developers such that compiler output worked so well for both of these bug sets, where
diverse faults are triggered by test cases early in the list. all faults had meaningful failure symptoms, we might expect using
4.1 Evaluating Effectiveness using Bug Discovery Curves output or test-case text alone to also perform acceptably. In fact,
Levenshtein distance based on test-case text alone (not normalized)
Figures 3–8 present the primary results of our work using bug performed moderately well for SpiderMonkey, but otherwise the
discovery curves. A discovery curve shows how quickly a ranking results for these distance functions were uniformly mediocre at
of items allows a human examining the items one by one to view at best. For GCC, using compiler output plus C features (Section 2.3)
least one representative of each different category of items [26, 40]. performed nearly as well as the best distance function, suggesting
Thus, a curve that climbs rapidly is better than a curve that climbs that the essential requirement is compiler output combined with a
more slowly. Here, the items are test cases and categories are the good representation of the test case, which may not be satisfied by
underlying compiler faults. The top of each graph represents the a simple vectorization: vectorizing test case plus output performed
point at which all faults have been presented. As shown by the badly for both GCC and SpiderMonkey.
y-axes of the figures, there are 28 SpiderMonkey bugs, 11 GCC Coverage-based methods worked fairly well for SpiderMonkey,
crash bugs, and 35 GCC wrong-code bugs in our study. appearing in six of the top ten functions and only two of the worst
Each of Figures 3–8 includes a baseline: the expected bug ten. Interestingly, these best coverage methods for SpiderMonkey
discovery curve without any fuzzer taming. We computed it by all included both line and function coverage. Both coverage-based
looking at test cases in random order, averaged over 10,000 orders. functions were uniformly mediocre for GCC crashes (coverage
We also show the theoretical best curve where for N faults each of did not appear in any of the best ten or worst ten methods). For
the first N tests reveals a new fault. GCC, Valgrind was of little value, as most failures did not produce
In each graph, we show in solid black the first method to find all any Valgrind output. Memory-safety errors were more common
bugs (which, in all of our examples, is also the method with the best in SpiderMonkey, so most test cases did produce Valgrind output;
area under the full curve). For GCC crash bugs and SpiderMonkey, however, for the most part, adding the information to a distance
this method also has the best climb for the first 50 tests, and for function still made the function perform worse in the long run.
GCC wrong-code bugs, it is almost the best for the first 50 tests Valgrind output alone performed extremely poorly in the long run
(and, in fact, discovers one more bug than the curve with the best for both GCC crashes and SpiderMonkey bugs.
area). For this best curve, we also show points sized by the log of
the frequency of the fault; our methods do not always find the most GCC Crash Bugs For these bugs, every distance function in-
commonly triggered faults first. Finally, each graph additionally creased the area under the curve for examining less than 50 tests by
shows the best result that we could obtain by ranking test cases using a factor or four or better, compared to the baseline. Clearly there is
clustering instead of FPF, using X-means to generate clusterings by a significant amount of redundancy in the information provided by
various features, sorting all clusterings by isolation and compactness, different functions. All but five of the 63 distance functions we used
and using the centermost test for each cluster. (See Section 4.6 for were able to discover all bugs within at most 90 tests: a dramatic im-
details.) provement over the baseline’s 491 tests. Only Valgrind output alone
performed worse than the baseline. The other four poorly perform-
4.2 Are These Results Any Good? ing methods all involved using vectorization of the test case, with
Our efforts to tame fuzzers would have clearly failed had we been no additional information beyond Valgrind output and/or test-case
unable to significantly improve on the baseline. On the other hand, output.
there is plenty of room for improvement: our bug discovery curves GCC crash bugs were, however, our easiest target: there are only
do not track the “theoretical best” lines in Figures 3 and 7 for very 11 crash outputs and 11 faults. Even so, the problem is not trivial,
long. For GCC crash bugs, however, our results are almost perfect. as the faults and outputs do not correspond perfectly—two faults
Perhaps the best way to interpret our results is in terms of the have two different outputs, and there are two outputs that are each
value proposition they create for compiler developers. For example, produced by two different faults. Failure output alone provides a
if a SpiderMonkey team member examines 15 randomly chosen great deal of information about a majority of the faults, and test-case
reduced test cases, he or she can expect them to trigger five different distance completes the story.
bugs. In contrast, if the developer examines the first 15 of our ranked SpiderMonkey Bugs The story was less simple for SpiderMonkey
tests, he or she will see 11 distinct bugs: a noticeable improvement. bugs, where many methods performed poorly and seven methods per-
5
28 28
27 Theoretical best 27
26 Baseline (examine in random order) 26
25 Best curve: FPF(Lev(test)+Lev(output)) 25
24 FPF(Lev(test)+Lev(output)) frequency 24
23 Best clustering curve: C(funccov)+C(linecov) 23
22 22
21 21
20 20
19 19
18 18
17 17
# Faults Seen
# Faults Seen
16 16
15 15
14 14
13 13
12 12
11 11
10 10
9 9
8 8
7 7
6 6
5 5 Theoretical best
4 4 Baseline (examine in random order)
3 3 Best curve: FPF(Lev(test)+Lev(output))
2 2 FPF(Lev(test)+Lev(output)) frequency
1 1 Best clustering curve: C(funccov)+C(linecov)
0 0
0 5 10 15 20 25 30 35 40 45 50 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
# Tests Examined # Tests Examined
Figure 3. SpiderMonkey 1.6 bug discovery curves, first 50 tests Figure 4. SpiderMonkey 1.6 bug discovery curves, all tests
11 11
10 10
9 9
8 8
7 7
# Faults Seen
# Faults Seen
6 6
5 5
4 4
3 3
Figure 5. GCC 4.3.0 crash bug discovery curves, first 50 tests Figure 6. GCC 4.3.0 crash bug discovery curves, all tests
35 35
34 34
33 Theoretical best 33
32 Baseline (examine in random order) 32
31 Best curve: FPF(funccov) 31
30 FPF(funccov) frequency 30
29 Best clustering curve: C(funccov)+C(C-features) 29
28 28
27 27
26 26
25 25
24 24
23 23
22 22
21 21
# Faults Seen
# Faults Seen
20 20
19 19
18 18
17 17
16 16
15 15
14 14
13 13
12 12
11 11
10 10
9 9
8 8
7 7
6 6 Theoretical best
5 5 Baseline (examine in random order)
4 4 Best curve: FPF(funccov)
3 3 FPF(funccov) frequency
2 2 Best clustering curve: C(funccov)+C(C-features)
1 1
0 0
0 5 10 15 20 25 30 35 40 45 50 0 100 200 300 400 500 600 700 800 900 1000 1100 1200
# Tests Examined # Tests Examined
Figure 7. GCC 4.3.0 wrong-code bug discovery curves, first 50 Figure 8. GCC 4.3.0 wrong-code bug discovery curves, all tests
tests
6
35 13
34
33 Our approach with complete avoidance of known faults (theoretical best)
32 12 Our result (using examples of known bugs)
31 Our technique without examples (better baseline)
30 11 Baseline (examine in random order)
29
28
27 10
26
25
24 9
23
20 FPF(test+linecov) 490
19 FPF(linecov) 490 7
18 FPF(linecov+C-features) 490
17 FPF(test+linecov+C-features) 489
16 6
15 FPF(test+funccov+linecov) 471
14 FPF(funccov+linecov+C-features) 471
13 FPF(funccov+linecov) 471 5
12 FPF(test+funccov+linecov+C-features) 469
11 FPF(funccov) 459 4
10 FPF(funccov+C-features) 450
9 FPF(test+funccov) 450
8 FPF(test+funccov+C-features) 431 3
7 FPF(test+C-features) 425
6 FPF(C-features) 425
5 2
4 FPF(test) 424
3 FPF(Lev(test) not normalized) 368
FPF(Lev(test)) 318 1
2
1 Baseline (examine in random order) 247
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
# Tests Examined # Tests Examined
Figure 9. All bug discovery curves for GCC 4.3.0 wrong-code bugs, Figure 11. Avoiding known bugs in SpiderMonkey 1.6
sorted by increasing area under the curve if examining only the 50
top-ranked test cases
For these bugs, the best method to use for fuzzer taming was less
clear. Figures 9 and 10 show the performance of all methods that
35 we tried, including a table of results sorted by area under the curve
34
33
32
up to 50 tests (Figure 9) and number of test cases to discover all
31
30 faults (Figure 10). It is clear that code coverage (line or function) is
29
28 much more valuable here than with crash bugs, though Levenshtein
27
26
25
distance based on test case alone performs well in the long run
24
23 (but badly initially). Line coverage is useful for early climb, but
22
21 Theoretical best 35 eventually function coverage is most useful for discovering all bugs.
# Faults Seen
20 FPF(funccov) 460
FPF(funccov+C-features) 460
19
18 FPF(test+funccov) 461 Perhaps most importantly, given the difficulty of handling GCC
17 FPF(test+funccov+C-features) 462
16 FPF(Lev(test)) 500 wrong-code bugs, all of our methods perform better than the baseline
15 FPF(Lev(test) not normalized) 605
14
13
FPF(linecov) 894 in terms of ability to find all bugs, and provide a clear advantage
FPF(funccov+linecov) 894
12
11 FPF(test+linecov) 911
FPF(test+linecov+C-features) 911
over the first 50 test cases. We do not wish to overgeneralize from
10
9 FPF(linecov+C-features) 911
FPF(test+funccov+linecov) 911
a few case studies, but these results provide hope that for difficult
8
7
6
FPF(funccov+linecov+C-features) 911
FPF(test+funccov+linecov+C-features) 911
bugs, if good reduction is possible, the exact choice of distance
5
4
FPF(C-features) 929
FPF(test+C-features) 932 function used in FPF may not be critical.
3 FPF(test) 934
2
1
Baseline (examine in random order) 1204 We were disappointed to see that Figures 9 and 10 show no
0
0 100 200 300 400 500 600 700 800 900 1000 1100 1200
evidence that our domain-specific feature detector for C programs is
# Tests Examined useful for wrong-code bugs; in the tables it appears as “C-Feature.”
Figure 10. All bug discovery curves for GCC 4.3.0 wrong-code 4.4 Avoiding Known Faults
bugs, sorted by increasing number of test cases that must be
examined to discover all faults In Section 2.2 we hypothesized that FPF could be used to avoid
reports about a set of known bugs; this is accomplished by lowering
the rankings of test cases that appear to be caused by those bugs.
formed worse than the baseline. In this case, compiler output alone Figure 11 shows, for our SpiderMonkey test cases, an averaged bug
did not provide as much direct guidance: there were 300 different discovery curve for the case where half of the bugs were assumed
failure outputs, and only four of 28 faults had a unique identifying to be already known, and five test cases (or fewer, if five were
output. As a result, while compiler output alone performed very well not available) triggering each of those bugs were used to seed
over the first 50 tests (where it was one of the best five functions), FPF. This experiment models the situation where, in the days or
it proved one of the worst functions for finding all faults, detecting weeks preceding the current fuzzing run, the user has flagged these
no new faults between the 50th and 100th tests ranked. Test-case test cases and does not want to see more test cases triggering the
text by itself performed well for SpiderMonkey with Levenshtein same bugs. The curve is the average of 100 discovery curves, each
distance, or when combined with line coverage, but performed badly corresponding to a different randomly chosen set of known bugs.
as a vectorization without line coverage, appearing in six of the The topmost bug discovery curve in Figure 11 represents an
worst ten functions. As with GCC crashes, Valgrind output alone idealized best case where all test cases corresponding to known bugs
performed very badly, requiring a user to examine 1,506 tests to are removed from the set of test cases to be ranked. The second
discover all bugs. Levenshtein-based approaches (whether over test curve from the top is our result. The third curve from the top is also
case, compiler output, Valgrind output, or a combination thereof) the average of 100 discovery curves; each corresponds to the case
performed very well over the first 50 tests examined. where the five (or fewer) test cases for each known bug are discarded
instead of being used to seed the FPF algorithm, and then the FPF
GCC Wrong-Code Bugs Wrong-code bugs in GCC were the algorithm proceeds normally. This serves as a baseline: our result
trickiest bugs that we faced: their execution does not provide failure would have to be considered poor if it could not improve on this.
output and, in the expected case where the bug is in a “middle end” Finally, the bottom curve is the “basic baseline” where the labeled
optimizer, the distance between execution of the fault and actual test cases are again discarded, but then the remaining test cases are
emission of code (and thus exposure of failure) can be quite long. examined in random order.
7
33
32
sive, our view is that attempting to tame a fuzzer without the aid of a
31
30
solid test-case reducer is inadvisable. The most informative sources
29 FPF(Valgrind+output) 71408 2314
28 FPF(Lev(output)) 69943 2336 of information about root causes are rendered useless by overwhelm-
27 FPF(Lev(output) not normalized) 69804 2336
26 FPF(output) 68267 2311 ing noise. Although we did not create results for unreduced GCC
25 FPF(Valgrind) 67935 2276
24
23
FPF(Lev(Valgrind+output)) 67207 2482
FPF(Valgrind+output+test) 64848 2390
test cases that are analogous to those shown in Figure 12 (the line
22
21
FPF(output+test) 64819 2388
Baseline (examine in random order) 63482 2603
coverage vectors were gigantic and caused problems by filling up
20 FPF(Valgrind+test) 62389 2393 disks), we have no reason to believe the results would have been any
# Faults Seen
8
Time (s) in many cases performed much worse than the baseline, generating a
Program / Feature FPF Clustering Figures small number of clusters that were not represented by distinct faults.
SpiderMonkey / Valgrind 8.27 23.68 – In fact, the few clustering results that manage to discover 20 faults
SpiderMonkey / output 8.38 46.71 –
also did so more slowly than the baseline curve. While GCC wrong-
SpiderMonkey / test 8.12 94.26 –
SpiderMonkey / funccov 9.56 227.78 3, 4 code clustering was particularly bad, clustering also always missed
SpiderMonkey / linecov 48.29 1,594.04 3, 4 at least three bugs for SpiderMonkey. Our hypothesis as to why FPF
SpiderMonkey / Lev. test+output 998.21 N/A 3, 4 performs so much better than clustering is that the nature of fuzzing
GCC crash bugs / output 0.08 0.71 – results, with a long tail of outliers, is a mismatch for clustering
GCC crash bugs / Valgrind 0.09 0.75 5, 6 algorithm assumptions. FPF is not forced to use any assumptions
GCC crash bugs / C-Feature 0.10 1.95 5, 6 about the size of clusters, and so is not “confused” by the many
GCC crash bugs / test 0.14 15.12 5, 6 single-instance clusters. A minor point supporting our hypothesis
GCC crash bugs / funccov 1.37 162.22 – is that the rank ordering of clustering effectiveness matched that of
GCC crash bugs / linecov 18.70 2,021.08 – the size of the tail for each set of faults: GCC crash results were
GCC crash bugs / Lev. test+output 75.07 N/A 5, 6
good but not optimal, SpiderMonkey results were poor, and GCC
GCC wrong-code bugs / C-Feature 0.49 4.26 7, 8
GCC wrong-code bugs / test 0.72 67.72 – wrong-code results were extremely bad.
GCC wrong-code bugs / funccov 4.12 1,046.07 7, 8
GCC wrong-code bugs / linecov 60.60 7,127.42 – 5. Related Work
GCC wrong-code bugs / Lev. test 667.21 N/A –
A great deal of research related to fuzzer taming exists, and some
Table 1. Runtimes for FPF versus clustering related areas such as fault localization are too large for us to do more
than summarize the high points.
35
Relating Test Cases to Faults Previous work focusing on the core
34
33 problem of “taming” sets of redundant test cases differs from ours
32
31
30
in a few key ways. The differences relate to our choice of primary
29
28
FPF algorithm, our reliance on unsupervised methods, and our focus on
27
26 randomly generated test cases.
25
24
23
Baseline First, the primary method used was typically clustering, as in
22
21 the work of Francis et al. [9] and Podgurski et al. [28], which at
# Faults Seen
20
19
18
first appears to reflect the core problem of grouping test cases into
17
16
equivalence classes by underlying fault. However, in practice the
15
14 user of a fuzzer does not usually care about the tests in a cluster,
13
12
11
but only about finding at least one example from each set with no
10
9 particular desire that it is a perfectly “representative” example. The
8
7 Clustering core problem we address is therefore better considered as one of
6
5
4
multiple output identification [8] or rare category detection [8, 40],
3
2 given that many faults will be found by a single test case out of
1
0 thousands. This insight guides our decision to provide the first
0 100 200 300 400 500 600 700 800 900 1000 1100 1200
# Tests Examined evaluation in terms of discovery curves (the most direct measure of
fuzzer taming capability we know of) for this problem. Our results
Figure 13. GCC 4.3.0 wrong-code bug clustering comparison suggest that this difference in focus is also algorithmically useful,
as clustering was less effective than our (novel, to our knowledge)
choice of FPF.
is a widely used tool written in C). Because the isolation and com- One caveat is that, as in the work of Jones et al. on debugging
pactness computations require many pairwise distance results, an in parallel [15], clusters may not be directly useful to users, but
efficient implementation should be approximately equal in time to might assist fault localization algorithms. Jones et al. provide an
running FPF. The final column of the table lists the figures in this evaluation in terms of a model of debugging effort, which combines
paper that show a curve based on the indicated results. If a curve clustering effectiveness with fault-localization effectiveness. This
relies on multiple clusterings, its generation time is (at least) the provides an interesting contrast to our discovery curves: it relies on
sum of the clustering times for each component. Note that because more assumptions about users’ workflow and debugging process
X-means expects inputs in vector form, we were unable to apply our and provides less direct information about the effectiveness of
direct Levenshtein-distance approach with clustering, but we include taming itself. In our experience, sufficiently reduced test cases make
some runtimes for FPF Levenshtein to provide a comparison. localization easy enough for many compiler bugs that discovery is
That clustering is more expensive and complex than FPF is the more important problem. Unfortunately, it is hard to compare
not surprising; clustering has to perform the additional work of results: cost-model results are only reported for SPACE, a program
computing clusters, rather than simply ranking items by distance. with only around 6,200 LOC, and their tests included not only
That FPF produces considerably better discovery curves, as shown random tests from a simple generator but 3,585 human-generated
in Figures 3–8, is surprising. The comparative ineffectiveness of tests. In the event that clusters are needed, FPF results for any k can
clustering is twofold: the discovery curves do not climb as quickly be transformed into k clusters with certain optimality bounds for the
as with FPF, and (perhaps even more critically) clustering does not chosen distance function [10].
ever find all the faults in many cases. In general, for almost all Second, our approach is completely unsupervised. There is no
feature sets, clustering over those same features was worse than expectation that users will examine clusters, add rules, or intervene
applying FPF to those features. The bad performance of clustering in the process. We therefore use test-case reduction for feature
was particularly clear for GCC wrong-code bugs: Figure 13 shows selection, rather than basing it on classifying test cases as successful
all discovery curves for GCC wrong-code, with clustering results or failing [9, 28]. Because fuzzing results follow a power law, many
shown in gray. Clustering at its “best” missed 15 or more bugs, and faults will be represented by far too few tests for a good classifier to
9
include their key features; this is a classic and extreme case of class Fault Localization Our work shares a common ultimate goal with
imbalance in machine learning. While bug slippage is a problem, fault localization work in general [5, 7, 11, 16, 17, 21, 22, 30]
reduction remains highly effective for feature selection, in that the and specifically for compilers [43]: reducing the cost of manual
features selected are correct for the reduced test cases, essentially debugging. We differ substantially in that we focus our methods and
by the definition of test-case reduction. evaluation on the narrow problem of helping the users of fuzzers
Finally, our expected use case and experimental results are based deal with the overwhelming amount of data that a modern fuzzer
on a large set of failures produced by large-scale random testing can produce when applied to a compiler. As suggested by Liu and
for complex programming languages implemented in large, com- Han [23], Jones et al. [15], and others, localization may support
plex, modern compilers. Most previous results in failure clustering fuzzer taming and fuzzer taming may support localization. As
used human-reported failures or human-created regression tests part of our future work, we propose to make use of vectors based
(e.g., GCC regression tests [9, 28]), which are essentially differ- on localization information to determine if, even after reduction,
ent in character from the failures produced by large-scale fuzzing, localization can improve bug discovery. A central question is
and/or concerned much smaller programs with much simpler in- whether the payoff from keeping summaries of successful executions
put domains [15, 23], i.e., examples from the Siemens suite. Li- (a requirement for many fault localizations) provides sufficient
blit et al. [22] in contrast directly addressed scalability by using improvement to pay for its overhead in reduced fuzzing throughput.
32,000 random inputs (though not from a pre-existing industrial-
strength fuzzer for a complex language) and larger programs (up 6. Conclusion
to 56 KLOC), and noted that they saw highly varying rates of fail-
ure for different bugs. Their work addresses a somewhat different Random testing, or fuzzing, has emerged as an important way to test
problem than ours—that of isolating bugs via sampled predicate compilers and language runtimes. Despite their advantages, however,
values, rather than straightforward ranking of test cases for user fuzzers create a unique set of challenges when compared to other
examination—and did not include any systems as large as GCC or testing methods. First, they indiscriminately and repeatedly find test
SpiderMonkey. cases triggering bugs that have already been found and that may
not be economical to fix in the short term. Second, fuzzers tend to
trigger some bugs far more often than others, creating needle-in-the-
Distance Functions for Executions and Programs Liu and Han’s haystack problems for engineers who are triaging failure-inducing
work [23], like ours, focuses less on a particular clustering method outputs generated by fuzzers.
and proposes that the core problem in taming large test suites Our contribution is to tame a fuzzer by adding a tool to the
is that of embedding test cases in a metric space that has good back end of the random-testing workflow; it uses techniques from
correlation with underlying fault causes. They propose to compute machine learning to rank test cases in such a way that interesting
distance by first applying fault localization methods to executions, tests are likely to be highly ranked. By analogy to the way people
then using distance over localization results rather than over the use ranked outputs from static analyses, we expect fuzzer users to
traces themselves. We propose that the reduction of random test inspect a small fraction of highly ranked outputs, trusting that lower-
cases essentially “localizes” the test cases themselves, allowing ranked test cases are not as interesting. If our rankings are good,
us to directly compute proximity over test cases while exhibiting fuzzer users will get most of the benefit of inspecting every failure-
the good correlation with underlying cause that Liu and Han seek inducing test case discovered by the fuzzer for a fraction of the effort.
to achieve by applying a fault-localization technique. Reduction For example, a user inspecting test cases for SpiderMonkey 1.6 in
has advantages over localization in that reduction methods are our ranked order will see all 28 bugs found during our fuzzing run
more commonly employed and do not require storing—or even 4.6× faster than will a user inspecting test cases in random order.
capturing or sampling—information about coverage, predicates, or A user inspecting test cases that cause GCC 4.3.0 to emit incorrect
other metrics for passing test cases. Liu and Han show that distance object code will see all 35 bugs 2.6× faster than one inspecting
based on localization algorithms better captures fault cause than tests in random order. The improvement for test cases that cause
distance over raw traces, but they do not provide discovery curves GCC 4.3.0 to crash is even higher: 32×, with all 11 bugs exposed
or a detailed clustering evaluation. They provide correlation results by only 15 test cases.
only over the Siemens suite’s small subjects and test case sets.
More generally, the problems of distance functions over ex- Acknowledgments
ecutions and test cases [5, 11, 23, 30, 39] and programs them-
selves [4, 35, 41] have typically been seen as essentially different We thank Michael Hicks, Robby Findler, and the anonymous
problems. While this is true for many investigations—generalized PLDI ’13 reviewers for their comments on drafts of this paper;
program understanding and fault localization on the one hand, and Suresh Venkatasubramanian for nudging us towards the furthest
plagiarism detection, merging of program edits, code clone, or mal- point first technique; James A. Jones for providing useful early
ware detection on the other—the difference collapses when we feedback; and Google for a research award supporting Yang Chen.
consider that every program compiled is an input to some other A portion of this work was funded by NSF grants CCF-1217824
program. A program is therefore both a program and a test input, and CCF-1054786.
which induces an execution of another program. Distance between
(compiled) programs, therefore, is a distance between executions. References
We are the first, to our knowledge, to essentially erase the distinction [1] James H. Andrews, Alex Groce, Melissa Weston, and Ru-Gang Xu.
between a metric space for programs and a metric space for execu- Random test run length and effectiveness. In Proc. ASE, pages 19–28,
tions, mixing the two concepts as needed. Moreover, we believe that September 2008.
our work addresses some of the concerns noted in fault-localization [2] Abhishek Arya and Cris Neckar. Fuzzing for security, April 2012.
efforts based on execution distances (e.g., poor results compared http://blog.chromium.org/2012/04/
to other methods [16]), in that distance functions should perform fuzzing-for-security.html.
much better on executions of reduced programs, due to the power [3] Mariano Ceccato, Alessandro Marchetto, Leonardo Mariani, Cu D.
of feature selection, and distances over programs (highly structured Nguyen, and Paolo Tonella. An empirical study about the
and potentially very informative inputs) can complement execution- effectiveness of debugging when random test cases are used. In Proc.
based distance functions. ICSE, pages 452–462, June 2012.
10
[4] Silvio Cesare and Yang Xiang. Malware variant detection using rare-category detection. In Advances in Neural Information
similarity search over sets of control flow graphs. In Proc. Processing Systems 18, December 2004.
TRUSTCOM, pages 181–189, November 2011. [27] Dan Pelleg and Andrew W. Moore. X-means: Extending K-means
[5] Sagar Chaki, Alex Groce, and Ofer Strichman. Explaining abstract with efficient estimation of the number of clusters. In Proc. ICML,
counterexamples. In Proc. FSE, pages 73–82, 2004. pages 727–734, June/July 2000.
[6] Koen Claessen and John Hughes. QuickCheck: a lightweight tool for [28] Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda
random testing of Haskell programs. In Proc. ICFP, pages 268–279, Minch, Jiayang Sun, and Bin Wang. Automated support for
2000. classifying software failure reports. In Proc. ICSE, pages 465–475,
May 2003.
[7] Holger Cleve and Andreas Zeller. Locating causes of program failures.
In Proc. ICSE, pages 342–351, May 2005. [29] John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, and
Xuejun Yang. Test-case reduction for C compiler bugs. In Proc. PLDI,
[8] Shai Fine and Yishay Mansour. Active sampling for multiple output pages 335–346, June 2012.
identification. Machine Learning, 69(2–3):213–228, 2007.
[30] Manos Renieris and Steven Reiss. Fault localization with nearest
[9] Patrick Francis, David Leon, Melinda Minch, and Andy Podgurski. neighbor queries. In Proc. ASE, pages 30–39, October 2003.
Tree-based methods for classifying software failures. In Proc. ISSRE,
pages 451–462, November 2004. [31] Jesse Ruderman. Introducing jsfunfuzz. http://www.squarefree.
com/2007/08/02/introducing-jsfunfuzz/.
[10] Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster
distance. Theoretical Computer Science, 38:293–306, 1985. [32] Jesse Ruderman. Mozilla bug 349611.
https://bugzilla.mozilla.org/show_bug.cgi?id=349611
[11] Alex Groce. Error explanation with distance metrics. In Proc. TACAS, (A meta-bug containing all bugs found using jsfunfuzz.).
pages 108–122, March 2004.
[33] Jesse Ruderman. How my DOM fuzzer ignores known bugs, 2010.
[12] Alex Groce, Gerard Holzmann, and Rajeev Joshi. Randomized http://www.squarefree.com/2010/11/21/
differential testing as a prelude to formal verification. In Proc. ICSE, how-my-dom-fuzzer-ignores-known-bugs/.
pages 621–631, May 2007.
[34] G. Salton, A. Wong, and C. S. Yang. A vector space model for
[13] Alex Groce, Chaoqiang Zhang, Eric Eide, Yang Chen, and John automatic indexing. CACM, 18(11):613–620, November 1975.
Regehr. Swarm testing. In Proc. ISSTA, pages 78–88, July 2012.
[35] Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Winnowing:
[14] Christian Holler, Kim Herzig, and Andreas Zeller. Fuzzing with code local algorithms for document fingerprinting. In Proc. SIGMOD,
fragments. In Proc. USENIX Security, pages 445–458, August 2012. pages 76–85, June 2003.
[15] James A. Jones, James F. Bowring, and Mary Jean Harrold. [36] Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a
Debugging in parallel. In Proc. ISSTA, pages 16–26, July 2007. knowledge reuse framework for combining multiple partitions. The
[16] James A. Jones and Mary Jean Harrold. Empirical evaluation of the Journal of Machine Learning Research, 3:583–617, 2003.
Tarantula automatic fault-localization technique. In Proc. ASE, pages [37] Chengnian Sun, David Lo, Siau-Cheng Khoo, and Jing Jiang.
273–282, November 2005. Towards more accurate retrieval of duplicate bug reports. In Proc.
[17] James A. Jones, Mary Jean Harrold, and John Stasko. Visualization of ASE, pages 253–262, November 2011.
test information to assist fault localization. In Proc. ICSE, pages [38] Chengnian Sun, David Lo, Xiaoyin Wang, Jing Jiang, and Siau-Cheng
467–477, May 2002. Khoo. A discriminative model approach for accurate duplicate bug
[18] Yungbum Jung, Jaehwang Kim, Jaeho Shin, and Kwangkeun Yi. report retrieval. In Proc. ICSE, pages 45–54, May 2010.
Taming false alarms from a domain-unaware C analyzer by a Bayesian [39] Vipindeep Vangala, Jacek Czerwonka, and Phani Talluri. Test case
statistical post analysis. In Proc. SAS, pages 203–217, September comparison and clustering using program profiles and static execution.
2005. In Proc. ESEC/FSE, pages 293–294, August 2009.
[19] Ted Kremenek and Dawson Engler. Z-ranking: using statistical [40] Pavan Vatturi and Weng-Keen Wong. Category detection using
analysis to counter the impact of static analysis approximations. In hierarchical mean shift. In Proc. KDD, pages 847–856, June/July
Proc. SAS, pages 295–315, June 2003. 2009.
[20] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, [41] Andrew Walenstein, Mohammad El-Ramly, James R. Cordy,
insertions, and reversals. Soviet Physics Doklady, 10:707–710, 1966. William S. Evans, Kiarash Mahdavi, Markus Pizka, Ganesan
[21] Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan. Bug Ramalingam, and Jürgen Wolff von Gudenberg. Similarity in
isolation via remote program sampling. In Proc. PLDI, pages programs. In Duplication, Redundancy, and Similarity in Software,
141–154, June 2003. Dagstuhl Seminar Proceedings, July 2006.
[22] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. [42] Xiaoyin Wang, Lu Zhang, Tao Xie, John Anvik, and Jiasu Sun. An
Jordan. Scalable statistical bug isolation. In Proc. PLDI, pages 15–26, approach to detecting duplicate bug reports using natural language and
June 2005. execution information. In Proc. ICSE, pages 461–470, May 2008.
[23] Chao Liu and Jiawei Han. Failure proximity: a fault localization-based [43] David B. Whalley. Automatic isolation of compiler errors. TOPLAS,
approach. In Proc. FSE, pages 46–56, November 2006. 16(5):1648–1659, September 1994.
[24] William M. McKeeman. Differential testing for software. Digital [44] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. Finding and
Technical Journal, 10(1):100–107, December 1998. understanding bugs in C compilers. In Proc. PLDI, pages 283–294,
June 2011.
[25] Nicholas Nethercote and Julian Seward. Valgrind: a framework for
heavyweight dynamic binary instrumentation. In Proc. PLDI, pages [45] Andreas Zeller and Ralf Hildebrandt. Simplifying and isolating
89–100, June 2007. failure-inducing input. IEEE TSE, 28(2):183–200, February 2002.
[26] Dan Pelleg and Andrew Moore. Active learning for anomaly and
11