0% found this document useful (0 votes)

10 views11 pages

Taming Compiler Fuzzers

Uploaded by

zhouxt0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views11 pages

Taming Compiler Fuzzers

Uploaded by

zhouxt0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Taming Compiler Fuzzers

Yang Chen Alex Groce† Chaoqiang Zhang† Weng-Keen Wong†

Xiaoli Fern† Eric Eide John Regehr
University of Utah † Oregon State University
Salt Lake City, UT Corvallis, OR

chenyang@cs.utah.edu agroce@gmail.com zhangch@onid.orst.edu wong@eecs.oregonstate.edu

xfern@eecs.oregonstate.edu eeide@cs.utah.edu regehr@cs.utah.edu

Abstract more than 1,700 previously unknown bugs in SpiderMonkey, the

Aggressive random testing tools (“fuzzers”) are impressively ef- JavaScript engine used in Firefox [32]. LangFuzz [14], a newer
fective at finding compiler bugs. For example, a single test-case randomized testing tool, has led to the discovery of more than 500
generator has resulted in more than 1,700 bugs reported for a sin- previously unknown bugs in the same JavaScript engine. Google’s
gle JavaScript engine. However, fuzzers can be frustrating to use: proprietary ClusterFuzz effort “hammers away at it [Chrome, includ-
they indiscriminately and repeatedly find bugs that may not be se- ing its V8 JavaScript engine] to the tune of around fifty-million test
vere enough to fix right away. Currently, users filter out undesirable cases a day,” with apparent success [2]: “ClusterFuzz has detected
test cases using ad hoc methods such as disallowing problematic 95 unique vulnerabilities since we brought it fully online at the end
features in tests and grepping test results. This paper formulates of last year [a four-month period].” For C compilers, Csmith [44]
and addresses the fuzzer taming problem: given a potentially large has identified more than 450 previously unknown bugs.
number of random test cases that trigger failures, order them such While fuzzers are powerful bug-finding tools, their use suffers
that diverse, interesting test cases are highly ranked. Our evalua- from several drawbacks. The first problem is that failures due to
tion shows our ability to solve the fuzzer taming problem for 3,799 random test cases can be difficult to debug. This has been largely
test cases triggering 46 bugs in a C compiler and 2,603 test cases solved by Delta Debugging [45], an automated greedy search for
triggering 28 bugs in a JavaScript engine. small failure-inducing test cases. In fact, there is some evidence that
debugging based on short random test cases is easier than debugging
Categories and Subject Descriptors D.2.5 [Software Engineer- based on human-created test cases [3]. A second problem is the sheer
ing]: Testing and Debugging—testing tools; D.3.4 [Programming volume of output: an overnight run of a fuzzer may result in hundreds
Languages]: Processors—compilers; H.3.3 [Information Storage or thousands of failure-inducing test cases. Moreover, some bugs
and Retrieval]: Information Search and Retrieval—selection process tend to be triggered much more often than others, creating needle-in-
Keywords compiler testing; compiler defect; automated testing; a-haystack problems. Figure 1 shows that some of the bugs studied
fuzz testing; random testing; bug reporting; test-case reduction in this paper were triggered thousands of times more frequently than
others. Compiler engineers are an expensive and limited resource,
and it can be hard for them to find time to sift through a large
1. Introduction collection of highly redundant bug-triggering test cases. A third
Modern optimizing compilers and programming language runtimes problem is that fuzzers are indiscriminate: they tend to keep finding
are complex artifacts, and their developers can be under significant more and more test cases that trigger noncritical bugs that may also
pressure to add features and improve performance. When difficult already be known. Although it would be desirable to fix these bugs,
algorithms and data structures are tuned for speed, internal modu- the realities of software development—where resources are limited
larity suffers and invariants become extremely complex. These and and deadlines may be inflexible—often cause low-priority bugs
other factors make it hard to avoid bugs. At the same time, compilers to linger unfixed for months or years. For example, in November
and runtimes end up as part of the trusted computing base for many 2012 we found 2,403 open bugs in GCC’s bug database, considering
systems. A code-generation error in a compiler for a critical embed- priorities P1, P2, and P3, and considering only bugs of “normal” or
ded system, or an exploitable vulnerability in a widely deployed higher severity. The median-aged bug in this list was well over two
scripting language runtime, is a serious matter. years old. If a fuzzer manages to trigger any appreciable fraction of
Random testing, or fuzzing, has emerged as an important tool these old, known bugs, its raw output will be very hard to use.
for finding bugs in compilers and runtimes. For example, a sin- A typical workflow for using a random tester is to (1) start
gle fuzzing tool, jsfunfuzz [31], is responsible for identifying running the random tester against the latest version of the compiler;
(2) go to bed; and (3) in the morning, sift through the new failure-
inducing test cases, creating a bug report for each that is novel and
important. Step 3 can be time-consuming and unrewarding. We
know of several industrial compiler developers who stopped using
Csmith not because it stopped finding bugs, but because step 3
c ACM, 2013. This is the author’s version of the work. It is posted here by permission became uneconomical. This paper represents our attempt to provide
of ACM for your personal use. Not for redistribution.
fuzzer users with a better value proposition.
The definitive version was published in Proceedings of the 34th ACM SIGPLAN Con-
ference on Programming Language Design and Implementation (PLDI), Seattle, WA,
Thus far, little research has addressed the problem of making
Jun. 2013, http://doi.acm.org/10.1145/NNNNNNN.NNNNNNN fuzzer output more useful to developers. In a blog entry, Ruderman,

1
Our contributions are as follows. First, we frame the fuzzer
GCC 4.3.0 Crashes
GCC 4.3.0 Wrong Code Bugs taming problem, which has not yet been addressed by the research
Mozilla SpiderMonkey 1.6 Bugs
1000 community, as far as we are aware. Second, we make and exploit
the observation that automatic triaging of test cases is strongly
synergistic with automated test-case reduction. Third, based on the
insight that bugs are highly diverse, we exploit diverse sources of
# of Failures (logscale)

100
information about bug-triggering test cases, including features of
the test case itself, features from execution of the compiler on the
test case, and features from the compiler’s output. Fourth, we show
that diverse test cases can be ranked highly by first placing test
10
cases in a metric space and then using the furthest point first (FPF)
technique from machine learning [10]. The more obvious approach
to fuzzer taming is to use a clustering algorithm [27] to place tests
into homogeneous groups, and then choose a representative test from
each cluster. We show that FPF is both faster and more effective
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 than clustering for all of our case studies. Finally, we show that
Bug our techniques can effectively solve the fuzzer taming problem
for 2,603 test cases triggering 28 bugs in a JavaScript engine and
Figure 1. A fuzzer tends to hit some bugs thousands of times more
3,799 test cases triggering 46 bugs in a C compiler. Using our
frequently than others
methods over this test suite, a developer who inspects the JavaScript
engine’s test cases in ranked order will more quickly find cases that
trigger the 28 bugs found during the fuzzing run. In comparison to
compiler code coverage
under test
a developer who examines cases in a random order, the developer
who inspects in ranked order will be 4.6× faster. For wrong-code
test bugs and crash bugs in the C compiler, the improvements are 2.6×
fuzzer cases output reducer tamer user
ranked
reduced
reduced and 32×, respectively. Even more importantly, users can find many
bug-triggering
test cases bug-triggering more distinct bugs than would be found with a random ordering by
oracle test cases
examining only a few tens of test cases.
bug-triggering feedback
test cases Taming a fuzzer differs from previous efforts in duplicate bug de-
tection [37, 38, 42] because user-supplied metadata is not available:
Figure 2. Workflow for a fuzzer tamer we must rely solely on information from failure-inducing test cases.
Compared to previous work on dealing with software containing
multiple bugs [15, 22, 28], our work differs in the methods used
the original author of the jsfunfuzz tool, reports using a variety (ranking bugs as opposed to clustering), in the kinds of inputs to the
of heuristics to avoid looking at test cases that trigger known bugs, machine learning algorithms (diverse, as opposed to just predicates
such as turning off features in the test-case generator and using tools or coverage information), and in its overall goal of taming a fuzzer.
like grep to filter out test cases triggering bugs that have predictable
symptoms [33]. During testing of mission file systems at NASA [12],
Groce et al. used hand-tuned rules to avoid repeatedly finding the 2. Approach
same bugs during random testing—e.g., “ignore any test case with a This section describes our approach to taming compiler fuzzers and
reset within five operations of a rename.” gives an overview of the tools implementing it.
We claim that much more sophisticated automation is feasible
and useful. In this paper we describe and evaluate a system that 2.1 Definitions
does this for two fuzzers: one for JavaScript engines, the other for C
compilers. We characterize the fuzzer taming problem: A fault or bug in a compiler is a flaw in its implementation. When
the execution of a compiler is influenced by a fault—e.g., by wrong
Given a potentially large collection of test cases, each of or missing code—the result may be an error that leads to a failure
which triggers a bug, rank them in such a way that test cases detected by a test oracle. In this paper, we are primarily concerned
triggering distinct bugs are early in the list. with two kinds of failures: (1) compilation or interpretation that
Sub-problem: If there are test cases that trigger bugs previ- fails to follow the semantics of the input source code, and (2) com-
ously flagged as undesirable, place them late in the list. piler crashes. The goal of a compiler fuzzer is to discover source
programs—test cases—that lead to these failures. The goal of a
Ideally, for a collection of test cases that triggers N distinct bugs fuzzer tamer is to rank failure-inducing test cases such that any
(none of which have been flagged as undesirable), each of the first prefix of the ranked list triggers as many different faults as possible.
N test cases in the list would trigger a different bug. In practice, Faults are not directly observable, but a fuzzer tamer can estimate
perfection is unattainable because the problem is hard and also which test cases are related by a common fault by making an as-
because there is some subjectivity in what constitutes “distinct bugs.” sumption: the more “similar” two test cases, or two executions of
Thus, our goal is simply to improve as much as possible upon the the compiler on those test cases, the more likely they are to stem
default situation where test cases are presented to users in effectively from the same fault [23].
random order. A distance function maps any pair of test cases to a real number
Figure 2 shows the workflow for a fuzzer tamer. The “oracle” in that serves as a measure of similarity. This is useful because our
the figure detects buggy executions, for example by watching for goal is to present fuzzer users with a collection of highly dissimilar
crashes and by running the compiler’s output against a reference test cases. Because there are many ways in which two test cases
compiler’s output. Our rank-ordering approach was suggested by the can be similar to each other—e.g., they can be textually similar,
prevalence of ranking approaches for presenting alarms produced cause similar failure output, or lead to similar executions of the
by static analyses to users [18, 19]. compiler—our work is based on several distance functions.

2
2.2 Ranking Test Cases case, but which does not occur in every test case, out of a batch of
Our approach to solving the fuzzer taming problem is based on the test cases that is being processed by the fuzzer tamer. The elements
following idea. in the vector are based on the number of appearances of each token
in the test case. We construct lexical feature vectors for both C and
Hypothesis 1: If we (1) define a distance function between JavaScript.
test cases that appropriately captures their static and dynamic Given two n-element vectors v1 and v2 , the Euclidean distance
characteristics and then (2) sort the list of test cases in between them is:
furthest point first (FPF) order, then the resulting list will r
constitute a usefully approximate solution to the fuzzer ∑ (v1 [i] − v2 [i])2
taming problem. i=1..n
If this hypothesis holds, the fuzzer taming problem is reduced to
For C code, our intuition was that lexical analysis in some sense
defining an appropriate distance function. The FPF ordering is one
produced shallower results than it did for JavaScript. To compensate,
where each point in the list is the one that maximizes the distance
we wrote a Clang-based detector for 45 additional features that we
to the nearest of all previously listed elements; it can be computed
guessed might be associated with compiler bugs. These features
using a greedy algorithm [10]. We use FPF to ensure that diverse
include:
test cases appear early in the list. Conversely, collections of highly
similar test cases will be found towards the end of the list. • common types, statement classes, and operator kinds;
Our approach to ignoring known bugs is based on the premise
that fuzzer users will have labeled some test cases as exemplifying • features specific to aggregate data types such as structs with
these bugs; this corresponds to the “feedback” edge in Figure 2. bitfields and packed structs;
Hypothesis 2: We can lower the rank of test cases corre- • obvious divide-by-zero operations; and
sponding to bugs that are known to be uninteresting by “seed- • some kinds of infinite loops that can be detected statically.
ing” the FPF computation with the set of test cases that are
labeled as uninteresting. In addition to constructing vectors from test cases, we also
Thus, the most highly ranked test case will be the one maximiz- constructed feature vectors from compiler executions. For example,
ing its minimum distance from any labeled test case. the function coverage of a compiler is a list of the functions that it
executes while compiling a test case. The overall feature vector for
2.3 Distance Functions for Test Cases function coverage contains an element for every function executed
while compiling at least one test case, but that is not executed while
The fundamental problem in defining a distance function that will
compiling all test cases. As with token-based vectors, the vector
produce good fuzzer taming results is that we do not know what the
elements are based on how many times each function executed. We
trigger for a generic compiler bug looks like. For example, one C
created vectors of:
compiler bug might be triggered by a struct with a certain sequence
of bitfields; another bug might be triggered by a large number of • functions covered;
local variables, which causes the register allocator to spill. Our
solution to this fundamental ambiguity has been to define a variety • lines covered;
of distance functions, each of which we believe will usefully capture • tokens in the compiler’s output as it crashes (if any); and
some kinds of bug triggers. This section describes these distance
• tokens in output from Valgrind (if any).
functions.
Levenshtein Distance Also known as edit distance, the Leven- In the latter two cases, we use the same tokenization as with test
shtein distance [20] between two strings is the smallest number of cases (treating output from the execution as a text document), except
character additions, deletions, and replacements that suffices to turn that in the case of Valgrind we abstract some non-null memory
one string into the other. For every pair of test cases we compute addresses to a generic ADDRESS token. The overall hypothesis is
the Levenshtein distance between the following, all of which can be that most bugs will exhibit some kind of dynamic signature that will
treated as plain text strings: reveal itself in one or more kinds of feature vector.
• the test cases themselves;
• the output of the compiler as it crashes (if any); and Normalization Information retrieval tasks can often benefit from
normalization, which serves to decrease the importance of terms
• the output of Valgrind [25] on a failing execution (if any).
that occur very commonly, and hence convey little information.
Computing Levenshtein distance requires time proportional to Before computing distances over feature vectors, we normalized
the product of the string lengths, but the constant factor is small (a the value of each vector element using tf-idf [34]; this is a common
few tens of instructions), so it is reasonably efficient in practice. practice in text clustering and classification. Given a count of a
feature (token) in a test case or its execution (the “document”),
Euclidean Distance Many aspects of failure-inducing test cases, the tf-idf is the product of the term-frequency (tf) and the inverse-
and of executions of compilers on these test cases, lend themselves document-frequency (idf) for the token. Term-frequency is the ratio
to summarization in the form of feature vectors. For example, of the count of the token in the document to the total number of
consider this reduced JavaScript test case, which triggers a bug tokens in the document. (For coverage we use number of times the
in SpiderMonkey 1.6: entity is executed.) Inverse-document-frequency is the logarithm
__proto__=__parent__ of the ratio of the total number of documents and the total number
new Error(this) of documents containing the token: this results in a uniformly zero
value for tokens appearing in all documents, which are therefore not
Lexing this code gives eight tokens, and a feature vector based included in the vector. We normalize Levenshtein distances by the
on these tokens contains eight nonzero elements. The overall vector length of the larger of the two strings, which helps handle varying
contains one element for every token that occurs in at least one test sizes for test cases or outputs.

3
3. A Foundation for Experiments manifest at run time (wrong-code bugs). This distinction makes less
To evaluate our work, we need a large collection of reduced versions sense for a just-in-time compiler such as SpiderMonkey; we did not
of randomly generated test cases that trigger compiler bugs. More- attempt to make it, but rather lumped all bugs into a single category.
over, we require access to ground truth: the actual bug triggered by Test cases produced by jsfunfuzz were also large, over 100 KB
each test case. This section describes our approach to meeting these on average. We reduced test cases using a custom reducer similar
prerequisites. in spirit to C-Reduce, tuned for JavaScript. Reduction resulted in
854 duplicate test cases that we removed, leaving 1,749 test cases
3.1 Compilers Tested for input to the fuzzer taming tools. The typical failure-inducing
test case for SpiderMonkey was reduced in size by more than three
We chose to test GCC 4.3.0 and SpiderMonkey 1.6, both running on
orders of magnitude, to an average size of 68 bytes.
Linux on x86-64. SpiderMonkey, best known as the JavaScript en-
gine embedded in Firefox, is a descendant of the original JavaScript 3.4 Establishing Ground Truth
implementation; it contains an interpreter and several JIT compil-
ers. Our selection of these particular versions was based on several Perhaps the most onerous part of our work involved determining
considerations. First, the version that we fuzzed had to be buggy ground truth: the actual bug triggered by each test case. Doing this
enough that we could generate useful statistics. Second, it was im- the hard way—examining the execution of the compiler for each of
portant that most of the bugs revealed by our fuzzer had been fixed thousands of failure-inducing test cases—is obviously infeasible.
by developers. This would not be the case for very recent compiler Instead, our goal was to create, for each of the 74 total bugs that
versions. Also, it turned out not to be the case for GCC 4.0.0, which our fuzzing efforts revealed, a patched compiler fixing only that
we initially started using and had to abandon, since maintenance of bug. At that point, ground-truth determination can be automated: for
its release branch—the 4.0.x series—terminated in 2007 with too each failure-inducing test case, run it through every patched version
many unfixed bugs. of the compiler and see which one changes its behavior. We only
partially accomplished our goal. For a collection of arbitrary bugs
3.2 Test Cases for C in a large application that is being actively developed, it turns out to
We used the default configuration of Csmith [44] version 2.1.0, be very hard to find a patch fixing each bug, and only that bug.
which over a period of a few days generated 2,501 test cases that For each bug, we started by performing an automated forward
crash GCC and 1,298 that trigger wrong-code bugs. The default search to find the patch that fixed the bug. In some cases this patch
configuration of Csmith uses swarm testing [13], which varies test (1) was small; (2) clearly fixed the bug triggered by the test case,
features to improve fault detection and code coverage. Each program as opposed to masking it by suppressing execution of the buggy
emitted by Csmith was compiled at -O0, -O1, -O2, -Os, and -O3. code; and (3) could be back-ported to the version of the compiler
To detect crash bugs, we inspected the return code of the main that we tested. In other cases, some or all of these conditions failed
compiler process; any nonzero value was considered to indicate a to hold. For example, some compiler patches were extraordinarily
crash. To detect wrong-code bugs, we employed differential testing: complex, changing tens of thousands of lines of code. Moreover,
we waited for the compiler’s output to produce a result different from these patches were written for compiler versions that had evolved
the result of executing the output of a reference compiler. Since no considerably since the GCC 4.3.0 and SpiderMonkey 1.6 versions
perfect reference compiler exists, we approximated one by running that are the basis for our experiments.
GCC 4.6.0 and Clang 3.1 at their lowest optimization levels and Although we spent significant effort trying to create a minimal
ensuring that both compilers produced executables that, when run, patch fixing each compiler bug triggered by our fuzzing effort,
had the same output. (We observed no mismatches during our tests.) this was not always feasible. Our backup strategy for assessing
Csmith’s outputs tend to be large, often exceeding 100 KB. We ground truth was first to approximately classify each test case
reduced each failure-inducing test case using C-Reduce [29], a tool based on the revision of the compiler that fixed the bug that it
that uses a generalized version of Delta debugging to heuristically triggered, and second to manually inspect each test case in order to
reduce C programs. After reduction, some previously different tests determine a final classification for which bug it triggered, based on
became textually equivalent; this happens because C-Reduce tries our understanding of the set of compiler bugs.
quite hard to reduce identifiers, constants, data types, and other
constructs to canonical values. For crash bugs, reduction produced 3.5 Bug Slippage
1,797 duplicates, leaving only 704 different test cases. Reduction When the original and reduced versions of a test case trigger different
was less effective at canonicalizing wrong-code test cases, with only bugs, we say that bug slippage has occurred. Slippage is not hard to
23 duplicate tests removed, leaving 1,275 tests to examine. In both avoid for bugs that have an unambiguous symptom (e.g., “assertion
cases, the typical test case was reduced in size by two to three orders violation at line 512”) but it can be difficult to avoid for silent bugs
of magnitude, to an average size of 128 bytes for crash bugs and such as those that cause a compiler to emit incorrect code. Although
243 bytes for wrong-code bugs. slippage is normally difficult to recognize or quantify, these tasks
are easy when ground truth is available, as it is here.
3.3 Test Cases for JavaScript Of our 2,501 unreduced test cases that caused GCC 4.3.0 to
We started with the last public release of jsfunfuzz [31], a tool crash, almost all triggered the same (single) bug that was triggered
that, over its lifetime, has led to the discovery of more than 1,700 by the test case’s reduced version. Thirteen of the unreduced test
faults in SpiderMonkey. We modified jsfunfuzz to support swarm cases triggered two different bugs, and in all of these cases the
testing and then ran it for several days, accumulating 2,603 failing reduced version triggered one of the two. Finally, we saw a single
test cases. Differential testing of JavaScript compilers is problematic instance of actual slippage where the original test case triggered one
due to their diverging implementations of many of the most bug- bug in GCC leading to a segmentation fault and the reduced version
prone features of JavaScript. However, jsfunfuzz comes with a triggered a different bug, also leading to a segmentation fault. For
set of built-in test oracles, including semantic checks (e.g., ensuring the 1,298 test cases triggering wrong-code bugs in GCC, slippage
that compiling then decompiling code is an identity function) and during reduction occurred fifteen times.
watchdog timers to ensure that infinite loops can only result from For JavaScript, bug slippage was a more serious problem: 23%
faults. For an ahead-of-time compiler like GCC, it is natural to divide of reduced JavaScript test cases triggered a different bug than the
bugs into those that manifest at compile time (crashes) and those that original test case. This problem was not mitigated (as we had

4
originally hoped) by re-reducing test cases using the slower “debug” 4.3 Selecting a Distance Function
version of SpiderMonkey. In Section 2 we described a number of ways to compute distances be-
In short, bug slippage was a problem for SpiderMonkey 1.6 but tween test cases. Since we did not know which of these would work,
not for GCC 4.3.0. Although the dynamics of test-case reduction we tried all of them individually and together, with Figures 3–8
are complex, we have a hypothesis about why this might have been showing our best results. Since we did not consider enough case
the case. Test-case reduction is a heuristic search that explores one studies to be able to reach a strong conclusion such as “fuzzer tam-
particular path through the space of all possible programs. This path ing should always use Levenshtein distance on test case text and
stays in the subset of programs that trigger a bug and also follows compiler output,” this section analyzes the detailed results from our
a gradient leading towards smaller test cases. Sometimes, the tra- different distance functions, in hopes of reaching some tentative
jectory will pass through the space of programs triggering some conclusions about which functions are and are not useful.
completely different bug, causing the reduction to be “hijacked” by
the second bug. We would expect this to happen more often for a SpiderMonkey and GCC Crash Bugs For these faults, the best
compiler that is buggier. Our observation is that GCC 4.3.0 is basi- distance function to use as the basis for FPF, based on our case
cally a solid and mature implementation whereas SpiderMonkey 1.6 studies, is the normalized Levenshtein distance between test cases
is not—it contains many bugs in fairly basic language features. plus normalized Levenshtein distance between failure outputs. Our
tentative recommendation for bugs that (1) reduce very well and
4. Results and Discussion (2) have compiler-failure outputs is: use normalized Levenshtein
distance over test-case text plus compiler-output text, and do not
For 1,979 reduced test cases triggering 46 bugs in GCC 4.3.0 and bother with Valgrind output or coverage information.
1,749 reduced test cases triggering 28 bugs in SpiderMonkey 1.6, Given that using Levenshtein distance on the test-case text plus
our goal is to rank them for presentation to developers such that compiler output worked so well for both of these bug sets, where
diverse faults are triggered by test cases early in the list. all faults had meaningful failure symptoms, we might expect using
4.1 Evaluating Effectiveness using Bug Discovery Curves output or test-case text alone to also perform acceptably. In fact,
Levenshtein distance based on test-case text alone (not normalized)
Figures 3–8 present the primary results of our work using bug performed moderately well for SpiderMonkey, but otherwise the
discovery curves. A discovery curve shows how quickly a ranking results for these distance functions were uniformly mediocre at
of items allows a human examining the items one by one to view at best. For GCC, using compiler output plus C features (Section 2.3)
least one representative of each different category of items [26, 40]. performed nearly as well as the best distance function, suggesting
Thus, a curve that climbs rapidly is better than a curve that climbs that the essential requirement is compiler output combined with a
more slowly. Here, the items are test cases and categories are the good representation of the test case, which may not be satisfied by
underlying compiler faults. The top of each graph represents the a simple vectorization: vectorizing test case plus output performed
point at which all faults have been presented. As shown by the badly for both GCC and SpiderMonkey.
y-axes of the figures, there are 28 SpiderMonkey bugs, 11 GCC Coverage-based methods worked fairly well for SpiderMonkey,
crash bugs, and 35 GCC wrong-code bugs in our study. appearing in six of the top ten functions and only two of the worst
Each of Figures 3–8 includes a baseline: the expected bug ten. Interestingly, these best coverage methods for SpiderMonkey
discovery curve without any fuzzer taming. We computed it by all included both line and function coverage. Both coverage-based
looking at test cases in random order, averaged over 10,000 orders. functions were uniformly mediocre for GCC crashes (coverage
We also show the theoretical best curve where for N faults each of did not appear in any of the best ten or worst ten methods). For
the first N tests reveals a new fault. GCC, Valgrind was of little value, as most failures did not produce
In each graph, we show in solid black the first method to find all any Valgrind output. Memory-safety errors were more common
bugs (which, in all of our examples, is also the method with the best in SpiderMonkey, so most test cases did produce Valgrind output;
area under the full curve). For GCC crash bugs and SpiderMonkey, however, for the most part, adding the information to a distance
this method also has the best climb for the first 50 tests, and for function still made the function perform worse in the long run.
GCC wrong-code bugs, it is almost the best for the first 50 tests Valgrind output alone performed extremely poorly in the long run
(and, in fact, discovers one more bug than the curve with the best for both GCC crashes and SpiderMonkey bugs.
area). For this best curve, we also show points sized by the log of
the frequency of the fault; our methods do not always find the most GCC Crash Bugs For these bugs, every distance function in-
commonly triggered faults first. Finally, each graph additionally creased the area under the curve for examining less than 50 tests by
shows the best result that we could obtain by ranking test cases using a factor or four or better, compared to the baseline. Clearly there is
clustering instead of FPF, using X-means to generate clusterings by a significant amount of redundancy in the information provided by
various features, sorting all clusterings by isolation and compactness, different functions. All but five of the 63 distance functions we used
and using the centermost test for each cluster. (See Section 4.6 for were able to discover all bugs within at most 90 tests: a dramatic im-
details.) provement over the baseline’s 491 tests. Only Valgrind output alone
performed worse than the baseline. The other four poorly perform-
4.2 Are These Results Any Good? ing methods all involved using vectorization of the test case, with
Our efforts to tame fuzzers would have clearly failed had we been no additional information beyond Valgrind output and/or test-case
unable to significantly improve on the baseline. On the other hand, output.
there is plenty of room for improvement: our bug discovery curves GCC crash bugs were, however, our easiest target: there are only
do not track the “theoretical best” lines in Figures 3 and 7 for very 11 crash outputs and 11 faults. Even so, the problem is not trivial,
long. For GCC crash bugs, however, our results are almost perfect. as the faults and outputs do not correspond perfectly—two faults
Perhaps the best way to interpret our results is in terms of the have two different outputs, and there are two outputs that are each
value proposition they create for compiler developers. For example, produced by two different faults. Failure output alone provides a
if a SpiderMonkey team member examines 15 randomly chosen great deal of information about a majority of the faults, and test-case
reduced test cases, he or she can expect them to trigger five different distance completes the story.
bugs. In contrast, if the developer examines the first 15 of our ranked SpiderMonkey Bugs The story was less simple for SpiderMonkey
tests, he or she will see 11 distinct bugs: a noticeable improvement. bugs, where many methods performed poorly and seven methods per-

5
28 28
27 Theoretical best 27
26 Baseline (examine in random order) 26
25 Best curve: FPF(Lev(test)+Lev(output)) 25
24 FPF(Lev(test)+Lev(output)) frequency 24
23 Best clustering curve: C(funccov)+C(linecov) 23
22 22
21 21
20 20
19 19
18 18
17 17
# Faults Seen

# Faults Seen
16 16
15 15
14 14
13 13
12 12
11 11
10 10
9 9
8 8
7 7
6 6
5 5 Theoretical best
4 4 Baseline (examine in random order)
3 3 Best curve: FPF(Lev(test)+Lev(output))
2 2 FPF(Lev(test)+Lev(output)) frequency
1 1 Best clustering curve: C(funccov)+C(linecov)
0 0
0 5 10 15 20 25 30 35 40 45 50 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
# Tests Examined # Tests Examined

Figure 3. SpiderMonkey 1.6 bug discovery curves, first 50 tests Figure 4. SpiderMonkey 1.6 bug discovery curves, all tests

11 11

10 10

9 9

8 8

7 7
# Faults Seen

# Faults Seen

6 6

5 5

4 4

3 3

2 Theoretical best 2 Theoretical best

Baseline (examine in random order) Baseline (examine in random order)
Best curve: FPF(Lev(test)+Lev(output)) Best curve: FPF(Lev(test)+Lev(output))
1 FPF(Lev(test)+Lev(output)) frequency 1 FPF(Lev(test)+Lev(output)) frequency
Best clustering curve: C(Valgrind)+C(test)+C(C-features) Best clustering curve: C(Valgrind)+C(test)+C(C-features)
0 0
0 5 10 15 20 25 30 35 40 45 50 0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500
# Tests Examined # Tests Examined

Figure 5. GCC 4.3.0 crash bug discovery curves, first 50 tests Figure 6. GCC 4.3.0 crash bug discovery curves, all tests

35 35
34 34
33 Theoretical best 33
32 Baseline (examine in random order) 32
31 Best curve: FPF(funccov) 31
30 FPF(funccov) frequency 30
29 Best clustering curve: C(funccov)+C(C-features) 29
28 28
27 27
26 26
25 25
24 24
23 23
22 22
21 21
# Faults Seen

# Faults Seen

20 20
19 19
18 18
17 17
16 16
15 15
14 14
13 13
12 12
11 11
10 10
9 9
8 8
7 7
6 6 Theoretical best
5 5 Baseline (examine in random order)
4 4 Best curve: FPF(funccov)
3 3 FPF(funccov) frequency
2 2 Best clustering curve: C(funccov)+C(C-features)
1 1
0 0
0 5 10 15 20 25 30 35 40 45 50 0 100 200 300 400 500 600 700 800 900 1000 1100 1200
# Tests Examined # Tests Examined

Figure 7. GCC 4.3.0 wrong-code bug discovery curves, first 50 Figure 8. GCC 4.3.0 wrong-code bug discovery curves, all tests
tests

6
35 13
34
33 Our approach with complete avoidance of known faults (theoretical best)
32 12 Our result (using examples of known bugs)
31 Our technique without examples (better baseline)
30 11 Baseline (examine in random order)
29
28
27 10
26
25
24 9
23

# New Faults Seen

22 8
21 Theoretical best 1155
# Faults Seen

20 FPF(test+linecov) 490
19 FPF(linecov) 490 7
18 FPF(linecov+C-features) 490
17 FPF(test+linecov+C-features) 489
16 6
15 FPF(test+funccov+linecov) 471
14 FPF(funccov+linecov+C-features) 471
13 FPF(funccov+linecov) 471 5
12 FPF(test+funccov+linecov+C-features) 469
11 FPF(funccov) 459 4
10 FPF(funccov+C-features) 450
9 FPF(test+funccov) 450
8 FPF(test+funccov+C-features) 431 3
7 FPF(test+C-features) 425
6 FPF(C-features) 425
5 2
4 FPF(test) 424
3 FPF(Lev(test) not normalized) 368
FPF(Lev(test)) 318 1
2
1 Baseline (examine in random order) 247
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
# Tests Examined # Tests Examined

Figure 9. All bug discovery curves for GCC 4.3.0 wrong-code bugs, Figure 11. Avoiding known bugs in SpiderMonkey 1.6
sorted by increasing area under the curve if examining only the 50
top-ranked test cases
For these bugs, the best method to use for fuzzer taming was less
clear. Figures 9 and 10 show the performance of all methods that
35 we tried, including a table of results sorted by area under the curve
34
33
32
up to 50 tests (Figure 9) and number of test cases to discover all
31
30 faults (Figure 10). It is clear that code coverage (line or function) is
29
28 much more valuable here than with crash bugs, though Levenshtein
27
26
25
distance based on test case alone performs well in the long run
24
23 (but badly initially). Line coverage is useful for early climb, but
22
21 Theoretical best 35 eventually function coverage is most useful for discovering all bugs.
# Faults Seen

20 FPF(funccov) 460
FPF(funccov+C-features) 460
19
18 FPF(test+funccov) 461 Perhaps most importantly, given the difficulty of handling GCC
17 FPF(test+funccov+C-features) 462
16 FPF(Lev(test)) 500 wrong-code bugs, all of our methods perform better than the baseline
15 FPF(Lev(test) not normalized) 605
14
13
FPF(linecov) 894 in terms of ability to find all bugs, and provide a clear advantage
FPF(funccov+linecov) 894
12
11 FPF(test+linecov) 911
FPF(test+linecov+C-features) 911
over the first 50 test cases. We do not wish to overgeneralize from
10
9 FPF(linecov+C-features) 911
FPF(test+funccov+linecov) 911
a few case studies, but these results provide hope that for difficult
8
7
6
FPF(funccov+linecov+C-features) 911
FPF(test+funccov+linecov+C-features) 911
bugs, if good reduction is possible, the exact choice of distance
5
4
FPF(C-features) 929
FPF(test+C-features) 932 function used in FPF may not be critical.
3 FPF(test) 934
2
1
Baseline (examine in random order) 1204 We were disappointed to see that Figures 9 and 10 show no
0
0 100 200 300 400 500 600 700 800 900 1000 1100 1200
evidence that our domain-specific feature detector for C programs is
# Tests Examined useful for wrong-code bugs; in the tables it appears as “C-Feature.”
Figure 10. All bug discovery curves for GCC 4.3.0 wrong-code 4.4 Avoiding Known Faults
bugs, sorted by increasing number of test cases that must be
examined to discover all faults In Section 2.2 we hypothesized that FPF could be used to avoid
reports about a set of known bugs; this is accomplished by lowering
the rankings of test cases that appear to be caused by those bugs.
formed worse than the baseline. In this case, compiler output alone Figure 11 shows, for our SpiderMonkey test cases, an averaged bug
did not provide as much direct guidance: there were 300 different discovery curve for the case where half of the bugs were assumed
failure outputs, and only four of 28 faults had a unique identifying to be already known, and five test cases (or fewer, if five were
output. As a result, while compiler output alone performed very well not available) triggering each of those bugs were used to seed
over the first 50 tests (where it was one of the best five functions), FPF. This experiment models the situation where, in the days or
it proved one of the worst functions for finding all faults, detecting weeks preceding the current fuzzing run, the user has flagged these
no new faults between the 50th and 100th tests ranked. Test-case test cases and does not want to see more test cases triggering the
text by itself performed well for SpiderMonkey with Levenshtein same bugs. The curve is the average of 100 discovery curves, each
distance, or when combined with line coverage, but performed badly corresponding to a different randomly chosen set of known bugs.
as a vectorization without line coverage, appearing in six of the The topmost bug discovery curve in Figure 11 represents an
worst ten functions. As with GCC crashes, Valgrind output alone idealized best case where all test cases corresponding to known bugs
performed very badly, requiring a user to examine 1,506 tests to are removed from the set of test cases to be ranked. The second
discover all bugs. Levenshtein-based approaches (whether over test curve from the top is our result. The third curve from the top is also
case, compiler output, Valgrind output, or a combination thereof) the average of 100 discovery curves; each corresponds to the case
performed very well over the first 50 tests examined. where the five (or fewer) test cases for each known bug are discarded
instead of being used to seed the FPF algorithm, and then the FPF
GCC Wrong-Code Bugs Wrong-code bugs in GCC were the algorithm proceeds normally. This serves as a baseline: our result
trickiest bugs that we faced: their execution does not provide failure would have to be considered poor if it could not improve on this.
output and, in the expected case where the bug is in a “middle end” Finally, the bottom curve is the “basic baseline” where the labeled
optimizer, the distance between execution of the fault and actual test cases are again discarded, but then the remaining test cases are
emission of code (and thus exposure of failure) can be quite long. examined in random order.

7
33
32
sive, our view is that attempting to tame a fuzzer without the aid of a
31
30
solid test-case reducer is inadvisable. The most informative sources
29 FPF(Valgrind+output) 71408 2314
28 FPF(Lev(output)) 69943 2336 of information about root causes are rendered useless by overwhelm-
27 FPF(Lev(output) not normalized) 69804 2336
26 FPF(output) 68267 2311 ing noise. Although we did not create results for unreduced GCC
25 FPF(Valgrind) 67935 2276
24
23
FPF(Lev(Valgrind+output)) 67207 2482
FPF(Valgrind+output+test) 64848 2390
test cases that are analogous to those shown in Figure 12 (the line
22
21
FPF(output+test) 64819 2388
Baseline (examine in random order) 63482 2603
coverage vectors were gigantic and caused problems by filling up
20 FPF(Valgrind+test) 62389 2393 disks), we have no reason to believe the results would have been any
# Faults Seen

19 FPF(test) 62218 2394

18 FPF(Valgrind+output+funccov) 60119 2449
17 FPF(output+funccov) 60111 2448 better than they were for JavaScript.
16 FPF(Valgrind+funccov) 60071 2449
15 FPF(Valgrind+linecov+test) 58362 2430
14 FPF(linecov+test) 58360 2430
13
12 FPF(output+linecov) 58320 2429
FPF(Valgrind+output+linecov) 58319 2429
4.6 Clustering as an Alternative to Furthest Point First
11
10 FPF(linecov) 58314 2429
9 FPF(Valgrind+linecov) 58313 2429
FPF(Valgrind+funccov+linecov+test) 58274 2425
The problem of ranking test cases is not, essentially, a clustering
8
7
6
FPF(Valgrind+output+funccov+linecov+test) 58272 2425
FPF(funccov+linecov+test) 58263 2425
problem. On the other hand, if our goal were simply to find a single
FPF(output+funccov+linecov+test) 58262 2425
5
4 FPF(Valgrind+output+funccov+linecov) 58238 2423 test case triggering each fault, an obvious approach would be to
3 FPF(output+funccov+linecov) 58237 2423
2 FPF(funccov+linecov) 58236 2423 cluster the test cases and then select a single test from each cluster, as
1 FPF(Valgrind+funccov+linecov) 58235 2423
0 in previous approaches to the problem [9, 28]. The FPF algorithm we
0 500 1000 1500 2000 2500
# Tests Examined use is itself based on the idea of approximating optimal clusters [10];
we simply ignore the clustering aspect and use only the ranking
Figure 12. All bug discovery curves for SpiderMonkey 1.6 using information.
unreduced test cases. It is difficult to significantly improve on the Our initial approach to taming compiler fuzzers was to start
baseline without test case reduction. with the feature vectors described in Section 2.3 and then, instead
of ranking test cases using FPF, use X-means [27] to cluster test
cases. A set of clusters does not itself provide a user with a set
As can be seen, our current performance for SpiderMonkey is of representative test cases, however, nor a ranking (since not all
reasonably good. Analogous results for GCC bugs (not included clusters are considered equally likely to represent true categories).
for reasons of space) were similar, but not quite as good for wrong- Our approach therefore followed clustering by selecting the member
code bugs. We speculate that classification, rather than clustering of each cluster closest to its center as that cluster’s representative
or ranking, might be a better machine-learning approach for this test. We ranked each test by the quality of its cluster, as measured
problem if better results are required. by compactness (whether the distance between tests in the cluster is
small) and isolation (whether the distance to tests outside the cluster
4.5 The Importance of Test-Case Reduction is large) [40]. This approach appeared to be promising as it improved
Randomly generated test cases are more effective at finding bugs considerably on the baseline bug discovery curves (Figures 3–8).
when they are large [1]. There are several reasons for this. First, We next investigated the possibility of independently clustering
large tests are more likely to bump into implementation limits and different feature vectors, then merging the representatives from these
software aging effects in the system under test. Second, large tests clusterings [36], and ranking highest those representatives appearing
amortize start-up costs. Third, undesirable feature interactions in the in clusterings based on multiple feature sets. This produced better
system under test are more likely to occur when a test case triggers results than our single-vector method, and it was also more efficient,
more behaviors. as it did not require the use of large vectors combining multiple
The obvious drawback of large random test cases is that they features. This approach is essentially a completely unsupervised
contain much content that is probably unrelated to the bug. They con- variation (with the addition of some recent advances in clustering)
sequently induce long executions that are difficult to debug. Several of earlier approaches to clustering test cases that trigger the same
random testers that have been used in practice, including McKee- bug [9]. Our approach is unsupervised because we exploit test-
man’s C compiler fuzzer [24] and QuickCheck [6], have included case reduction as a way to select relevant features, rather than
built-in support for greedily reducing the size of failure-inducing relying on the previous approaches’ assumption that features useful
inputs. Zeller and Hildebrandt [45] generalized and formalized this in predicting failure or success would also distinguish failures from
kind of test-case reduction as Delta Debugging. each other.
While previous work has assumed that the consumer for reduced However, in comparison to FPF for all three of our case studies,
test cases is a human, our observation is that machine-learning-based clustering was (1) significantly more complex to use, (2) more com-
methods can greatly benefit from reduced test cases. First, machine- putationally expensive, and (3) most importantly, less effective. The
learning algorithms can be overwhelmed by noisy inputs: reduced additional complexity of clustering should be clear from our descrip-
test cases have a vastly improved “signal to noise ratio.” Second, a tion of the algorithm, which omits details such as how we compute
suitably designed test-case reducer has a canonicalizing effect. normalized isolation and compactness, the algorithm for merging
Figure 12 shows the discovery curves for the FPF ordering on multiple views, and (especially) the wide range of parameters that
unreduced JavaScript test cases, accompanied by a table sorted can be supplied to the underlying X-means algorithm.
by area under the discovery curve for all 2,603 unreduced tests Table 1 compares runtimes, with the time for FPF including the
(first column after distance function name) and also the number full end-to-end effort of producing a ranking and the clustering col-
of tests required to discover all faults (second column). Note that umn only showing the time for computing clusters using X-means,
this is a different baseline than in previous graphs, as there are with settings that are a compromise between speed and effectiveness.
no duplicates among the unreduced test cases. While all methods (Increased computation time to produce “more accurate” clusters
improve (slightly) on the baseline for finding all faults, it is difficult in our experience had diminishing returns after this point, which
to consider these approaches acceptable: many methods produce allowed up to 40 splits and a maximum of 300 clusters.) Comput-
an overall discovery curve that is worse than the baseline. Without ing isolation and compactness of clusters and merging clusters to
test-case reduction, it is essentially impossible to efficiently find the produce a ranking based on multiple feature vectors adds additional
more obscure SpiderMonkey bugs. significant overhead to the X-means time shown, if multiple clus-
Based on these poor results, and on the fact that ranking unre- terings are combined, but we have not measured this time because
duced test cases (which have much longer feature vectors) is expen- our implementation is highly unoptimized Python (while X-means

8
Time (s) in many cases performed much worse than the baseline, generating a
Program / Feature FPF Clustering Figures small number of clusters that were not represented by distinct faults.
SpiderMonkey / Valgrind 8.27 23.68 – In fact, the few clustering results that manage to discover 20 faults
SpiderMonkey / output 8.38 46.71 –
also did so more slowly than the baseline curve. While GCC wrong-
SpiderMonkey / test 8.12 94.26 –
SpiderMonkey / funccov 9.56 227.78 3, 4 code clustering was particularly bad, clustering also always missed
SpiderMonkey / linecov 48.29 1,594.04 3, 4 at least three bugs for SpiderMonkey. Our hypothesis as to why FPF
SpiderMonkey / Lev. test+output 998.21 N/A 3, 4 performs so much better than clustering is that the nature of fuzzing
GCC crash bugs / output 0.08 0.71 – results, with a long tail of outliers, is a mismatch for clustering
GCC crash bugs / Valgrind 0.09 0.75 5, 6 algorithm assumptions. FPF is not forced to use any assumptions
GCC crash bugs / C-Feature 0.10 1.95 5, 6 about the size of clusters, and so is not “confused” by the many
GCC crash bugs / test 0.14 15.12 5, 6 single-instance clusters. A minor point supporting our hypothesis
GCC crash bugs / funccov 1.37 162.22 – is that the rank ordering of clustering effectiveness matched that of
GCC crash bugs / linecov 18.70 2,021.08 – the size of the tail for each set of faults: GCC crash results were
GCC crash bugs / Lev. test+output 75.07 N/A 5, 6
good but not optimal, SpiderMonkey results were poor, and GCC
GCC wrong-code bugs / C-Feature 0.49 4.26 7, 8
GCC wrong-code bugs / test 0.72 67.72 – wrong-code results were extremely bad.
GCC wrong-code bugs / funccov 4.12 1,046.07 7, 8
GCC wrong-code bugs / linecov 60.60 7,127.42 – 5. Related Work
GCC wrong-code bugs / Lev. test 667.21 N/A –
A great deal of research related to fuzzer taming exists, and some
Table 1. Runtimes for FPF versus clustering related areas such as fault localization are too large for us to do more
than summarize the high points.

35
Relating Test Cases to Faults Previous work focusing on the core
34
33 problem of “taming” sets of redundant test cases differs from ours
32
31
30
in a few key ways. The differences relate to our choice of primary
29
28
FPF algorithm, our reliance on unsupervised methods, and our focus on
27
26 randomly generated test cases.
25
24
23
Baseline First, the primary method used was typically clustering, as in
22
21 the work of Francis et al. [9] and Podgurski et al. [28], which at
# Faults Seen

20
19
18
first appears to reflect the core problem of grouping test cases into
17
16
equivalence classes by underlying fault. However, in practice the
15
14 user of a fuzzer does not usually care about the tests in a cluster,
13
12
11
but only about finding at least one example from each set with no
10
9 particular desire that it is a perfectly “representative” example. The
8
7 Clustering core problem we address is therefore better considered as one of
6
5
4
multiple output identification [8] or rare category detection [8, 40],
3
2 given that many faults will be found by a single test case out of
1
0 thousands. This insight guides our decision to provide the first
0 100 200 300 400 500 600 700 800 900 1000 1100 1200
# Tests Examined evaluation in terms of discovery curves (the most direct measure of
fuzzer taming capability we know of) for this problem. Our results
Figure 13. GCC 4.3.0 wrong-code bug clustering comparison suggest that this difference in focus is also algorithmically useful,
as clustering was less effective than our (novel, to our knowledge)
choice of FPF.
is a widely used tool written in C). Because the isolation and com- One caveat is that, as in the work of Jones et al. on debugging
pactness computations require many pairwise distance results, an in parallel [15], clusters may not be directly useful to users, but
efficient implementation should be approximately equal in time to might assist fault localization algorithms. Jones et al. provide an
running FPF. The final column of the table lists the figures in this evaluation in terms of a model of debugging effort, which combines
paper that show a curve based on the indicated results. If a curve clustering effectiveness with fault-localization effectiveness. This
relies on multiple clusterings, its generation time is (at least) the provides an interesting contrast to our discovery curves: it relies on
sum of the clustering times for each component. Note that because more assumptions about users’ workflow and debugging process
X-means expects inputs in vector form, we were unable to apply our and provides less direct information about the effectiveness of
direct Levenshtein-distance approach with clustering, but we include taming itself. In our experience, sufficiently reduced test cases make
some runtimes for FPF Levenshtein to provide a comparison. localization easy enough for many compiler bugs that discovery is
That clustering is more expensive and complex than FPF is the more important problem. Unfortunately, it is hard to compare
not surprising; clustering has to perform the additional work of results: cost-model results are only reported for SPACE, a program
computing clusters, rather than simply ranking items by distance. with only around 6,200 LOC, and their tests included not only
That FPF produces considerably better discovery curves, as shown random tests from a simple generator but 3,585 human-generated
in Figures 3–8, is surprising. The comparative ineffectiveness of tests. In the event that clusters are needed, FPF results for any k can
clustering is twofold: the discovery curves do not climb as quickly be transformed into k clusters with certain optimality bounds for the
as with FPF, and (perhaps even more critically) clustering does not chosen distance function [10].
ever find all the faults in many cases. In general, for almost all Second, our approach is completely unsupervised. There is no
feature sets, clustering over those same features was worse than expectation that users will examine clusters, add rules, or intervene
applying FPF to those features. The bad performance of clustering in the process. We therefore use test-case reduction for feature
was particularly clear for GCC wrong-code bugs: Figure 13 shows selection, rather than basing it on classifying test cases as successful
all discovery curves for GCC wrong-code, with clustering results or failing [9, 28]. Because fuzzing results follow a power law, many
shown in gray. Clustering at its “best” missed 15 or more bugs, and faults will be represented by far too few tests for a good classifier to

9
include their key features; this is a classic and extreme case of class Fault Localization Our work shares a common ultimate goal with
imbalance in machine learning. While bug slippage is a problem, fault localization work in general [5, 7, 11, 16, 17, 21, 22, 30]
reduction remains highly effective for feature selection, in that the and specifically for compilers [43]: reducing the cost of manual
features selected are correct for the reduced test cases, essentially debugging. We differ substantially in that we focus our methods and
by the definition of test-case reduction. evaluation on the narrow problem of helping the users of fuzzers
Finally, our expected use case and experimental results are based deal with the overwhelming amount of data that a modern fuzzer
on a large set of failures produced by large-scale random testing can produce when applied to a compiler. As suggested by Liu and
for complex programming languages implemented in large, com- Han [23], Jones et al. [15], and others, localization may support
plex, modern compilers. Most previous results in failure clustering fuzzer taming and fuzzer taming may support localization. As
used human-reported failures or human-created regression tests part of our future work, we propose to make use of vectors based
(e.g., GCC regression tests [9, 28]), which are essentially differ- on localization information to determine if, even after reduction,
ent in character from the failures produced by large-scale fuzzing, localization can improve bug discovery. A central question is
and/or concerned much smaller programs with much simpler in- whether the payoff from keeping summaries of successful executions
put domains [15, 23], i.e., examples from the Siemens suite. Li- (a requirement for many fault localizations) provides sufficient
blit et al. [22] in contrast directly addressed scalability by using improvement to pay for its overhead in reduced fuzzing throughput.
32,000 random inputs (though not from a pre-existing industrial-
strength fuzzer for a complex language) and larger programs (up 6. Conclusion
to 56 KLOC), and noted that they saw highly varying rates of fail-
ure for different bugs. Their work addresses a somewhat different Random testing, or fuzzing, has emerged as an important way to test
problem than ours—that of isolating bugs via sampled predicate compilers and language runtimes. Despite their advantages, however,
values, rather than straightforward ranking of test cases for user fuzzers create a unique set of challenges when compared to other
examination—and did not include any systems as large as GCC or testing methods. First, they indiscriminately and repeatedly find test
SpiderMonkey. cases triggering bugs that have already been found and that may
not be economical to fix in the short term. Second, fuzzers tend to
trigger some bugs far more often than others, creating needle-in-the-
Distance Functions for Executions and Programs Liu and Han’s haystack problems for engineers who are triaging failure-inducing
work [23], like ours, focuses less on a particular clustering method outputs generated by fuzzers.
and proposes that the core problem in taming large test suites Our contribution is to tame a fuzzer by adding a tool to the
is that of embedding test cases in a metric space that has good back end of the random-testing workflow; it uses techniques from
correlation with underlying fault causes. They propose to compute machine learning to rank test cases in such a way that interesting
distance by first applying fault localization methods to executions, tests are likely to be highly ranked. By analogy to the way people
then using distance over localization results rather than over the use ranked outputs from static analyses, we expect fuzzer users to
traces themselves. We propose that the reduction of random test inspect a small fraction of highly ranked outputs, trusting that lower-
cases essentially “localizes” the test cases themselves, allowing ranked test cases are not as interesting. If our rankings are good,
us to directly compute proximity over test cases while exhibiting fuzzer users will get most of the benefit of inspecting every failure-
the good correlation with underlying cause that Liu and Han seek inducing test case discovered by the fuzzer for a fraction of the effort.
to achieve by applying a fault-localization technique. Reduction For example, a user inspecting test cases for SpiderMonkey 1.6 in
has advantages over localization in that reduction methods are our ranked order will see all 28 bugs found during our fuzzing run
more commonly employed and do not require storing—or even 4.6× faster than will a user inspecting test cases in random order.
capturing or sampling—information about coverage, predicates, or A user inspecting test cases that cause GCC 4.3.0 to emit incorrect
other metrics for passing test cases. Liu and Han show that distance object code will see all 35 bugs 2.6× faster than one inspecting
based on localization algorithms better captures fault cause than tests in random order. The improvement for test cases that cause
distance over raw traces, but they do not provide discovery curves GCC 4.3.0 to crash is even higher: 32×, with all 11 bugs exposed
or a detailed clustering evaluation. They provide correlation results by only 15 test cases.
only over the Siemens suite’s small subjects and test case sets.
More generally, the problems of distance functions over ex- Acknowledgments
ecutions and test cases [5, 11, 23, 30, 39] and programs them-
selves [4, 35, 41] have typically been seen as essentially different We thank Michael Hicks, Robby Findler, and the anonymous
problems. While this is true for many investigations—generalized PLDI ’13 reviewers for their comments on drafts of this paper;
program understanding and fault localization on the one hand, and Suresh Venkatasubramanian for nudging us towards the furthest
plagiarism detection, merging of program edits, code clone, or mal- point first technique; James A. Jones for providing useful early
ware detection on the other—the difference collapses when we feedback; and Google for a research award supporting Yang Chen.
consider that every program compiled is an input to some other A portion of this work was funded by NSF grants CCF-1217824
program. A program is therefore both a program and a test input, and CCF-1054786.
which induces an execution of another program. Distance between
(compiled) programs, therefore, is a distance between executions. References
We are the first, to our knowledge, to essentially erase the distinction [1] James H. Andrews, Alex Groce, Melissa Weston, and Ru-Gang Xu.
between a metric space for programs and a metric space for execu- Random test run length and effectiveness. In Proc. ASE, pages 19–28,
tions, mixing the two concepts as needed. Moreover, we believe that September 2008.
our work addresses some of the concerns noted in fault-localization [2] Abhishek Arya and Cris Neckar. Fuzzing for security, April 2012.
efforts based on execution distances (e.g., poor results compared http://blog.chromium.org/2012/04/
to other methods [16]), in that distance functions should perform fuzzing-for-security.html.
much better on executions of reduced programs, due to the power [3] Mariano Ceccato, Alessandro Marchetto, Leonardo Mariani, Cu D.
of feature selection, and distances over programs (highly structured Nguyen, and Paolo Tonella. An empirical study about the
and potentially very informative inputs) can complement execution- effectiveness of debugging when random test cases are used. In Proc.
based distance functions. ICSE, pages 452–462, June 2012.

10
[4] Silvio Cesare and Yang Xiang. Malware variant detection using rare-category detection. In Advances in Neural Information
similarity search over sets of control flow graphs. In Proc. Processing Systems 18, December 2004.
TRUSTCOM, pages 181–189, November 2011. [27] Dan Pelleg and Andrew W. Moore. X-means: Extending K-means
[5] Sagar Chaki, Alex Groce, and Ofer Strichman. Explaining abstract with efficient estimation of the number of clusters. In Proc. ICML,
counterexamples. In Proc. FSE, pages 73–82, 2004. pages 727–734, June/July 2000.
[6] Koen Claessen and John Hughes. QuickCheck: a lightweight tool for [28] Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda
random testing of Haskell programs. In Proc. ICFP, pages 268–279, Minch, Jiayang Sun, and Bin Wang. Automated support for
2000. classifying software failure reports. In Proc. ICSE, pages 465–475,
May 2003.
[7] Holger Cleve and Andreas Zeller. Locating causes of program failures.
In Proc. ICSE, pages 342–351, May 2005. [29] John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, and
Xuejun Yang. Test-case reduction for C compiler bugs. In Proc. PLDI,
[8] Shai Fine and Yishay Mansour. Active sampling for multiple output pages 335–346, June 2012.
identification. Machine Learning, 69(2–3):213–228, 2007.
[30] Manos Renieris and Steven Reiss. Fault localization with nearest
[9] Patrick Francis, David Leon, Melinda Minch, and Andy Podgurski. neighbor queries. In Proc. ASE, pages 30–39, October 2003.
Tree-based methods for classifying software failures. In Proc. ISSRE,
pages 451–462, November 2004. [31] Jesse Ruderman. Introducing jsfunfuzz. http://www.squarefree.
com/2007/08/02/introducing-jsfunfuzz/.
[10] Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster
distance. Theoretical Computer Science, 38:293–306, 1985. [32] Jesse Ruderman. Mozilla bug 349611.
https://bugzilla.mozilla.org/show_bug.cgi?id=349611
[11] Alex Groce. Error explanation with distance metrics. In Proc. TACAS, (A meta-bug containing all bugs found using jsfunfuzz.).
pages 108–122, March 2004.
[33] Jesse Ruderman. How my DOM fuzzer ignores known bugs, 2010.
[12] Alex Groce, Gerard Holzmann, and Rajeev Joshi. Randomized http://www.squarefree.com/2010/11/21/
differential testing as a prelude to formal verification. In Proc. ICSE, how-my-dom-fuzzer-ignores-known-bugs/.
pages 621–631, May 2007.
[34] G. Salton, A. Wong, and C. S. Yang. A vector space model for
[13] Alex Groce, Chaoqiang Zhang, Eric Eide, Yang Chen, and John automatic indexing. CACM, 18(11):613–620, November 1975.
Regehr. Swarm testing. In Proc. ISSTA, pages 78–88, July 2012.
[35] Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Winnowing:
[14] Christian Holler, Kim Herzig, and Andreas Zeller. Fuzzing with code local algorithms for document fingerprinting. In Proc. SIGMOD,
fragments. In Proc. USENIX Security, pages 445–458, August 2012. pages 76–85, June 2003.
[15] James A. Jones, James F. Bowring, and Mary Jean Harrold. [36] Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a
Debugging in parallel. In Proc. ISSTA, pages 16–26, July 2007. knowledge reuse framework for combining multiple partitions. The
[16] James A. Jones and Mary Jean Harrold. Empirical evaluation of the Journal of Machine Learning Research, 3:583–617, 2003.
Tarantula automatic fault-localization technique. In Proc. ASE, pages [37] Chengnian Sun, David Lo, Siau-Cheng Khoo, and Jing Jiang.
273–282, November 2005. Towards more accurate retrieval of duplicate bug reports. In Proc.
[17] James A. Jones, Mary Jean Harrold, and John Stasko. Visualization of ASE, pages 253–262, November 2011.
test information to assist fault localization. In Proc. ICSE, pages [38] Chengnian Sun, David Lo, Xiaoyin Wang, Jing Jiang, and Siau-Cheng
467–477, May 2002. Khoo. A discriminative model approach for accurate duplicate bug
[18] Yungbum Jung, Jaehwang Kim, Jaeho Shin, and Kwangkeun Yi. report retrieval. In Proc. ICSE, pages 45–54, May 2010.
Taming false alarms from a domain-unaware C analyzer by a Bayesian [39] Vipindeep Vangala, Jacek Czerwonka, and Phani Talluri. Test case
statistical post analysis. In Proc. SAS, pages 203–217, September comparison and clustering using program profiles and static execution.
2005. In Proc. ESEC/FSE, pages 293–294, August 2009.
[19] Ted Kremenek and Dawson Engler. Z-ranking: using statistical [40] Pavan Vatturi and Weng-Keen Wong. Category detection using
analysis to counter the impact of static analysis approximations. In hierarchical mean shift. In Proc. KDD, pages 847–856, June/July
Proc. SAS, pages 295–315, June 2003. 2009.
[20] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, [41] Andrew Walenstein, Mohammad El-Ramly, James R. Cordy,
insertions, and reversals. Soviet Physics Doklady, 10:707–710, 1966. William S. Evans, Kiarash Mahdavi, Markus Pizka, Ganesan
[21] Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan. Bug Ramalingam, and Jürgen Wolff von Gudenberg. Similarity in
isolation via remote program sampling. In Proc. PLDI, pages programs. In Duplication, Redundancy, and Similarity in Software,
141–154, June 2003. Dagstuhl Seminar Proceedings, July 2006.
[22] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. [42] Xiaoyin Wang, Lu Zhang, Tao Xie, John Anvik, and Jiasu Sun. An
Jordan. Scalable statistical bug isolation. In Proc. PLDI, pages 15–26, approach to detecting duplicate bug reports using natural language and
June 2005. execution information. In Proc. ICSE, pages 461–470, May 2008.
[23] Chao Liu and Jiawei Han. Failure proximity: a fault localization-based [43] David B. Whalley. Automatic isolation of compiler errors. TOPLAS,
approach. In Proc. FSE, pages 46–56, November 2006. 16(5):1648–1659, September 1994.
[24] William M. McKeeman. Differential testing for software. Digital [44] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. Finding and
Technical Journal, 10(1):100–107, December 1998. understanding bugs in C compilers. In Proc. PLDI, pages 283–294,
June 2011.
[25] Nicholas Nethercote and Julian Seward. Valgrind: a framework for
heavyweight dynamic binary instrumentation. In Proc. PLDI, pages [45] Andreas Zeller and Ralf Hildebrandt. Simplifying and isolating
89–100, June 2007. failure-inducing input. IEEE TSE, 28(2):183–200, February 2002.
[26] Dan Pelleg and Andrew Moore. Active learning for anomaly and

Learning PHP MySQL JavaScript Early Release 6th Edition Robin Nixon all chapter instant download
75% (4)
Learning PHP MySQL JavaScript Early Release 6th Edition Robin Nixon all chapter instant download
52 pages
Fuzzing - A Survey For Roadmap
No ratings yet
Fuzzing - A Survey For Roadmap
36 pages
Learn Software Testing in 24 Hours
From Everand
Learn Software Testing in 24 Hours
Alex Nordeen
No ratings yet
Introduction To Generative AI
No ratings yet
Introduction To Generative AI
2 pages
Using Grammar Extracted From Sample Inputs To Generate Effective Fuzzing Files
No ratings yet
Using Grammar Extracted From Sample Inputs To Generate Effective Fuzzing Files
23 pages
fuzzing
No ratings yet
fuzzing
28 pages
Woot 23
No ratings yet
Woot 23
80 pages
Week 05 Testing
No ratings yet
Week 05 Testing
54 pages
A Survey of Modern Compiler Fuzzing
No ratings yet
A Survey of Modern Compiler Fuzzing
25 pages
hushcon23
No ratings yet
hushcon23
84 pages
Fuzzing or Fuzz Testing
No ratings yet
Fuzzing or Fuzz Testing
3 pages
Anti Fuzzing PDF
No ratings yet
Anti Fuzzing PDF
5 pages
3243734.3243804
No ratings yet
3243734.3243804
16 pages
4_Fuzzing
No ratings yet
4_Fuzzing
57 pages
2308.04748v2
No ratings yet
2308.04748v2
13 pages
Professional Test Driven Development with C#: Developing Real World Applications with TDD
From Everand
Professional Test Driven Development with C#: Developing Real World Applications with TDD
James Bender
No ratings yet
Fuzzing Defined: - Automated Testing Technique Used To Find Bugs in Software
No ratings yet
Fuzzing Defined: - Automated Testing Technique Used To Find Bugs in Software
13 pages
Quickfuzz: An Automatic Random Fuzzer For Common File Formats
No ratings yet
Quickfuzz: An Automatic Random Fuzzer For Common File Formats
8 pages
A Large-Scale Parallel Fuzzing System: Yang Li, Chao Feng, Chaojing Tang
No ratings yet
A Large-Scale Parallel Fuzzing System: Yang Li, Chao Feng, Chaojing Tang
4 pages
Fuzzing and Random Testing
No ratings yet
Fuzzing and Random Testing
2 pages
A Review of Fuzzing Tools and Methods
No ratings yet
A Review of Fuzzing Tools and Methods
21 pages
4_Fuzzing_up_to_SAGE
No ratings yet
4_Fuzzing_up_to_SAGE
40 pages
09 FindingBugs
No ratings yet
09 FindingBugs
41 pages
Software Verification: Testing vs. Model Checking: A Comparative Evaluation of The State of The Art
No ratings yet
Software Verification: Testing vs. Model Checking: A Comparative Evaluation of The State of The Art
16 pages
Sec19fall Guler Prepub
No ratings yet
Sec19fall Guler Prepub
17 pages
The Art of Debugging with GDB, DDD, and Eclipse
From Everand
The Art of Debugging with GDB, DDD, and Eclipse
Norman Matloff
3.5/5 (6)
Automated Whitebox Fuzz Testing Paper Patrice Godefroid
No ratings yet
Automated Whitebox Fuzz Testing Paper Patrice Godefroid
16 pages
Alpha Prog
No ratings yet
Alpha Prog
7 pages
Automated Testing With Commercial Fuzzing Tools
No ratings yet
Automated Testing With Commercial Fuzzing Tools
10 pages
Fuzz Testing: Headstrong - Strong Opinions, Strong Results
No ratings yet
Fuzz Testing: Headstrong - Strong Opinions, Strong Results
14 pages
Comparing Mining Algorithms For Predicting The Severity of A Reported Bug
No ratings yet
Comparing Mining Algorithms For Predicting The Severity of A Reported Bug
10 pages
Fuzz-Doc 1
No ratings yet
Fuzz-Doc 1
26 pages
Fuzzing Frameworks
No ratings yet
Fuzzing Frameworks
49 pages
ICST-Industry - fuzzing
No ratings yet
ICST-Industry - fuzzing
12 pages
1812 00140 PDF
No ratings yet
1812 00140 PDF
21 pages
Fuzzing and Patch Analysis - SAGEly Advice
No ratings yet
Fuzzing and Patch Analysis - SAGEly Advice
61 pages
Software Debugging Techniques: P. Adragna
No ratings yet
Software Debugging Techniques: P. Adragna
16 pages
Go Speed Tracer
No ratings yet
Go Speed Tracer
63 pages
Unit Testing and TDD with TypeScript: Quality Code from Day One
From Everand
Unit Testing and TDD with TypeScript: Quality Code from Day One
Baldurs L.
No ratings yet
Software Testing
From Everand
Software Testing
Alisa Turing
No ratings yet
Automating Software Tests Using Selenium
From Everand
Automating Software Tests Using Selenium
Hugo Peres
No ratings yet
Go Debugging from Scratch: A Practical Guide with Examples
From Everand
Go Debugging from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Lecture 3 Test Planning
No ratings yet
Lecture 3 Test Planning
32 pages
Toward Understanding Compiler Bugs in GCC and LLVM
No ratings yet
Toward Understanding Compiler Bugs in GCC and LLVM
12 pages
ST MOD2 FaultBasedTesting2
No ratings yet
ST MOD2 FaultBasedTesting2
19 pages
An Introduction To Dynamic Analysis For R.E. (2020) PDF
No ratings yet
An Introduction To Dynamic Analysis For R.E. (2020) PDF
30 pages
14 Debugging
No ratings yet
14 Debugging
27 pages
Software Testing For Moodle
No ratings yet
Software Testing For Moodle
6 pages
Fuzzing For Software Security Testing and Quality Assurance
No ratings yet
Fuzzing For Software Security Testing and Quality Assurance
5 pages
Slides Fuzzing Workshop Hack - Lu v1.0 WINAFLD
No ratings yet
Slides Fuzzing Workshop Hack - Lu v1.0 WINAFLD
232 pages
Testing Unit 4
No ratings yet
Testing Unit 4
19 pages
Experiment
No ratings yet
Experiment
12 pages
A Few Billion Lines of Code Later - Using Static Analysis to Find Bugs in the Real World - ACM - 2010 (BLOC-coverity)
No ratings yet
A Few Billion Lines of Code Later - Using Static Analysis to Find Bugs in the Real World - ACM - 2010 (BLOC-coverity)
10 pages
Beginners Guide To Software Testing
100% (1)
Beginners Guide To Software Testing
2 pages
Bug Management
No ratings yet
Bug Management
40 pages
OceanofPDF.com Modern Software Testing Techniques - Istvan Forgacs
No ratings yet
OceanofPDF.com Modern Software Testing Techniques - Istvan Forgacs
343 pages
Technical Trends and Challenges of Software Testing
No ratings yet
Technical Trends and Challenges of Software Testing
12 pages
sec22-zou
No ratings yet
sec22-zou
18 pages
Cmiller Toorcon2007 Code Coverage With Fuzzing
No ratings yet
Cmiller Toorcon2007 Code Coverage With Fuzzing
104 pages
Scalability Security: Functional Vs Non-Functional Testing
No ratings yet
Scalability Security: Functional Vs Non-Functional Testing
15 pages
Rust In Practice
From Everand
Rust In Practice
GitforGits
No ratings yet
Rust In Practice: A Programmers Guide to Build Rust Programs, Test Applications and Create Cargo Packages
From Everand
Rust In Practice: A Programmers Guide to Build Rust Programs, Test Applications and Create Cargo Packages
Rustacean Team
No ratings yet
skill acquisition theory
No ratings yet
skill acquisition theory
12 pages
North East Indian Linguistics 7
100% (1)
North East Indian Linguistics 7
301 pages
CBR Literasi Bahasa Inggris - Kel 01
No ratings yet
CBR Literasi Bahasa Inggris - Kel 01
15 pages
Hung-yi Lee word2vec (v3)
No ratings yet
Hung-yi Lee word2vec (v3)
23 pages
MODBUS Series V6.2 Excerpt (5)
No ratings yet
MODBUS Series V6.2 Excerpt (5)
123 pages
Ferdowsi Final
No ratings yet
Ferdowsi Final
14 pages
CITD Ug MaterialNX9.0
No ratings yet
CITD Ug MaterialNX9.0
971 pages
Footnote
No ratings yet
Footnote
21 pages
A Brief Guide To Writing Paragraphs
No ratings yet
A Brief Guide To Writing Paragraphs
6 pages
SUMMATIVE ASSESSMENT II - Class 3
No ratings yet
SUMMATIVE ASSESSMENT II - Class 3
3 pages
2010 HAT Marking Scheme
No ratings yet
2010 HAT Marking Scheme
6 pages
Choreography Dances
No ratings yet
Choreography Dances
36 pages
Understanding Adoption Clinical Work with Adults Children and Parents Kathleen Hushion 2024 scribd download
100% (1)
Understanding Adoption Clinical Work with Adults Children and Parents Kathleen Hushion 2024 scribd download
77 pages
Gfk1675 - Cimplicity Hmi Opc Server Operation Manual
No ratings yet
Gfk1675 - Cimplicity Hmi Opc Server Operation Manual
48 pages
Types KBCH 120, 130, 140 Differential Protection For Transformers and Generators
No ratings yet
Types KBCH 120, 130, 140 Differential Protection For Transformers and Generators
20 pages
ENGLISH TEST 7TH GRADE
No ratings yet
ENGLISH TEST 7TH GRADE
2 pages
33 Unique Indian Hindu Baby Names Stating With LE With Meaning - Parentune - Com - Page - 3
No ratings yet
33 Unique Indian Hindu Baby Names Stating With LE With Meaning - Parentune - Com - Page - 3
3 pages
Dissertation Difference Between Aims and Objectives
100% (1)
Dissertation Difference Between Aims and Objectives
8 pages
Common Errors: The Words: At, by
No ratings yet
Common Errors: The Words: At, by
4 pages
What Is Cultural Convergence?: Need Help With The Assignment?
No ratings yet
What Is Cultural Convergence?: Need Help With The Assignment?
2 pages
Pierre de Fermat
No ratings yet
Pierre de Fermat
3 pages
Interview Questions, Study Materials For Computer Science
No ratings yet
Interview Questions, Study Materials For Computer Science
2 pages
[Ebooks PDF] download Franz Kafka and Chinese Culture Yanbing Zeng full chapters
100% (1)
[Ebooks PDF] download Franz Kafka and Chinese Culture Yanbing Zeng full chapters
51 pages
Pertemuan Ke Ix, Materi Dan Tugas
No ratings yet
Pertemuan Ke Ix, Materi Dan Tugas
16 pages
M3. 21st Century Lit Phil and World
No ratings yet
M3. 21st Century Lit Phil and World
5 pages
PIC Trainer Kit User Manual
100% (2)
PIC Trainer Kit User Manual
43 pages
SBA Oral Presentation
100% (2)
SBA Oral Presentation
2 pages
Makalah Simple Past Tense B.inggris KLMPK 6
No ratings yet
Makalah Simple Past Tense B.inggris KLMPK 6
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Taming Compiler Fuzzers

Uploaded by

Taming Compiler Fuzzers

Uploaded by

Taming Compiler Fuzzers

Yang Chen Alex Groce† Chaoqiang Zhang† Weng-Keen Wong†

chenyang@cs.utah.edu agroce@gmail.com zhangch@onid.orst.edu wong@eecs.oregonstate.edu

Abstract more than 1,700 previously unknown bugs in SpiderMonkey, the

2 Theoretical best 2 Theoretical best

# New Faults Seen

19 FPF(test) 62218 2394

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.