894488029-MIT
894488029-MIT
894488029-MIT
by 2 T El
Tal Tchwella
USRARIES
Submitted to the Department of Electrical Engineering and Compute ~
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
June 2014
@ Massachusetts Institute of Technology 2014. All rights reserved.
Signature redacted
Author ................................................
Department of Electrical Engineering and Computer Science
May 23, 2014
Signature redacted
C ertified by ..........................................................
Prof. Martin C. Rinard
Signature redacted Thesis Supervisor
C ertified by .. ... .... . -.. -.. ....................................
Stelios Sidiroglou-Douskos, Research Scientist
Thesis Co-Supervisor
Signature redacted
Accepted by ....... ..........................
Prof. Albert R. Meyer
Chairman, Masters of Engineering Thesis Committee
Fault Prophet: A Fault Injection Tool for Large Scale
Computer Systems
by
Tal Tchwella
Abstract
In this thesis, I designed and implemented a fault injection tool, to study the impact of
soft errors for large scale systems. Fault injection is used as a mechanism to simulate
soft errors, measure the output variability and provide analysis of the impact of soft
errors on the program. The underlying framework for the tool is based on LLFI,
a LLVM fault injection tool, which I modified to support an end-to-end scenario for
program testing purposes. The modifications and addition provide greater modularity
of the tool by abstracting the input and output of the different components of the tool,
support multiple fault scenarios and models, and supply an extensive visualizations
framework. I evaluated the effectiveness of the new tool based on a set of benchmark
programs as well as showcased the impact of soft errors on programs. The results
demonstrate that while the sensitivity of instructions is program dependent, certain
instruction opcodes are generally more sensitive than others, such as binary and
memory operations, however well placed protection mechanisms can decrease the
sensitivity of those instructions.
3
4
Acknowledgments
This work is dedicated to Oded 'Akram' Tchwella and Ida Tzouk.
First and foremost, I would like to thank my parents, Yaron and Michal, and my
sisters, Shir and Ophir, for their love, support and their presence in times of difficulty.
I would like to thank my grandparents for their love and appreciation throughout liy
life.
I would also like to thank my family abroad, the Aharoni family, the Youssefmir
and Youssefzadah families, and the Zinger family, as well as the people who provided
me with a home away from home, Joshua Blum, Orit Shamir, Itai Turbahn, Tom and
Anat Peleg, Naor Brown, Jason Black and Edan Krolewicz. I would like to seize the
opportunity to thank my friends in Israel for standing by me these past four years,
showing me what true friendship is all about.
It is with great pleasure that I acknowledge Professor Martin C. Rinard for in-
spiring and motivating me to do brilliant things. It is also with immense gratitude
that I acknowledge the support and help of Michael Carbin, Sasa Misailovic, and Dr.
Stelios Sidiroglou-Douskos throughout this process, for believing in me and providing
me with this opportunity to work with them.
5
6
Contents
1 Introduction 15
1.2 U sage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Related Work 19
3 Motivating Example 23
3.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 W alkthrough . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Design 39
7
4.1 System Overview . . . . .. ..... ... 39
.
4.2 Fault Selection Analysis .. ..... ... 39
.
4.2.1 Static Analysis .. ..... .. . 40
.
4.2.2 Filters . . . . . . .. ..... ... 41
.
4.3 Profiler . . . . . . . . . . .. ..... ... 43
.
4.3.1 Dynamic Analysis .. ..... ... 44
.
4.3.2 Database . . . . .. ..... ... 44
.
4.4 Fault Injector . . . . . . .. ..... .. . 46
.
4.4.1 Selecting Faults .. ..... ... 46
.
4.4.2 Fault Scenarios .. .... ... . 47
.
4.4.3 Fault Models . . .. ..... ... 49
.
.
4.5 Sensitivity Report Analysis .. ..... ... 51
.
4.5.1 Output Abstraction and Fault Classification 51
4.5.2 Summary . . . . .. ..... ... 52
.
.
4.5.3 Vim Plugins . . . .. ..... ... 53
.
.
4.5.4 Visualizations . . .. ..... ... 54
.
5 Evaluation 57
5.1 Benchmark Programs . . . . . . . . 57
.
5.1.1 Jacobi . . . . . . . . . . . . 57
.
5.2 Performance . . . . . . . . . . . . . 58
.
5.2.2 Storage . . . . . . . . . . . 58
.
5.4.1 Binary . . . . . . . . . . . . 60
.
8
5.4.3 M emory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
.
5 .4 .4 A ll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
.
6 Discussion 69
.
6.1.1 Instructions Separation . . . . . 69
.
6.1.2 Conditionals . . . . . . . . . . . 70
.
6.2 Design Challenges . . . . . . . . . . . . 71
.
6.2.1 Scalability . . . . . . . . . . . . 71
.
6.3 Future Work . . . . . . . . . . . . . . . 71
.
6.3.1 Code Optinizations . . . . . . . 71
.
6.3.2 Fault Model Factory for Function Calls and Arrays 72
A Documentation 75
A.1.1 Installation . . . . . . . . . . . . . . . . 75
.
A.1.3 Configuration . . . . . . . . . . . . . . . 76
.
B Figures 85
9
10
List of Figures
.
3-2 Pie chart exam ple. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
.
3-3 Bar graph exam ple. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
.
3-4 Treemap example of showing function level overview. . . . . . . . . 30
.
4-1 System overview diagram. . . . . . . . . . . . . . . . . . . . . . . . 40
.
4-2 Code sample for an instruction counter of a function. . . . . . . . . 42
.
4-3 D atabase schem a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
.
4-4 The function signature of a fault model function. . . . . . . . . . . 50
.
4-5 The fault factory function. . . . . . . . . . . . . . . . . . . . . . . . 50
.
4-6 FltMark vim plugin example. . . . . . . . . . . . . . . . . . . . . . 54
.
5-1 A sample of GasKinetics. cpp from the Cantera library, showing the
different fault outcomes, using the Bit Flip fault model on Binary in-
structions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
.
5-2 A sample of GasKinetics. cpp from the Cantera library, showing the
different fault outcomes, using the Bit Flip fault model on Memory
instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
.
B-3 The output abstraction wrapper for the analysis of the Jacobi program. 92
11
B-4 The output abstraction algorithm used for the analysis of the Jacobi
program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B-5 The experiments script used to run the tool for the evaluation of both
the tool and program s. . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12
List of Tables
3.3 Statistical significant analysis for the fault models based on the binary
instruction set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Comparison of fault models based on the Hydrogen ignition case for
binary instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Comparison of fault models based on the Methane ignition case for
binary instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Comparison of fault models based on the Hydrogen ignition case for
5.6 Comparison of fault models based on the Hydrogen ignition case for
m em ory instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.7 Results for the Random Bit Flips fault model for memory instructions
5.8 Comparison of fault models based on the Hydrogen ignition case for
all instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
13
14
Chapter 1
Introduction
Errors may affect the result that the program produces. In particular, soft errors are
a type of error caused by uncontrollable events: cosmic rays hitting memory cells,
voltage droops, missing deadlines, etc. [1], and as such they are hard to identify, they
occur non-deteriinistically, and may silently corrupt computation. Understanding
these types of errors is especially important for systems consisting of hundreds of
thousands or millions of processors [2, 21]. In this thesis I present Fault Prophet,
a system for measuring, analyzing, and visualizing the impact of soft errors on the
behavior and output of a program.
Fault Prophet enables developers to reason about the impact of soft errors on a
program using three components: a fault instrumentor, a sensitivity measurement
tool, and an error impact visualizer.
The Fault Instrumentor component enables Fault Prophet to intelligently select and
inject faults, controlled errors, to various execution points in a program. The compo-
nent profiles user defined instructions, which it then instruments in order to find the
15
potential execution points for fault injection. Based on the instrumentation output,
a histogram of the program is created to randomly select instructions that are on
the critical path of the program flow. Faults are then injected to those instructions,
and the output is saved to measure the sensitivity of those instructions to the faults.
Fault Prophet builds upon LLFI [24], LLVM [9] Fault Injection Tool, created by the
scenarios and fault models empowering the developers to extensively test the fault
tolerance of a program.
The sensitivity measurement tool enables to measure the variation in output a fault
caused when compared to the baseline execution. The baseline execution is the control
group in the experiment, producing the standard output of the program based on the
normal flow of operation. The amount of variation in output as a result of the injected
fault is defined as sensitivity.
There are two main methods to measure the sensitivity of the computation a fault
induces:
While using both methods will provide a comprehensive level of detail of the
effect of the fault, for this fault injection tool, the sensitivity is measured using the
output comparison method. The reason behind this choice, is that the analysis and
comparison of traces is resource intensive, both for processing and storage. However,
the tool does provide a way to output traces for later analysis if needed.
Fault Prophet visualizes the impacts of the soft-errors simulations enabling the devel-
opers to quickly and easily deduce the critical sections in a program. Critical sections
16
are portions of the program that have an impact on the integrity and reliability of the
program, if modified would cause a deviation in the expected output. Fault Prophet
classifies a fault into one of the fault categories based on the sensitivity of the in-
struction the fault was injected to. The developers can classify the faults according
to their own output abstraction methodology. Fault Prophet aggregates the results
of the various injected faults for each source line, function, and the entire program,
which it then displays using different visualizations. Fault Prophet inaps the out-
comes of the fault injections to the source code, enabling the developer to view the
impacts in a popular source code editor, as well as inject faults directly from the
1.2 Usage
The idea of a fault injection tool is motivated by the concept that not all parts of
a program are created equal: some parts are more critical to a program's reliability
and correctness than others. Protecting these critical sections is essential, as marny
programs are resilient to errors: after experiencing soft errors they may continue,
executing and producing acceptable exact or approximate, results, often without any
guidance from the recovery mechanisms. The existing recovery mechanisms for soft
error: Checkpoint / Restart [3], Triple module redundancy [12, 14], and Instruction
level replication [7, 18, 16], are often not used, due to the large overhead to the energy,
the performance, and the cost of the systems they add [3].
Fault Prophet would enable a new class of recovery and protection mechanisms
which are selective in their nature to the critical section of the program, instead of
protecting the entire programn. The selective protection would reduce the overhead
costs associated with existing recovery imechanisins, thereby making it more feasible
17
1.3 Contributions
Fault Prophet has several contributions:
The contributions extend the capabilities of existing fault injection tools, by help-
ing developers and testers alike, to identify critical sections under various circum-
stances in a program, via an interactive visualizations framework.
18
Chapter 2
Related Work
Multiple fault injection tools have been developed over the years, most with emphasis
on software faults, while others simulated hardware faults as well. The earlier fault
injection tools, such as FERRARI [8], a flexible software based fault and error in-
jection tool, were designed to emulate both transient errors and permanent faults in
software.
As system developers and researchers alike, recognized the potential fault injec-
tion tools has to validate the fault tolerance of a system, the features were enhanced
beyond the realm of fault tolerance to performance testing as well. FTAPE, Fault
Tolerance And Performance Evaluator, is a tool that compares the performance of
19
as faults. In that respect, BIFIT is limited to a specific set of fault scenarios, and
testing and analyzing those specific kinds of faults.
While hardware injection faults have been around for a while [4], simulation of
hardware faults has not advanced as much, until recently. Relyzer [5] was one of the
first tools in recent years to put an emphasis on hardware faults. Relyzer deviates
from the standard software fault detection tools, detection of software corruptions,
to the detection and monitoring of silent data corruption (SDC), enabling corruption
detecting at a lower level.
LLFI [24], LLVM Fault Injection tool, is a recent fault injection tool that is de-
LLFI can inject faults into instruction buffers as well as and operands, while pick-
ing the fault injection method via a fault factory. However, LLFI does not provide a
choice of fault scenarios, and injects faults randomly. If LLFI did have that detailed
knowledge on the frequencies of instructions, it could intelligently inject faults to ran-
dom execution points of prioritized instructions. Moreover, LLFI does not provide
a visualizations framework to analyze the outcomes of the fault injection pass, but
instead compares the fault output to a baseline output, and as such, uses an internal
output analysis mechanism that might not be suitable for all applications.
Another recent tool is ASAC [20], Automatic Sensitivity Analysis for Approximate
Computing, which analyzes the fault tolerance of approximation based programs. Ap-
proximation based programs compute an approximation to a precise value when the
deviation from the precise value is not critical, thereby allowing the system to save
energy, by reducing the number of computations. The tool instruments these types of
programs to analyze which variables must produce exact calculations by employing a
20
technique using hyberbox, to test the range of allowable values a variable can store
after which they measure the deviation from the baseline output.
lizes both LLVM [9] and Pin [11]; LLVM is used as a compiler to embed approximate
oization, and Pin is used as a hardware simulator that handles the approximations.
Using LLVM and Pin, the tool tests the application level error tolerance, while pro-
viding knowledge about the native support hardware could provide for approximate
computations. While the focus of the tool is specifically for approximate computing,
some of the fault models written for the tool were found to be resourceful in the un-
derstanding the sensitivity of instructions to faults, and are used by Fault Prophet.
None of the tools above provide an extensive fault injection platform that supports
better understanding of the impact faults have on the flow of programs. Therefore,
combining the benefits of existing fault injection tools would provide a comprehensive
fault injection tool, which at the same time would be designed for the next generation
21
22
Chapter 3
Motivating Example
Given that not all parts of a program are created equal, there is a need to identify
which parts of a program are more fault-tolerant than others by identifying the critical
parts in a program automatically. The motivating example for this tool is a freely
kernel used in various scientific computations. Fault Prophet will throughly test
Jacobi to ensure its reliability and correctness even in situations of faulty operations.
3.1 Motivations
To analyze the effectiveness of the tool, and to understand its impact, the tool was
o Can Fault Prophet identify the critical sections of a program? What is their
sensitivity to faults?
o What computation patterns are critical and what are approximate critical sec-
tions within a program? Across programs?
o How do different fault scenarios and models help identify the critical sections
of a program?
23
Those questions are motivated on the hypothesis that a program has critical sec-
tions that are exposed to Single Event Upsets (SEU), which can cause Silent Data
Corruptions (SDC). There are three main areas that were explored as a part of the
motivation to answer these questions: instruction set sensitivity to faults, fault model
analysis, and performance.
The Jacobi program' is a freely distributed program, which solves a finite differ-
ence discretization of Helmholtz equation: f = 92,, + 02 - omu using Jacobi iterative
method.
The program has four functions excluding the main function: driver, initialize,
jacobi and error-check in addition to main function. The main function calls the
driver function to run the program. The driver function calls the initialize
function that allocates and initializes the two arrays used by the program to their
initial values: one for a values, and the other for the a values. The driver then
calls jacobi, shown in Figure 3-1, to solve the Helmholtz equation, after which it
calls the error-check function to measure the distortion of the computed value by
the program, to check for the absolute relative error. Since error-check is comparing
the values of the relative error, in our evaluation we assume that it is error free, to
ensure the computation is reliable, i.e. running on reliable hardware. We are trying
to evaluate the fault sensitivity of the other three functions.
1 void jacobi(void);
'The program was originally written by Joseph Robicheaux, and later modified by Sanjiv Shah
in 1998 [19] until translated to its C version by Chunhua Liao in 2005.
24
4
5 * Subroutine HelmholtzJ
22 *****************************************************************/
23
24 void jacobi(
)
25 {
26 double omega;
27 int i,j,k;
28 double errorresidaxayb;
29 // double errorlocal;
30
31 // float ta,tb,tc ,td,te ,tal ,ta2 ,tbl ,tb2 ,tcl ,tc2 ,tdl ,td2;
32 // float tel,te2;
33 // float second;
34
35 omega=relax;
36 /*
37 * Initialize coefficients */
25
ax = 1.0/(dx*dx); /* X-direction coef */
k = 1;
while ((k<=mits)&&(error>tol))
{
error = 0.0;
{
#pragma omp for private (j,i)
for(i=0;i<n;i++)
for(j=0;j<m;j++)
{
resid = (ax*(uold[i-1] [j] + uold[i+i1 [j])
+ ay*(uold[il [j-1] + uold[il [j+11)
+ b * uold[i][j] f[il[jl)/b;
-
65
66 u[i][j] = uold[i][j] - omega * resid;
67 error = error + resid*resid
68 }
69
70 }
73 /* Error check */
26
74
75 k = k + 1;
76 if (k%500==O)
77 printf("FinisheduYditeration.\n",k);
78 error = sqrt(error)/(n*m);
79
80 } /* End iteration loop */
81
83 printf("Residual:%E\n", error);
84
85 }
3.3 Walkthrough
The tool was used on the Jacobi program to analyze the sensitivity to faults of
the program. The tool injected 100,000 faults of each fault model as described in
Section 4.4.3 to each of the binary, floating point and memory instruction sets. To
run the tool on the Jacobi program, the user must follow these steps:
2. Copy the contents of the fault-inject/scripts folder to the jacobi program folder.
3. Modify the settings file (settings.cfg) as outlined in Section A.1.3 to suit the
needs of the intended testing. A sample configuration for the jacobi program is
shown in Figure B-2.
4. Run the tool by typing 'python script.py settings.cfg' into the terminal.
4.1. The tool will filter all candidate instructions for fault injection.
4.2. The tool will instrument the program to provide dynamic analysis of the
program.
4.3. The tool will inject random faults according to histogram of the program
execution by line number, and instruction opcode.
27
4.4. The tool will analyze the results and will display them to the user.
The tool produces the outputs as described in Section 4.5, once it has finished running.
As shown in Figures 3-2, 3-3, 3-4, the tool aggregates all the fault data, and based
on the given output abstraction (see Section 4.5.1), gives a visual representation of
the critical sections the tool identified.
Acceptable
Critical
3.4 Analysis
28
83380
75042
66704
58366
50028
41690
33352
25014
16676
8338
4393 704071
from the exact solution. The output abstraction class illustrated in Figure B-3 was
used is by the tool to analyze the output of the Jacobi program. The class uses a
script, 'distortion.py' shown in Figure B-4 that calculates the distortion of the baseline
output from the fault injected output, assuming both outputs are numbers, producing
a range of statistics: minimum, mean, median, max, standard deviation, range, and
zero elements. The output abstraction class specifically uses the mean to analyze the
impact of the fault on the output of the program, given certain boundaries.
The value of the mean the distortion algorithm outputs is between 0 and inf and as
such, the fault classifications were defined as the following, for a given output oi for
all i in the set of faults:
29
Figure 3-4: Treemap example of showing function level overview.
30
aggregate fault classifications for a function, where the color of each rectangle varies
from green to red, correlating to the scale of 'Acceptable' to 'Critical' classification
of faults, based on a score calculation of the number of 'Acceptable', 'Moderate,' and
'Critical' faults the function has.
The experimental setup is a EC2 instance hosted on Amazon AWS service in region
US East (Virginia). The operating system installed on the virtual machine is Ubuntu
12.04 precise LTS_20120612, from Canonical Group Limited. The specification of the
machine is reported in Table 3.1.
Resource Allocation
Memory 3.75 GiB
CPU 2 EC2 Compute Units (1 virtual core with 2 EC2 Compute Units)
Storage 1 x 410 GB
Platform 64-bit
Network performance Moderate
API Name ml.medium
AMI ami-967edcff
To ensure that the experiments generated by the tool on the program are consistent
and reproducible, an experiments script was created. The experiments script runs
the tool with different parameters, pertaining to different settings such as profiled
instruction sets, fault models, quota and line coverage tests. Figure B-5 presents
the script that was used for this program and other benchmark program explored in
31
3.5 Results
The Figures 3-2, 3-3, 3-4 were only a part of results of the case study. The analysis of
the faults injected, based on the single fault scenario described in Section 4.4.2, to the
Jacobi program, demonstrated the fault outcomes were dependent on both the fault
model and the instruction set. Therefore, there is reason to believe that sections in the
program that contain specific types of instructions will be more critical than others.
While the error-check function had faults injected to it, those are not analyzed or
displayed as a part of the report, as it assumes that function runs on reliable hardware,
however, this results are available.
Binary Instructions
The binary instruction set encapsulates all instructions that deal with arithmetic
and logic operations. The results of 100,000 faults that were injected to the binary
instruction set are shown in Table 3.2. In the table, the columns correspond to the
names of the fault model described in Section 4.4.3, and thus each column contains
the number of faults classified as 'Acceptable,' 'Moderate,' or 'Critical' as explained
in Section 3.4.2.
Fault Type Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Acceptable 82,963 88,303 69, 798 67, 835 62. 564 18, 001
Moderate 8,476 5,418 11,171 17,320 18,634 31,071
Critical 8,564 6,282 19,034 14,848 18,805 50,931
Segmentation Faults 7,427 5,433 16,501 T10,812 13,219 20,005
Time Outs 0 0 0 0 0 0
Table 3.2: Comparison of 100, 000 generated faults by fault model for the binary
instruction set of the Jacobi program.
The data in Table 3.2 shows that there is a lot of variation in the outcome of a
fault that is injected to the binary instruction set based on the chosen fault model.
The number of 'Moderate' and 'Critical' faults combined varies from 17% to 82%,
which is high when compared to the results of faults that were injected floating-point
32
operations and memory operations. This result can be attributed to the fact that
binary operations do most of the computations in a program, some of which could
memory addresses. In addition, the bottom of Table 3.2 explores the reasons for the
'Critical' faults. As observed, a large portion of the 'Critical' faults were a result
addresses.
Results Analysis Examining the distribution of the faults per source line, the
critical lines in the Jacobi program reside in the jacobi function, specifically the lines
pertaining the calculation of a value in the solution. For those lines, the 'Critical'
faults were more than 50% of the total faults injected into each line.
1 for (i=1;i<(n-1);i++)
2 for (j=1;j<(m-1);j++)
3 {
4 double ul = uold[i-11[j];
5 double u2 = uold[i+11[ji;
6 double u3 = uold[i] [j-11;
7 double u4 = uold[i][j+11;
The loaded values are then weighted, to create the new value in the solution for the
next iteration.
If the calculation of which value to load from memory, i.e. array, to calculate the
over the next iterations, skewing the calculation, or even more so, accessing a non-
33
In addition, the initialize function has multiple lines that have over 10% 'Critical'
faults, pertaining to the calculation of values that used later to calculate the solution
in each iteration. Again, the same reasoning applies as before that if there is a
deviation in the initialize values, it will be compounded with each iteration, causing
a skew in the end result.
2 yy = (int)(-1.0 + dy * (j-1))
3 ...
4 f[il[j] = -1.0*alpha *(1.0-xx*xx)*(1.0-yy*yy)\
2
5 - 2.0*(1.0-xx*xx)- .0*(1.0-yy*yy);
Statistical Significance Table 3.3 presents the results of the T-test calculations
for the average error of each pair of fault models, based on the assumption that the
sample of fault injection outcomes is independent for each fault model. A 'False'
value in the table means that Pvalue > 0.05, while 'True' means that Pvalue < 0.05.
As observed from the table, the fault models have significantly different averages, as
represented by the 'True' value, except for the case of the Bit Flip model and the
Byte Flip model results. The test was conducted using test-ind function in the
stats package of scipy, an open-source python extension for mathematics, science,
and engineering.
Fault Model Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Random Bit Flips - True True True True True
Sticking Bits - True True True True
XOR - True True True
Bit Flip - False True
Byte Flip - True
Buffer Flip
Table 3.3: Statistical significant analysis of the fault models based on 100, 000 gen-
erated faults for the binary instruction set of the Jacobi program. True means the
standard significance level was lower than 0.05, and therefore there is a statistical
significance, where False means it was higher than 0.05.
34
Floating-Point Instructions
point arithmetic and logic operations. As such, it is reasonable to believe that faults
injected to those operations would cause deviation in output, but will not cause the
program to crash, as floating point operations do riot have control over the flow of
a program. This hypothesis was verified by the results shown in Table 3.4, as some
fault models caused no 'Critical' faults at all, and the deviation in the output was
only 'Moderate', ranging from 0% - 4.3% for those fault models. The fault models
that did cause 'Critical' faults, Bit Flip, Byte Flip and Buffer Flip, less 'Moderate'
and 'Critical' errors than in the binary instruction set. Moreover, none of the faults
that were classified as 'Critical' were the results of a segmentation faults, i.e. program
Fault Type Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Acceptable 95, 734 99, 999 97, 277 85, 324 81, 087 22, 579
Moderate 4, 265 0 2, 722 11, 605 13, 427 35, 544
Critical 0 0 0 3,070 5,485 41,876
Table 3.4: Comparison of 100, 000 generated faults by fault model for the floating-
point instruction set of the Jacobi program.
Results Analysis Mapping the results of the faults to the source lines, there were
two main lines that are 'Moderate' and somewhat 'Critical' in the program across all
fault models. Those lines are a subset of the lines that were error prone in the binary
instruction set, which is an expected result, as floating point operations are a subset
of the binary instruction set. In the initialize function, initializing an array that
2 - 2.0*(1.0-xx*xx) -2.0*(1.0-yy*yy);
In the j acobi function, the line that aggregates the results of the different coefficients
35
Memory Instructions
The memory instruction set used by the Jacobi program has only two instructions:
load and call. As such, it is expected to see variation in the output, by modifying the
load instruction, as well as segmentation faults to some extent if the call instruction is
modified to call an illegal address. Table 3.5 aggregates the results of 100, 000 faults
injected to memory operations in the Jacobi program, which show that all fault
models caused some variation in the output, unlike the floating point operations, in
which one fault model caused no change. Moreover, the fault models that caused no
'Critical' faults for the floating point operations, caused some 'Critical' faults, but the
faults caused less deviation than the benchmark output. Another important finding
is that out of all the faults classified as 'Critical', expect for at most one occurrence
per fault model, none of those faults were a result of a segmentation fault.
Fault Type Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Acceptable 97,480 99,901 99,587 87,249 81,626 25,479
Moderate 2, 498 77 392 12, 082 13, 990 39, 425
Critical 2 2 1 649 4, 364 35, 076
Table 3.5: Comparison of 100, 000 generated faults by fault model for the memory
instruction set of the Jacobi program.
Results Analysis Inspecting that lines that produced deviation from the expected
output, pertaining to faults that are classified as 'Moderate' or 'Critical', it becomes
clear those faults were injected into the lines that are on the critical path of calcu-
lating the solution. Those lines are mostly found in the jacobi function, which load
calculated values that are used in the next iterations of the function:
2 ...
3 double ul uold[i-1][j];
4 double u2 = uold[i+1][j];
7 double u5 = uold[i][j];
36
8 double u6 = f[i][j];
9 ...
10 error = sqrt(error)/(n*m);
Since these values are used for the different iterations, their impact will be coin-
pounded, and therefore will produce a much larger deviation than the expected out-
put, than a variable that is used only a handful of times in the program.
All Instructions
The instructions that run in the Jacobi program and were profiled include: bitcast,
load, sub, sitofp, fdiv, icmp, fmul, fsub, fptosi, sext, getelementptr, mul, fadd, add,
call, alloca, femp, srei. Of these instructions, getelementptr, the instruction that ex-
plicitly provides address computations but does not access memory, and sext, the sign
extension instruction, together account for about 50% of all instruction executions.
The next instruction is load, which accounts for roughly 14% of instruction execu-
tions. This distribution helps to explain the results shown in Table 3.6, in which more
than 50% of faults, on average, were classified as 'Moderate' or 'Critical', as address
calculations and sign extension instructions were the most prominent instructions.
This is also clear from the bottom of Table 3.6, which shows that most 'Critical'
faults were a result of a segmentation fault, supporting the fact it is due to injection
of critical instructions and operations.
Fault Type Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Acceptable 52, 552 65,894 35,915 45,410 36,350 9,589
Moderate 6, 922 9,424 6,195 14,364 14, 717 17, 891
Critical 40, 567 24, 723 57, 931 40, 267 48. 974 72, 561
Segmentation Faults 38, 796 18,611 56,914 38,149 45,009 58, 027
Time Outs 0 0 0 0 0 _
Table 3.6: Comparison of 100, 000 generated faults by fault model for the all instruc-
tions of the Jacobi program.
37
38
Chapter 4
Design
Fault Prophet allows a user to analyze the impact soft-errors have on specific in-
structions within a program. For example, a user might want to examine the impact
of soft-errors on floating point operations, while another user might want to under-
39
Original
Program
Program in
bitcode
-irVisuazat Fions
daat.
Figure 4-1: Te system overview diagram, showing the dependences of each the
component.
stand the impact on memory operations. To support this functionality, the Fault
Selection Analysis module performs a static analysis of the program, to capture the
instructions the user is interested in. After aggregating all the instructions, the Fault
Selection Analysis module filters out unwanted opcodes from an instruction set, as
well as unwanted source lines from specific source files.
The static analysis performed by the Fault Selection Analysis is done using LLVM
passes on a program. An LLVM pass allows to programmatically go through every
instruction in LLVM bit code (middle layer between source code and machine code),
and analyze it as needed. The pass that was written for the tool profiles instructions if
they belong to a specific instruction set: floating point operations, binary operations
(including floating point), memory operations, use-define chains, function calls, arrays
and all instructions, saving the output of each instruction set that the user selected
40
to be instrumented to a separate file. At least one instruction set has to be selected
for the tool to work, which will be defined as the white-listed instruction set(s); faults
will be injected only to the white-listed instruction sets. The output of the static
The function name, opcode, line number and file name are data points about the
instruction that are provided either from the instruction itself, or the debug infor-
mation of the instruction. The rest of the data points are a bit more complicated to
gather. Since the LLVM bit code does not change unless the original source code is
modified and compiled again, the count of every instruction traversed in a function
identify an instruction in the LLVM bit code across the other modules of the tool.
Because the instructions are traversed in a deterministic order set forth by the com-
piler implementation, this algorithm might be dropped in future iterations of the tool
given the identifiers of the instructions will be cached during compilation. Figure 4-2
presents ail example of code that provides the instruction count of an instruction in a
and 0 otherwise. Likewise, the is-an-array parameter is set to 1 if the instruction has
has an array in its operands the meta-data of the instruction must be traversed com-
pletely, to find a specific tag LLVM generates that is associated with arrays. Those
two parameters allow much greater control over the the fault injection capabilities to
4.2.2 Filters
Once the instructions of interest have been selected, filters can provide a finer gran-
ularity to select the instructions of choice. There are three filtering options that are
41
1 unsigned int getProfileInstr(Instruction *I) {
2
3 Function* F = I->getParentO->getParentO;
4 string funcname = F->getNameStro;
5
6 unsigned int funccounter = 0;
7
8 for (institerator In = instbegin(F), E = inst-end(F); In
E; ++In) {
9 Instruction *Instr = dyncast<Instruction>(&*In);
10 funccounter ++;
11
12 if (I == Instr)
13 break;
14 }
15 return funccounter;
16 }
employed by the tool: blacklisting, opcode filter, and source line filter. The instruc-
tions that remain after running the filters are the only the instructions that will be
potential candidates for fault injection by the tool.
A user can choose whether they would like to blacklist an instruction set or white-list
it, in the same manner that is described in Section 4.2.1. The difference between
blacklisting and white-listing is that by blacklisting an instruction set, the instruc-
tions that fall into that instruction set will be excluded from the fault injection pro-
cess, while white-listing an instruction set, will include those instructions in the fault
injection process. Whenever there is a conflict, the blacklisting has priority. The
blacklisting feature can come in handy whenever specific instructions overlap in at
least two instruction sets.
Example 1 A user wants to include only binary operations for integers and not
floats. In such a case, binary operations will be white-listed, while floating point
operations will be blacklisted.
42
Example 2 A user wants to include binary operations, such as add but leave out
any instructions that involve use-define chains, such as getelementptr or sext, which
include operations to calculate pointer addresses. In such a case, binary operations
would be white-listed, while use-define chains will be blacklisted.
The user has the option to filter specific opcodes and source lines from specific source
files.
Opcode Filter The user has the option to exclude specific opcodes on top of the
instruction sets that are white-listed. This filter might come handy if the user wants
to inject faults to an instruction set but leave out specific opcodes. For example, a
user might want to profile binary operations but leave out specific binary operations
such as srem or xor instructions, which calculate the remainder of integer division
and performs an xor two operands, respectively.
Source Line Filter The user might want to inject faults only to specific lines of
code, by either including specific lines and excluding everything else, or excluding
specific lines and including everything else. While only one line filter is needed - to
either include or exclude lines, as one is a negation of the other, in some scenarios it
will be easier to use one or the other, and therefore both options are provided.
4.3 Profiler
The Profiler module has two responsibilities: providing a baseline output to be used
for comparison analysis, and profiling the program to understand the instructions
frequencies and flow of the program. The Profiler depends on the output of the Fault
Selection Analysis module, to decide which instructions to instrument, and saves the
profiling data only for those instructions.
43
4.3.1 Dynamic Analysis
The Profiler augments the original program in order to trace the execution of the
program, and to keep a running count of the number of instructions that are executed.
The running count allows the program to map an execution of an instruction, an
execution point, to a unique number. The count helps determine how many times
each instruction has been executed, as the same instruction could be executed multiple
times, for example, if it is in an function or a loop. This data enables the Fault Injector
module to inject a fault to a specific execution point in the program.
The tool augments the program by adding a function call before each instruction.
If the instruction is an instruction of interest (as provided by the Fault Selection
Analysis module, the called function saves the instruction execution data to a log file,
which includes:
name>
If the instruction is not an instruction of interest, then a function called only increases
the running count of instruction executions in the program.
4.3.2 Database
The tool uses a sqlite3 database to store all the dynamic analysis data, to deal with
programs that generate large amounts of data during the dynamic analysis process.
The tool chunks the data gathered into many log files, each containing 100, 000 profiled
instructions, keeping the log files small. The dynamic analysis and log file process
occur simultaneously, allows the tool to store the data in the database while the
program is instrumented, parallelizing the process. The tool employs the watchdog
module for python [13], to monitor the directory to which all the log files are saved,
and processes a file after all the instructions were written to it. After the log file
has been processed it is deleted to free up space. The database layout is shown in
Figure 4-3.
44
files fe opcodesd
idinstructions . p..d....d
opcode id
func names
linenumbers
In id func-id
In id .-
func-id func-name
line-number
In addition, while the tool processes the log files, it saves a histogram of the instruction
frequencies by source file, line number and opcode. This data is later used to pick
the faults according to the instruction distribution presented by the histogram.
Debugging Information
If needed, the debugging information can be outputted during the Profiler stage. The
debugging information maps an instruction in LLVM bitcode to a source line number,
and source file. This data is later used by the Vim plugin, fitmark, in order to show
the number of instructions in each source line, and the type of instructions, making
it critical to output this data if the user wants to view the data in the plugin.
Dynamic Trace
The user can output the trace of the original program: saving the entire execution
path of the program. However, it is important to realize that the overhead is high,
as the number of executed instructions in the program are at least doubled and
each of those added instructions is an I/O operation used to record the data of each
instruction in the program, meaning the run-time of the program is about 100x-
1000x the regular run-time. The dynamic trace of the original program can later
be compared to traces of programs that had faults injected to them, in order to
understand how the fault modified the flow of the program.
45
4.4 Fault Injector
The Fault Injector module is the core of the tool, as it is the part that injects the
actual faults into the program. The module is responsible to select the instructions
that will have faults injected to them, given the dynamic analysis output of the
Profiler module, and to inject the faults to those instructions. There are several fault
models the user cam choose from, or they can write their own.
There are two ways faults are picked by the module: either by a fixed quota set by
the user, or a quota set by a line coverage percentage.
Fixed Quota
In the fixed quota model, the tool selects the number of faults based on the histogram
of the executed instruction in the program. The histogram provides the weight of each
source line in the overall flow of the program, Wsource line=, based on the number of
instruction executions mapped to that source line, i over the total number of profiled
instruction executions. The tool then calculates the weight of each opcode in a source
line, Ropcode=o, by dividing Wource line= by the number of executions of an opcode in
source line i. fopcode=o in source line=i = Wsource line x Ropcode=o is the fraction of the
overall faults that would be selected to an opcode in a given source line.
The tool samples Qx fopeode=o in source line=i, random execution points' that have the
opcode o and are in source line i.
Line Coverage
In the line coverage model, the user specifies the source line coverage percentage, p,
the percent of lines of code that must have a fault injected to them before halting.
The tool retrieves the total number of lines, coURtnumber of lines that have instructions
of interest, and calculates N = p x COUntnumber of lines the number of lines that must
1Q is the fixed quota of faults set by the user.
46
have faults injected to them. The tool randomly selects N source lines, and randomly
selects C, a configurable number of execution points for each of those source lines that
will have faults injected to them, for a total of C x N faults will be injected. The
default configurable number of faults, C, is 3.
Errors are unexpected by nature, their frequencies, their consequences, and their ori-
gils. Errors can also occur in different locations, whether it is within the same buffer,
within consecutive instructions, all of which could be independent or correlated. To
support this myriad of fault scenario, the tool allows a user to use four different fault
injection scenarios: single faults, burst of faults, function call faults and array faults.
The tool allows the fault injected program to run for a specified amount of time,
which if exceeded the program is terminated, deeming that fault as one that caused
timeout.
Single Fault
Under the single fault scenario, the tool injects a single fault to a specific execution
of an instruction. Before the fault is injected, the tool will verify that the given
execution point matches the opcode, as depicted by the Profiler, and will inject a
fault only if that condition is true. The tool also allows the user to inject repeated
faults into the same execution point as many times as they specify.
With this fault scenario, the user can specify a fault model they would like to use to
inject a fault to the buffer of the chosen execution point, as described in Section 4.4.3.
Regardless of the fault model that is specified, the tool saves the fault information
for a later analysis if needed, including the original instruction buffer as well as the
new one, along with the execution point identifier of the injected fault.
47
Burst Mode
The bursts mode scenario, allows the injection of multiple faults at a time. It em-
ploys the same basis of the single fault model as described in Single Fault under
Section 4.4.2, injecting a fault at a specific execution of an instruction, after which it
continues to inject faults, at a specified interval of instruction until a quota of faults
have been injected. For example, the tool will inject a fault every 2 consecutive in-
structions for a total of 5 faults, after instruction Ih, where i is the initial execution
point that had a fault injected to it. Under these configuration, the tool will inject
faults to Ii+2, Ii+4, i+6, li+8, li+1o, regardless if they were profiled or not. If a fault
causes a segmentation fault, no further faults will be injected.
This fault model allows the simulation of a burst of soft-errors that might occur from
cosmic rays. Since the basis is the same as the single fault model, it uses the same
fault models as described in Section 4.4.3. However, unlike the single fault model,
the analysis of these faults performed by the Sensitivity Report Analysis module
described in Section 4.5, is not deterministic, as the affects on the output can arise
from any of the faults injected. As such, the analysis of this fault scenario requires
the comparison of the faulty trace to the baseline trace, which is not implemented
due to reasons mentioned in Section 1.1.2.
Under this scenario, which was inspired by Rely and Chisel, a function call instruction
is modified to call a new function. This new function, will replace the call instruction
buffer with a user defined probability,p, to a randomly generated value, and with
probability 1 - p, the original function will be called. To utilize this functionality, the
user must specify a certain flag in the tool configuration.
Array Faults
Much like the function call faults described in Function Call Faults under Sec-
tion 4.4.2, faults can be injected specifically into instruction that use arrays, by mod-
48
ifying the instruction that uses an array to a call instruction. The call instruction
calls a function that with a user defined probability, p, will replace the array pointer
of the original instruction with a new random value, and with probability 1 - p will
keep the original array pointer. Since pointers cannot be differentiated at run-time,
any pointer of an instruction that has been marked for fault injection, i.e., has an
array pointer, will be modified with probability p. If the user would like to inject
only a single fault, rather than inject faults with some probability, the user can use
the single fault scenario. To utilize this functionality, the array instruction set must
be profiled by the Fault Selection Analysis module along with the appropriate flag in
The tool supports a robust and dynamic configuration of fault models that can be
utilized across different fault scenarios. As such, fault models can be added, removed
and modified, and used quite easily. To achieve these characteristics, the tool uses
a function factory, which selects the user specified fault function. To add a new
fault model, the user needs to write a function that adheres to the function signature
shown in Figure 4-4. The function signature provides the user with the original buffer,
which they will need to modify, the size of the buffer, which is 4 or 8 if 32 bits or
64 bits respectively, and the opcode of the instruction. To add the fault function,
the user will need to add it to the factory, as shown in Figure 4-5. There are six
fault models by default: random bit flips, sticking random bits, XOR mantissa bits,
random bit flip, random byte flip, and a buffer flip. Random bit flips and XOR
mantissa bits were burrowed from Sara Achour, based on fault models created by
Accelerator Architecture Lab, Intel Labs. Sticking random bits fault model is Sara
Achour's own fault model. Random bit flip, byte flip and buffer flip are taken from
LLFI [24].
49
1 void fifunc(int size, char* buf, char* opcode);
Figure 4-4: The function signature of a fault model function. Size is the size of the
buffer, buf is the original buffer, which the function must modify, and opcode is the
instruction opcode.
1 #include "fifunc.c"
2
3 void initfactory() {
4
5 createfactory();
6 addfifunc(&fijfunc);
7 }
Figure 4-5: The factory function found in fi-factory.c. The file where the fault model
is located must be included in the file, and the function must be added to the factory.
Please make note of the location of the function in the factory, as it serves the the
fault model id number later used by the tool.
The random bit flips fault model randomly flips bits in the buffer. The most number
of bits that can be flipped by this model is 8 regardless of the buffer size. However,
due to the nature the bits are chosen, the same bit could be flipped multiple times,
returning it to its original state.
The sticking random bits randomly generates a mask that is used to pick a value
from one of two buffers. The first buffer is the original instruction buffer, while the
second buffer is generated using the XOR fault model using the original instruction
buffer. The mask is set for each byte individually, for a total of 8 bit flips across all
the masks.
The XOR mantissa bits fault model flips the mantissa bits, which are the lower 7 bits
of a float (32 bits), and the lower 44 bits of a double (64 bits). If used for an integer
(32 bits) or a long (64 bits), the same bits will be flipped.
50
Random Bit Flip
The random bit flip fault model generates a random number that corresponds to a
byte within the given buffer, and a random number that corresponds to a bit within
the selected byte, and flips the value of that bit, by xoring the bit with 0A.
The random byte flip fault model generates a random number that corresponds to a
byte within the given buffer, and sticks a byte sized random number in that position.
Buffer Flip
The buffer flip fault model generates a byte sized random number for each byte in
the original buffer.
The Sensitivity Report Analysis module looks at the data produced by the Fault
Injector module, and analyzes it to determine the effect the fault(s) had on the pro-
gram. A fault can be classified into 3 categories: Acceptable, Moderate, and Critical.
The classification of each fault is later used to visually show the impact of the faults
by the vimi plugin or the other visualizations.
The Sensitivity Report Analysis module determines the effect a fault had on a pro-
gram, by comparing the output of the fault injected program to that of the baseline
output of the program. The tool allows the user to write their own output abstraction,
by using a temiplate class, Output-Abs. The user has to write two functions:
51
* calculate-score - The function analyzes the aggregate score of a line in a source
file, given the number of acceptable, moderate, and critical faults that were
observed for that line, by the user's classification, and outputs the score and
classification of the source line based on the score.
faults are the same regardless of the variation they caused, the output that is saved
the summary described in Section 4.5.2, allows the user to see that information. This
is useful, if for example, the user wants to know how many faults were caused by
segmentation faults and / or timeouts.
Acceptable
The Acceptable category is used to classify faults that cause acceptable or preferably
no variation in the output of the fault injected program when compared to the original
output.
Moderate
The Moderate category is used to classify faults that cause a moderate variation in
output. An example of such a case would be a valid output that within multiple
Critical
The Critical category pertains to any output that causes an extreme variation in
output, and thus it is recommended that segmentation faults will be considered a
part of this category, as well as timeouts.
4.5.2 Summary
The tool will record the output abstraction results into a summary file for each in-
jected fault, as well as the source file, line number, opcode and the execution point
52
identifier of the injected fault. This file is useful for a quick analysis of the output
abstraction script, over all the injected faults, as well as a future reference point.
There are two vim plugins: FltMark and InstShow that were written as a part of the
tool, to help visualize the data in a text editor, on top of the source code data.
FitMark
FltMark, as the name suggests, shows fault markers based on the classifications, for
lines of code that had faults injected to them. Since each line of code can have
multiple instructions, and multiple points of execution, the aggregated score for all
those data points is used to classify the line as explained in Section 4.5.1. Thus, this
vim plugin allows the developer to easily see the sensitivity of each source line to the
injected faults.
On top of the visual functionality, the plugin can run Fault Prophet from within vimim
itself. In order to run the tool, the user must tag lines, and only the tagged lines
will have faults injected them. Moreover, the user must generate a configuration file
that will be used by the plugin, from an existing configuration file, by running the
'config.py' script. While most settings will be used, some settings will be changed
automatically to suit the configuration to run the tool from within vim, such as the
quota. Please see Section A.2 for full documentation of the plugin.
InstShow
InstShow is a plugin for vim that shows the instructions that are encapsulated in each
source line from the LLVM Internal Representation, IR. The plugin itself is based on
PyFlakes [6] and requires vim that has been compiled with python. A part of the
Fault Prophet installation, such a vim is compiled and is installed on the machine as
vimnpy'.
53
Ai da* -: C 4 s2++ 4a ft GI %I@
for(i=9;i<n;i++)
for(j=o;j<m;j++)
uold[i][j] = u[i][j];
/* Error check */
k = k + 1;
if (k%508== ) {
//printf("Finished %d iteration.\n",k);
288,11 82%
4.5.4 Visualizations
There are currently three supported visualizations: pie chart, bar graph and treemap
that visually show the categorization of faults in different ways.
Pie Chart
The pie chart visualizes the overall distribution of the different fault classifications as
described in Section 4.5.1 for the whole program. There are at most three parts of
the chart, green, yellow and red, which represent the fault classifications 'Acceptable',
'Moderate', and 'Critical', respectively, as shown in Figure 3-2-. The goal of the
visualization is to quickly understand how sensitive the program is, for the specific
54
fault scenario and fault model.
Bar Graph
The bar graph visualizes the distribution of the faults for each function in the source
file(s) of the program, as well as the categorization of the faults as described in
Section 4.5.1. The height of each bar represents the total number of faults injected
to each function, and the bar itself is divided into three parts, green, yellow and
red, which represent the fault classifications 'Acceptable', 'Moderate', and 'Critical',
respectively, as shown in Figure 3-3. The goal of the visualization is for the user to
understand how sensitive a function is, for the specific fault scenario and fault model.
Using this visualization, the user can quickly understand which functions must be
explored in depth, using the treemap visualization.
Treemap
The treemnap graph is used to show two things: the weight each entity, whether
function or source line, has in terms of instruction executions, and the aggregated
sensitivity of that entity. To exhibit those two qualities, the size of each rectangle in
the graph symbolizes the weight of each entity in relation to other entities, by using
the number of instructions executions that are mapped to that entity; the color of the
rectangle is the average score of a function based on the source lines in that function,
on a scale from 0 to 1, where 0 is not sensitive at all, and 1 is the most sensitive, as
shown in Figures 3-4, B-1. The tiling algorithm, which is used to determine the size of
each rectangle in the visualization, was taken from an existing Treemiap library [22].
The treemap is the only interactive graph from the current visualizations. The initial
view shows the sensitivity of functions by source code file. Upon clicking on a rectangle
in the view, a new view will appear, showing the sensitivity of lines in that function.
Clicking on a rectangle in this view, will open gvim, a graphical vim interface, to the
source line in the source code file, the user clicked on. Using this view, the user can
quickly drill down, from a function, to a source line, to instructions, using the vim
plugins described in Section 4.5.3.
55
56
Chapter 5
Evaluation
As mentioned in Chapter 3, the motivation for the experiments was to evaluate the
tool in three different areas: instruction set sensitivity to faults, fault model analysis,
We used several programs to test the effectiveness of the tool, all of which were
5.1.1 Jacobi
There are three programs reporting the observations made when simulating the igni-
tion times of three chemical compounds: hydrogen, ic8 and methane, in tubes, under
different temperature conditions. Each program uses the Cantera, a suite of object-
and transport processes. Both the inputs and the output of the programs has been
modified to run for shorter intervals, still employing the same functionality, and to
57
produce a single number that sums up all the observations made by the program on
the data, respectively.
5.2 Performance
There are overheads costs to augmenting the program, in order to either profile the
program as described in Section 4.3 or to inject faults to it as described in Section 4.4.
This overhead cost has several factors, including processing time, as well storage
needed to run the tool on programs.
Table 5.1 compares the runtime of a program without Fault Prophet, to the time it
takes Fault Prophet reach the point of first fault injection to the program for binary
operations, in seconds.
5.2.2 Storage
Table 5.2 shows the number of profiled instructions for binary operations by Fault
Prophet, and the amount of storage it required on the machine. The data that is
stored for all those profiled instructions, includes the execution id, as well as function
name, source file name, line number and opcode. In addition visualization data
will get aggregated for each source file name, and line number in different tables in
the database. However, since this data pertains only for instruction that had faults
injected to them it is insignificant compared to the rest of the data stored.
58
Program Number of Instructions Profiled Sqlite3 Database Size
Jacobi 557,375 11
Ignition Case: Hydrogen 13, 996, 793 253
Ignition Case: ic8 155, 294, 211 2, 944
Ignition Case: Methane 193, 650, 619 3, 701
The output of the benchmark programs was modified in order to produce a single
number as an output, which is the aggregation of the results from observed simnula-
tions. Therefore, the same output abstraction class and fault classification techniques
that were used by the Jacobi program described in Sections 3.4.1, 3.4.2, were used to
analyze the rest of the benchmark programs.
Based on the results of the different fault models of the case-study outlined in Sec-
tion 3.5.1 as well as additional results of the ignition cases, shown in the tables in this
Chapter, including but not limited to Tables 5.3, 5.4, 5.5, the 'Buffer Flip' fault model
is the one that causes the most disruption in the output of the program, averaging
at 50% of the time that the output has varied. However, the likelihood that a whole
buffer will be flipped as a result of a SEU, is rather low, making the fault model
an unlikely candidate to understand the behavior of the program. The fault model
that causes the least amount of disruption in the output when a fault is injected is
'Sticking Bits' that on average causes variation in 1% of outputs, with the exception
of the data displayed in Table 3.2.
59
Fault Type Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Acceptable 851 886 755 768 777 649
Moderate 50 27 105 74 63 66
Critical 38 26 79 97 99 224
Table 5.3: Comparison of generated fault types by fault model, based on the Hydrogen
ignition case, with profiling binary operations, and 1, 000 fault quota.
Fault Type Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Acceptable 841 895 789 789 799 685
Moderate 74 28 109 99 76 83
Critical 32 14 49 59 72 179
Segmentation Faults 16 9 128 52 60 1 156
Timeouts 12 1 4 1181 7 6 1 1
Table 5.4: Comparison of generated fault types by fault model, based on the Methane
ignition case, with profiling binary operations, and 1, 000 fault quota.
This section analyzes the sensitivity of the different instruction sets based on the
results of the case study outlined in Section 3.5.1, as well as of the ignition cases.
5.4.1 Binary
Tables 5.3, 5.4 provide additional insight to the fault outcomes that occur as a result
of injecting faults in the binary instruction set. As observed for the hydrogen and
methane ignition cases, faults injected to the binary instruction set produced similar
outcomes based on the number of fault classifications for each fault model. The
'Moderate' and 'Critical' faults which combined make up 9% - 30% of the faults,
are less than the 17% - 82% that was observed in the Jacobi program as shown
in Table 3.2. The reason for this could be attributed to the fact that the ignition
cases perform many calculations that are not on the critical path of the program,
pertaining to calculations of species in the reaction. This strengthens the hypothesis
that the sensitivity of an instruction set depends on the program itself, rather than
the instruction set.
60
Results Analysis of Ignition Cases
The results of the hydrogen, ic8 arid methane are not only similar in terms of dis-
playing close numbers of fault classifications, but also in the outcomes of the fault
themselves when mapped back to source lines. As the three programs use the same
libraries, and most of the instruction executions happen in those libraries, there is a
commonality in the faulty lines. A deeper inspection of the 'Moderate' arid 'Critical'
faults shows that there is a greater likely for those faults to occur in the processes
that compute the reactions defined by the input parameters. An example would be in
the GasKinetics module which provides a module for a reaction mechanism, in which
all the reaction update functions in the module provides are fault prone.
1 void GasKinetics::_update-ratesT() {
2 ...
7 }
8 ...
9 }
10 ...
11 void GasKinetics::updateKc() {
12 ...
();)
16 m-rkc[irxnl = exp(m-rkc[irxnl*rrt
-
17 m-dn[irxn]*logStandConc);
18 }
19 }
20 ...
21 void GasKinetics::updateROP() {
22 . ..
61
23 for (int j = 0; j = mii; ++j) {
24 ropnet[j] = ropf[j] - roprEji;
25 }
26 ..
.
27 }
The lines shown were classified had more than 50% of the faults classified as 'Mod-
erate' or 'Critical' across the different fault modules and programs, based on the
aggregate faults that were injected in those lines, advising that the reaction update
process, wherever it is taking place, is a sensitive one. In addition, Figure 5-1 shows
the lines in another function function, which is used to update the reactants of a
reaction with the data, dealing with the actual update process are sensitive to faults.
The update proceeses had the same outcomes in other modules of the libraries such
as IdealGasPhase, in which the update thermodynamic properties function is also
prone to 'Moderate' and 'Critical' faults. Therefore, the update functions in general,
and the lines pertaining to the actual updating, can cause faulty outcomes, as they
deal directly with the output data.
As hypothesized in Section 3.5.1, fault that are injected into floating point operations
causes variations in the output, however, those variations will not cause a segmen-
tation fault, as those operations are not on the critical path of the program control.
A fault injected to floating point operations results in 3% - 13% 'Moderate' faults,
and 0% - 5% 'Critical' faults as observed from Tables 3.4, 5.5, with the exception of
Buffer Flip that caused higher variations in output.
Fault Type Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Acceptable 908 931 832 854 857 709
Moderate 57 31 126 91 81 101
Critical 2 0 9 22 29 157
Table 5.5: Comparison of generated fault types by fault model, based on the Hydrogen
ignition case, with profiling floating point operations, and 1, 000 fault quota.
62
Figure 5-1: A sample of GasKinetics .cpp from the Cantera library, showing the
different fault outcomes, using the Bit Flip fault model on Binary instructions.
As the floating point operations are a subset of the binary instructions, much like the
Jacobi program, the same source lines that were classified as 'Moderate' or 'Critical'
in binary instruction set, were classified the same way in the floating point operations
instruction set. For example, in GasKinetics module the lines that update the
species in updateKc function, which directly influence the output, were found to be
'Moderate' and 'Critical' amongst the different fault models.
3 m-rkc[irxn] = exp(m-rkc[irxn]*rrt
-
4 m-dn[irxn]*logStandConc);
63
Moreover, much like the hypothesis made in the Jacobi study case, faults injected
solely to fault point operations in the hydrogen ignition case caused almost no seg-
mentation faults and / or timeouts, due to the calculations they execute, which are
by large not on the critical path of the program flow.
5.4.3 Memory
Table 5.6 shows the results of injecting at most 1, 000 faults into the memory instruc-
tion set of the hydrogen ignition case. In addition Table 5.7 presents the results for
the Random Bit Flips fault model for the methane ignition case, injecting 2 faults
to each profiled source line to achieve 100% line coverage. As shown in both tables,
the Random Bit Flips model generates roughly 66% 'Acceptable' faults, with the rest
33% faults as 'Moderate' or 'Critical'. In both experiments there were only three in-
structions that were profiled: call, invoke and load. Comparing the combined results
results to the results from Table 3.5, which shows the analysis of the Jacobi program,
there are huge differences. First, there are many more 'Critical' faults in the hydro-
gen ignition case, while the 'Moderate' faults stay around the same range of around
1% - 10%. Moreover, in the hydrogen ignition case, almost all of the 'Critical' faults
were a result of a segmentation fault, again, unlike the Jacobi program as analyzed
in Section 3.5.1. Last but not least, every fault model, except for the XOR model,
caused between 1 - 2 timeouts in the hydrogen ignition case. These strike differences
between the two programs, makes the injection of faults into memory operations pro-
gram dependent, and apparent that memory operations are critical for the ignition
cases.
Fault Type Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Acceptable 636 753 529 562 549 425
Moderate 41 23 73 48 50 40
Critical 183 84 258 250 281 395
Table 5.6: Comparison of generated fault types by fault model, based on the Hydrogen
ignition case, with profiling memory operations, and 1, 000 fault quota.
64
Fault Type Random Bit Flips
Acceptable 1645
Moderate 74
Critical 822
Segiientatiou Faults 816
Timeouts 3
Table 5.7: Results for the Random Bit Flips fault model when profiling memory
instructions based for the Methane ignition case and 100% line coverage.
The results of the memory instructions across the ignition cases were consistent
as many faults that were injected into memory operations, caused many 'Critical'
faults, while almost no 'Moderate' faults. Figure 5-2 shows an example of how the
_update-ratesC function in the GasKinetics module, which is mainly composed
of memory instructions, caused all 'Critical' faults. A possible explanation to this
would be that multiple memory instructions are encapsulated within each line, and a
deviation in each one, could potentially cause a fault. For example, the line
accesses an array location in the function call itself, which could cause ail error if the
array output is invalid. The reason for the parity with the Jacobi program could be
that there are many more memory accesses, and therefore more potential deviation
in the memory calculation due to the larger memory size allocated to the ignition
cases. However, it remains that memory instructions cause little to no timeouts, as
suggested in Section 3.5.1.
5.4.4 All
Tables 3.6,5.8 present the results of injecting faults into all instructions. As observed,
without differentiation of the instruction sets faults are injected to, the program are
error prone, causing on average 50% 'Moderate' and 'Critical' faults. The ignition
cases have many more instructions that are profiled and have faults injected to them:
call, invoke, getelementptr, bitcast, icmnp, load, sext, add, zext, shl, ptrtoint, sub,
65
Figure 5-2: A sample of GasKinetics. cpp from the Cantera library, showing the
different fault outcomes, using the Bit Flip fault model on Memory instructions.
ashr, extractvalue, and, trunc, ret, inttoptr, insertvalue, fcmp, sitofp, fmul, fadd,
fsub, fptosi, sdiv, mul, fdiv, store, xor, or, br, select, lshr. The additional opcodes
such as ptrtoint, extractvalue, inttoptr, insertvalue, amongst the many different ones,
are critical to the proper flow of the program and the output, as they convert a point
to an integer, extract a value from an array or a struct, covert an integer to a pointer,
or insert a value to an array or a struct, respectively. The two major opcodes that are
utilized in the ignition cases are getelementptr and load, which pertain to memory
operations, collectively making up 66% of executed opcodes. As observed from the
analysis of the memory operations, those are error prone, with causing roughly 33%
'Critical' faults on average. The third most executed opcode is icmp is a logical
comparison opcode, which makes up about 12% of total executions. As the icmp
pertains to the logical operations, such as conditionals, it is very likely that faults
injected to those operations would produce a deviation in the output.The other 25%
of the executions are distributed amongst the other 31 opcodes listed above, with the
majority (over 50%) of them belonging to binary instruction set, which are known
to be fault prone, possibly explaining the makeup of the rest of the 'Moderate' and
'Critical' faults.
66
Fault Type Random Bit Flips Sticking Bits XOR Bit Flip Byte Flip Buffer Flip
Acceptable 559 822 401 485 450 309
Moderate 39 27 41 45 57 37
Critical 450 199 606 518 541 702
Segmentation Faults 420 157 564 486 509 656
Timeouts 20 40 38 27 28 38
Table 5.8: Comparison of generated fault types by fault model, based on the Hydrogen
ignition case, with profiling all instructions operations, and 1, 000 fault quota.
67
68
Chapter 6
Discussion
The Fault Prophet fault injection tool provides the ability to test the resilience of
programs to faults caused by SEU (Single Event Upsets), by identifying the critical
sections in the program, and visualizing the findings. The additional value on top of
existing fault injection tools: greater modularity by abstracting the input and output
of the different components of the tool, choice of fault scenarios, and a visualizations
Several commonalities were found amongst the results of the experiments, enabling
For code cleanliness and optimizations, multiple lines of code can be combined into
a single line. For example, when loading a value from an array, the index can be
calculated in the memory access code, as shown below from the Jacobi program.
While this is a typical line of code, combining the index calculation with the memory
access, separating the two can provide means to mitigate errors, as the index calcula-
69
tion can be verified for correctness before the memory access. While this creates an
overhead, it is only for memory index and address calculations that could salvage an
execution of a program.
6.1.2 Conditionals
As observed from the results analysis, timeouts do occur in program due to soft-
errors in binary instructions, usually due to a change in the execution flow extending
the execution time. Moreover, if variables are calculated in an iterative manner,
additional iterations can cause deviations in the output of the program, which are
more critical in loops, as such deviations are compounded. An example of such a case
is in the main function of the ignition cases, in which a loop checks for cantera errors
in the reaction. Additional iterations would change the calculated error deviation,
producing a different results.
1 while (errFlag)
2 f
3 /* Repeat to resolve errors.
4 ctrErr++;
7 }
To mitigate the consequences of soft-errors on the execution flow and output, and
allow normal operation of the program, an additional conditional check in conditional
iterations, could serve as a fail-safe mechanism to stop extraneous iterations from
taking place. While this will cause additional overhead, it is only for conditionals,
and would protect critical calculations.
70
6.2 Design Challenges
6.2.1 Scalability
As the development of the tool progressed, larger programs were experiments with as
described in Section 5.1. Those larger programs exhibited a certain behavior that was
unnoticed beforehand - the size of the log outputted by the Profiler module, grew from
MBs to GBs. As a consequence, objects that were once held in memory, grew too big,
impeding the system, and at times causing errors in interpreter itself. To cope with
these problems, the tool uses sqlite3 to store the data, allowing the tool unlimited
space to hold objects needed. However, this comes at a cost to performance, as sqlite3,
and sql in general are not optimized to perform the queries the tool needs; moreover,
pulling data from a database is much slower than in-memory objects. Therefore, the
tool uses a hybrid approach, to store minimal data that is essential for most fault
selection calculations in-memory while keeping the log outputs in a database. This
hybrid approach is able to reduce the size of the log by normalizing the data by 75%
as well as select faults within a reasonable amount of time, ranging from seconds
to a few hours, depending on the program size. Additional optimizations could be
performed to reduce the time for fault selection even further.
There are multiple code optimizations that could reduce the overall runtime of the
tool.
SQL queries
The transition to a Sqlite3 database to hold and retain the data of the profiled instruc-
tions, and the injected faults has a performance cost as mentioned in Section 6.2.1.
Proper, normalized and optimized database choice, structure and queries can reduce
71
the average runtime of the tool.
Under ideal settings, the Fault Selection Analysis module would have to run only once
per profile option, i.e. instruction set, and will save that data for future iterations
of the tool that use the same profiling option. While the current architecture of the
tool cannot support this feature, profiling a fairly sized program execution creates a
bearable overhead as discussed in Section 6.2.1.
The current architecture of the tool supports different fault models only for two fault
scenarios: single fault and burst mode (see Section 4.4.3. While the current fault
model for function calls and array faults is to replace the existing buffer, whether call
instruction of pointer, with a generated random value, as shown in the Section 5.3,
the fault model has a great impact on the outcomes. As such, providing support for
multiple fault models is essential to fully support the identification of critical section
via a myriad of fault scenarios.
The tool was designed to inject faults into instruction buffer. However, there are
times in which there is a need to test the behavior of an instruction when an operand
changes. Therefore, there is a need to inject faults into a more granular object
than instructions. As demonstrated by the array fault scenario, which modifies only
pointers in an instruction, this is possible, and should be exploited whenever possible.
72
6.3.4 Profiling Options
Loops
Loops are a critical asset of any program, and there could be variations in the sen-
sitivity of the program in general, and of critical sections in particular, if a fault in
injected in a loop, depending on which iteration the fault is injected. A loops profiling
option would measure the number of times a loop runs, and will allow to inject faults
As mentioned in Section 1.1.2, to measure the sensitivity of a fault, the tool compares
the faulty output to the baseline output, since it is cheapest analysis in terms of
resources. However, comparing a faulty trace comparison to the baseline trace would
provide a more accurate analysis; this requires optimizations on the trace and the
storage mechanism of the traces, as they can take up a lot of space, as well as the
fault models. Given faults were injected using multiple fault models, a visualization,
much like a heat map, based on the composite score on all fault models, could provide
the program.
73
74
Appendix A
Documentation
This is the documentation of how to use the Fault Prophet Fault Injection Tool, and
the Vim plugins designed for it.
After the tool has been installed, the iain resource users will have to deal with is
the configuration file of the tool. The configuration file allows the user to setup all
the options pertaining to the progran they want to run their tool on. Once the
configuration file has been set, and the program compiled, all that is left to do is run
the tool by running:
A.1.1 Installation
A virtual machine has been preconfigured with the tool that can be obtained by con-
tacting the Computer Architecture research group at CSAIL, Computer Science and
Artificial Intelligence Lab. The user niane of the miachine is 'neo' and the password
75
is '1234'. The tool itself can be found under the home folder in the fault-inject folder.
In order for the tool to provide the best possible data on the tested program, it is
suggested to compile the program with the '-00 -g' flags. The '-00' will not optimize
the program, allowing a mapping of instructions to source line with utmost accuracy.
The '-g' must be used in order to map each instruction to its source code file and
source line. This is especially necessary if the user would like to make use of the
visualization and vim plugins.
A.1.3 Configuration
There are seven sections in the settings file: general, scripts, filter, modes, profiler,
faultinjection and analysis. Each section has one or many options that can be set.
While not all options are mandatory, it is recommended the user will fill all the
options, in order to make sure the output is to their liking. Please see Figure B-2 for
an example of a configuration file. Settings marked with (*) the asterisk mark are
mandatory settings.
General section
" Program Name * (string) : the name of the executable without the .be extension.
The tool expects the executable to reside in the same directory as the tool itself.
" Program Input * (string) : the path of the file that has the program inputs. If
no inputs are needed or required, a path to file must still be specified, and the
file must exist. In such a case, use the touch command.
" SO-Libs * (string) : the path to the Fault Prophet LLVM libraries. In
accordance with the installation script, those will exist in <specified home
path> /llvm/llvm-2.9/Relase+Asserts/lib
76
Scripts section
This section specifies the path of the scripts that the tool runs: the profiler, the fault
injection, fault injector module compilation, and fault analysis. Those settings should
not be modified unless there is a reason to do so.
./run-profiler.sh'.
Filter section
This section pertains to the settings of the Fault Selection Analysis module.
- Note: The tool allows a user to create a filtering tool themselves, and run
it through the tool. As such, the tool allows the user to specify a custom
filtering script using the following format:
The tool includes a filter tool that can be used with the following param-
eters:
[./run-filters.sh,<profiling option>]
* a : all instructions
* c: function calls
* v : arrays
77
* m: memory operations
* u: use-define chains
" black-list-filters (special format) : the instructions sets that will be filtered out.
This option takes precedence over the white-list-filters options.
" exclude-lines : Source lines from specific source files that will be filtered out, if
instructions on those lines are being profiled.
- Note: the tool allows a user to specify any source file name, and corre-
sponding line numbers. As such the format is:
The starting line number must be less than or equal to the ending line
number, for those lines to be included.
- Note: uses the same formatting as exclude-lines option. Please refer to the
note under that setting to obtain information about the input format for
this setting.
" is-skip * (boolean) : allows the user to specify whether lines will be excluded or
included, as only one option can be used.The default value is True, meaning to
use the exclude lines option, rather than include lines.
78
Modes Section
* run-native (boolean) : the mode in which the program needs to run, whether
native, i.e. the program will be compiled to an executable, or not, i.e. the
LLVM interpreter, Ili, will be used, as some program can run only in one mode
out of the two. The default value is False.
Profiler Section
" trace (boolean) : an option to dynamically trace instructions for each fault
injection. The default value is False.
" static-trace (boolean) : output a static trace of the program. The default value
is False.
- Note: This must be True in order to use the InstShow vim plugin described
in Section 4.5.3.
Fault-Injection Section
* burst-mode (boolean) : use the burst of faults scenario. The default value is
False.
79
* burst-num-faults (integer) : if in burst mode, this option specifies the total
number of faults to inject.
" repeat-fault (integer) : specifies the number of faults to inject to the same buffer.
This does not count against the quota, meaning quota * repeat - fault faults
will be injected total. The default value is 1.
- Note: This option is useful if the fault model is based on random values,
allowing to exhaustively test the given buffer.
" test-line-coverage (boolean) : if True, given a percentage, the tool will stop
injecting faults into the program once the percentage of source lines have been
tested. The tool will inject two faults into each source line in each source file,
regardless of number of instructions it encapsulates or the number of instruction
executions it has, making the quota parameter extraneous in this case.
" func-faults (boolean) : use of function call and arrays fault scenarios described
in Section 4.4.2.
- Note: If the intended use is for array faults, then must specify 'v' as a
profiled instruction set in the white-list-filter setting.
" fault-model * (integer) : the choice of fault model. The fault model is assigned
a number by the fault factory as described in Section 4.4.3.
- Note: Given the current fault factory settings, the fault models described
in Section 4.4.3 are mapped to the following numbers:
80
* 3 Flip a Random Bit
* 4 Flip a Random Byte
* 5 New Buffer
Analysis Section
" output-file * (string) : name of the file where to save the results of the output
abstraction script.
The FltMark vim plugin allows the user to inject faults directly into the program
itself from vimi as described in Section 4.5.3. However, preliminary steps must take
place before this functionality can be used.
In order to utilize functionality of injecting faults directly from vim, the user must
first create a configuration file that is suitable for use by the plugin. The configuration
script lets the user generate a suitable configuration file, by copying the settings it
needs from an existing configuration file, by typing:
into the terminal window. This command will generate a new file cfgCreator.py
that will be used by the vim plugin whenever fault are injected through vim. Some
settings are modified from the original settings, whatever they might be. The plugin
81
configuration file will profile all instruction sets, filter only the selected lines and
will inject 2 faults to instructions executed in each of those lines (see Section A.2.3).
Moreover, the configuration will be modified to save no traces, static or dynamic, of
those fault injections to optimize the fault injection time.
There are three commands to control whether to show the sensitivity of source lines in
a source file that is opened with vim, given that Fault Prophet was used beforehand.
FaultMarkersEnable
FaultMarkersDisable
FaultMarkersOnOff
This command enables or disables the sensitivity view in vim, depending on the
current view.
There are two classes of commands that are supported by the plugin to inject faults:
TagLine(s) and InjectFault(s).
TagLine(s)
The TagLine command tags the current line the user is on for fault injection. There-
fore, the user can tag multiple lines and inject faults into all those lines at once. The
TagLines command allows the user to tag multiple lines at once. However, the user
must enter the command twice: once for the start line, and another for the end line.
The order of the start line and end line does not matter.
82
1%, 0 W
for(i=G;i<n;i++)
for(j=G;j<m;j++)
uold[i][j] = u[i][j];
/*
error = error + resid*resid
k = k + 1;
if (k%580== ) {
//printf("Finished %d iteration.\n",k);
: TagLines
InjectFault(s)
The InjectFaults command injects faults to tagged lines. If no lines are tagged, the
command will not run, and will alert the user lines need to be tagged first. The
InjectFault command on the other hand, injects faults to the current line the user
is on, regardless of any lines are tagged. The tagged lines will be removed once the
fault have been injected. In order to inject faults, the tool must reside in the same
directory as the source file, as well as the binary on which it will execute.
83
84
Appendix B
Figures
85
1 [general]
2 # program name without .bc extension
3 program-name = jacobi
4
5 # must specify an input file and must exist
6 program-input = input.jacobi
7
8 # fault injection libs
9 so-libs = /home/neo/llvm/llvm-2.9/Release+Asserts/lib
10
11 [scripts]
12 run-profiler = ./run-profiler.sh
13 run-inject = ./run-inject.sh
14 run-inject-compile = ./faultinject-compile.pl
15
16 [filter]
17
18 # profiled instructions to include (white) exclude (black),
19 # include script and profile option
20 # separate scripts and options using
21 # example: [./runjfilters.sh,b];[./run-filters.sh,f]
22 # profile options:
23 # a - all insturctions
24 # b - binary operations (include float)
25 # f - float operations
26 # c - function calls
27 # v - arrays
28 # m - memory operations
29 # u - usedef chains
30 # s - backward slice = all except invoke and cast
31
32 white-list-filters = [./runfilters.sh,f]
33 black-list-filters =
34
86
37
38 # profiled lines to include or exclude
39 # include the source file name and starting and ending line
numbers,
40 # both inclusive the format is [src file name , start line, end
line]
41 # include additional files and line numbers by separating them
with ;
if
43 # skipping, else would include
44
45 exclude-lines = [jacobi.c,229,249] #errorcheck function
46 include-lines =
47 is-skip = True
48
49 [modes]
50 # Whether to run the program in native mode, i.e. create an
executable
51 run-native = True
52
53 # Runtime in secs
54 run-time = 5
55
56 [profiler]
57 # True to dynamically trace instructions for each fault injection
58 trace = False
59
60 # Static trace of program.
62 static-trace = True
63
64 [faultinjection]
65 # number of faults to inject
66 quota = 1000
67
87
68 # True for burst mode , otherwise False
69 burst-mode = False
70
71 # number of instructions to jump before injecting a fault in
burst mode
72 burst-repeat = 2
73
74 # number of total faults to inject in fault mode
75 burst-num-faults = 5
76
77 # number of fault to inject in the same buffer
78 # (useful if flipping a random bit or byte)
79 repeat-fault = 1
80
81 # line coverage percentage
82 # if test-line-coverage is set to true, then won't reach quota,
83 # but will stop when reaching line coverage percentage
84 test-line-coverage = False
85 line-coverage-percent = 90
86
87 # fault injection noise
88 noise-max = 11000
89 noise-control = 0.002
90
91 # True if want to inject faults to function calls, false
otherwise
92 # if True, must specify "c", function calls to be profiled in
filter section
93 func-faults = False
94
88
101 # 5 - flip the entire buffer with random bytes
103 # for 6, you can set the noise level under fault-injection noise
104
105 fault-model = 5
106
107 [analysis]
Figure B-2: A sample configuration file of Fault-Prophet for the Jacobi program.
89
1 import os.path
2 import subprocess
3
4 settings = {'mean-green-lower' : 0.0,
5 'mean-green-upper': 0.0,
6 'mean-yellow-lower' : 0.0,
7 'mean-yellow-upper' : 1.0,
8 'mean-red-lower' : 1.0,
9 'mean-red-upper' : float('inf'),
10 'mean-segfault' : -1,
11 'mean-default' : 'critical',
12 'output-abstraction': 'distortion.py'}
13
14 class OutputAbs:
15 def __init__(self):
16 pass
17
+ reds)
20 classification = self.classify-fault(score)
21 return score, classification
22
23 def output-analysis(self, fault-output , fault-path,
baseline-path):
24 output = ""
25 mean_ = 0.0
26 if faultoutput ==
27 mean_ = settings["mean-segfault"1
28 output = "SegmentationuFault"
29 elif fault _ output == "TimeOut":
30 mean_ = settings["mean-segfault"]
31 output = "TimeuOut"
32 else :
33 proc = subprocess.Popen(['python', settings["output-
abstraction"],
90
34 baselinepath, fault-pathl , stdout=subprocess.
PIPE)
35 output = proc.stdout.read()
36 _meanindx = output.find("Mean")
37 if _meanindx != -1:
38 _meanindx += 5
_meanindx)]
40 mean_ = float(mean_)
Li,"
)
42
43 return output, mean_
44
46
47 classification = settings ["mean-default"]
48
49 if mean_ == settings["mean-segfault"]:
50 classification = "critical"
53 classification = "acceptable"
56 classification = "moderate"
59 classification = "critical"
60
61 return classification
62
64
65 if not os.path.exists(fault-path):
66 return
91
67 faultfile = open(fault-path)
68 fault-output = faultfile.read().strip()
69 faultjfile.close()
70
71 output, mean_ = self.output-analysis(fault-output,
fault-path, baseline-path)
72 classification = self.classify-fault(mean_)
73
74 return output, classification
Figure B-3: The output abstraction wrapper for the analysis of the Jacobi program.
92
1 #!/usr/bin/env python
3 import sys
4 import os
5 import re
6 import math
7 import numpy
8 from pprint import pprint
10
11
12 STATS = {
13 "min" numpy.min
14 "max": numpy.max,
15 "mean": numpy.mean,
16 "median" numpy.median,
17 "stdev" numpy.std,
18 "range" lambda x numpy.max(x) - numpy.min(x),
20 }
21
22 STATSDESCRIPTION = [
23 ("min", "Minimum")
24 ("mean", "Mean"),
25 ("median", "Median"),
26 ("max", "Maximum"),
27 ("stdev", "StandardjDeviation"),
28 ("range", "Range"),
29 ("zeros", "Zerocelements")
30 1
31
32 ROUNDPLACES = 10
33
34 ## For aLL errors, the first argument is the reference vaLue, the
93
35 ERRFUNCTION = {
36 "abserr" lambda x, y: math.fabs(x - y),
47 1
48
49 ERRFUNCTIONDEFAULT = "absrelerr"
50 #ERRFUNCTION DEFAULT = "relerr"
51
52 WORDDELIMITERS = "11t"
53 COMMENTDELIMITERS = "#="
54 DIGITPRECISION = 10
55
56 ##
57 ## Calculates the distortion. Measures the whole contents of
58 ## both files.
59 ##
60
61 def readLines(FileInput , comments):
62 ListOutput = []
63
94
66 for lineO in FileInput:
67 line = lineO.strip()
68 if len(line) == 0: continue
69
70 continueLoop=True
71 for lcomment in lcomments:
72 if re.match(lcomment, line):
73 continueLoop = False
74 break
76
77 ListOutput.append(line)
78
79 return ListOutput
80
81
82 def convertLineToNumbers(strLine, delimiters):
83
84 r = re.compile('[Xs]' % delimiters)
86
87
88 def convertToValues(ListStringInput , delimiters):
89 ListIntOutput = []
90 ListIntPerLineOutput = []
91
92 for pl in ListStringInput:
94 ListIntOutput.extend(apl)
95 ListIntPerLineOutput.append(len(apl))
96
98
99 def isNan(x):
101
102
103 def calculateStats(ListDistortions , statsWhitelist = STATS.keys()
in statsWhitelist)
108
delimiters)
114
116
117
118
120 distortions = []
121 for rv, cv in map(None, refValues, cmpValues):
96
132 ###
133
134
135
136 def printDistortions(distortions, elementsStart = 0,
False)
137
138 format = "%%5.Xdf" % digits
139 if elementsPerLine and not minimumPrint:
140 start = 0
141 for linecnt in elementsPerLine:
142 print "X4d:aj/s" % (elementsStart + start, "u".join([
+
linecnt] 1))
143 start += linecnt
144 else:
145 print "u".join([str(d) for d in distortions])
146
= False):
148
151 try:
157
158
159
160
161 def constructOptions():
97
163
164 parser.add-option("-c", "--comments",
165 action="store", dest="comments",
166 default=COMMENTDELIMITERS,
167 help="Charactersuinjinputufilesuthatubegins
,comments")
>,u<tab>)")
172
173 # parser.add-option("-m", "--mode",
174 # action="store", dest="mode",
175 # default=MODEDEFAULT,
176 # help="Select mode of operation (one of Ys)
% ", ".join(MODES))
177
182
187
188 parser.add-option("-p", "--precision",
189 action="store", type="int", dest="precision
190 default=DIGITPRECISION,
98
191 help="Numberuofusignificantudigitsuwhile,
print ingutheuoutput")
192
195 default="all",
197 ## ^ todo add pattern langauge for this, such as odd, even,
or a sequence....
198
201 default=False,
202 help="Eachulineuinucomparisonufileu
correspondsutouaudifferenturesult")
203
204 parser.add-option("-i", "--individual",
207 help="Printaindividualudistortions")
208
212 help="Printusummaryustatistics")
215 default=",".join(STATS.keys()),
216 help="Summaryustatisticsucomponentu(
includingufollowing:us,uoru'all')u" %
"
,u".join(STATS.keys())
217
218 return parser
99
219
220
221 def prepareData(args, options):
222
225 if len(args) 2:
226 print "Thereushouldubeuexactlyutwouarguments: a) <
ref erence src >uandu<comparison-src >"
227 exit(1)
228
individualudistortionsu(checkuparametersu-suandu-i)"
231 exit (1)
232 if options.printSummary and len(options.statistics) == 0:
233 print "Whenuprintingusummaryuatuleastuoneustatisticumusta
beuchosenu (checkuparameter,-S)"
234 exit (1)
235
236
refInput, cmpInput)
245 #print "Options:
"
100
248 pass
249
250
251 refValues, elementsPerLine = readFromFile(refInput
comments)
252 cmpValues, _ = readFromFile(cmpInput , delimiters = options.
253
254 return refValues , cmpValues, elementsPerLine
255
256
257 def filterValues (refValues , cmpValues , options)
258
266
267
268 def computeDistortion(refValues , cmpValues , options)
269
270 refValues, cmpValues = filterValues (refValues, cmpValues,
options)
271
275 exit(1)
276
277 try:
101
279 except KeyError:
280 print "Errorufunctionu/suunknown!" % options.
errorFunction
281 exit(i)
282
0, elementsPerLine = [1):
288
%
2)
295 printDistortions(distortions, elementsStart =
305
102
307
308 return 0
309
310
311 def main():
314
315 refValues, cmpValues , elementsPerLine = prepareData(args,
options)
316
317 rl, cl = len(refValues), len(cmpValues)
318
319 if rl == 0:
321 exit(1)
322
323 if options.cmpMultiple:
326 exit(1)
327 else:
328 if cl != rl:
329 print "Theunumberuofacompari sonvalues,(%d),shouldjbe
cl, rl)
330 exit (1)
331
332 start = 0
333 totaldistortions = []
334 while start < cl:
335 distortion, summary = computeDistortion( refValues,
336 totaldistortions.extend(distortion)
103
337 presentResults(distortion, None if options.cmpMultiple
else summary, options, elementsStart = start,
elementsPerLine = elementsPerLine)
338 start += rl
339
340 if options.cmpMultiple:
341 summary = calculateStats(totaldistortions , statsWhitelist
= options.statistics.split(','))
342 presentResults(None, summary, options)
343
344
346 main()
Figure B-4: The output abstraction algorithm used for the analysis of the Jacobi
program.
104
1 #!/bin/bash
3 export PRGNAME="main.x"
4 export PRGINPUT="input.main.x"
6 export PRGQUOTA="1000"
7
8 export FILTERBLACKLIST=""
9 export FILTERINSTRS=""
10 export FILTEREXCLUDELINES=""
11 export FILTERINCLUDELINES=""
12 export FILTERISSKIP="True"
13
14 export PRGRUNNATIVE="False"
15 export PRGRUNTIME="20"
16
17 export FIREPEATFAULT="1"
18 export FIFUNCFAULTS="True"
19
20 export FILINECOVERAGE="True"
21 export FILINECOVERAGEFAULTS="2"
22 export FILINECOVERAGEPERCENT="100"
23
24 export FINOISEMAX="1000"
25 export FINOISECONTROL="0.123"
26
27 for option in 'b' 'f' 'm' )a'
28 do
SXmdY)"
30 mkdir $OPTIONOUTPUTDIR
31
33 for i in $(seq 0 5)
34 do
35 export PRGFAULTMODEL=$i
105
36 export PRGSETTINGSPATH="settings-$PRGNAME$(
date +_HM_ Xmd/Y) .cfg"
37
38 python configCreator.py
39 python script.py $PRGSETTINGSPATH
40
41 export OUTPUTDIR="$PRGNAME-$PRGFAULTMODEL$(
date +_H/MXS_/mdY)"
42
43 mkdir $OUTPUTDIR
44
45 my $PRGNAME\_output $OUTPUTDIR
46 my $PRGNAME\_baseline $OUTPUTDIR
47 my faults $OUTPUTDIR
48 my analysis $OUTPUTDIR
49 my trace $OUTPUTDIR
50 my logs $OUTPUTDIR
51 my summary.txt $OUTPUTDIR
52 my $PRGSETTINGSPATH $OUTPUTDIR
53 my $PRGNAME.mem2reg.bc.analysis.log $OUTPUTDIR
54 my $PRGNAME.mem2reg.bc.log $OUTPUTDIR
55 my $PRGNAME.db $OUTPUTDIR
56 my fltmark* $OUTPUTDIR
57 tar -zcf "$OUTPUTDIR.tar.gz" $OUTPUTDIR
58 rm -rf $OUTPUTDIR
59 my "$OUTPUTDIR.tar.gz" $OPTION_ OUTPUT_ DIR
60 done
61 done
Figure B-5: The experiments script used to run the tool for the evaluation of both
the tool and programs.
106
Bibliography
[1] Robert Baumann. Soft errors in advanced computer systems. Design & Test of
Computers, IEEE, 22(3):258--266, 2005.
[3] Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc
Snir. Toward exascale resilience. International Journal of High Performance
Computing Applications, 23(4):374 388, 2009.
[4] Seungjae Han, Kang G Shin, and Harold A Rosenberg. Doctor: An integrated
software fault injection environment for distributed real-time systems. In Com-
puter Performance and Dependability Symposium, 1995. Proceedings., Interna-
tional, pages 204- 213. IEEE, 1995.
[5] Siva Kumar Sastry Hari, Sarita V Adve, Helia Naeimi, and Pradeep Ramachan-
dran. Relyzer: exploiting application-level fault equivalence to analyze applica-
tion resiliency to transient faults. In ACM SIGARCH Computer Architecture
News, volume 40, pages 123 134. ACM, 2012.
[7] John G. Holm and Prithviraj Banerjee. Low cost concurrent error detection in a
vliw architecture using replicated instructions. In Proceedings of the International
Conference on ParallelProcessing, pages 192-195, 1992.
[9] Chris Lattuer and Vikram Adve. Llvmn: A compilation framework for lifelong
program analysis & transformation. In Code Generation and Optimization, 2004.
CGO 2004. InternationalSyrmposiuM on, pages 75--86. IEEE, 2004.
[10] Dong Li, Jeffrey S Vetter, and Weikuan Yu. Classifying soft error vulnerabilities
in extreme-scale scientific applications using a binary instrumentation tool. In
107
Proceedings of the International Conference on High Performance Computing,
Networking, Storage and Analysis, SC '12, pages 57:1-57:11, Los Alamitos, CA,
USA, 2012. IEEE Computer Society Press, IEEE Computer Society Press.
[11] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff
Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building
customized program analysis tools with dynamic instrumentation. A CM Sigplan
Notices, 40(6):190-200, 2005.
[12] Robert E. Lyons and Wouter Vanderkulk. The use of triple-modular redundancy
to improve computer reliability. IBM Journal of Research and Development,
6(2):200-209, 1962.
[15] Asit K Mishra, Rajkishore Barik, and Somnath Paul. iact: A software-hardware
framework for understanding the scope of approximate computing. 2014.
[16] Nahmsuk Oh, Philip P. Shirvani, and Edward J. McCluskey. Error detection by
duplicated instructions in super-scalar processors. Reliability, IEEE Transactions
on, 51(1):63-75, 2002.
[18] George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I.
August, and Shubhendu S. Mukherjee. Software-controlled fault tolerance. A CM
Trans. Archit. Code Optim., 2(4):366-396, December 2005.
[20] Pooja Roy, Rajarshi Ray, Chundong Wang, and Weng-Fai Wong. Asac: Auto-
matic sensitivity analysis for approximate computing. 2014.
108
[23] Anna Thomas and Karthik Pattabirainan. Error detector placement for soft
computation. In Dependable Systems and Networks (DSN), 2013 43rd Annual
IEEE/IFIP International Conference on, pages 1--12. IEEE, 2013.
[24] Anna Thomas and Karthik Pattabiranian. Llfi: An intermediate code level fault
injector for soft computing applications. ser. SELSE, 2013.
[25] Timothy K Tsai and Ravishankar K Iyer. FTAPE: A fault injection tool to
measure fault tolerance. In AIA A, Computing in Aerospace Conference, 10 th,
San Antonio, TX, pages 339-346, 1995.
109