Computer Architecture Performance Evaluation Methods
Computer Architecture Performance Evaluation Methods
Performance Evaluation
Methods
Copyright © 2010 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00273ED1V01Y201006CAC010
Lecture #10
Series Editor: Mark D. Hill, University of Wisconsin
Series ISSN
Synthesis Lectures on Computer Architecture
Print 1935-3235 Electronic 1935-3243
Synthesis Lectures on Computer
Architecture
Editor
Mark D. Hill, University of Wisconsin
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. The scope will
largely follow the purview of premier computer architecture conferences, such as ISAC, HAPP,
MICRO, and APOLLOS.
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009
The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
Transactional Memory
James R. Larus and Ravi Rajwar
2006
Lieven Eeckhout
Ghent University
M
&C Morgan & cLaypool publishers
ABSTRACT
Performance evaluation is at the foundation of computer architecture research and development.
Contemporary microprocessors are so complex that architects cannot design systems based on in-
tuition and simple models only. Adequate performance evaluation methods are absolutely crucial to
steer the research and development process in the right direction. However, rigorous performance
evaluation is non-trivial as there are multiple aspects to performance evaluation, such as picking
workloads, selecting an appropriate modeling or simulation approach, running the model and in-
terpreting the results using meaningful metrics. Each of these aspects is equally important and a
performance evaluation method that lacks rigor in any of these crucial aspects may lead to inaccurate
performance data and may drive research and development in a wrong direction.
The goal of this book is to present an overview of the current state-of-the-art in computer
architecture performance evaluation, with a special emphasis on methods for exploring processor
architectures. The book focuses on fundamental concepts and ideas for obtaining accurate perfor-
mance data. The book covers various topics in performance evaluation, ranging from performance
metrics, to workload selection, to various modeling approaches including mechanistic and empirical
modeling. And because simulation is by far the most prevalent modeling technique, more than half
the book’s content is devoted to simulation. The book provides an overview of the simulation tech-
niques in the computer designer’s toolbox, followed by various simulation acceleration techniques
including sampled simulation, statistical simulation, parallel simulation and hardware-accelerated
simulation.
KEYWORDS
computer architecture, performance evaluation, performance metrics, workload charac-
terization, analytical modeling, architectural simulation, sampled simulation, statistical
simulation, parallel simulation, FPGA-accelerated simulation
vii
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Structure of computer architecture (r)evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Importance of performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Book outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Single-threaded workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
2.2 Multi-threaded workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Multiprogram workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 System throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Average normalized turnaround time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Comparison to prevalent metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 STP versus ANTT performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Average performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Harmonic and arithmetic average: Mathematical viewpoint . . . . . . . . . . . . . .12
2.4.2 Geometric average: Statistical viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.3 Final thought on averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Partial metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Workload Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 From workload space to representative workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Reference workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
3.1.2 Towards a reduced workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
viii
3.2 PCA-based workload design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Workload characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Plackett and Burman based workload design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Limitations and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
5.1 The computer architect’s toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
ix
5.2 Functional simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51
5.2.2 Operating system effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Full-system simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Specialized trace-driven simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 Trace-driven simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6 Execution-driven simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6.1 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6.2 Dealing with non-determinism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7 Modular simulation infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.8 Need for simulation acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Sampled Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 What sampling units to select? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1.1 Statistical sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1.2 Targeted Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.3 Comparing design alternatives through sampled simulation . . . . . . . . . . . . . .70
6.2 How to initialize architecture state? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.1 Fast-forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.2 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 How to initialize microarchitecture state? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.1 Cache state warmup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.2 Predictor warmup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3.3 Processor core state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Sampled multiprocessor and multi-threaded processor simulation . . . . . . . . . . . . . . 78
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
BOOK ORGANIZATION
This book is organized as follows, see also Figure 1.
Chapter 2 describes ways to quantify performance and revisits performance metrics for single-
threaded workloads, multi-threaded workloads and multi-program workloads. Whereas quantifying
performance for single-threaded workloads is straightforward and well understood, some may still
be confused about how to quantify multi-threaded workload performance. This is especially true for
xii PREFACE
quantifying multiprogram performance. This book sheds light on how to do a meaningful multipro-
gram performance characterization by focusing on both system throughput and job turnaround time.
We also discuss ways for computing the average performance number across a set of benchmarks
and clarify the opposite views on computing averages, which fueled the debate over the past two
decades.
Chapter 3 talks about how to select a representative set of benchmarks from a larger set
of specified benchmarks. The chapter covers two methodologies for doing so, namely Principal
Component Analysis and the Plackett and Burman design of experiment. The idea behind both
PREFACE xiii
methodologies is that benchmarks that exhibit similar behavior in their inherent behavior and/or
their interaction with the microarchitecture should not both be part of the benchmark suite, only
dissimilar benchmarks should. By retaining only the dissimilar benchmarks, one can reduce the
number of retained benchmarks and thus reduce overall experimentation time while not sacrificing
accuracy too much.
Analytical performance modeling is the topic of Chapter 4. Although simulation is the preva-
lent computer architecture performance evaluation method, analytical modeling clearly has its place
in the architect’s toolbox. Analytical models are typically simple and, therefore, very fast. This allows
for using analytical models to quickly explore large design spaces and narrow down on a region
of interest, which can later be explored in more detail through simulation. Moreover, analytical
models can provide valuable insight, which is harder and more time-consuming to obtain through
simulation. We discuss three major flavors of analytical models. Mechanistic modeling or white-
box modeling builds a model based on first principles, along with a good understanding of the
system under study. Empirical modeling or black-box modeling builds a model through training
based simulation results; a model, typically, is a regression model or a neural network. Finally, hybrid
mechanistic-empirical modeling aims at combining the best of worlds: it provides insight (which
it inherits from mechanistic modeling) while easing model construction (which it inherits from
empirical modeling).
Chapter 5 gives an overview of the computer designer’s toolbox while focusing on simulation
methods. We revisit different flavors of simulation, ranging from functional simulation, (specialized)
trace-driven simulation, execution-driven simulation, full-system simulation, to modular simulation
infrastructures. We describe a taxonomy of execution-driven simulation, and we detail on ways for
how to deal with non-determinism during simulation.
The next three Chapters 6, 7, and 8, cover three approaches to accelerate simulation, namely
sampled simulation, statistical simulation and through exploiting parallelism. Sampled simulation
(Chapter 6) simulates only a small fraction of a program’s execution. This is done by selecting
a number of so called sampling units and only simulating those sampling units. There are three
challenges for sampled simulation: (i) what sampling units to select; (ii) how to initialize a sampling
unit’s architecture starting image (register and memory state); (iii) how to estimate a sampling unit’s
microarchitecture starting image, i.e., the state of the caches, branch predictor, and processor core
at the beginning of the sampling unit. Sampled simulation has been an active area of research over
the past few decades, and this chapter covers the most significant problems and solutions.
Statistical simulation (Chapter 7) takes a different approach. It first profiles a program exe-
cution and collects some program metrics that characterize the program’s execution behavior in a
statistical way. A synthetic workload is generated from this profile; by construction, the synthetic
workload exhibits the same characteristics as the original program. Simulating the synthetic work-
load on a simple statistical simulator then yields performance estimates for the original workload.
The key benefit is that the synthetic workload is much shorter than the real workload, and as a
xiv PREFACE
result, simulation is done quickly. Statistical simulation is not meant to be a replacement for detailed
cycle-accurate simulation but rather as a useful complement to quickly explore a large design space.
Chapter 8 covers three ways to accelerate simulation by exploiting parallelism. The first ap-
proach leverages multiple machines in a simulation cluster to simulate multiple fragments of the
entire program execution in a distributed way. The simulator itself may still be a single-threaded
program. The second approach is to parallelize the simulator itself in order to leverage the available
parallelism in existing computer systems, e.g., multicore processors. A parallelized simulator typi-
cally exploits coarse-grain parallelism in the target machine to efficiently distribute the simulation
work across multiple threads that run in parallel on the host machine. The third approach aims at
exploiting fine-grain parallelism by mapping (parts of ) the simulator on reconfigurable hardware,
e.g., Field Programmable Gate Arrays (FPGAs).
Finally, in Chapter 9, we briefly discuss topics that were not (yet) covered in the book, namely
measurement bias, design space exploration and simulator validation, and we look forward towards
the challenges ahead of us in computer performance evaluation.
Lieven Eeckhout
June 2010
Acknowledgments
First and foremost, I would like to thank Mark Hill and Michael Morgan for having invited
me to write a synthesis lecture on computer architecture performance evaluation methods. I was
really honored when I received the invitation from Mark, and I really enjoyed working on the book.
Special thanks also to my reviewers who have read early versions of the book and who gave
me valuable feedback, which greatly helped me improve the text. Many thanks to: Joel Emer, Lizy
John, Babak Falsafi, Jim Smith, Mark Hill, Brad Calder, Benjamin Lee, David Brooks, Amer Diwan,
Joshua Yi and Olivier Temam.
I’m also indebted to my collaborators over the past years who have given me the opportunity
to learn more and more about computer architecture in general and performance evaluation methods
in particular. This book comprises many of their contributions.
Last but not least, I would like to thank my wife, Hannelore Van der Beken, for her endless
support throughout this process, and our kids, Zoë, Jules, Lea and Jeanne for supporting me —
indirectly — through their love and laughter.
Lieven Eeckhout
June, 2010
1
CHAPTER 1
Introduction
Performance evaluation is at the foundation of computer architecture research and development.
Contemporary microprocessors are so complex that architects cannot design systems based on in-
tuition and simple models only. Adequate performance evaluation methodologies are absolutely
crucial to steer the development and research process in the right direction. In order to illustrate
the importance of performance evaluation in computer architecture research and development, let’s
take a closer look at how the field of computer architecture makes progress.
Figure 1.1: Structure of (a) scientific research versus (b) systems research.
while sweeping through the design space. The key question then is to navigate through this wealth
of data and make meaningful conclusions. Interpretation of the results is crucial to make correct
design decisions. The insights obtained from the experiment may or may not support the hypothesis
made by the architect. If the experiment supports the hypothesis, the experimental design is im-
proved and additional experimentation is done, i.e., the design is incrementally refined (inner loop).
If the experiment does not support the hypothesis, i.e., the results are completely surprising, then the
architect needs to re-examine the hypothesis (outer loop), which may lead the architect to change
the design or propose a new design.
Although there are clear similarities, there are also important differences that separate systems
research from scientific research. Out of practical necessity the step of picking a baseline design and
workloads is typically based on the experimenters judgment and experience, rather than objectively
drawing a scientific sample from a given population. This means that the architect should be well
aware of the subjective human aspect of experiment design when interpreting and analyzing the
results. Trusting the results produced through the experiment without a clear understanding of its
1.2. IMPORTANCE OF PERFORMANCE EVALUATION 3
design may lead to misleading or incorrect conclusions. In this book, we will focus on the scientific
approach suggested by Kuhn, but we will also pay attention to making the important task of workload
selection less subjective.
CHAPTER 2
Performance Metrics
Performance metrics are at the foundation of experimental research and development. When eval-
uating a new design feature or a novel research idea, the need for adequate performance metrics is
paramount. Inadequate metrics may be misleading and may lead to incorrect conclusions and may
steer development and research in the wrong direction. This chapter discusses metrics for evalu-
ating computer system performance. This is done in a number of steps. We consider metrics for
single-threaded workloads, multi-threaded workloads and multi-program workloads. Subsequently,
we will discuss ways of summarizing performance in a single number by averaging across multiple
benchmarks. Finally, we will briefly discuss the utility of partial metrics.
with CP I the average number of cycles per useful instruction and f the processor’s clock frequency.
Note the wording ‘useful instruction’. This is to exclude the instructions executed along mispredicted
paths — contemporary processors employ branch prediction and speculative execution, and in case of
a branch misprediction, speculatively executed instructions are squashed from the processor pipeline
and should, therefore, not be accounted for as they don’t contribute to the amount of work done.
The utility of the Iron Law of Performance is that the terms correspond to the sources of
performance. The number of instructions N is a function of the instruction-set architecture (ISA)
and compiler; CP I is a function of the micro-architecture and circuit-level implementation; and f
is a function of circuit-level implementation and technology. Improving one of these three sources
of performance improves overall performance. Justin Rattner [161], in his PACT 2001 keynote
presentation, reported that x86 processor performance has improved by over 75× over a 10 years
time period, between the 1.0μ technology node (early 1990s) and the 0.18μ technology node (around
2000): 13× comes from improvements in frequency, and 6× from micro-architecture enhancements,
and the 50× improvement in frequency in its turn is due to improvements in technology (13×) and
micro-architecture (4×).
6 2. PERFORMANCE METRICS
Assuming that the amount of work that needs to be done is constant, i.e., the number of
dynamically executed instructions N is fixed, and the processor clock frequency f is constant, one
can express the performance of a processor in terms of the CPI that it achieves. The lower the
CPI, the lower the total execution time, the higher performance. Computer architects frequently
use its reciprocal, or IPC, the average number of (useful) instructions executed per cycle. IPC is a
higher-is-better metric. The reason why IPC (and CPI) are popular performance metrics is that
they are easily quantified through architectural simulation. Assuming that the clock frequency does
not change across design alternatives, one can compare microarchitectures based on IPC.
Although IPC seems to be more widely used than CPI in architecture studies — presumably,
because it is a higher-is-better metric — CPI provides more insight. CPI is additive, and one can
break up the overall CPI in so called CPI adders [60] and display the CPI adders in a stacked bar
called the CPI stack. The base CPI is typically shown at the bottom of the CPI stack and represents
useful work done. The other CPI adders, which reflect ‘lost’ cycle opportunities due to miss events
such as branch mispredictions and cache and TLB misses, are stacked on top of each other.
The proliferation of multi-threaded and multicore processors in the last decade has urged the need
for adequate performance metrics for multiprogram workloads. Not only do multi-threaded and
multicore processors execute multi-threaded workloads, they also execute multiple independent
programs concurrently. For example, a simultaneous multi-threading (SMT) processor [182] may
co-execute multiple independent jobs on a single processor core. Likewise, a multicore processor or
a chip-multiprocessor [151] may co-execute multiple jobs, with each job running on a separate core.
A chip-multithreading processor may co-execute multiple jobs across different cores and hardware
threads per core, e.g., Intel Core i7, IBM POWER7, Sun Niagara [108].
As the number of cores per chip increases exponentially, according to Moore’s law, it is to
be expected that more and more multiprogram workloads will run on future hardware. This is true
across the entire compute range. Users browse, access email, and process messages and calls on their
cell phones while listening to music. At the other end of the spectrum, servers and datacenters
leverage multi-threaded and multicore processors to achieve greater consolidation.
The fundamental problem that multiprogram workloads impose to performance evaluation
and analysis is that the independent co-executing programs affect each other’s performance. The
amount of performance interaction depends on the amount of resource sharing. A multicore pro-
cessor typically shares the last-level cache across the cores as well as the on-chip interconnection
network and the off-chip bandwidth to memory. Chandra et al. [24] present a simulation-based
experiment that shows that the performance of individual programs can be affected by as much as
65% due to resource sharing in the memory hierarchy of a multicore processor when co-executing
two independent programs. Tuck and Tullsen [181] present performance data measured on the Intel
8 2. PERFORMANCE METRICS
Pentium 4 processor which is an SMT processor1 with two hardware threads. They report that for
some programs per-program performance can be as low as 71% of the per-program performance
observed when run in isolation, whereas for other programs, per-program performance may be
comparable (within 98%) to isolated execution.
Eyerman and Eeckhout [61] take a top-down approach to come up with performance metrics
for multiprogram workloads, namely system throughput (STP) and average normalized turnaround
time (ANTT).They start from the observation that there are two major perspectives to multiprogram
performance: a user’s perspective and a system’s perspective. A user’s perspective cares about the
turnaround time for an individual job or the time it takes between submitting the job and its
completion. A system’s perspective cares about the overall system throughput or the number of
jobs completed per unit of time. Of course, both perspectives are not independent of each other. If
one optimizes for job turnaround time, one will likely also improve system throughput. Similarly,
improving a system’s throughput will also likely improve a job’s turnaround time. However, there are
cases where optimizing for one perspective may adversely impact the other perspective. For example,
optimizing system throughput by prioritizing short-running jobs over long-running jobs will have a
detrimental impact on job turnaround time, and it may even lead to starvation of long-running jobs.
We now discuss the STP and ANTT metrics in more detail, followed by a discussion on how
they compare against prevalent metrics.
1 1 TiMP
n n
AN T T = NT Ti = , (2.5)
n
i=1
n TiSPi=1
ANTT is a lower-is-better metric. For the above example, the one program achieves an NTT of
1/0.75 = 1.33 and the other 1/0.5 = 2, and thus ANTT equals (1.33 + 2)/2 = 1.67, which means
that the average slowdown per program equals 1.67.
with CP IiSP and CP IiMP the CPI under single-program and multiprogram execution, respectively.
ANTT can be computed as
1 CP IiMP
n
AN T T = . (2.7)
n CP IiSP
i=1
IPC throughput. IPC throughput is defined as the sum of the IPCs of the co-executing programs:
n
I P C_throughput = I P Ci . (2.8)
i=1
IPC throughput naively reflects a computer architect’s view on throughput; however, it doesn’t have a
meaning in terms of performance from either a user perspective or a system perspective. In particular,
one could optimize a system’s IPC throughput by favoring high-IPC programs; however, this may not
necessarily reflect improvements in system-level performance (job turnaround time and/or system
throughput). Therefore, it should not be used as a multiprogram performance metric.
Weighted speedup. Snavely and Tullsen [174] propose weighted speedup to evaluate how well jobs
co-execute on a multi-threaded processor. Weighted speedup is defined as
n
I P C MP
weighted_speedup = i
, (2.9)
i=1
I P CiSP
The motivation by Snavely and Tullsen for using IPC as the basis for the speedup metric is that
if one job schedule executes more instructions than another in the same time interval, it is more
symbiotic and, therefore, yields better performance; the weighted speedup metric then equalizes
the contribution of each program in the job mix by normalizing its multiprogram IPC with its
single-program IPC.
From the above, it follows that weighted speedup equals system throughput (STP) and, in fact,
has a physical meaning — it relates to the number of jobs completed per unit of time — although
this may not be immediately obvious from weighted speedup’s definition and its original motivation.
Harmonic mean. Luo et al. [129] propose the harmonic mean metric, or hmean for short, which
computes the harmonic mean rather than an arithmetic mean (as done by weighted speedup) across
the IPC speedup numbers:
n
hmean = . (2.10)
n I P CiSP
i=1 I P C MP
i
2.4. AVERAGE PERFORMANCE 11
The motivation by Luo et al. for computing the harmonic mean is that it tends to result in lower
values than the arithmetic average if one or more programs have a lower IPC speedup, which they
argue better captures the notion of fairness than weighted speedup. The motivation is based solely
on properties of the harmonic and arithmetic means and does not reflect any system-level meaning.
It follows from the above that the hmean metric is the reciprocal of the ANTT metric, and hence
it has a system-level meaning, namely, it relates to (the reciprocal of ) the average job’s normalized
turnaround time.
If on the other hand, B is weighed equally among the benchmarks, then the arithmetic mean is
meaningful:
n n
1 Ai 1 Ai
n n
i=1 Ai Ai
n = i=1 = = = AM(Ai /Bi ). (2.12)
i=1 Bi n·B n B n Bi
i=1 i=1
We refer to John [93] for a more extensive description, including a discussion on how to weigh the
different benchmarks.
Hence, depending on the performance metric and how the metric was computed, one has to
choose for either the harmonic or arithmetic mean. For example, assume we have selected a 100M
instruction sample for each benchmark in our benchmark suite. The average IPC (instructions
executed per cycles) needs to be computed as the harmonic mean across the IPC numbers for the
individual benchmarks because the instruction count is constant across the benchmarks. The same
applies to computing MIPS or million instructions executed per second. Inversely, the average CPI
(cycles per instruction) needs to be computed as the arithmetic mean across the individual CPI
numbers. Similarly, the arithmetic average also applies for TPI (time per instruction).
The choice for harmonic versus arithmetic mean also depends on the experimenter’s perspec-
tive. For example, when computing average speedup (execution time on original system divided by
execution time on improved system) over a set of benchmarks, one needs to use the harmonic mean
if the relative duration of the benchmarks is irrelevant, or, more precisely, if the experimenter weighs
the time spent in the original system for each of the benchmarks equally. If on the other hand, the
2.4. AVERAGE PERFORMANCE 13
experimenter weighs the duration for the individual benchmarks on the enhanced system equally,
or if one expects a workload in which each program will run for an equal amount of time on the
enhanced system, then the arithmetic mean is appropriate.
Weighted harmonic mean and weighted arithmetic can be used if one knows a priori which
applications will be run on the target system and in what percentages — this may be the case in
some embedded systems. Assigning weights to the applications proportional to their percentage of
execution will provide an accurate performance assessment.
In this formula, it is assumed that x is log-normally distributed, or, in other words, ln(x) is normally
distributed. The average of a normal distribution is the arithmetic mean; hence, the exponential of
the arithmetic mean over ln(xi ) computes the average for x (see the right-hand side of the above
formula). This equals the definition of the geometric mean (see the left-hand side of the formula).
The geometric mean has an appealing property. One can compute the average speedup be-
tween two machines by dividing the average speedups for these two machines relative to some
reference machine. In particular, having the average speedup numbers of machines A and B relative
to some reference machine R, one can compute the average relative speedup between A and B by
simply dividing the former speedup numbers. SPEC CPU uses the geometric mean for computing
14 2. PERFORMANCE METRICS
average SPEC rates, and the speedups are computed against some reference machine, namely a Sun
Ultra5_10 workstation with a 300MHz SPARC processor and 256MB main memory.
The geometric mean builds on two assumptions. For one, it assumes that the benchmarks
are representative for the much broader workload space. A representative set of benchmarks can
be obtained by randomly choosing benchmarks from the population, provided a sufficiently large
number of benchmarks are taken. Unfortunately, the set of benchmarks is typically not chosen
randomly from a well-defined workload space; instead, a benchmark suite is a collection of interesting
benchmarks covering an application domain of interest picked by a committee, an individual or a
marketing organization. In other words, and as argued in the introduction, picking a set of workloads
is subject to the experimenter’s judgment and experience. Second, the geometric mean assumes that
the speedups are distributed following a log-normal distribution.These assumptions have never been
validated (and it is not clear how they can ever be validated), so it is unclear whether these assumptions
holds true. Hence, given the high degree of uncertainty regarding the required assumptions, using
the geometric mean for computing average performance numbers across a set of benchmarks is of
questionable value.
CHAPTER 3
Workload Design
Workload design is an important step in computer architecture research, as already described in
the introduction: subsequent steps in the design process are subject to the selection process of
benchmarks, and choosing a non-representative set of benchmarks may lead to biased observations,
incorrect conclusions, and, eventually, designs that poorly match their target workloads.
For example, Maynard et al. [137] as well as Keeton et al. [105] compare the behavior of
commercial applications, including database servers, against the SPEC CPU benchmarks that are
widely used in computer architecture. They find that commercial workloads typically exhibit more
complex branch behavior, larger code and data footprints, and more OS as well as I/O activity.
In particular, the instruction cache footprint of the SPEC CPU benchmarks is small compared
to commercial workloads; also, memory access patterns for footprints that do not fit in on-chip
caches are typically regular or strided. Hence, SPEC CPU benchmarks are well suited for pipeline
studies, but they should be used with care for memory performance studies. Guiding processor
design decisions based on the SPEC CPU benchmarks only may lead to suboptimal performance
for commercial workloads.
workload space
reference workload
reduced workload
Figure 3.1: Introducing terminology: workload space, reference workload and reduced workload.
reference workload
consisting of n programs
p program characteristics
(1) characterization
n programs
(2) PCA
q program characteristics
n programs
Hardware performance monitors. One way of characterizing workload behavior is to employ hard-
ware performance monitors — in fact, hardware performance monitors are widely used (if not prevail-
ing) in workload characterization because they are available on virtually all modern microprocessors,
can measure a wide range of events, are easy to use, and allow for characterizing long-running com-
plex workloads that are not easily simulated (i.e., the overhead is virtually zero). The events measured
using hardware performance monitors are typically instruction mix (e.g., percentage loads, branches,
floating-point operations, etc.), IPC (number of instructions retired per cycle), cache miss rates,
branch mispredict rates, etc.
Inspite of its widespread use, there is a pitfall in using hardware performance monitors: they
can be misleading in the sense that they can conceal the workload’s inherent behavior. This is to say
that different inherent workload behavior can lead to similar behavior when measured using hardware
performance monitors. As a result, based on a characterization study using hardware performance
monitors one may conclude that different benchmarks exhibit similar behavior because they show
similar cache miss rates, IPC, branch mispredict rates, etc.; however, a more detailed analysis based on
a microarchitecture-independent characterization (as described next) shows that both benchmarks
exhibit different inherent behavior.
Hoste and Eeckhout [85] present data that illustrates exactly this pitfall, see also Table 3.1 for
an excerpt. The two benchmarks, gzip with the graphic input from the SPEC CPU2000 benchmark
suite and fasta from the BioPerf benchmark suite, exhibit similar behavior in terms of CPI and cache
miss rates, as measured using hardware performance monitors. However, the working set size and
memory access patterns are shown to be very different. The data working set size is an order of
magnitude bigger for gzip compared to fasta, and the memory access patterns seem to be fairly
different as well between these workloads.
The notion of microarchitecture-independent versus microarchitecture-dependent character-
ization also appears in sampled simulation, see Chapter 6.
Hardware performance monitor data across multiple machines. One way for alleviating this pitfall
is to characterize the workload on a multitude of machines and architectures. Rather than collecting
hardware performance monitor data on a single machine, collecting data across many different
3.2. PCA-BASED WORKLOAD DESIGN 21
machines is likely to yield a more comprehensive and more informative workload characterization
because different machines and architectures are likely to stress the workload behavior (slightly)
differently. As a result, an inherent behavioral difference between benchmarks is likely to show up
on at least one of a few different machines.
Phansalkar et al. [160] describe an experiment in which they characterize the SPEC CPU2006
benchmark suite on five different machines with four different ISAs and compilers (IBM Power,
Sun UltraSPARC, Itanium and x86). They use this multi-machine characterization as input for
the PCA-based workload analysis method, and then study the diversity among the benchmarks in
the SPEC CPU2006 benchmark suite. This approach was used by SPEC for the development of
the CPU2006 benchmark suite [162]: the multi-machine workload characterization approach was
used to understand the diversity and similarity among the benchmarks for potential inclusion in the
CPU2006 benchmark suite.
22 3. WORKLOAD DESIGN
Detailed simulation. One could also rely on detailed cycle-accurate simulation for collecting pro-
gram characteristics in a way similar to using hardware performance monitors.The main disadvantage
is that is extremely time-consuming to simulate industry-standard benchmarks in a cycle-accurate
manner — cycle-accurate simulation is at least five orders of magnitude slower than native hardware
execution. The benefit though is that simulation enables collecting characteristics on a range of
machine configurations that are possibly not (yet) available.
PC1
PC2
3.2.5 APPLICATIONS
The PCA-based methodology enables various applications in workload characterization.
3.2. PCA-BASED WORKLOAD DESIGN 25
Workload analysis. Given the limited number of principal components, one can visualize the work-
load space, as illustrated in Figure 3.4, which shows the PCA space for a set of SPEC CPU95
benchmarks along with TPC-D running on the postgres DBMS. (This graph is based on the data
presented by Eeckhout et al. [53] and represents old and obsolete data — both CPU95 and TPC-D
are obsolete — nevertheless, it illustrates various aspects of the methodology.) The graphs show
the various benchmarks as dots in the space spanned by the first and second principal components,
and third and fourth principal components, respectively. Collectively, these principal components
capture close to 90% of the total variance, and thus they provide an accurate picture of the workload
space. The different colors denote different benchmarks; the different dots per benchmark denote
different inputs.
By interpreting the principal components, one can reason about how benchmarks differ from
each other in terms of their execution behavior. The first principal component primarily quantifies
a benchmark’s control flow behavior: benchmarks with relatively few dynamically executed branch
instructions and relatively low I-cache miss rates show up with a high first principal component. One
example benchmark with a high first principal component is ijpeg. Benchmarks with high levels of
ILP and poor branch predictability have a high second principal component, see for example go
and compress. The third and fourth primarily capture D-cache behavior and the instruction mix,
respectively.
Several interesting observations can be made from these plots. Some benchmarks exhibit
execution behavior that is different from the other benchmarks in the workload. For example, ijpeg,
go and compress seem to be isolated in the workload space and seem to be relatively dissimilar from
the other benchmarks. Also, the inputs given to the benchmark may have a big impact for some
benchmarks, e.g.,TPC-D, whereas for other benchmarks the execution behavior is barely affected by
its input, e.g., ijpeg. The different inputs (queries) are scattered around for TPC-D; hence, different
inputs seem to lead to fairly dissimilar behavior; for ijpeg, on the other hand, the inputs seem to
clustered, and inputs seem to have limited effect on the program’s execution behavior. Finally, it
also suggests that this set of benchmarks only partially covers the workload space. In particular, a
significant part of the workload space does not seem to be covered by the set of benchmarks, as
illustrated in Figure 3.5.
Workload reduction. By applying cluster analysis after PCA, one can group the various benchmarks
into a limited number of clusters based on their behavior, i.e., benchmarks with similar execution
behavior are grouped in the same cluster. The benchmark closest to the cluster’s centroid can then
serve as the representative for the cluster. By doing so, one can reduce the workload to a limited set
of representative benchmarks — also referred to as benchmark subsetting. Phansalkar et al. [160]
present results for SPEC CPU2006, and they report average prediction errors of 3.8% and 7%
for the subset compared to the full integer and floating-point benchmark suite, respectively, across
five commercial processors. Table 3.3 summarizes the subsets for the integer and floating-point
benchmarks.
26 3. WORKLOAD DESIGN
compress
PC2: high ILP and
gcc
xlisp
perlbmk
TPC-D ijpeg
m88ksim
vortex
compress TPC-D
ijpeg
perl gcc
vortex
m88ksim
go
xlisp
Figure 3.4: Example PCA space as a function of the first four principal components: the first and second
principal components are shown in the top graph, and the third and fourth principal components are
shown in the bottom graph.
3.3. PLACKETT AND BURMAN BASED WORKLOAD DESIGN 27
Figure 3.5: The PCA-based workload analysis method allows for finding regions (weak spots) in the
workload space that are not covered by a benchmark suite. Weak spots are shaded in the graph.
Other applications. The PCA-based methodology has been used for various other or related pur-
poses, including evaluating the DaCapo benchmark suite [18], analyzing workload behavior over
time [51], studying the interaction between the Java application, its input and the Java virtual machine
( JVM) [48], and evaluating the representativeness of (reduced) inputs [52].
(with foldover) involves 2c simulations to quantify the effect of c microarchitecture parameters and
all pairwise interactions. The outcome of the PB experiment is a ranking of the most significant
microarchitecture performance bottlenecks. Although the primary motivation for Yi et al. to propose
the Plackett and Burman design was to explore the microarchitecture design space in a simulation-
friendly manner, it also has important applications in workload characterization. The ranking of
performance bottlenecks provides a unique signature that characterizes a benchmark in terms of
how it stresses the microarchitecture. By comparing bottleneck rankings across benchmarks one can
derive how (dis)similar the benchmarks are.
The Plackett and Burman design uses a design matrix — Table 3.4 shows an example design
matrix with foldover — Yi et al. and the original paper by Plackett and Burman provide design
matrices of various dimensions. A row in the design matrix corresponds to a microarchitecture con-
figuration that needs to be simulated; each column denotes a different microarchitecture parameter.
A ‘+1’ and ‘−1’ value represents a high and low — or on and off — value for a parameter. For
example, a high and low value could be a processor width of 8 and 2, respectively; or with aggressive
hardware prefetching and without prefetching, respectively. It is advised that the high and low val-
ues be just outside of the normal or expected range of values in order to take into account the full
potential impact of a parameter. The way the parameter high and low values are chosen may lead to
microarchitecture configurations that are technically unrealistic or even infeasible. In other words,
the various microarchitecture configurations in a Plackett and Burman experiment are corner cases
in the microarchitecture design space.
3.4. LIMITATIONS AND DISCUSSION 29
The next step in the procedure is to simulate these microarchitecture configurations, collect
performance numbers, and calculate the effect that each parameter has on the variation observed
in the performance numbers. The latter is done by multiplying the performance number for each
configuration with its value (+1 or −1) and by, subsequently, adding these products across all
configurations. For example, the effect for parameter A is computed as:
Similarly, one can compute the effect of pairwise effects by multiplying the performance number for
each configuration with the product of the parameters’ values , and adding these products across all
configurations. For example, the interaction effect between A and C is computed as:
After having computed the effect of each parameter, the effects (including the interaction effects)
can be ordered to determine their relative impact. The sign of the effect is meaningless, only the
magnitude is. An effect with a higher ranking is more of a performance bottleneck than a lower
ranked effect. For the example data in Table 3.4, the most significant parameter is B (with an effect
of -129).
By running a Plackett and Burman experiment on a variety of benchmarks, one can com-
pare the benchmarks against each other. In particular, the Plackett and Burman experiment yields a
ranking of the most significant performance bottlenecks. Comparing these rankings across bench-
marks provides a way to assess whether the benchmarks stress similar performance bottlenecks, i.e.,
if the top N ranked performance bottlenecks and their relative ranking is about the same for two
benchmarks, one can conclude that both benchmarks exhibit similar behavior.
Yi et al. [197] compare the PCA-based and PB-based methodologies against each other.
The end conclusion is that both methodologies are equally accurate in terms of how well they can
identify a reduced workload. Both methods can reduce the size of the workload by a factor of 3 while
incurring an error (difference in IPC for the reduced workload compared to the reference workload
across a number of processor architectures) of less than 5%. In terms of computational efficiency,
the PCA-based method was found to be more efficient than the PB-based approach. Collecting the
PCA program characteristics was done more efficiently than running the detailed cycle-accurate
simulations needed for the PB method.
CHAPTER 4
Analytical Performance
Modeling
Analytical performance modeling is an important performance evaluation method that has gained
increased interest over the past few years. In comparison to the prevalent approach of simulation
(which we will discuss in subsequent chapters), analytical modeling may be less accurate, yet it
is multiple orders of magnitude faster than simulation: a performance estimate is obtained almost
instantaneously — it is a matter of computing a limited number of formulas which is done in seconds
or minutes at most. Simulation, on the other hand, can easily take hours, days, or even weeks.
Because of its great speed advantage, analytical modeling enables exploring large design spaces
very quickly, which makes it a useful tool in early stages of the design cycle and even allows for
exploring very large design spaces that are intractable to explore through simulation. In other words,
analytical modeling can be used to quickly identify a region of interest that is later explored in
more detail through simulation. One example that illustrates the power of analytical modeling for
exploring large design spaces is a study done by Lee and Brooks [120], which explores the potential
of adaptive miroarchitectures while varying both the adaptibility of the microarchitecture and the
time granularity for adaptation — this is a study that would have been infeasible through detailed
cycle-accurate simulation.
In addition, analytical modeling provides more fundamental insight. Although simulation
provides valuable insight as well, it requires many simulations to understand performance sensitivity
to design parameters. In contrast, the sensitivity may be apparent from the formula itself in analytical
modeling. As an example, Hill and Marty extend Amdahl’s law towards multicore processors [82].
They augment Amdahl’s law with a simple hardware cost model, and they explore the impact of
symmetric (homogeneous), asymmetric (heterogeneous) and dynamic multicore processing. In spite
of its simplicity, it provides fundamental insight and reveals various important consequences for the
multicore era.
data point
linear fit
n
y = β0 + βi xi + , (4.1)
i=1
with y the dependent variable (also called the response variable), xi the independent variables (also
called the input variables), and the error term due to lack of fit. β0 is the intercept with the y-axis and
the βi coefficients are the regression coefficients. The βi coefficients represent the expected change
in the response variable y per unit of change in the input variable xi ; in other words, a regression
coefficient represents the significance of its respective input variable. A linear regression model
could potentially relate performance (response variable) to a set of microarchitecture parameters
(input variables); the latter could be processor width, pipeline depth, cache size, cache latency, etc.
In other words, linear regression tries to find the best possible linear fit for a number of data points,
as illustrated in Figure 4.1.
This simple linear regression model assumes that the input variables are independent of each
other, i.e., the effect of variable xi on the response y does not depend on the value of xj , j = i. In
many cases, this is not an accurate assumption, especially in computer architecture. For example,
the effect on performance of making the processor pipeline deeper depends on the configuration
of the memory hierarchy. A more aggressive memory hierarchy reduces cache miss rates, which
reduces average memory access times and increases pipelining advantages. Therefore, it is possible
to consider interaction terms in the regression model:
n
n
n
y = β0 + β i xi + βi,j xi xj + . (4.2)
i=1 i=1 j =i+1
34 4. ANALYTICAL PERFORMANCE MODELING
This particular regression model only includes, so called, two-factor interactions, i.e., pairwise in-
teractions between two input variables only; however, this can be trivially extended towards higher
order interactions.
The goal for applying regression modeling to performance modeling is to understand the
effect of the important microarchitecture parameters and their interactions on overall processor
performance. Joseph et al. [97] present such an approach and select a number of microarchitecture
parameters such as pipeline depth, processor width, reorder buffer size, cache sizes and latencies,
etc., along with a selected number of interactions. They then run a number of simulations while
varying the microarchitecture parameters and fit the simulation results to the regression model. The
method of least squares is commonly used to find the best fitting model that minimizes the sum
of squared deviations between the predicted response variable (through the model) and observed
response variable (through simulation). The fitting is done such that the error term is as small as
possible. The end result is an estimate for each of the regression coefficients. The magnitude and
sign of the regression coefficients represent the relative importance and impact of the respective
microarchitecture parameters on overall performance.
There are a number of issues one has to deal with when building a regression model. For one,
the architect needs to select the set of microarchitecture input parameters, which has an impact on
both accuracy and the number of simulations needed to build the model. Insignificant parameters
only increase model building time without contributing to accuracy. On the other hand, crucial
parameters that are not included in the model will likely lead to an inaccurate model. Second, the
value ranges that need to be set for each variable during model construction depends on the purpose
of the experiment. Typically, for design space exploration purposes, it is advised to take values that
are slightly outside the expected parameter range — this is to cover the design space well [196].
Table 4.1 shows the most significant variables and interactions obtained in one of the experi-
ments done by Joseph et al. They consider six microarchitecture parameters and their interactions as
the input variables, and they consider IPC as their response performance metric. Some parameters
and interactions are clearly more significant than others, i.e., their respective regression coefficients
have a large magnitude: pipeline depth and reorder buffer size as well as its interaction are significant,
much more significant than L2 cache size. As illustrated in this case study, the regression coeffi-
cients can be both positive and negative, which complicates gaining insight. In particular, Table 4.1
suggests that IPC decreases with increasing reorder buffer size because the regression coefficient is
negative. This obviously makes no sense. The negative regression coefficient is compensated for by
the positive interaction terms between reorder buffer size and pipeline depth, and reorder buffer size
and issue buffer size. In other words, increasing the reorder buffer size will increase the interaction
terms more than the individual variable so that IPC would indeed increase with reorder buffer size.
4.2. EMPIRICAL MODELING 35
data point
spline-based fit
with each Si (x) a polynomial. Higher-order polynomials typically lead to better fits. Lee and Brooks
use cubic splines which have the nice property that the resulting curve is smooth because the first and
second derivatives of the function are forced to agree at the knots. Restricted cubic splines constrain
the function to be linear in the tails; see Figure 4.2 for an example restricted cubic spline. Lee and
Brooks successfully leverage spline-based regression modeling to build multiprocessor performance
models [122], characterize the roughness of the architecture design space [123], and explore the
huge design space of adaptive processors [120].
input layer
hidden layer
output layer
w1
w2
∑
wk activation function
(e.g., sigmoid)
Figure 4.3: Neural networks: (a) architecture of a fully connected feed-forward network, and (b) archi-
tecture for an individual node.
taking a small step in the direction of steepest decrease in error. This is typically done through a
well-known procedure called backpropagation.
A limitation for empirical modeling, both neural networks and regression modeling, is that
it requires a number of simulations to infer the model. This number of simulations typically varies
between a couple hundreds to a few thousands of simulations. Although this is time consuming to do,
it is a one-time cost. Once the simulations are run and once the model is built, making performance
predictions is done instantaneously.
38 4. ANALYTICAL PERFORMANCE MODELING
time
time
base penalty
interval
Figure 4.4: Interval behavior: (a) overall execution can be split up in intervals; (b) an interval consists of
a base part where useful work gets done and a penalty part.
and dispatch is resumed. The front-end pipeline re-fill time is the same as the drain time — they
offset each other. Hence, the penalty for an I-cache (and I-TLB) miss is its miss delay.
time
penalty
time
penalty
performance penalty equals the branch resolution time, i.e., the time between the mispredicted
branch entering the window and the branch being resolved, plus the front-end pipeline depth.
Eyerman et al. [65] found that the mispredicted branch often is the last instruction to be executed;
and hence, the branch resolution time can be approximated by the ‘window drain time’, or the number
of cycles needed to empty a reorder buffer with a given number of instructions. For many programs,
the branch resolution time is the main contributor to the overall branch misprediction penalty (not
the pipeline re-fill time). And this branch resolution time is a function of the dependence structure
of the instructions in the window, i.e., the longer the dependence chain and the execution latency
of the instructions leading to the mispredicted branch, the longer the branch resolution time [65].
4.3. MECHANISTIC MODELING: INTERVAL MODELING 41
1 2 3 4 5 6 7 8
sliding window
Figure 4.7: ILP model: window cannot slide any faster than determined by the critical path.
long-latency
load miss
is issued
reorder buffer and/or
issue queue fills up
long-latency
load miss
is dispatched
effective dispatch rate
miss delay
time
penalty
The ILP model is only one example illustrating the utility of Little’s law — Little’s law is
widely applicable in systems research and computer architecture. It can be applied as long as the
three parameters (throughput, number of elements in the system, latency of each element) are long-
term (steady-state) averages of a stable system. There are multiple examples of how one could use
Little’s law in computer architecture. One such example relates to computing the number of physical
registers needed in an out-of-order processor. Knowing the target IPC and the average time between
acquiring and releasing a physical register, one can compute the required number of physical registers.
Another example relates to computing the average latency of a packet in a network. Tracking the
latency for each packet may be complex to implement in an FPGA-based simulator in an efficient
way. However, Little’s law offers an easy solution: it suffices to count the number of packets in the
network and the injection rate during steady-state, and compute the average packet latency using
Little’s law.
second long-latency
load miss is issued
first long-latency
load miss is issued
long-latency
load misses
miss delay
are dispatched
effective dispatch rate
penalty
instructions are dispatched under the long-latency miss. These useful instructions are dispatched
between the time the long-latency load dispatches and the time the ROB blocks after the long-
latency load reaches its head; this is the time it takes to fill the entire ROB minus the time it takes
for the load to issue after it has been dispatched — this is the amount of useful work done underneath
the memory access. Given that this is typically much smaller than the memory access latency, the
penalty for an isolated miss is assumed to equal the memory access latency.
For multiple long back-end misses that are independent of each other and that make it in
the reorder buffer at the same time, the penalties overlap [31; 76; 102; 103] — this is referred to as
memory-level parallelism (MLP).This is illustrated in Figure 4.9. After the first load receives its data
and unblocks the ROB, S more instructions dispatch before the ROB blocks for the second load, and
the time to do so, S/D, offsets an equal amount of the second load’s miss penalty. This generalizes to
any number of overlapping misses, so the penalty for a burst of independent long-latency back-end
misses equals the penalty for an isolated long-latency load.
Nk
C =
D
+miL1 · ciL1 + miL2 · cL2
+mbr · (cdr + cf e )
+m∗dL2 (W ) · cL2 .
The various parameters in the model are summarized in Table 4.2. The first line of the model
computes the total number of cycles needed to dispatch all the intervals. Note there is an inherent
dispatch inefficiency because the interval length Nk is not always an integer multiple of the processor
dispatch width D, i.e., fewer instructions may be dispatched at the trailer of an interval than the
designed dispatch width, simply because there are too few instructions to the next interval to fill the
entire width of the processor’s front-end pipeline. The subsequent lines in Equation 4.5 represent
I-cache misses, branch mispredictions and long back-end misses, respectively. (The TLB misses
are not shown here to increase the formula’s readability.) The I-cache miss cycle component is the
number of I-cache misses times their penalty. The branch misprediction cycle component equals
the number of mispredicted branches times their penalty, the window drain time plus the front-end
pipeline depth. Finally, the long back-end miss cycle component is computed as the number of
non-overlapping misses times the memory access latency.
CHAPTER 5
Simulation
Simulation is the prevalent and de facto performance evaluation method in computer architecture.
There are several reasons for its widespread use. Analytical models, in spite of the fact that they are
extremely fast to evaluate and in spite of the deep insight that they provide, incur too much inaccuracy
for many of the design decisions that an architect needs to make. One could argue that analytical
modeling is valuable for making high-level design decisions and identifying regions of interest in the
huge design space. However, small performance variations across design alternatives are harder to
evaluate using analytical models. At the other end of the spectrum, hardware prototypes, although
they are extremely accurate, are too time-consuming and costly to develop.
A simulator is a software performance model of a processor architecture. The processor ar-
chitecture that is modeled in the simulator is called the target architecture; running the simulator
on a host architecture, i.e., a physical machine, then yields performance results. Simulation has the
important advantage that development is relatively cheap compared to building hardware proto-
types, and it is typically much more accurate than analytical models. Moreover, the simulator is
flexible and easily parameterizable which allows for exploring the architecture design space — a
property of primary importance to computer architects designing a microprocessor and researchers
evaluating a novel idea. For example, evaluating the impact of cache size, latency, processor width,
branch predictor configuration is easily done through parameterization, i.e., by changing some of
the simulator’s parameters and running a simulation with a variety of benchmarks, one can evaluate
what the impact is of an architecture feature. Simulation even enables evaluating (very) different
architectures than the ones in use today.
accuracy
development evaluation
time time
coverage
Figure 5.1: Simulation diamond illustrates the trade-offs in simulator accuracy, coverage, development
time and evaluation time.
general) can be characterized along these four dimensions. These dimensions are not independent
of each other, and, in fact, are contradictory. For example, more faithful modeling with respect to
real hardware by modeling additional features, i.e., increasing the simulator’s coverage, is going to
increase accuracy, but it is also likely to increase the simulator’s development and evaluation time —
the simulator will be more complex to build, and because of its increased complexity, it will also run
slower, and thus simulation will take longer. In contrast, a simulator that only models a component
of the entire system, e.g., a branch predictor or cache, has limited coverage with respect to the entire
system; nevertheless, it is extremely valuable because its accuracy is good for the component under
study while being relatively simple (limited development time) and fast (short evaluation time).
The following sections describe several commonly used simulation techniques in the computer
architect’s toolbox, each representing a different trade-off in accuracy, coverage, development time
and evaluation time. We will refer to Table 5.1 throughout the remainder of chapter; it summarizes
the different simulation techniques along the four dimensions.
sign rather than evaluating its performance characteristics. Consequently, the accuracy and coverage
with respect to performance and implementation detail are not applicable. However, development
time is rated as excellent because a functional simulator is usually already present at the time a hard-
ware development project is undertaken (unless the processor implements a brand new instruction
set). Functional simulators have a very long lifetime that can span many development projects. Eval-
uation time is good because no microarchitecture features need to be modeled. Example functional
simulator are SimpleScalar’s sim-safe and sim-fast [7].
From a computer architect’s perspective, functional simulation is most useful because it can
generate instruction and address traces. A trace is the functionally correct sequence of instructions
and/or addresses that a benchmark program produces. These traces can be used as inputs to other
simulation tools — so called (specialized) trace-driven simulators.
5.2.1 ALTERNATIVES
An alternative to functional simulation is instrumentation, also called direct execution. Instrumen-
tation takes a binary and adds code to it so that when running the instrumented binary on real
hardware the property of interest is collected. For example, if the goal is to generate a trace of mem-
ory addresses, it suffices to instrument (i.e., add code to) each instruction referencing memory in the
binary to compute and print the memory address; running the instrumented binary on native hard-
ware then provides a trace of memory addresses. The key advantage of instrumentation compared
to functional simulation is that it incurs less overhead. Instrumentation executes all the instructions
natively on real hardware; in contrast, functional simulation emulates all the instructions and hence
executes more host instructions per target instruction. There exist two flavors of instrumentation,
static instrumentation, which instruments the binary statically, and dynamic instrumentation, which
instruments the binary at run time. An example tool for static instrumentation is Atom [176], or
EEL [114] (used in the Wisconsin Wind Tunnel II simulator [142]); Embra [191], Shade [34] and
Pin [128] support dynamic instrumentation. A limitation of instrumentation compared to functional
simulation is that the target ISA is typically the same as the host ISA, and thus an instrumentation
framework is not easily portable. A dynamic binary translator that translates host ISA instructions
to target ISA instructions can address this concern (as is done in the Shade framework); however, the
52 5. SIMULATION
add: D=A+B;
sub: D=A-B;
bne: if (A) goto B;
simulator in C code
Figure 5.2: Functional simulator synthesizer proposed by Burtscher and Ganusov [21].
simulator can only run on a machine that implements the target ISA. Zippy, a static instrumentation
system at Digital in the late 1980s, reads in an Alpha binary, and adds ISA emulation and modeling
code in MIPS code.
An approach that combines the speed of instrumentation with the portability of functional
simulation was proposed by Burtscher and Ganusov [21], see also Figure 5.2. They propose a func-
tional simulator synthesizer which takes as input a binary executable as well as a file containing
C definitions (code snippets) of all the supported instructions. The synthesizer then translates in-
structions in the binary to C statements. If desirable, the user can add simulation code to collect
for example a trace of instructions or addresses. Compiling the synthesized C code generates the
customized functional simulator.
5.6.1 TAXONOMY
Mauer et al. [136] present a useful taxonomy of execution-driven simulators, see also Figure 5.3.
The taxonomy reflects four different ways of how to couple the functional and timing components
in order to manage simulator complexity and development time. An execution-driven simulator that
tightly integrates the functional and timing components, hence called integrated execution-driven
simulator (see Figure 5.3(a)), is obviously harder to develop and maintain. An integrated simulator
is not flexible, is harder to extend (e.g., when evaluating a new architectural feature), and there is
a potential risk that modifying the timing component may accidentally introduce an error in the
functional component. In addition, the functional model tends to change very little, as mentioned
before; however, the timing model may change a lot during architecture exploration. Hence, it is
desirable from a simulator complexity and development point of view to decouple the functional part
56 5. SIMULATION
timing functional
(c) functional-first
simulator simulator
timing functional
(d) timing-first
simulator simulator
from the timing part. There are a number of ways of how to do the decoupling, which we discuss
now.
Timing-directed simulation. A timing-directed simulator lets the timing simulator direct the func-
tional simulator to fetch instructions along mispredicted paths and select a particular thread inter-
leaving (Figure 5.3(b)). The Asim simulator [57] is a timing-directed simulator. The functional
models keeps track of the architecture state such as register and memory values. The timing model
has no notion of values; instead, it gets the effective addresses from the functional model, which it
uses to determine cache hits and misses, access the branch predictor, etc. The functional model can
be viewed of as a set of function calls that the timing model calls to perform specific functional tasks
at precisely the correct simulated time. The functional model needs to be organized such that it can
partially simulate instructions. In particular, the functional simulator needs the ability to decode,
5.6. EXECUTION-DRIVEN SIMULATION 57
execute, perform memory operations, kill, and commit instructions. The timing model then calls the
functional model to perform specific tasks at the correct time in the correct order. For example, when
simulating the execution of a load instruction on a load unit, the timing model asks the functional
model to compute the load’s effective address. The address is then sent back to the timing model,
which subsequently determines whether this load incurs a cache miss. Only when the cache access
returns or when a cache miss returns from memory, according to the timing model, will the functional
simulator read the value from memory. This ensures that the load reads the exact same data as the
target architecture would. When the load commits in the target architecture, the instruction is also
committed in the functional model. The functional model also keeps track of enough internal state
so that an instruction can be killed in the functional model when it turns out that the instruction
was executed along a mispredicted path.
Timing-first simulation. Timing-first simulation lets the timing model run ahead of the func-
tional model [136], see Figure 5.3. The timing simulator models architecture features (register and
memory state), mostly correctly, in addition to microarchitecture state. This allows for accurately
(though not perfectly) modeling speculative execution along mispredicted branches as well as the
ordering of inter-thread events. When the timing model commits an instruction, i.e., when the
instruction becomes non-speculative, the functional model verifies whether the timing simulator
has deviated from the functional model. On a deviation, the timing simulator is repaired by the
functional simulator. This means that the architecture state be reloaded and microarchitecture state
be reset before restarting the timing simulation. In other words, timing-first simulation consists of
an almost correctly integrated execution-driven simulator (the timing simulator) which is checked
by a functionally correct functional simulator.
A timing-first simulator is easier to develop than a fully integrated simulator because the timing
simulator does not need to implement all the instructions. A subset of instructions that is important
to performance and covers the dynamically executed instructions well is sufficient. Compared to
timing-directed simulation, timing-first simulation requires less features in the functional simulator
while requiring more features in the timing simulator.
Long-running simulations. One solution is to run the simulation for a long enough period, e.g.,
simulate minutes of simulated times rather than seconds. Non-determinism is likely to largely vanish
for long simulation experiments; however, given that architecture simulation is extremely slow, this
is not a viable solution in practice.
Statistical methods. A third approach is to use (classical) statistical methods to draw valid con-
clusions. Alameldeen and Wood [1] propose to artificially inject small timing variations during
simulation. More specifically, they inject small changes in the memory system timing by adding
a uniformly distributed random number between 0 and 4 ns to the DRAM access latency. These
randomly injected perturbations create a range of possible executions starting from the same initial
condition — note that the simulator is deterministic and will always produce the same timing, hence
the need for introducing random perturbations. They then run the simulation multiple times and
compute the mean across these runs along with its confidence interval.
An obvious drawback of this approach is that it requires multiple simulation runs, which
prolongs total simulation time. This makes this approach more time-consuming compared to the
approaches that eliminate non-determinism. However, this is the best one can do to obtain reliable
performance numbers through simulation. Moreover, multiple (small) simulation runs is likely to be
more time-efficient than one very long-running simulation.
can easily exchange components while leaving the rest of the performance model unchanged). All
of these benefits lead to a shorter overall development time.
Several simulation infrastructures implement the modularity principle, see for example
Asim [57] by Digital/Compaq/Intel, Liberty [183] at Princeton University, MicroLib [159] at
INRIA, UNISIM [6] at INRIA/Princeton, and M5 [16] at the University of Michigan. A modular
simulation infrastructure typically provides a simulator infrastructure for creating many performance
models rather than having a single performance model. In particular, Asim [57] considers modules,
which are the basic software components. A module represents either a physical component of the
target design (e.g., a cache, branch predictor, etc.) or a hardware algorithm’s operation (e.g., cache
replacement policy). Each module provides a well-defined interface, which enables module reuse.
Developers can contribute new modules to the simulation infrastructure as long as they implement
the module interface, e.g., a branch predictor should implement the three methods for a branch
predictor: get a branch prediction, update the branch predictor and handle a mispredicted branch.
Asim comes with the Architect’s Workbench [58] which allows for assembling a performance model
by selecting and connecting modules.
CHAPTER 6
Sampled Simulation
The most prevalent method for speeding up simulation is sampled simulation. The idea of sampled
simulation is to simulate the execution of only a small fraction of a benchmark’s dynamic instruc-
tion stream, rather than its entire stream. By simulating only a small fraction, dramatic simulation
speedups can be achieved.
Figure 6.1 illustrates the concept of sampled simulation. Sampled simulation simulates one or
more so called sampling units selected at various places from the benchmark execution.The collection
of sampling units is called the sample. We refer to the pre-sampling units as the parts between two
sampling units. Sampled simulation only reports performance metrics of interest for the instructions
in the sampling units and discards the instructions in the pre-sampling units. And this is where the
dramatic performance improvement comes from: only the sampling units, which is a small fraction
of the total dynamic instruction count, are simulated in a cycle-accurate manner.
There are three major challenges for sampled simulation to be accurate and fast:
sample
sampling unit
pre-sampling unit
during sampled simulation, the sampling unit’s MSI is unknown. This is well known in the
literature as the cold-start problem.
Note the subtle but important difference between getting the architecture state correct and
getting the microarchitecture state accurate. Getting the architecture state correct is absolutely
required in order to enable a functionally correct execution of the sampling unit. Getting the
microarchitecture state as accurate as possible compared to the case where the entire dynamic
instruction stream would have been executed up to the sampling unit is desirable if one wants
accurate performance estimates.
The following subsections describe each of these three challenges in more detail.
performance
Figure 6.2: Three ways for selecting sampling units: (a) random sampling, (b) periodic sampling, and
(c) representative sampling.
periodically across the entire program execution, see also Figure 6.2(b): the pre-sampling unit size is
fixed, as opposed to random sampling, e.g., a sampling unit of 10,000 instructions is selected every
1 million instructions.
The key advantage of statistical sampling is that it builds on statistics theory, and allows for
computing confidence bounds on the performance estimates through the central limit theorem [126].
Assume we have a performance metric of interest (e.g., CPI) for n sampling units: xi , 1 ≤ i ≤ n.
66 6. SAMPLED SIMULATION
The mean of these measurements x̄ is computed as
n
xi
x̄ = i=1 .
n
The central limit theory then states that, for large values of n (typically n ≥ 30), x̄ is approximately
Gaussian distributed provided that the samples xi are (i) independent,1 and (ii) come from the same
population with a finite variance.2 Statistics then states that we can compute the confidence interval
[c1 , c2 ] for the mean as
s s
[x̄ − z1−α/2 √ ; x̄ + z1−α/2 √ ],
n n
with s the sample’s standard deviation. The value z1−α/2 is typically obtained from a precomputed
table; z1−α/2 equals 1.96 for a 95% confidence interval. A 95% confidence interval [c1 , c2 ] basically
means that the probability for the true mean μ to lie between c1 and c2 equals 95%. In other words,
a confidence interval gives the user some confidence that the true mean (e.g., the average CPI across
the entire program execution) can be approximated by the sample mean x̄.
SMARTS [193; 194] goes one step further and leverages the above statistics to determine how
many sampling units are required to achieve a desired confidence interval at a given confidence level.
In particular, the user first determines a particular confidence interval size (e.g., a 95% confidence
interval within 3% of the sample mean). The benchmark is then simulated and n sampling units are
collected, n being some initial guess for the number of sampling units. The mean and its confidence
interval is computed for the sample, and, if it satisfies the above 3% rule, this estimate is considered
to be good. If not, more sampling units (> n) must be collected, and the mean and its confidence
interval must be recomputed for each collected sample until the accuracy threshold is satisfied.
This strategy yields bounded confidence interval sizes at the cost of requiring multiple (sampled)
simulation runs.
SMARTS [193; 194] uses a fairly small sampling unit size of 1,000 instructions for SPEC
CPU workloads; Flexus [190] uses sampling units of a few 10,000 instructions for full-system server
workloads. The reason for choosing a small sampling unit size is to minimize measurement (reduce
number of instructions simulated in detail) while keeping into account measurement practicality (i.e.,
measure IPC or CPI over a long enough time period) and bias (i.e., make sure microarchitecture
state is warmed up, as we will discuss in Section 6.3). The use of small sampling units implies that we
need lots of them, typically on the order of 1000 sampling units. The large number of sampling units
implies in its turn that statistical simulation becomes embarrassingly parallel, i.e., one can distribute
the sampling units across a cluster of machines, as we will discuss in Section 8.1. In addition, it
allows for throttling simulation turnaround time on-the-fly based on a desired error and confidence.
1 One could argue whether the condition of independent measurements is met because the sampling units are selected from a single
program execution, and thus these measurements are not independent. This is even more true for periodic sampling because the
measurements are selected at fixed intervals.
2 Note that the central limit theory does not impose a particular distribution for the population from which the sample is taken.
The population may not be Gaussian distributed (which is most likely to be the case for computer programs), yet the sample
mean is Gaussian distributed.
6.1. WHAT SAMPLING UNITS TO SELECT? 67
The potential pitfall of systematic sampling compared to random sampling is that the sam-
pling units may give a skewed view in case the periodicity present in the program execution under
measurement equals the sampling periodicity or its higher harmonics. For populations with low ho-
mogeneity though, periodic sampling is a good approximation of random sampling. Wunderlich et
al. [193; 194] showed this to be the case for their workloads. This also agrees with the intuition that
the workloads do not have sufficiently regular cyclic behavior at the periodicity relevant to sampled
simulation (tens of millions of instructions).
BBV
BBV
BBV
BBV
BBV
BBV
BBV
BBV
BBV
BBV
BBV
Step #2: random projection
cluster A
cluster B
cluster D
cluster C
cluster E
A A C C D E B B B C D E
A C D B E
6.2.1 FAST-FORWARDING
The principle of fast-forwarding between sampling units is illustrated in Figure 6.4(a). Starting
from either the beginning of the program or the prior sampling unit, fast-forwarding constructs the
architecture starting image through functional simulation. When it reaches the beginning of the
next sampling unit, the simulator switches to detailed execution-driven simulation. When detailed
simulation reaches the end of the sampling unit, the simulator switches back to functional simulation
to get to the next sampling unit, or in case the last sampling unit is executed, the simulator quits.
The main advantage is that it is relatively straightforward to implement in an execution-driven
simulator — an execution-driven simulator comes with a functional simulator, and switching between
functional simulation and detailed execution-driven simulation is not that hard to implement. The
disadvantage is that fast-forwarding can be time-consuming for sampling units that are located deep
in the dynamic instruction stream. In addition, it also serializes the simulation of all of the sampling
units, i.e., one needs to simulate all prior sampling units and fast-forward between the sampling
units in order to construct the ASI for the next sampling unit. Because fast-forwarding can be fairly
time-consuming, researchers have proposed various techniques to speed up fast-forwarding.
Szwed et al. [179] propose to fast-forward between sampling units through native hardware
execution, called direct execution, rather than through functional simulation, see also Figure 6.4(b).
Because native hardware execution is much faster than functional simulation, substantial speedups
can be achieved. Direct execution is employed to quickly go from one sampling unit to the next.
When the next sampling unit is reached, checkpointing is used to communicate the architecture state
from the real hardware to the simulator. Detailed execution-driven simulation of the sampling unit
is done starting from this checkpoint. When the end of the sampling unit is reached, the simulator
switches back to native hardware execution to quickly reach the next sampling unit. Many ways
to incorporate direct hardware execution into simulators for speeding up simulation and emulation
systems have been proposed, see for example [43; 70; 109; 163; 168].
One requirement for fast-forwarding through direct execution is that the simulation needs to
be done on a host machine with the same instruction-set architecture (ISA) as the target machine.
Fast-forwarding on a host machine with a different ISA than the target machine cannot be sped
up through direct execution. This is a serious concern for studies that explore ISA extensions, let
alone an entirely novel ISA. This would imply that such studies would need to fall back to relatively
slow functional simulation. One possibility to overcome this limitation is to employ techniques
from dynamic binary translation methods such as just-in-time ( JIT) compilation and caching of
translated code, as is done in Embra [191]. A limitation with dynamic binary translation though
is that it makes the simulator less portable to host machines with different ISAs. An alternative
72 6. SAMPLED SIMULATION
(c) Checkpointing
Figure 6.4: Three approaches to initialize the architecture starting image: (a) fast-forwarding through
functional simulation, (b) fast-forwarding through direct execution, and (c) checkpointing.
approach is to resort to so called compiled instruction-set simulation as proposed by [21; 147; 164].
The idea of compiled instruction-set simulation is to translate each instruction in the benchmark
by C code that decodes the instruction. Compiling the C code yields a functional simulator. Given
that the generated functional simulator is written in C, it is easily portable across platforms. (We
already discussed these approaches in Section 5.2.1.)
6.2.2 CHECKPOINTING
Checkpointing takes a different approach and stores the ASI before a sampling unit. Taking a
checkpoint is similar to storing a core dump of a program so that it can be replayed at that point
6.2. HOW TO INITIALIZE ARCHITECTURE STATE? 73
in execution. A checkpoint stores the register contents and the memory state prior to a sampling
unit. During sampled simulation, getting the architecture starting image initialized is just a matter
of loading the checkpoint from disk and updating the register and memory state in the simulator,
see Figure 6.4(c). The advantage of checkpointing is that it allows for parallel simulation, in contrast
to fast-forwarding, i.e., checkpoints are independent of each other and enables simulating multiple
sampling units in parallel.
There is one major disadvantage to checkpointing compared to fast-forwarding and direct
execution, namely, large checkpoint files need to be stored on disk. Van Biesbrouck et al. [184]
report checkpoint files up to 28 GB for a single benchmark. Using many sampling units could be
prohibitively costly in terms of disk space. In addition, the large checkpoint file size also affects
total simulation time due to loading the checkpoint file from disk when starting the simulation of a
sampling unit and transferring over a network during parallel simulation.
Reduced checkpointing addresses the large checkpoint concern by limiting the amount of
information stored in the checkpoint. The main idea behind reduced checkpointing is to only
store the registers along with the memory words that are read in the sampling unit — a naive
checkpointing approach would store the entire memory state. The Touched Memory Image (TMI)
approach [184] and the live-points approach in TurboSMARTS [189] implement this principle.
The checkpoint only stores the chunks of memory that are read during the sampling unit. This is a
substantial optimization compared to full checkpointing which stores the entire memory state for
each sampling unit. An additional optimization is to store only the chunks of memory that are read
before they are written — there is no need to store a chunk of memory in the checkpoint in case that
chunk of memory is written prior to being read in the sampling unit. At simulation time, prior to
simulating the given sampling unit, the checkpoint is loaded from disk and the chunks of memory
in the checkpoint are written to their corresponding memory addresses. This guarantees a correct
ASI when starting the simulation of the sampling unit. A small file size is further achieved by using
a sparse image representation, so regions of memory that consist of consecutive zeros are not stored
in the checkpoint.
Van Biesbrouck et al. [184] and Wenisch et al. [189] provide a comprehensive evaluation of the
impact of reduced ASI checkpointing on simulation accuracy, storage requirements, and simulation
time. These studies conclude that the impact on error is marginal (less than 0.2%) — the reason
for the inaccuracy due to ASI checkpointing is that the data values for loads along mispredicted
paths may be incorrect. Reduced ASI checkpointing reduces storage requirements by two orders
of magnitude compared to full ASI checkpointing. For example, for SimPoint using one-million
instruction sampling units, an average (compressed) full ASI checkpoint takes 49.3 MB whereas
a reduced ASI checkpoint takes only 365 KB. Finally, reduced ASI checkpointing reduces the
simulation time by an order of magnitude (20×) compared to fast-forwarding and by a factor 4×
compared to full checkpointing.
Ringenberg and Mudge [165] present intrinsic checkpointing which basically stores the
checkpoint in the binary itself. In other words, intrinsic checkpointing brings the ASI up to state by
74 6. SAMPLED SIMULATION
providing fix-up checkpointing code consisting of store instructions to put the correct data values in
memory — again, only memory locations that are read in the sampling unit need to be updated; it
also executes instructions to put the correct data values in registers. Intrinsic checkpointing has the
limitation that it requires binary modification for including the checkpoint code in the benchmark
binary. On the other hand, it does not require modifying the simulator, and it even allows for running
sampling units on real hardware. Note though that the checkpoint code may skew the performance
metrics somewhat; this can be mitigated by considering large sampling units.
No warmup. The cold or no warmup scheme [38; 39; 106] assumes an empty cache at the beginning
of each sampling unit. Obviously, this scheme will overestimate the cache miss rate. However, the
bias can be small for large sampling unit sizes. Intel’s PinPoint approach [153], for example, considers
a fairly large sampling unit size, namely 250 million instructions, and does not employ any warmup
approach because the bias due to an inaccurate MSI is small.
Continuous warmup. Continuous warmup, as the name says, continuously keeps the cache state
warm between sampling units. This means that the functional simulation between sampling units
needs to be augmented to also access the caches. This is a very accurate approach but increases the
time spent between sampling units. This approach is implemented in SMARTS [193; 194]: the
tiny sampling units of 1,000 instructions used in SMARTS require a very accurate MSI, which is
achieved through continuous warmup; this is called functional warming in the SMARTS approach.
6.3. HOW TO INITIALIZE MICROARCHITECTURE STATE? 75
Stitch. Stitch or stale state [106] approximates the microarchitecture state at the beginning of a
sampling unit with the hardware state at the end of the previous sampling unit. An important
disadvantage of the stitch approach is that it cannot be employed for parallel sampled simulation.
Cache miss rate estimation. Another approach is to assume an empty cache at the beginning of
each sampling unit and to estimate which cold-start misses would have missed if the cache state
at the beginning of the sampling unit was known. This is the so called cache miss rate estimator
approach [106; 192]. A simple example cache miss estimation approach is hit-on-cold or assume-hit.
Hit-on-cold assumes that the first access to a cache line is always a hit. This is an easy-to-implement
technique which is fairly accurate for programs with a low cache miss rate.
Self-monitored adaptive (SMA) warmup. Luo et al. [131] propose a self-monitored adaptive
(SMA) cache warmup scheme in which the simulator monitors the warmup process of the caches
and decides when the caches are warmed up. This warmup scheme is adaptive to the program being
simulated as well as to the cache being simulated — the smaller the application’s working set size or
the smaller the cache, the shorter the warmup phase. One limitation of SMA is that it is unknown
a priori when the caches will be warmed up and when detailed simulation should get started. This
may be less of an issue for random statistical sampling (although the sampling units are not selected
in a random fashion anymore), but it is a problem for periodic sampling and targeted sampling.
Memory Reference Reuse Latency (MRRL). Haskins and Skadron [79] propose the MRRL
warmup strategy. The memory reference reuse latency is defined as the number of instructions
between two consecutive references to the same memory location. The MRRL warmup approach
computes the MRRL for each memory reference in the sampling unit, and collects these MRRLs
in a distribution. A given percentile, e.g., 99%, then determines when cache warmup should start
prior to the sampling unit. The intuition is that a sampling unit with large memory reference reuse
latencies also needs a long warmup period.
Boundary Line Reuse Latency (BLRL). Eeckhout et al. [49] only look at reuse latencies that ‘cross’
the boundary line between the pre-sampling unit and the sampling unit, hence the name boundary
line reuse latency (BLRL). In contrast, MRRL considers all the reuse latencies which may not be
an accurate picture for the cache warmup required for the sampling unit. Relative to BLRL, MRRL
may result in a warmup period that is either too short to be accurate or too long for the attained
level of accuracy.
Checkpointing. Another approach to the cold-start problem is to checkpoint or to store the MSI at
the beginning of each sampling unit. Checkpointing yields perfectly warmed up microarchitecture
state. On the flipside, it is specific to a particular microarchitecture, and it may require excessive disk
space for storing checkpoints for a large number of sampling units and different microarchitectures.
Since this is infeasible to do in practice, researchers have proposed more efficient approaches to MSI
checkpointing.
76 6. SAMPLED SIMULATION
One approach is the No-State-Loss (NSL) approach [35; 118]. NSL scans the pre-sampling
unit and records the last reference to each unique memory location. This is called the least recently
used (LRU) stream. For example, the LRU stream of the following reference stream ‘ABAACDABA’
is ‘CDBA’.The LRU stream can be computed by building the LRU stack: it is easily done by pushing
an address on top of the stack when it is referenced. NSL yields a perfect warmup for caches with
an LRU replacement policy.
Barr and Asanovic [10] extended this approach for reconstructing the cache and directory
state during sampled multiprocessor simulation. In order to do so, they keep track of a timestamp
per unique memory location that is referenced. In addition, they keep track of whether a memory
location is read or written. This information allows them to quickly reconstruct the cache and
directory state at the beginning of a sampling unit.
Van Biesbrouck et al. [185] and Wenisch et al. [189] proposed a checkpointing approach in
which the largest cache of interest is simulated once for the entire program execution. The SimPoint
project refers to this technique as ‘memory hierarchy state’; the TurboSMARTS project proposes
the term ‘live points’. At the beginning of each sampling unit, the cache content is stored on disk
as a checkpoint. The content of smaller sized caches can then be derived from the checkpoint.
Constructing the content of a cache with a smaller associativity is trivial to from the checkpoint:
the most recently accessed cache lines need to be retained per set, see Figure 6.5(a). Reducing the
number of sets in the cache is slightly more complicated: the new cache set retains the most recently
used cache lines from the merging cache sets — this requires keeping track of access times to cache
lines during checkpoint construction, see Figure 6.5(b).
A B C D A C
t=19 t=3 t=10 t=4 t=19 t=10
E F G H F G
t=1 t=32 t=45 t=22 t=32 t=45
I J K L J L
t=2 t=14 t=5 t=33 t=14 t=33
M N O P O P
t=6 t=7 t=8 t=17 t=8 t=17
A B C D
t=19 t=3 t=10 t=4
E F G H A C J L
t=1 t=32 t=45 t=22 t=19 t=10 t=14 t=33
I J K L F G H P
t=2 t=14 t=5 t=33 t=32 t=45 t=22 t=17
M N O P
t=6 t=7 t=8 t=17
Figure 6.5: Constructing the content of a smaller sized cache from a checkpoint, when (a) reducing
associativity and (b) reducing the number of sets. Each cache line in the checkpoint is tagged with a
timestamp that represents the latest access to the cacheline.
CHAPTER 7
Statistical Simulation
Statistical modeling has a long history. Researchers typically employ statistical modeling to generate
synthetic workloads that serve as proxies for realistic workloads that are hard to capture. For example,
collecting traces of wide area networks (WAN) or even local area networks (LAN) (e.g., a cluster of
machines) is non-trivial, and it requires a large number of disks to store these huge traces. Hence,
researchers often resort to synthetic workloads (that are easy to generate) to exhibit the same char-
acteristics (in a statistical sense) as the real network traffic. As another example, researchers studying
commercial server systems may employ statistical workload generators to generate load for the server
systems under study. Likewise, researchers in the area of interconnection networks frequently use
synthetic workloads in order to evaluate a network topology and/or router design across a range of
network loads.
Synthetic workload generation can also be used to evaluate processor architectures. For exam-
ple, Kumar and Davidson [111] used synthetically generated workloads to evaluate the performance
of the memory subsystem of the IBM 360/91; the motivation for using synthetic workloads is that
they enable investigating the performance of a computer system as a function of the workload char-
acteristics. Likewise, Archibald and Baer [3] use synthetically generated address streams to evaluate
cache coherence protocols. The paper by Carl and Smith [23] renewed recent interest in synthetic
workloads for evaluating modern processors and coined the term ‘statistical simulation’. The basic
idea of statistical simulation is to collect a number of program characteristics in the form of dis-
tributions and then generate a synthetic trace from it that serves as a proxy for the real program.
Simulating the synthetic trace then yields a performance estimate for the real program. Because
the synthetic trace is much shorter than the real program trace, simulation is much faster. Several
research projects explored this idea over the past decade [13; 50; 87; 149; 152].
workload
specialized
simulation
synthetic trace
generation
synthetic
trace
generates a synthetic trace based from this statistical profile.The characteristics of the trace reflect the
properties in the statistical profile and thus the original workload, by construction. Finally, simulating
the synthetic trace on a simple trace-driven statistical simulator yields performance numbers. The
hope/goal is that, if the statistical profile captures the workload’s behavior well and if the synthetic
trace generation algorithm is able to translate these characteristics into a synthetic trace, then the
performance numbers obtained through statistical simulation should be accurate estimates for the
performance numbers of the original workload.
The key idea behind statistical simulation is that capturing a workload’s execution behavior
in the form of distributions enables generating short synthetic traces that are representative for
7.2. APPLICATIONS 83
long-running real-life applications and benchmarks. Several researchers have found this property
to hold true: it is possible to generate short synthetic traces with on the order of a few millions
of instructions that resemble workloads that run for tens to hundreds of billions of instructions —
this implies a simulation speedup of at least four orders of magnitude. Because the synthetic trace is
generated based on distributions, its performance characteristics quickly converge, typically after one
million (or at most a few millions) of instructions. And this is obviously where the key advantage lies
for statistical simulation: it enables predicting performance for long-running workloads using short
running synthetic traces. This is likely to speed up processor architecture design space exploration
substantially.
7.2 APPLICATIONS
Statistical simulation has a number of potential applications.
Design space exploration. The most obvious application for statistical simulation is processor design
space exploration. Statistical simulation does not aim at replacing detailed cycle-accurate simulation,
primarily because it is less accurate — e.g., it does not model cache accesses along mispredicted
paths, it simulates an abstract representation of a real workload, etc., as we will describe later. Rather,
statistical simulation aims at providing a tool that enables a computer architect to quickly make
high-level design decisions, and it quickly steers the design exploration towards a region of interest,
which can then be explored through more detailed (and thus slower) simulations. Steering the design
process in the right direction early on in the design cycle is likely to reduce the overall design process
and time to market. In other words, statistical simulation is to be viewed of as a useful complement
to the computer architect’s toolbox to quickly make high-level design decisions early in the design
cycle.
Workload space exploration. Statistical simulation can also be used to explore how program charac-
teristics affect performance. In particular, one can explore the workload space by varying the various
characteristics in the statistical profile in order to understand how these characteristics relate to
performance. The program characteristics that are part of the statistical profile are typically hard to
vary using real benchmarks and workloads, if at all possible. Statistical simulation, on the other hand,
allows for easily exploring this space. Oskin et al. [152] provide such a case study in which they vary
basic block size, cache miss rate, branch misprediction rate, etc. and study its effect on performance.
They also study the potential of value prediction.
Stresstesting. Taking this one step further, one can use statistical simulation for constructing stress-
marks, or synthetic benchmarks that stress the processor for a particular metric, e.g., max power
consumption, max temperature, max peak power, etc. Current practice is to manually construct
stressmarks which is both tedious and time-consuming. An automated stressmark building frame-
work can reduce this overhead and cost. This can be done by integrating statistical simulation in a
tuning framework that explores the workload space (by changing the statistical profile) while search-
84 7. STATISTICAL SIMULATION
ing for the stressmark of interest. Joshi et al. [101] describe such a framework that uses a genetic
algorithm to search the workload space and automatically generate stressmarks for various stress
conditions.
Workload characterization. Given that the statistical profile captures the most significant program
characteristics, it can be viewed of as an abstract workload model or a concise signature of the
workload’s execution behavior [46]. In other words, one can compare workloads by comparing their
respective statistical profiles.
Large system evaluation. Finally, current state-of-the-art in statistical simulation addresses the
time-consuming simulation problem of single-core and multi-core processors. However, for larger
systems containing several processors, such as multi-chip servers, clusters of computers, datacenters,
etc., simulation time is even a bigger challenge. Statistical simulation may be an important and
interesting approach for such large systems.
cache and
predictor
models
program uarch-indep
binary chars
specialized
simulation
program uarch-dep
input chars
statistical profile
sub
add
branch miss
statistical
load miss
processor
simulator add
store hit
branch
synthetic trace
A A CB
A B BA A C
33.3% 100%
Eeckhout et al. [45] propose the statistical flow graph (SFG) which models the control flow
using a Markov chain; various program characteristics are then correlated to the SFG. The SFG is
illustrated in Figure 7.3 for the AABAABCABC example basic block sequence. In fact, this example
shows a first-order SFG because it shows transition probabilities between nodes that represent a
basic block along with the basic block executed immediately before it. (Extending towards higher-
order SFGs is trivial.) The nodes here are A|A, B|A, A|B, etc.: A|B refers to the execution of basic
block A given that basic block B was executed immediately before basic block A. The percentages at
the edges represent the transition probabilities between the nodes. For example, there is a 33.3% and
66.6% probability to execute basic block A and C, respectively, after having executed basic block A
and then B (see the outgoing edges from B|A node). The idea behind the SFG and the reason why it
improves accuracy is that, by correlating program characteristics along the SFG, it models execution
path dependent program behavior.
All of the characteristics discussed so far are independent of any microarchitecture-specific
organization. In other words, these characteristics do not rely on assumptions related to processor
issue width, window size, number of ALUs, instruction execution latencies, etc. They are, therefore,
called microarchitecture-independent characteristics.
0
d program characteristic
(e.g., inter-instruction dependence distance)
Figure 7.4: Synthetic trace generation: determining a program characteristic through random number
generation.
synthetic trace is a linear sequence of synthetically generated instructions. Each instruction has an
instruction type, a number of source operands, an inter-instruction dependence for each register input
(which denotes the producer for the given register dependence in case of downstream dependence
distance modeling), I-cache miss info, D-cache miss info (in case of a load), and branch miss info
(in case of a branch). The locality miss events are just labels in the synthetic trace describing whether
the load is an L1 D-cache hit, L2 hit or L2 miss and whether the load generates a TLB miss. Similar
labels are assigned for the I-cache and branch miss events.
Nussbaum and Smith [150] extended the statistical simulation methodology towards symmetric
multiprocessor (SMP) systems running multi-threaded workloads. This requires modeling inter-
thread synchronization and communication. More specifically, they model cache coherence events,
sequential consistency events, lock accesses and barrier distributions. For modeling cache coherence
events and sequential consistency effects, they model whether a store writes to a cache line that it
does not own, in which case it will not complete until the bus invalidation has reached the address
bus. Also, they model whether a sequence of consecutive stores access private versus shared memory
pages. Consecutive stores to private pages can be sent to memory when their input registers are
available; consecutive stores to shared pages can only be sent to memory if the invalidation of the
previous store has reached the address bus in order to preserve sequential consistency.
Lock accesses are modeled through acquire and release instructions in the statistical profile
and synthetic trace. More specifically, for an architecture that implements critical sections through
load-linked and store-conditional instructions, the load-linked instruction is retried until it finds
the lock variable is clear. It then acquires the lock through a store-conditional to the same lock
variable. If the lock variable has been invalidated since the load-linked instruction, this indicates
that another thread entered the critical section first. Statistical simulation models all instructions
executed between the first load-linked — multiple load-linked instructions may need to be executed
before it sees the lock variable is clear — and the successful store-conditional as a single acquire
instruction. When a thread exits the critical section, it releases the lock by storing a zero in the
lock variable through a conventional store instruction; this is modeled through a single release
instruction in statistical simulation. A distribution of lock variables is also maintained in order to
be able to discern different critical sections and have different probabilities for entering each critical
section. During statistical simulation, a random number along with the lock variable distribution
then determines which critical section a thread is entering. Separate statistical profiles are computed
for code executed outside versus inside critical sections.
Finally, modeling barrier synchronization is done by counting the number of instructions
executed per thread between two consecutive barriers. During synthetic trace generation, the number
of instructions between barriers is then scaled down proportionally to the number of instructions
executed during detailed execution relative to the number of instructions during statistical simulation.
92 7. STATISTICAL SIMULATION
A E K
B F G M
(T1) (L1) (L2) (L2)
C N
H I O
(T2) (L1)
D J P
The idea is to scale down the amount of work done between barriers in proportion to the length of
the statistical simulation.
A limitation of the Nussaum and Smith approach is that many of the characteristics are
microarchitecture-dependent; hence, the method requires detailed simulation instead of specialized
functional simulation during statistical profiling.Therefore, Hughes and Li [87] propose the concept
of a synchronized statistical flow graph (SSFG), which is a function only of the program under study
and not the microarchitecture. The SSFG is illustrated in Figure 7.5. Thread T 0 is the main thread,
and T 1 and T 2 are two child threads. There is a separate SFG for each thread, and thread spawning
is marked explicitly in the SSFG, e.g., T 1 is spawned in node B of T 0 and T 2 is spawned in node
C of T 0. In addition, the SSFG also models critical sections and for each critical section which lock
variable it accesses. For example, node F in T 1 accesses the same critical section as node N in T 2;
they both access the same lock variable L1.
1 SMART should not be confused with SMARTS [193; 194], which is a statistical sampling approach, as described in Chapter 6.
95
CHAPTER 8
1 2 3 4 5 6 7 8 9 10 11 12
machine 1
1 2 3 4 5 6
simulates chunk 1
machine 2
5 6 7 8 9
simulates chunk 2
machine 3
9 10 11 12
simulates chunk 3
Legend: fast-forwarding
overlapping sub-chunk
are machines, and each chunk consists of 1/Nth of the total dynamic instruction stream with N
the number of machines. This is different from sampled simulation because sampled simulation
simulates sampling units only, in contrast to the Girbal et al. approach which eventually simulates
the entire benchmark. Each machine executes the benchmark from the start. The first machine
starts detailed simulation immediately; the other machines employ fast-forwarding (or, alternatively,
employ checkpointing), and when the beginning of the chunk is reached, they run detailed simulation.
The idea now is to continue the simulation of each chunk past its end point (and thus simulate
instructions for the next chunk). In other words, there is overlap in simulation load between adjacent
machines. By comparing the post-chunk performance numbers against the performance numbers
for the next chunk (simulated on the next machine), one can verify whether microarchitecture state
has been warmed up on the next machine. (Because detailed simulation of a chunk starts from a
cold state on each machine, the performance metrics will be different than for the post-chunk on
the previous machine — this is the cold-start problem described in Chapter 6.) Good similarity
between the performance numbers will force the simulation to stop on the former machine. The
8.2. PARALLEL SIMULATION 97
performance numbers at the beginning of each chunk are then discarded and replaced by post-chunk
performance numbers from the previous chunk. The motivation for doing so is to compute an overall
performance score from performance numbers that were collected from a warmed up state.
cycle x cycle x
cycle x
cycle x
barrier
simulation time
cycle x+1
cycle x+1
cycle x+1 cycle x+1
cycle x cycle x
cycle x
cycle x
cycle x+1
cycle x+1 cycle x+1 cycle x+1
simulation time
cycle x+3
cycle x+3
cycle x cycle x
cycle x
cycle x
cycle x+1
cycle x+1 cycle x+1
cycle x+1
barrier
cycle x+2
cycle x+2 cycle x+2
cycle x+2
simulation time
cycle x+3
cycle x+3 cycle x+3
cycle x+3
Figure 8.2: Three approaches for synchronizing a parallel simulator that simulates a parallel machine:
(a) barrier synchronization at every cycle, (b) relaxed or no synchronization, and (c) quantum-based
synchronization.
8.2. PARALLEL SIMULATION 99
core 0 core 1
store A @ cycle x
Figure 8.3: Violation of temporal causality due to lack of synchronization: the load sees the old value at
memory location A.
performance may not be that great because it requires barrier synchronization at every simulated
cycle. If the number of simulator instructions per simulated cycle is low, parallel cycle-by-cycle
simulation is not going to yield substantial simulation speed benefits and scalability will be poor.
In order to achieve better simulation performance and scalability, one can relax the cycle-by-
cycle condition. In other words, the simulated cores do not synchronize every simulated cycle, which
greatly improves simulation speed, see Figure 8.2(b). The downside is that relaxing the synchro-
nization may introduce simulation error. The fundamental reason is that relaxing the cycle-by-cycle
condition may lead to situations in which a future event may affect state in the past or a past event
does not affect the future — a violation of temporal causality. Figure 8.3 illustrates this. Assume a
load instruction to a shared memory address location A is executed at simulated time x + 1 and a
store instruction to that same address is executed at simulated time x; obviously, the load should
see the value written by the store. However, if the load instruction at cycle x would happen to be
simulated before the store at cycle x − 1, the load would see the old value at memory location A.
There exist two approaches to relax the synchronization imposed by cycle-by-cycle simula-
tion [69]. The optimistic approach takes periodical checkpoints and detects timing violations. When
a timing violation is detected, the simulation is rolled back and resumes in a cycle-by-cycle manner
until after the timing violation and then switches back to relaxed synchronization. The conservative
approach avoids timing violations by processing events only when no other event could possibly
affect it. A popular and effective conservative approach is based on barrier synchronization, while
relaxing the cycle-by-cycle simulation. The entire simulation is divided into quanta, and each quan-
tum comprises multiple simulated cycles. Quanta are separated through barrier synchronization,
see Figure 8.2(c). In other words, simulation threads can advance independently from each other
between barriers, and the simulated events become visible to all threads at each barrier. Provided that
100 8. PARALLEL SIMULATION AND HARDWARE ACCELERATION
the time intervals are smaller than the latency for an inter-thread dependence (e.g., to propagate an
event from one core to another), temporal causality will be preserved. Hence, quantum-based syn-
chronization achieves cycle-by-cycle simulation accuracy while greatly improving simulation speed
compared to cycle-by-cycle simulation. The Wisconsin Wind Tunnel projects [142; 163] implement
this approach; the quantum is 100 cycles when simulating shared memory multiprocessors. Chan-
drasekaran and Hill [25] aim at overcoming the quantum overhead through speculation; Falsafi and
Wood [67] leverage multiprogramming to hide the quantum overhead. Falcon et al. [66] propose
an adaptive quantum-based synchronization scheme for simulating clusters of machines and the
quantum can be as large as 1,000 cycles. When simulating multicore processors, the quantum needs
to be smaller because of the relatively small communication latencies between cores: for example,
Chidester and George [28] employ a quantum of 12 cycles. A small quantum obviously limits
simulation speed. Therefore, researchers are looking into relaxing even further, thereby potentially
introducing simulation inaccuracy. Chen et al. [26] study both unbounded slack and bounded slack
schemes; Miller et al. [140] study similar approaches. Unbounded slack implies that the slack, or the
cycle count difference between two target cores in the simulation, can be as large as the entire simu-
lated execution time. Bounded slack limits the slack to a preset number of cycles, without incurring
barrier synchronization.
8.3.1 TAXONOMY
Many different FPGA-accelerated simulation approaches have been proposed by various research
groups both in industry and academia. In order to understand how these approaches differ from
each other, it is important to classify them in terms of how they operate and yield performance
numbers. Joel Emer presents a useful taxonomy on FPGA-accelerated simulation [55] and discerns
three flavors:
• A functional emulator is a circuit that is functionally equivalent to a target design, but does
not provide any insight on any specific design metric. A functional emulator is similar to a
functional simulator, except that it is implemented in FPGA hardware instead of software.The
key advantage of an FPGA-based functional emulator is that it can execute code at hardware
speed (several orders of magnitude faster than software simulation), which allows architects
and software developers to run commercial software in a reasonable amount of time.
• A prototype (or structural emulator) is a functionally equivalent and logically isomorphic rep-
resentation of the target design. Logically isomorphic means that the prototype implements
the same structures as in the target design, and its timing may be scaled with respect to the
target system. Hence, a prototype or structural emulator can be used to project performance.
For example, a prototype may be a useful vehicle for making high-level design decisions, e.g.,
study the scalability of software code and/or architecture proposal.
• A model is a representation that is functionally equivalent and logically isomorphic with the
target design, such that a design metric of interest (of the target design), e.g., performance,
102 8. PARALLEL SIMULATION AND HARDWARE ACCELERATION
power and/or reliability, can be faithfully quantified. The advantage of a model compared
to a prototype is that it allows for some abstraction which simplifies model development and
enables modularity and easier evaluation of target design alternatives.The issue of representing
time faithfully then requires that a distinction is made between a simulated cycle (of the target
system) versus an FPGA cycle.
CHAPTER 9
Concluding Remarks
This book covered a wide spectrum of performance evaluation topics, ranging from performance
metrics to workload design, to analytical modeling and various simulation acceleration techniques
such as sampled simulation, statistical simulation, and parallel and hardware-accelerated simulation.
However, there are a number of topics that this book did not cover. We will now briefly discuss a
couple of these.
Bibliography
[1] A. Alameldeen and D. Wood. Variability in architectural simulations of multi-threaded
workloads. In Proceedings of the Ninth International Symposium on High-Performance Computer
Architecture (HPCA), pages 7–18, February 2003. DOI: 10.1109/HPCA.2003.1183520 58,
59, 60
[2] A. R. Alameldeen and D. A. Wood. IPC considered harmful for multiprocessor workloads.
IEEE Micro, 26(4):8–17, July 2006. DOI: 10.1109/MM.2006.73 6
[3] J. Archibald and J.-L. Baer. Cache coherence protocols: Evaluation using a multiprocessor
simulation model. ACM Transactions on Computer Systems (TOCS), 4(4):273–298, November
1986. DOI: 10.1145/6513.6514 81
[5] Arvind, K. Asanovic, D. Chiou, J. C. Hoe, C. Kozyrakis, S.-L. Lu, M. Oskin, D. Patterson,
J. Rabaey, and J. Wawrzynek. RAMP: Research accelerator for multiple processors — a
community vision for a shared experimental parallel HW/SW platform. Technical report,
University of California, Berkeley, 2005. 102
[7] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An infrastructure for computer system
modeling. IEEE Computer, 35(2):59–67, February 2002. 51, 55
[8] D. A. Bader, Y. Li, T. Li, and V. Sachdeva. BioPerf: A benchmark suite to evaluate high-
performance computer architecture on bioinformatics applications. In Proceedings of the 2005
IEEE International Symposium on Workload Characterization (IISWC), pages 163–173, Octo-
ber 2005. DOI: 10.1109/IISWC.2005.1526013 16
[9] K. C. Barr and K. Asanovic. Branch trace compression for snapshot-based simulation. In
Proceedings of the International Symposium on Performance Analysis of Systems and Software
(ISPASS), pages 25–36, March 2006. 76
110 BIBLIOGRAPHY
[10] K. C. Barr, H. Pan, M. Zhang, and K. Asanovic. Accelerating multiprocessor simulation
with a memory timestamp record. In Proceedings of the 2005 IEEE International Sympo-
sium on Performance Analysis of Systems and Software (ISPASS), pages 66–77, March 2005.
DOI: 10.1109/ISPASS.2005.1430560 76, 78
[11] C. Bechem, J. Combs, N. Utamaphetai, B. Black, R. D. Shawn Blanton, and J. P. Shen. An
integrated functional performance simulator. IEEE Micro, 19(3):26–35, May/June 1999.
DOI: 10.1109/40.768499 55
[12] R. Bedichek. SimNow: Fast platform simulation purely in software. In Proceedings of the
Symposium on High Performance Chips (HOT CHIPS), August 2004. 54
[13] R. Bell, Jr. and L. K. John. Improved automatic testcase synthesis for performance model
validation. In Proceedings of the 19th ACM International Conference on Supercomputing (ICS),
pages 111–120, June 2005. DOI: 10.1145/1088149.1088164 81, 93
[14] E. Berg and E. Hagersten. Fast data-locality profiling of native execution. In Proceedings of the
International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS),
pages 169–180, June 2005. DOI: 10.1145/1064212.1064232 88, 93
[15] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Char-
acterization and architectural implications. In Proceedings of the International Conference
on Parallel Architectures and Compilation Techniques (PACT), pages 72–81, October 2008.
DOI: 10.1145/1454115.1454128 16
[16] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Rein-
hardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4):52–60, 2006.
DOI: 10.1109/MM.2006.82 54, 55, 61
[17] S. M. Blackburn, P. Cheng, and K. S. McKinley. Myths and realities: The performance
impact of garbage collection. In Proceedings of the International Conference on Measurements
and Modeling of Computer Systems (SIGMETRICS), pages 25–36, June 2004. 105
[18] S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan,
D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. L. Hosking, M. Jump, H. B. Lee, J. Eliot
B. Moss, A. Phansalkar, D. Stefanovic,T. VanDrunen, D. von Dincklage, and B. Wiedermann.
The dacapo benchmarks: Java benchmarking development and analysis. In Proceedings of the
Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and
Applications (OOPSLA), pages 169–190, October 2006. 16, 27, 105
[19] P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rockhold, C. Lefurgy,
H. Shafi, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. Van Hensbergen, and L. Zhang.
Mambo: a full system simulator for the PowerPC architecture. ACM SIGMETRICS Perfor-
mance Evaluation Review, 31(4):8–12, March 2004. DOI: 10.1145/1054907.1054910 54
BIBLIOGRAPHY 111
[20] D. C. Burger and T. M. Austin. The SimpleScalar Tool Set. Computer Architec-
ture News, 1997. See also http://www.simplescalar.com for more information.
DOI: 10.1145/268806.268810 52, 61
[23] R. Carl and J. E. Smith. Modeling superscalar processors via statistical simulation. In
Workshop on Performance Analysis and its Impact on Design (PAID), held in conjunction with the
25th Annual International Symposium on Computer Architecture (ISCA), June 1998. 81, 86, 88
[24] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a
chip-multiprocessor architecture. In Proceedings of the Eleventh International Symposium on
High Performance Computer Architecture (HPCA), pages 340–351, February 2005. 7, 91, 93
[26] J. Chen, M. Annavaram, and M. Dubois. SlackSim: A platform for parallel simulation of
CMPs on CMPs. ACM SIGARCH Computer Architecture News, 37(2):20–29, May 2009.
DOI: 10.1145/1577129.1577134 97, 100
[27] X. E. Chen and T. M. Aamodt. Hybrid analytical modeling of pending cache hits, data
prefetching, and MSHRs. In Proceedings of the International Symposium on Microarchitecture
(MICRO), pages 59–70, December 2008. 46
[33] D. Citron. MisSPECulation: Partial and misleading use of SPEC CPU2000 in computer ar-
chitecture conferences. In Proceedings of the 30th Annual International Symposium on Computer
Architecture (ISCA), pages 52–59, June 2003. DOI: 10.1109/ISCA.2003.1206988 17
[34] B. Cmelik and D. Keppel. SHADE: A fast instruction-set simulator for execution profiling.
In Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of
Computer Systems, pages 128–137, May 1994. DOI: 10.1145/183018.183032 51
[35] T. M. Conte, M. A. Hirsch, and W. W. Hwu. Combining trace sampling with single pass
methods for efficient cache simulation. IEEE Transactions on Computers, 47(6):714–720, June
1998. DOI: 10.1109/12.689650 54, 76
[36] T. M. Conte, M. A. Hirsch, and K. N. Menezes. Reducing state loss for effective trace
sampling of superscalar processors. In Proceedings of the International Conference on Computer
Design (ICCD), pages 468–477, October 1996. DOI: 10.1109/ICCD.1996.563595 64, 76
[37] H. G. Cragon. Computer Architecture and Implementation. Cambridge University Press, 2000.
11
[38] P. Crowley and J.-L. Baer. Trace sampling for desktop applications on Windows NT. In
Proceedings of the First Workshop on Workload Characterization (WWC) held in conjunction
with the 31st ACM/IEEE Annual International Symposium on Microarchitecture (MICRO),
November 1998. 74
[39] P. Crowley and J.-L. Baer. On the use of trace sampling for architectural studies of
desktop applications. In Proceedings of the 1999 ACM SIGMETRICS International Con-
ference on Measurement and Modeling of Computer Systems, pages 208–209, June 1999.
DOI: 10.1145/301453.301573 74
[43] M. Durbhakula, V. S. Pai, and S. V. Adve. Improving the accuracy vs. speed tradeoff for
simulating shared-memory multiprocessors with ILP processors. In Proceedings of the Fifth
International Symposium on High-Performance Computer Architecture (HPCA), pages 23–32,
January 1999. DOI: 10.1109/HPCA.1999.744317 71
[44] J. Edler and M. D. Hill. Dinero IV trace-driven uniprocessor cache simulator. Available
through http://www.cs.wisc.edu/∼markhill/DineroIV, 1998. 54
[45] L. Eeckhout, R. H. Bell Jr., B. Stougie, K. De Bosschere, and L. K. John. Control flow
modeling in statistical simulation for accurate and efficient processor design studies. In
Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), pages
350–361, June 2004. 87
[46] L. Eeckhout and K. De Bosschere. Hybrid analytical-statistical modeling for efficiently ex-
ploring architecture and workload design spaces. In Proceedings of the 2001 International
Conference on Parallel Architectures and Compilation Techniques (PACT), pages 25–34, Septem-
ber 2001. DOI: 10.1109/PACT.2001.953285 84
[47] L. Eeckhout, K. De Bosschere, and H. Neefs. Performance analysis through synthetic trace
generation. In The IEEE International Symposium on Performance Analysis of Systems and
Software (ISPASS), pages 1–6, April 2000. 86, 88
[48] L. Eeckhout, A. Georges, and K. De Bosschere. How Java programs interact with virtual
machines at the microarchitectural level. In Proceedings of the 18th Annual ACM SIGPLAN
Conference on Object-Oriented Programming, Languages, Applications and Systems (OOPSLA),
pages 169–186, October 2003. 27, 105
[49] L. Eeckhout, Y. Luo, K. De Bosschere, and L. K. John. BLRL: Accurate and efficient
warmup for sampled processor simulation. The Computer Journal, 48(4):451–459, May 2005.
DOI: 10.1093/comjnl/bxh103 75
[57] J. Emer, P. Ahuja, E. Borch, A. Klauser, C.-K. Luk, S. Manne, S. S. Mukherjee, H. Patil,
S. Wallace, N. Binkert, R. Espasa, and T. Juan. Asim: A performance model framework.
IEEE Computer, 35(2):68–76, February 2002. 55, 56, 61
[58] J. Emer, C. Beckmann, and M. Pellauer. AWB:The Asim architect’s workbench. In Proceedings
of the Third Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), held in
conjunction with ISCA, June 2007. 61
[61] S. Eyerman and L. Eeckhout. System-level performance metrics for multi-program work-
loads. IEEE Micro, 28(3):42–53, May/June 2008. DOI: 10.1109/MM.2008.44 8, 11
[62] S. Eyerman, L. Eeckhout, and K. De Bosschere. Efficient design space exploration of high
performance embedded out-of-order processors. In Proceedings of the 2006 Conference on
Design Automation and Test in Europe (DATE), pages 351–356, March 2006. 104
[65] S. Eyerman, James E. Smith, and L. Eeckhout. Characterizing the branch misprediction
penalty. In IEEE International Symposium on Performance Analysis of Systems and Software
(ISPASS), pages 48–58, March 2006. 40
[66] A. Falcón, P. Faraboschi, and D. Ortega. An adaptive synchronization technique for par-
allel simulation of networked clusters. In Proceedings of the IEEE International Sympo-
sium on Performance Analysis of Systems and Software (ISPASS), pages 22–31, April 2008.
DOI: 10.1109/ISPASS.2008.4510735 100
[68] P. J. Fleming and J. J. Wallace. How not to lie with statistics: The correct way to sum-
marize benchmark results. Communications of the ACM, 29(3):218–221, March 1986.
DOI: 10.1145/5666.5673 11
[69] R. M. Fujimoto. Parallel discrete event simulation. Communications of the ACM, 33(10):30–53,
October 1990. DOI: 10.1145/84537.84545 99
[70] R. M. Fujimoto and W. B. Campbell. Direct execution models of processor behavior and per-
formance. In Proceedings of the 19th Winter Simulation Conference, pages 751–758, December
1987. 71
[71] D. Genbrugge and L. Eeckhout. Memory data flow modeling in statistical simulation for
the efficient exploration of microprocessor design spaces. IEEE Transactions on Computers,
57(10):41–54, January 2007. 88
[72] D. Genbrugge and L. Eeckhout. Chip multiprocessor design space exploration through
statistical simulation. IEEE Transactions on Computers, 58(12):1668–1681, December 2009.
DOI: 10.1109/TC.2009.77 90, 91
[73] D. Genbrugge, S. Eyerman, and L. Eeckhout. Interval simulation: Raising the level of
abstraction in architectural simulation. In Proceedings of the International Symposium on High-
Performance Computer Architecture (HPCA), pages 307–318, January 2010. 45
[74] A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous java performance evaluation.
In Proceedings of the Annual ACM SIGPLAN Conference on Object-Oriented Programming,
Languages, Applications and Systems (OOPSLA), pages 57–76, October 2007. 105
116 BIBLIOGRAPHY
[75] S. Girbal, G. Mouchard, A. Cohen, and O. Temam. DiST: A simple, reliable and scalable
method to significantly reduce processor architecture simulation time. In Proceedings of the
2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer
Systems, pages 1–12, June 2003. DOI: 10.1145/781027.781029 95
[76] A. Glew. MLP yes! ILP no! In ASPLOS Wild and Crazy Idea Session, October 1998. 43
[77] G. Hamerly, E. Perelman, J. Lau, and B. Calder. SimPoint 3.0: Faster and more flexible
program analysis. Journal of Instruction-Level Parallelism, 7, September 2005. 70
[78] A. Hartstein and T. R. Puzak. The optimal pipeline depth for a microprocessor. In Proceedings
of the 29th Annual International Symposium on Computer Architecture (ISCA), pages 7–13, May
2002. DOI: 10.1109/ISCA.2002.1003557 46
[79] J. W. Haskins Jr. and K. Skadron. Accelerated warmup for sampled microarchitecture simu-
lation. ACM Transactions on Architecture and Code Optimization (TACO), 2(1):78–108, March
2005. DOI: 10.1145/1061267.1061272 75
[81] J. L. Henning. SPEC CPU2000: Measuring CPU performance in the new millennium.
IEEE Computer, 33(7):28–35, July 2000. 17
[82] M. D. Hill and M. R. Marty. Amdahl’s law in the multicore era. IEEE Computer, 41(7):33–38,
July 2008. 31
[83] M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Transactions on
Computers, 38(12):1612–1630, December 1989. DOI: 10.1109/12.40842 54, 87
[84] S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and
thread-level parallelism awareness. In Proceedings of the International Symposium on Computer
Architecture (ISCA), pages 152–163, June 2008. 46
[86] C. Hsieh and M. Pedram. Micro-processor power estimation using profile-driven program
synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
17(11):1080–1089, November 1998. DOI: 10.1109/43.736182 93
[87] C. Hughes and T. Li. Accelerating multi-core processor design space evaluation us-
ing automatic multi-threaded workload synthesis. In Proceedings of the IEEE Interna-
tional Symposium on Workload Characterization (IISWC), pages 163–172, September 2008.
DOI: 10.1109/IISWC.2008.4636101 81, 92
BIBLIOGRAPHY 117
[88] C. J. Hughes, V. S. Pai, P. Ranganathan, and S. V. Adve. Rsim: Simulating shared-memory
multiprocessors with ILP processors. IEEE Computer, 35(2):40–49, February 2002. 55
[90] V. S. Iyengar and L. H. Trevillyan. Evaluation and generation of reduced traces for bench-
marks. Technical Report RC 20610, IBM Research Division, T. J. Watson Research Center,
October 1996. 93
[91] V. S. Iyengar, L. H. Trevillyan, and P. Bose. Representative traces for processor models with
infinite cache. In Proceedings of the Second International Symposium on High-Performance Com-
puter Architecture (HPCA), pages 62–73, February 1996. DOI: 10.1109/HPCA.1996.501174
70, 93
[92] R. K. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental
Design, Measurement, Simulation, and Modeling. Wiley, 1991. xi
[93] L. K. John. More on finding a single number to indicate overall performance of a benchmark
suite. ACM SIGARCH Computer Architecture News, 32(4):1–14, September 2004. 11, 12
[94] L. K. John and L. Eeckhout, editors. Performance Evaluation and Benchmarking. CRC Press,
Taylor and Francis, 2006. xi
[95] E. E. Johnson, J. Ha, and M. B. Zaidi. Lossless trace compression. IEEE Transactions on
Computers, 50(2):158–173, February 2001. DOI: 10.1109/12.908991 54
[96] R. A. Johnson and D. W. Wichern. Applied Multivariate Statistical Analysis. Prentice Hall,
fifth edition, 2002. 18, 23
[97] P. J. Joseph, K. Vaswani, and M. J. Thazhuthaveetil. Construction and use of linear regression
models for processor performance analysis. In Proceedings of the 12th International Symposium
on High-Performance Computer Architecture (HPCA), pages 99–108, February 2006. 32, 34,
35
[99] A. Joshi, A. Phansalkar, L. Eeckhout, and L. K. John. Measuring benchmark similarity using
inherent program characteristics. IEEE Transactions on Computers, 55(6):769–782, June 2006.
DOI: 10.1109/TC.2006.85 23, 62
118 BIBLIOGRAPHY
[100] A. M. Joshi, L. Eeckhout, R. Bell, Jr., and L. K. John. Distilling the essence of proprietary
workloads into miniature benchmarks. ACM Transactions on Architecture and Code Optimiza-
tion (TACO), 5(2), August 2008. 93
[101] A. M. Joshi, L. Eeckhout, L. K. John, and C. Isen. Automated microprocessor stressmark
generation. In Proceedings of the International Symposium on High-Performance Computer
Architecture (HPCA), pages 229–239, February 2008. 84, 93
[102] T. Karkhanis and J. E. Smith. A day in the life of a data cache miss. In Proceedings of the 2nd
Annual Workshop on Memory Performance Issues (WMPI) held in conjunction with ISCA, May
2002. 42, 43, 88
[103] T. Karkhanis and J. E. Smith. A first-order superscalar processor model. In Proceedings of the
31st Annual International Symposium on Computer Architecture (ISCA), pages 338–349, June
2004. 43, 45
[104] T. Karkhanis and J. E. Smith. Automated design of application specific superscalar processors:
An analytical approach. In Proceedings of the 34th Annual International Symposium on Computer
Architecture (ISCA), pages 402–411, June 2007. DOI: 10.1145/1250662.1250712 45, 104
[105] K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker. Performance character-
ization of a quad Pentium Pro SMP using OLTP workloads. In Proceedings of the International
Symposium on Computer Architecture (ISCA), pages 15–26, June 1998. 15
[106] R. E. Kessler, M. D. Hill, and D. A. Wood. A comparison of trace-sampling techniques
for multi-megabyte caches. IEEE Transactions on Computers, 43(6):664–675, June 1994.
DOI: 10.1109/12.286300 74, 75
[107] S. Kluyskens and L. Eeckhout. Branch predictor warmup for sampled simulation through
branch history matching. Transactions on High-Performance Embedded Architectures and Com-
pilers (HiPEAC), 2(1):42–61, January 2007. 76
[108] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded SPARC
processor. IEEE Micro, 25(2):21–29, March/April 2005. DOI: 10.1109/MM.2005.35 7
[109] V. Krishnan and J. Torrellas. A direct-execution framework for fast and accurate sim-
ulation of superscalar processors. In Proceedings of the 1998 International Conference on
Parallel Architectures and Compilation Techniques (PACT), pages 286–293, October 1998.
DOI: 10.1109/PACT.1998.727263 71
[110] T. Kuhn. The Structure of Scientific Revolutions. University Of Chicago Press, 1962. 1
[111] B. Kumar and E. S. Davidson. Performance evaluation of highly concurrent computers by
deterministic simulation. Communications of the ACM, 21(11):904–913, November 1978.
DOI: 10.1145/359642.359646 81
BIBLIOGRAPHY 119
[112] T. Lafage and A. Seznec. Choosing representative slices of program execution for microar-
chitecture simulations: A preliminary application to the data stream. In IEEE 3rd Annual
Workshop on Workload Characterization (WWC-2000) held in conjunction with the International
Conference on Computer Design (ICCD), September 2000. 67
[113] S. Laha, J. H. Patel, and R. K. Iyer. Accurate low-cost methods for performance evaluation
of cache memory systems. IEEE Transactions on Computers, 37(11):1325–1336, November
1988. DOI: 10.1109/12.8699 64
[114] J. R. Larus and E. Schnarr. EEL: Machine-independent executable editing. In Proceedings of
the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI),
pages 291–300, June 1995. 51
[115] J. Lau, E. Perelman, and B. Calder. Selecting software phase markers with code structure
analysis. In Proceedings of the International Symposium on Code Generation and Optimization
(CGO), pages 135–146, March 2006. DOI: 10.1109/CGO.2006.32 68
[116] J. Lau, J. Sampson, E. Perelman, G. Hamerly, and B. Calder. The strong correlation be-
tween code signatures and performance. In Proceedings of the International Symposium
on Performance Analysis of Systems and Software (ISPASS), pages 236–247, March 2005.
DOI: 10.1109/ISPASS.2005.1430578 68
[117] J. Lau, S. Schoenmackers, and B. Calder. Structures for phase classification. In Proceedings of
the 2004 International Symposium on Performance Analysis of Systems and Software (ISPASS),
pages 57–67, March 2004. DOI: 10.1109/ISPASS.2004.1291356 68
[118] G. Lauterbach. Accelerating architectural simulation by parallel execution of trace samples.
Technical Report SMLI TR-93-22, Sun Microsystems Laboratories Inc., December 1993.
70, 76, 95
[119] B. Lee and D. Brooks. Accurate and efficient regression modeling for microarchitectural
performance and power prediction. In Proceedings of the Twelfth International Conference
on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages
185–194, October 2006. 35
[120] B. Lee and D. Brooks. Efficiency trends and limits from comprehensive microarchitec-
tural adaptivity. In Proceedings of the 13th International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), pages 36–47, March 2008.
DOI: 10.1145/1346281.1346288 31, 36, 104
[121] B. Lee, D. Brooks, Bronis R. de Supinski, M. Schulz, K. Singh, and S. A. McKee. Methods of
inference and learning for performance modeling of parallel applications. In Proceedings of the
12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP),
pages 249–258, March 207. 36
120 BIBLIOGRAPHY
[122] B. Lee, J. Collins, H. Wang, and D. Brooks. CPR: Composable performance regression for
scalable multiprocessor models. In Proceedings of the 41st Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), pages 270–281, November 2008. 36
[123] B. C. Lee and D. M. Brooks. Illustrative design space studies with microarchitectural re-
gression models. In Proceedings of the International Symposium on High Performance Computer
Architecture (HPCA), pages 340–351, February 2007. 36
[125] K. M. Lepak, H. W. Cain, and M. H. Lipasti. Redeeming IPC as a performance metric for
multithreaded programs. In Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques (PACT), pages 232–243, September 2003. 59
[127] M. H. Lipasti, C. B. Wilkerson, and J. P. Shen. Value locality and load value prediction. In
Proceedings of the International Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS), pages 138–147, October 1996. 30
[128] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and
K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumenta-
tion. In Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and
Implementation (PLDI), pages 190–200, June 2005. DOI: 10.1145/1065010.1065034 22, 51
[129] K. Luo, J. Gummaraju, and M. Franklin. Balancing throughput and fairness in SMT pro-
cessors. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems
and Software (ISPASS), pages 164–171, November 2001. 9, 10
[130] Y. Luo and L. K. John. Efficiently evaluating speedup using sampled processor simulation.
Computer Architecture Letters, 4, September 2004. 70
[131] Y. Luo, L. K. John, and L. Eeckhout. SMA: A self-monitored adaptive warmup scheme
for microprocessor simulation. International Journal on Parallel Programming, 33(5):561–581,
October 2005. DOI: 10.1007/s10766-005-7305-9 75
[134] J. R. Mashey. War of the benchmark means: Time for a truce. ACM SIGARCH Computer
Architecture News, 32(4):1–14, September 2004. DOI: 10.1145/1040136.1040137 11, 13
[135] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage
hierarchies. IBM Systems Journal, 9(2):78–117, June 1970. DOI: 10.1147/sj.92.0078 54, 87
[139] D. Mihocka and S. Schwartsman. Virtualization without direct execution or jitting: Designing
a portable virtual machine infrastructure. In Proceedings of the Workshop on Architectural and
Microarchitectural Support for Binary Translation, held in conjunction with ISCA, June 2008. 54
[140] J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald III, N. Beckmann, C. Celio, J. Eastep, and
A. Agarwal. Graphite: A distribuyted parallel simulator for multicores. In Proceedings of the
International Symposium on High Performance Computer Architecture (HPCA), pages 295–306,
January 2010. 97, 100
[144] S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic logging of oper-
ating system effects to guide application level architecture simulation. In Proceedings of the
ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS), pages 216–227, June 2006. 53
[146] A.-T. Nguyen, P. Bose, K. Ekanadham, A. Nanda, and M. Michael. Accuracy and speed-up of
parallel trace-driven architectural simulation. In Proceedings of the 11th International Parallel
Processing Symposium (IPPS), pages 39–44, April 1997. DOI: 10.1109/IPPS.1997.580842 95
[147] A. Nohl, G. Braun, O. Schliebusch, R. Leupers, and H. Meyr. A universal technique for
fast and flexible instruction-set architecture simulation. In Proceedings of the 39th Design
Automation Conference (DAC), pages 22–27, June 2002. DOI: 10.1145/513918.513927 72
[149] S. Nussbaum and J. E. Smith. Modeling superscalar processors via statistical simulation.
In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation
Techniques (PACT), pages 15–24, September 2001. DOI: 10.1109/PACT.2001.953284 81,
86, 88
[151] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K.-Y. Chang. The case for a single-
chip multiprocessor. In Proceedings of the International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), pages 2–11, October 1996. 7
[152] M. Oskin, F. T. Chong, and M. Farrens. HLS: Combining statistical and symbolic simulation
to guide microprocessor design. In Proceedings of the 27th Annual International Symposium
BIBLIOGRAPHY 123
on Computer Architecture (ISCA), pages 71–82, June 2000. DOI: 10.1145/339647.339656 81,
83, 86, 88
[153] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. Pinpointing repre-
sentative portions of large Intel Itanium programs with dynamic instrumentation. In Proceed-
ings of the 37th Annual International Symposium on Microarchitecture (MICRO), pages 81–93,
December 2004. 70, 74
[154] M. Pellauer, M. Vijayaraghavan, M. Adler, Arvind, and J. S. Emer. Quick performance
models quickly: Closely-coupled partitioned simulation on FPGAs. In Proceedings of the
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS),
pages 1–10, April 2008. DOI: 10.1109/ISPASS.2008.4510733 102
[155] D. A. Penry, D. Fay, D. Hodgdon, R. Wells, G. Schelle, D. I. August, and D. Connors.
Exploiting parallelism and structure to accelerate the simulation of chip multi-processors. In
Proceedings of the Twelfth International Symposium on High Performance Computer Architecture
(HPCA), pages 27–38, February 2006. 97, 102
[156] C. Pereira, H. Patil, and B. Calder. Reproducible simulation of multi-threaded work-
loads for architecture design space exploration. In Proceedings of the IEEE Interna-
tional Symposium on Workload Characterization (IISWC), pages 173–182, September 2008.
DOI: 10.1109/IISWC.2008.4636102 59
[157] E. Perelman, G. Hamerly, and B. Calder. Picking statistically valid and early simulation points.
In Proceedings of the 12th International Conference on Parallel Architectures and Compilation
Techniques (PACT), pages 244–256, September 2003. 68
[158] E. Perelman, J. Lau, H. Patil, A. Jaleel, G. Hamerly, and B. Calder. Cross binary simulation
points. In Proceedings of the Annual International Symposium on Performance Analysis of Systems
and Software (ISPASS), March 2007. 70
[159] D. G. Perez, G. Mouchard, and O. Temam. MicroLib: A case for the quantitative comparison
of micro-architecture mechanisms. In Proceedings of the 37th Annual International Symposium
on Microarchitecture (MICRO), pages 43–54, December 2004. 61
[160] A. Phansalkar, A. Joshi, and L. K. John. Analysis of redundancy and application balance in
the SPEC CPU2006 benchmark suite. In Proceedings of the Annual International Symposium
on Computer Architecture (ISCA), pages 412–423, June 2007. 21, 25, 27, 62, 103, 104
[161] J. Rattner. Electronics in the internet age. Keynote at the International Conference on Parallel
Architectures and Compilation Techniques (PACT), September 2001. 5
[162] J. Reilly. Evolve or die: Making SPEC⣙s CPU suite relevant today and tomorrow. IEEE
International Symposium on Workload Characterization (IISWC), October 2006. Invited
presentation. DOI: 10.1109/IISWC.2006.302735 21
124 BIBLIOGRAPHY
[163] S. K. Reinhardt, M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, and D. A. Wood. The
wisconsin wind tunnel: Virtual prototyping of parallel computers. In Proceedings of the ACM
SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 48–60,
May 1993. DOI: 10.1145/166955.166979 71, 97, 100
[164] M. Reshadi, P. Mishra, and N. D. Dutt. Instruction set compiled simulation: a technique
for fast and flexible instruction set simulation. In Proceedings of the 40th Design Automation
Conference (DAC), pages 758–763, June 2003. DOI: 10.1145/775832.776026 72
[167] M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod. Using the SimOS machine simulator
to study complex computer systems. ACM Transactions on Modeling and Computer Simulation
(TOMACS), 7(1):78–103, January 1997. DOI: 10.1145/244804.244807 53
[168] E. Schnarr and J. R. Larus. Fast out-of-order processor simulation using memoiza-
tion. In Proceedings of the Eighth International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), pages 283–294, October 1998.
DOI: 10.1145/291069.291063 71
[169] J. P. Shen and M. H. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors.
McGraw-Hill, 2007. 5
[170] T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis to find periodic
behavior and simulation points in applications. In Proceedings of the International Conference
on Parallel Architectures and Compilation Techniques (PACT), pages 3–14, September 2001.
DOI: 10.1109/PACT.2001.953283 68
[186] M. Van Biesbrouck, L. Eeckhout, and B. Calder. Considering all starting points for simulta-
neous multithreading simulation. In Proceedings of the International Symposium on Performance
Analysis of Systems and Software (ISPASS), pages 143–153, March 2006. 78
[188] M. Van Biesbrouck, T. Sherwood, and B. Calder. A co-phase matrix to guide si-
multaneous multithreading simulation. In Proceedings of the International Symposium
on Performance Analysis of Systems and Software (ISPASS), pages 45–56, March 2004.
DOI: 10.1109/ISPASS.2004.1291355 78, 91
[189] T. F. Wenisch, R. E. Wunderlich, B. Falsafi, and J. C. Hoe. Simulation sampling with live-
points. In Proceedings of the Annual International Symposium on Performance Analysis of Systems
and Software (ISPASS), pages 2–12, March 2006. 73, 76
[191] E. Witchell and M. Rosenblum. Embra: Fast and flexible machine simulation. In Proceedings
of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems,
pages 68–79, June 1996. 51, 54, 71
[192] D. A. Wood, M. D. Hill, and R. E. Kessler. A model for estimating trace-sample miss ratios.
In Proceedings of the 1991 SIGMETRICS Conference on Measurement and Modeling of Computer
Systems, pages 79–89, May 1991. DOI: 10.1145/107971.107981 75
[196] J. J. Yi, D. J. Lilja, and D. M. Hawkins. A statistically rigorous approach for im-
proving simulation methodology. In Proceedings of the Ninth International Symposium
on High Performance Computer Architecture (HPCA), pages 281–291, February 2003.
DOI: 10.1109/HPCA.2003.1183546 27, 34
[197] J. J. Yi, R. Sendag, L. Eeckhout, A. Joshi, D. J. Lilja, and L. K. John. Evaluating benchmark
subsetting approaches. In Proceedings of the 2006 IEEE International Symposium on Workload
Characterization (IISWC), pages 93–104, October 2006. DOI: 10.1109/IISWC.2006.302733
29
[198] J. J. Yi, H. Vandierendonck, L. Eeckhout, and D. J. Lilja. The exigency of benchmark and
compiler drift: Designing tomorrow’s processors with yesterday’s tools. In Proceedings of
the 20th ACM International Conference on Supercomputing (ICS), pages 75–86, June 2006.
DOI: 10.1145/1183401.1183414 17
[199] M. T. Yourst. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator.
In Proceedings of the International Symposium on Performance Analysis of Systems and Software
(ISPASS), pages 23–34, April 2007. 55, 61
129
Author’s Biography
LIEVEN EECKHOUT
Lieven Eeckhout is an Associate Professor at Ghent University, Belgium. His main research interests
include computer architecture and the hardware/software interface in general, and performance
modeling and analysis, simulation methodology, and workload characterization in particular. His
work was awarded twice as an IEEE Micro Top Pick in 2007 and 2010 as one of the previous year’s
"most significant research publications in computer architecture based on novelty, industry relevance
and long-term impact". He has served on a couple dozen program committees, he was the program
chair for ISPASS 2009 and general chair for ISPASS 2010, and he serves as an associate editor
for ACM Transactions on Architecture and Code Optimization. He obtained his Master’s degree
and Ph.D. degree in computer science and engineering from Ghent University in 1998 and 2002,
respectively.