ACA Lec2 New
ACA Lec2 New
ACA Lec2 New
Compare Agree/Disagree
Module
B
Switch
Module
Output
C
Compare Agree/Disagree
Module
D
Measures of reliability and failure tolerance
7
Which of these airplanes has the best performance?
8
Computer Performance: TIME, TIME, TIME
9
Computer Performance: TIME, TIME, TIME
10
Book's Definition of Performance
o For some program running on machine X,
11
Execution Time
o There are different measures of execution time in
computer performance.
n Elapsed Time
o counts everything (disk and memory accesses, I/O , etc.)
o a useful number, but often not good for comparison
purposes
n CPU time
o doesn't count I/O or time spent running other programs
o can be broken up into system time, and user time
13
How to Improve Performance
14
Instruction Cycles
o Can we assume that
n The number of cycles = number of instructions?
n The number of cycles is proportional to number of instructions?
2nd instruction
3rd instruction
1st instruction
4th
6th
5th
...
Clock
15
Instruction Cycles
Clock
n For example:
q Multiply instruction may take more cycles than an Add
instruction.
q Floating-point operations take longer than integer
operations.
q Accessing memory takes more time than accessing
registers.
16
Example 1
o Our favorite program runs in 10 seconds on computer A,
which has a 400 MHz clock. We are trying to help a
computer designer build a new machine B, that will run this
program in 6 seconds. The designer can use new (or
perhaps more expensive) technology to substantially
increase the clock rate, but has informed us that this
increase will affect the rest of the CPU design, causing
machine B to require 1.2 times as many clock cycles as
machine A for the same program. What clock rate should we
tell the designer to target at?
n ANSWER:
Let C be the number of clock cycles required for that
program.
For A: Ex. time = 10 sec. = C ´1/400MHz
For B: Ex. time = 6 sec. = (1.2 ´C) ´1/clock_rateB
Therefore, clock_rateB = ?
=1.2x400x10/6 = 800 (Mhz) 17
Cycles Per Instruction
o A given program will require
Some number of instructions (machine instructions)
´average CPI
18
Cycles Per Instruction
o Average cycle per instruction (CPI)
CPI = (CPU time ´Clock rate) / Instruction count
= Clock cycles / Instruction count
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
n
CPI = ∑ CPIk × F k where I
k =1
F k
= k
Instruction count
Ik = instruction frequency
19
Example 2
o A compiler designer is deciding between 2 codes for a particular
machine. Based on the hardware implementation, there are 3
classes of instructions: Class A, Class B, and Class C, and they
require 1, 2, and 3 cycles respectively.
o First code has 5 instructions: 2 of A, 1 of B, and 2 of C.
Second code has 6 instructions: 4 of A, 1 of B, and 1 of C.
o Which code is faster? By how much?
o What is the (average) CPI for each code?
n ANSWER:
Let T be the cycle time.
Execution time(code1) = (2´1 + 1´2 + 2´3) ´T = 10T
Execution time(code2) = (4´1 + 1´2 + 1´3) ´T = 9T
Execution time(code1)/ Execution time(code2) = 10/9
CPI(code1) = 10/5 = 2
CPI(code2) = 9/6 = 1.5
20
Example 3
o Suppose we have 2 implementations of the same ISA, and a
program is run on these 2 machines.
o Machine A has a clock cycle time of 10 ns and a CPI of 2.0.
Machine B has a clock cycle time of 20 ns and a CPI of 1.2.
o Which machine is faster for this program? By how much?
n ANSWER:
Let N be the number of instructions.
Machine A: Execution time = N x 2.0 x 10 ns
Machine B: Execution time = N x 1.2 x 20 ns
21
Example 4
o You are given 2 machine designs M1 and M2 for performance
benchmarking. Both M1 and M2 have the same ISA, but different
hardware implementations and compilers. Assuming that the clock
cycle times for M1 and M2 are the same, performance study gives
the following measurements for the 2 designs.
For M1 For M2
Instruction
class No. of instructions No. of instructions
CPI CPI
executed executed
A 1 3,000,000,000,000 2 2,700,000,000,000
B 2 2,000,000,000,000 3 1,800,000,000,000
C 3 2,000,000,000,000 3 1,800,000,000,000
D 4 1,000,000,000,000 2 900,000,000,000
22
Example 4
a) What is the CPI for each machine?
Let Y = 1,000,000,000,000
CPI(M1) = (3Y*1 + 2Y*2 + 2Y*3 + Y*4) / (3Y + 2Y + 2Y + Y)
= 17Y / 8Y = 2.125
CPI(M2) = (2.7Y*2 + 1.8Y*3 + 1.8Y*3 + 0.9Y*2) /
(2.7Y+1.8Y+1.8Y+0.9Y)
= 18Y / 7.2Y = 2.5
23
Example 4
c) To further improve the performance of the machines, a new
compiler technique is introduced. The compiler can simply
eliminate all class D instructions from the benchmark program
without any side effects. (That is, there is no change to the number
of class A, B and C instructions executed in the 2 machines.) With
this new technique, which machine is faster? By how much?
24
Example 4
d) Alternatively, to further improve the performance of the machines,
a new hardware technique is introduced. The hardware can simply
execute all class D instructions in zero times without any side
effects. (There is still execution for class D instructions.) With this
new technique, which machine is faster? By how much?
25
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Instr. Set X X X
Organization X X
Circuit Design, VLSI X X
Technology X
26
Now that we understand cycles
o A given program will require
n some number of instructions (machine instructions)
n some number of cycles
n some number of seconds
27
Performance
o Performance is determined by execution time
o Do any of the other variables equal performance?
n # of cycles to execute program?
n # of instructions in program?
n # of cycles per second?
n average # of cycles per instruction?
n average # of instructions per second?
28
MIPS (million instructions per second) example
o Two different compilers are being tested for a 100 MHz. machine
with three different classes of instructions: Class A, Class B, and
Class C, which require one, two, and three cycles (respectively).
Both compilers are used to produce code for a large piece of
software.
29
MIPS example
o Answer:
n CPU clock cycles (Compiler1) = (5*1 + 1*2 + 1*3)x106
n CPU clock cycles (Compiler2) = (10*1 + 1*2 + 1*3)x106
n Ex. time (Compiler1) = CPU clock cycles 1 / clock rate
= 10x106 / 100x106 = 0.10 sec
n Ex. time (Compiler2) = CPU clock cycles 2 / clock rate
= 15x106 / 100x106 = 0.15 sec
Performance(Compiler1)/Performance(Compiler2) = 1.5
The 1st Compiler generates faster program -> run faster
n MIPS1 = (5 + 1 + 1)x106 / 0.10x106 = 70
n MIPS2 = (10 + 1 + 1)x106 / 0.15x106 = 80 > MIPS1
The 2nd Compiler has a higher MIPS rating ! (Pitfall)
30
Benchmarks
o Performance best determined by running a real
application
n Use programs typical of expected workload
n Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
o Small benchmarks
n nice for architects and designers
n easy to standardize
n can be abused
o SPEC (System Performance Evaluation Cooperative)
n companies have agreed on a set of real program and inputs
n can still be abused (Intel’s “other” bug)
n valuable indicator of performance (and compiler technology)
31
SPEC ‘89
o Compiler “enhancements” and performance
800
700
600
SPEC performance ratio
500
400
300
200
100
0
gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv
Benchmark
Compiler
Enhanced compiler
32
SPEC ‘95
Benchmark Description
go Artificial intelligence; plays the game of Go
m88ksim Motorola 88k chip simulator; runs test program
gcc The Gnu C compiler generating SPARC code
compress Compresses and decompresses file in memory
li Lisp interpreter
ijpeg Graphic compression and decompression
perl Manipulates strings and prime numbers in the special-purpose programming language Perl
vortex A database program
tomcatv A mesh generation program
swim Shallow water model with 513 x 513 grid
su2cor quantum physics; Monte Carlo simulation
hydro2d Astrophysics; Hydrodynamic Naiver Stokes equations
mgrid Multigrid solver in 3-D potential field
applu Parabolic/elliptic partial differential equations
trub3d Simulates isotropic, homogeneous turbulence in a cube
apsi Solves problems regarding temperature, wind velocity, and distribution of pollutant
fpppp Quantum chemistry
wave5 Plasma physics; electromagnetic particle simulation
33
SPEC ‘95
Does doubling the clock rate double the performance?
Can a machine with a slower clock rate have better
performance?
10 10
9 9
8 8
7 7
6 6
SPECfp
SPECint
5 5
4 4
3 3
2 2
1 1
0 0
50 100 150 200 250 50 100 150 200 250
Clock rate (MHz)
Clock rate (MHz) Pentium Pentium
Pentium Pro Pentium Pro
34
Amdahl’s Law
o Make the common case faster
1
o Speedup = Perfnew / Perfold = Told / Tnew =
f
(1 − f ) +
P
o Performance improvement from using faster mode is limited by the
fraction the faster mode can be applied.
Told
(1 - f) f
Tnew
(1 - f) f/P
35
Amdahl’s Law Example
o New CPU 10X faster
o I/O bound server, so 40% time waiting for I/O
1
Speedup overall =
Fraction enhanced
(1 − Fraction enhanced ) +
Speedup enhanced
1 1
= = = 1.56
0.4 0.64
(1 − 0.4) +
10
o Apparently, its human nature to be attracted by 10X
faster, vs. keeping in perspective its just 1.6X faster
36
Amdahl’s Law Example
o Overall speedup if we make 90% of a program run 10 times
faster. 1
Speedupoverall =
Fraction enhanced
(1 − Fraction enhanced ) +
Speedupenhanced
1 1
= = = 5.26
0.9 0.19
(1 − 0.9 ) +
10
o Overall speedup if we make 80% of a program run 20% faster.
1
Speedupoverall =
Fraction enhanced
(1 − Fraction enhanced ) +
Speedupenhanced
1 1
= = = 1.153
0.8 0.86
(1 − 0.8) +
1.2 37
Amdahl's Law
§ Pitfall: Expecting the improvement of one aspect of a
machine to increase performance by an amount
proportional to the size of the improvement.
§ Example:
§ Suppose a program runs in 100 seconds on a machine, with
multiply operations responsible for 80 seconds of this time.
How much do we have to improve the speed of multiplication if
we want the program to run 4 times faster?
38
Amdahl's Law
§ Example (continued):
§ How about making it 5 times faster?
39
Amdahl's Law
§ This concept is the Amdahl’s law. Performance is
limited to the non-speedup portion of the program.
§ Execution time after improvement = Execution time of
unaffected part + (execution time of affected part /
speedup)
§ Corollary of Amdahl’s law: Make the common case fast.
40
Example 6
§ Suppose we enhance a machine making all floating-
point instructions run five times faster. If the execution
time of some benchmark before the floating-point
enhancement is 12 seconds, what will the speedup be
if half of the 12 seconds is spent executing floating-
point instructions?
41
Example 7
§ We are looking for a benchmark to show off the new
floating-point unit described in the previous example,
and we want the overall benchmark to show a speedup
of 3. One benchmark we are considering runs for 100
seconds with the old floating-point hardware. How
much of the execution time would floating-point
instructions have to account for in this program in order
to yield our desired speedup on this benchmark?
43
Enjoy !!!
Q&A
44