L3 Performance Measures
L3 Performance Measures
L3 Performance Measures
Lecture 3
Performance measures
M S Bhat
msbhat@nitk.edu.in
Performance Measures
❑ Benchmark suites (SPEC - Standard Performance Evaluation
Corporation SPEC CPU2006 – 26 programs gives one SPEC rating,
ESPRESSO….)
❑ Performance is the result of executing a workload on a configuration
❑ Workload = program + input
❑ Configuration = CPU + cache + memory + I/O + OS + compiler +
optimizations
◼ Compiler optimizations can make a huge difference
❑ Queuing Theory
Benchmark Suites
❑ Performance is measured with benchmark suites: a collection of
programs that are likely relevant to the user
◼ SPEC CPU 2006: cpu-oriented programs (for desktops)
performanceX execution_timeY
-------------------- = --------------------- = n
performanceY execution_timeX
CPU time = Instruction Count Cycles Per Instruction Clock cycle time
CPU time =
Seconds Instructions Clock Cycles
= Seconds
Program Program Instruction Clock Cycle
Programming
language
Compiler
ISA
Processor
organization
Technology
Determinates of CPU Performance
CPU time = Instruction count x CPI x cycle time
Programming X X
language
Compiler X X
ISA X X X
Processor
X X
organization
Technology
X
A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1 .5
Load 20% 5 1.0
Store 10% 3 .3
Branch 20% 2 .4
Overall effective CPI = 2.2
A Simple Example
Op Freq CPIi Freq x CPIi Q1
ALU 50% 1 .5 .5
Load 20% 5 1.0 .4
Store 10% 3 .3 .3
Branch 20% 2 .4 .4
Overall effective CPI = 2.2 1.6
n
CPU time = IC i CPIi Clock cycle time
i =1
• The exact same workload (the four programs execute the same
number of instructions that they did on machine X) is run on a new
machine Y and the execution times for each program are 0.8, 1.1,
0.5, 2 (Total time = 4.4s. )
• For (i) to be true for AM too, P1 must occur 100 times for every
occurrence of P2 i.e., Consider : P1x100+P2 . Now take GM for this
scenario
• With the above assumption, (ii) is no longer true for AM. Hence, GM
can lead to inconsistencies
GM Example
Computer-A Computer-B Computer-C
P1 1 sec 10 secs 20 secs
P2 1000 secs 100 secs 20 secs
GM 31.62 31.62 20
SpecB1 * SpecB 2 GM B
=
SpecC1 * SpecC 2 GM C
Summarizing Performance
GM: Does not require a reference machine, but does not predict
performance very well. So we multiplied execution times and
determined that sys-C is 1.6x faster…but on what workload?
• CPI (cycles per instruction) or IPC (instructions per cycle) can not
be accurately estimated analytically
An Alternative Perspective - I
• Each program is assumed to run for an equal number of cycles, so
we’re fair to each program
This measure implicitly assumes that 1 instr in prog-A has the same
importance as 1 instr in prog-B
An Alternative Perspective - II
• Each program is assumed to run for an equal number of
instructions, so we’re fair to each program
❑ Limits of improvement
◼ Improvement is limited by how frequent the frequent case is!
Make the Common Case Fast!
Most pervasive principle in design
Common case
◼ Need to validate what is common or uncommon
◼ H/W that isn’t used still costs you
◼ S/W done right that isn’t used probably doesn’t cost you
Exec-time = 1/performance
Depends on 2 factors
◼ Fraction of original computation time that can take advantage
of the enhancement
◼ Level of improvement of the enhancement
Amdahl’s Law
Execution Time without Enhancemen t Execution Time old
Speedup = =
Execution Time with Enhancemen t Execution Time new
= Execution Time old (1 − Fraction Enhanced ) +
Fraction Enhanced
Execution Time new
SpeedupEnhanced
1
Overall Speedup =
(1 − Fraction Enhanced ) +
Fraction Enhanced
Speedup Enhanced
Caution: fraction
of What?
Amdahl’s Law
S = Speedupenhanced
F = Fractionenhanced
Amdahl’s Law
❑ Make the Common Case Fast
1
Overall Speedup =
(1 − Fraction Enhanced ) +
Fraction Enhanced
Speedup Enhanced
Speedup Enhanced = 20
SpeedupEnhanced = 1.2
Fraction Enhanced = 0.1 VS
FractionEnhanced = 0.9
1
Speedup = = 1.105 Speedup =
1
= 1.176
(1 − 0.1) + 0.1 0 .9
20
(1 − 0.9) +
1 . 2
Generation 1
Total Execution Time SpeedupGreen = 2
Green Phase Blue Phase FractionGreen =
1
2
Generation 2 SpeedupOverall = 1.33 over Generation 1
Total Execution Time SpeedupGreen = 2
Green Blue FractionGreen =
1
3
Generation 3 SpeedupOverall = 1.2 over Generation 2
Total Execution Time
Blue
Simple Example
In an application:
◼ FPSQRT is 20%
◼ FP instructions account for 50% of instructions
◼ Other 30%
The following are possible (at the same cost)
◼ Improve FPSQRT by 40x
◼ improve all FP by 2x
◼ Improve other (other than FP instructions) by 8x
Which one should you do?
And the answer for Speedup is…
1
FPSQRT = = 1.242
(1-0.2) + 0.2
40
1
FP = = 1.538
(1-0.7) + 0.7
2
1
= = 1.356
OTHER 0.3
(1-0.3) +
8
Another Amdahl’s Example
Suppose application is “almost all” parallel: 90%
◼ What is the speedup using 10, 100, and 1000
processors?
# of processors = P 1
Speedup p =
Fraction enhanced = 0.9 0.9
0.1 +
P
1
1
Speedup 10 = = = 5.3
0.9 0.19
0.1 +
10
1 1
Speedup 100 = = = 9.1
0.9 0.109
0.1 +
100
1
Speedup 100+2 = = 1
= 16.95
0.1 + 0.9 0.05 + 0.009
2 100
Return on Investment
❑ What does it cost me to add some performance
enhancement?
❑ How much effective performance do I get out of it?
◼ 100% speedup for small fraction of time wasn’t a big win
❑ Static Metrics:
◼ How many bytes does the program occupy in memory?
❑ Dynamic Metrics:
◼ How many instructions are executed? How many bytes does the
processor fetch to execute the program?
◼ How many clocks are required per instruction? CPI
◼ How "lean" a clock is practical?
Best Metric: Time to execute the program!