Lecture 2: Review of Performance/Cost/Power Metrics and Architectural Basics
Lecture 2: Review of Performance/Cost/Power Metrics and Architectural Basics
Lecture 2: Review of Performance/Cost/Power Metrics and Architectural Basics
Performance/Cost/Power Metrics
and Architectural Basics
JR.S00 1
Review Lecture 1
• Class Organization
– Class Projects
• Trends in the Industry and Driving Forces
JR.S00 2
Computer Architecture Topics
Input/Output and Storage
Emerging Technologies
DRAM Interleaving
Bus protocols
Coherence,
Memory L2 Cache Bandwidth,
Hierarchy Latency
L1 Cache Addressing,
VLSI Protection,
Instruction Set Architecture Exception Handling
Shared Memory,
P M P M P M P M
Message Passing,
°°° Data Parallelism
Processor-Memory-Switch Topologies,
Routing,
Multiprocessors Bandwidth,
Networks and Interconnections Latency,
Reliability
JR.S00 4
The Secret of Architecture Design:
Measurement and Evaluation
Architecture Design is an iterative process:
Design
• Searching the space of possible designs
• At all levels of computer systems
Analysis
Creativity
Cost /
Performance
Analysis
Good Ideas
Mediocre Ideas
Bad Ideas JR.S00 5
Computer Engineering Methodology
Evaluate Existing
Implementation
Systems for Analysis
Complexity
Bottlenecks
Imple- Benchmarks
mentation Technology
Trends
Implement Next
Simulate New
Generation System Designs and
Organizations
Workloads
JR.S00 6
Design
Measurement Tools
JR.S00 7
Review:
Performance, Cost, Power
JR.S00 8
Metric 1: Performance In passenger-mile/hour
ExTime(Y) Performance(X)
--------- = ---------------
ExTime(X) Performance(Y)
JR.S00 10
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E
JR.S00 11
Amdahl’s Law
1
ExTimeold
Speedupoverall = =
(1 - Fractionenhanced) + Fractionenhanced
ExTimenew
Speedupenhanced
JR.S00 12
Amdahl’s Law
1
Speedupoverall = = 1.053
0.95
JR.S00 14
Aspects of CPU Performance
CPU
CPUtime
time == Seconds
Seconds == Instructions
Instructions xx Cycles
Cycles xx Seconds
Seconds
Program
Program Program
Program Instruction
Instruction Cycle
Cycle
Compiler X (X)
Inst. Set. X X
Organization X X
Technology X
JR.S00 15
Cycles Per Instruction
“Average Cycles per Instruction”
CPI = Cycles / Instruction Count
= (CPU Time * Clock Rate) / Instruction Count
n
CPU time = CycleTime * Σ CPI i * I i
i =1
“Instruction Frequency”
n
CPI = Σ CPI i * Fi where F i = I i
i =1
Instruction Count
JR.S00 16
Example: Calculating CPI
Base Machine (Reg / Reg)
Op Freq CPIi CPIi*Fi (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Typical Mix
JR.S00 17
Creating Benchmark Sets
• Real programs
• Kernels
• Toy benchmarks
• Synthetic benchmarks
– e.g. Whetstones and Dhrystones
JR.S00 18
SPEC: System Performance Evaluation
Cooperative
• First Round 1989
– 10 programs yielding a single number (“SPECmarks”)
• Second Round 1992
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs)
» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:
spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
• Third Round 1995
– new set of programs: SPECint95 (8 integer programs) and
SPECfp95 (10 floating point)
– “benchmarks useful for 3 years”
– Single flag setting for all programs: SPECint_base95,
JR.S00 19
SPECfp_base95
How to Summarize Performance
• Arithmetic mean (weighted arithmetic mean)
tracks execution time: Σ(Ti)/n or Σ(Wi*Ti)
• Harmonic mean (weighted harmonic mean) of
rates (e.g., MFLOPS) tracks execution time:
n/ Σ(1/Ri) or n/Σ(Wi/Ri)
• Normalized execution time is handy for scaling
performance (e.g., X times faster than
SPARCstation 10)
– Arithmetic mean impacted by choice of reference machine
• Use the geometric mean for comparison:
∏ (Ti)^1/n
– Independent of chosen machine
– but not good metric for total execution time
JR.S00 20
SPEC First Round
• One program: 99% of time in single line of code
• New front-end compiler could improve dramatically
800
700
600
500
400
300
200
100
0
gcc
doduc
li
spice
nasa7
fpppp
eqntott
tomcatv
epresso
matrix300
Benchmark
JR.S00 21
IBM Powerstation 550 for 2 different compilers
Impact of Means
on SPECmark89 for IBM 550
(without and with special compiler option)
Ratio to VAX: Time: Weighted Time:
Program Before After Before After Before After
gcc 30 29 49 51 8.91 9.22
espresso 35 34 65 67 7.64 7.86
spice 47 47 510 510 5.69 5.69
doduc 46 49 41 38 5.81 5.45
nasa7 78 144 258 140 3.43 1.86
li 34 34 183 183 7.86 7.86
eqntott 40 40 28 28 6.68 6.68
matrix300 78 730 58 6 3.43 0.37
fpppp 90 87 34 35 2.97 3.07
tomcatv 33 138 20 19 2.01 1.94
Mean 54 72 124 108 54.42 49.99
Geometric Arithmetic Weighted Arith.
Ratio 1.33 Ratio 1.16 Ratio 1.09
JR.S00 22
Performance Evaluation
• “For better or worse, benchmarks shape a field”
• Good products created when have:
– Good benchmarks
– Good ways to summarize performance
• Given sales is a function in part of performance
relative to competition, investment in improving
product as reported by performance summary
• If benchmarks/summary inadequate, then choose
between improving product for real programs vs.
improving product to get more sales;
Sales almost always wins!
• Execution time is the measure of computer
performance!
JR.S00 23
Integrated Circuits Costs
Die cost + Testing cost + Packaging cost
IC cost =
Final test yield
Wafer cost
Die cost =
Dies per Wafer × Die yield
π (Wafer_diam/2)2 π ×Wafer_diam
Dies per wafer = − − Test_Die
Die_Area 2 ⋅ Die_Area
−α
Defect_Density × Die_area
Die Yield = Wafer_yield × 1 +
α
JR.S00 25
Cost/Performance
What is Relationship of Cost to Price?
• Recurring Costs
– Component Costs
– Direct Costs (add 25% to 40%) recurring costs: labor, purchasing, scrap,
warranty
JR.S00 27
Summary: Price vs. Cost
100%
80% Average Discount
0%
Mini W/S PC
5 4.7
3.8
4 3.5 Average Discount
0
Mini W/S PC JR.S00 28
Power/Energy
100
Pentium II (R)
Pentium Pro ?
Max Power (Watts)
(R)
Pentium(R)
Pentium(R)
10 MMX
486 486
Source: Intel
386
386
1
1.5µ 1µ 0.8µ 0.6µ 0.35µ 0.25µ 0.18µ 0.13µ
JR.S00 29
Energy/Power
JR.S00 30
The Energy-Flexibility Gap
1000
MOPS/mW (or MIPS/mW)
Dedicated
HW
100
Reconfigurable Pleiades
Energy Efficiency
CPU
CPUtime
time == Seconds
Seconds == Instructions
Instructions xx Cycles
Cycles xx Seconds
Seconds
Program
Program Program
Program Instruction
Instruction Cycle
Cycle
• Execution time is the REAL measure of computer
performance!
• Good products created when have:
– Good benchmarks, good ways to summarize performance
• Different set of metrics apply to embedded
systems
JR.S00 33
Review:
Instruction Sets, Pipelines, and Caches
JR.S00 34
Computer Architecture Is …
the attributes of a [computing] system as seen
by the programmer, i.e., the conceptual
structure and functional behavior, as distinct
from the organization of the data flows and
controls the logic design, and the physical
implementation.
Amdahl, Blaaw, and Brooks, 1964
JR.S00 35
Computer Architecture’s Changing
Definition
• 1950s to 1960s:
Computer Architecture Course = Computer Arithmetic
• 1970s to mid 1980s:
Computer Architecture Course = Instruction Set
Design, especially ISA appropriate for compilers
• 1990s:
Computer Architecture Course = Design of CPU,
memory system, I/O system, Multiprocessors
JR.S00 36
Computer Architecture is ...
Organization
Hardware
JR.S00 37
Instruction Set Architecture (ISA)
software
instruction set
hardware
JR.S00 38
Interface Design
A good interface:
• Lasts through many implementations (portability,
compatability)
• Is used in many differeny ways (generality)
• Provides convenient functionality to higher levels
• Permits an efficient implementation at lower levels
use imp 3
JR.S00 39
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
RISC
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
JR.S00 40
LIW/”EPIC”? (IA-64. . .1999)
Evolution of Instruction Sets
JR.S00 42
Example: MIPS (- DLX)
Register-Register
31 26 25 21 20 16 15 11 10 6 5 0
Register-Immediate
31 26 25 21 20 16 15 0
Op Rs1 Rd immediate
Branch
31 26 25 21 20 16 15 0
Jump / Call
31 26 25 0
Op target
JR.S00 43
Pipelining: Its Natural!
• Laundry Example
• Ann, Brian, Cathy, Dave A B C D
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
JR.S00 44
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
JR.S00 45
Pipelined Laundry
Start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
JR.S00 48
5 Steps of DLX Datapath
Figure 3.1, Page 130
MUX
Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
RS2
Address
Memory
Reg File
Inst
ALU
Memory
RD L
Data
M
MUX
D
Sign
Imm Extend
WB Data
JR.S00 49
5 Steps of DLX Datapath
Figure 3.4, Page 134
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
MEM/WB
Address
Memory
RS2
EX/MEM
Reg File
ID/EX
IF/ID
ALU
Memory
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
ALU
n Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
JR.S00 51
Its Not That Easy for Computers
JR.S00 52
One Memory Port/Structural Hazards
Figure 3.6, Page 142
ALU
Ifetch Reg DMem Reg
Load
n
s
ALU
Ifetch Reg DMem Reg
t
Instr 1
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
ALU
Ifetch Reg DMem Reg
d Instr 3
e
r
ALU
Ifetch Reg DMem Reg
Instr 4
JR.S00 53
One Memory Port/Structural Hazards
Figure 3.7, Page 143
ALU
Ifetch Reg DMem Reg
Load
n
s
ALU
Ifetch Reg DMem Reg
t
Instr 1
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r
ALU
Ifetch Reg DMem Reg
Instr 3
JR.S00 54
Speed Up Equation for Pipelining
JR.S00 55
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard Architecture”)
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
JR.S00 56
Data Hazard on R1
Figure 3.9, page 147
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
Ifetch Reg DMem Reg
add r1,r2,r3
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
r.
ALU
O Ifetch Reg DMem Reg
and r6,r1,r7
r
d
ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
JR.S00 57
Three Generic Data Hazards
I: add r1,r2,r3
J: sub r4,r1,r3
JR.S00 58
Three Generic Data Hazards
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more complicated
pipes
JR.S00 60
Forwarding to Avoid Data Hazard
Figure 3.10, Page 149
Time (clock cycles)
I
n
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
O
r
ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
JR.S00 61
HW Change for Forwarding
Figure 3.20, Page 161
NextPC
mux
Registers
MEM/WR
EX/MEM
ALU
ID/EX
Data
mux
Memory
mux
Immediate
JR.S00 62
Data Hazard Even with Forwarding
Figure 3.12, Page 153
ALU
Reg
lw r1, 0(r2) Ifetch Reg DMem
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r6
r.
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
r
d
e
ALU
Ifetch Reg DMem Reg
r
or r8,r1,r9
JR.S00 63
Data Hazard Even with Forwarding
Figure 3.13, Page 154
ALU
Reg
s lw r1, 0(r2) Ifetch Reg DMem
t
r.
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
O
r
d Bubble
ALU
Ifetch Reg DMem Reg
e and r6,r1,r7
r
Bubble
ALU
Ifetch Reg DMem
or r8,r1,r9
JR.S00 64
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f SW a,Ra
SUB Rd,Re,Rf SUB Rd,Re,Rf
JR.S00 65
SW d,Rd SW d,Rd
Control Hazard on Branches
Three Stage Stall
ALU
Ifetch Reg DMem Reg
10: beq r1,r3,36
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Ifetch Reg DMem Reg
18: or r6,r1,r7
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
Ifetch Reg DMem Reg
36: xor r10,r1,r11
JR.S00 66
Branch Stall Impact
JR.S00 67
Pipelined DLX Datapath
Figure 3.22, page 163
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc. Access Back
This is the correct 1 cycle
latency implementation!
JR.S00 68
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% DLX branches not taken on average
– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% DLX branches taken on average
– But haven’t calculated branch target address in DLX
» DLX still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
JR.S00 69
Four Branch Hazard Alternatives
branch instruction
sequential successor1
sequential successor2
........ Branch delay of length n
sequential successorn
branch target if taken
JR.S00 70
Delayed Branch
• Where to get instructions to fill branch delay slot?
– Before branch instruction
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
– Cancelling branches allow more slots to be filled
JR.S00 71
Evaluating Branch Alternatives
Pipeline speedup = Pipeline depth
1 +Branch frequency ×Branch penalty
JR.S00 72
Summary :
Control and Pipelining
• Just overlap tasks; easy if tasks are independent
• Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:
Pipeline depth Cycle Timeunpipeline d
Speedup = ×
1 + Pipeline stall CPI Cycle Timepipelined
JR.S00 73