Digital Design and Computer Architecture, 2: Edition
Digital Design and Computer Architecture, 2: Edition
Digital Design and Computer Architecture, 2: Edition
Chapter 7 <1>
MICROARCHITECTURE Chapter 7 :: Topics
• Introduction
• Performance Analysis
• Single-Cycle Processor
• Multicycle Processor
• Pipelined Processor
• Exceptions
• Advanced Microarchitecture
Chapter 7 <2>
MICROARCHITECTURE Introduction
• Microarchitecture: how to Application
Software
programs
in hardware Architecture
instructions
registers
• Processor: Micro- datapaths
adders
– Control: control signals Logic
memories
Analog amplifiers
Circuits filters
transistors
Devices
diodes
Physics electrons
Chapter 7 <3>
MICROARCHITECTURE Microarchitecture
• Multiple implementations for a single
architecture:
– Single-cycle: Each instruction executes in a
single cycle
– Multicycle: Each instruction is broken into series
of shorter steps
– Pipelined: Each instruction broken up into series
of steps & multiple instructions execute at once
Chapter 7 <4>
MICROARCHITECTURE Processor Performance
• Program execution time
Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)
• Definitions:
– CPI: Cycles/instruction
– clock period: seconds/cycle
– IPC: instructions/cycle = IPC
• Challenge is to satisfy constraints of:
– Cost
– Power
– Performance
Chapter 7 <5>
MICROARCHITECTURE MIPS Processor
• Consider subset of MIPS instructions:
– R-type instructions: and, or, add, sub, slt
– Memory instructions: lw, sw
– Branch instructions: beq
Chapter 7 <6>
MICROARCHITECTURE Architectural State
• Determines everything about a processor:
– PC
– 32 registers
– Memory
Chapter 7 <7>
MICROARCHITECTURE MIPS State Elements
Chapter 7 <8>
MICROARCHITECTURE Single-Cycle MIPS Processor
• Datapath
• Control
Chapter 7 <9>
MICROARCHITECTURE Single-Cycle Datapath: lw fetch
STEP 1: Fetch instruction
CLK CLK
CLK
PC Instr WE3 WE
PC' A1 RD1
A RD
A RD
Instruction
A2 RD2 Data
Memory
A3 Memory
Register
WD3 WD
File
Chapter 7 <10>
MICROARCHITECTURE Single-Cycle Datapath: lw Register Read
CLK CLK
CLK
25:21
WE3 WE
PC' PC Instr A1 RD1
A RD
A RD
Instruction
A2 RD2 Data
Memory
A3 Memory
Register
WD3 WD
File
Chapter 7 <11>
MICROARCHITECTURE Single-Cycle Datapath: lw Immediate
CLK CLK
CLK
25:21
WE3 WE
PC' PC Instr A1 RD1
A RD
A RD
Instruction
A2 RD2 Data
Memory
A3 Memory
Register
WD3 WD
File
15:0 SignImm
Sign Extend
Chapter 7 <12>
MICROARCHITECTURE Single-Cycle Datapath: lw address
STEP 4: Compute the memory address
ALUControl2:0
010
CLK CLK
CLK
25:21
WE3 SrcA Zero WE
PC' PC Instr A1 RD1
A RD
ALU
ALUResult
A RD
Instruction
A2 RD2 SrcB Data
Memory
A3 Memory
Register
WD3 WD
File
SignImm
15:0
Sign Extend
Chapter 7 <13>
MICROARCHITECTURE Single-Cycle Datapath: lw Memory Read
ALU
ALUResult ReadData
A RD
Instruction
A2 RD2 SrcB Data
Memory 20:16
A3 Memory
Register
WD3 WD
File
SignImm
15:0
Sign Extend
Chapter 7 <14>
MICROARCHITECTURE Single-Cycle Datapath: lw PC Increment
ALU
ALUResult ReadData
A RD
Instruction
A2 RD2 SrcB Data
Memory 20:16
A3 Memory
Register
WD3 WD
File
PCPlus4
+
SignImm
4 15:0
Sign Extend
Result
Chapter 7 <15>
MICROARCHITECTURE Single-Cycle Datapath: sw
Write data in rt to memory
RegWrite ALUControl2:0 MemWrite
0 010 1
CLK CLK
CLK
25:21
WE3 SrcA Zero WE
PC' PC Instr A1 RD1
A RD
ALU
ALUResult ReadData
20:16 A RD
Instruction
A2 RD2 SrcB Data
Memory 20:16
A3 Memory
Register WriteData
WD3 WD
File
PCPlus4
+
SignImm
4 15:0
Sign Extend
Result
Chapter 7 <16>
MICROARCHITECTURE Single-Cycle Datapath: R-Type
• Read from rs and rt
• Write ALUResult to register file
• Write to rd (instead of rt)
RegWrite RegDst ALUSrc ALUControl2:0 MemWrite MemtoReg
1 1 0 varies 0
CLK CLK 0
CLK
25:21
WE3 SrcA Zero WE
PC' PC Instr A1 RD1 0
A RD
ALU
ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
Sign Extend
Result
Chapter 7 <17>
MICROARCHITECTURE Single-Cycle Datapath: beq
• Determine whether values in rs and rt are equal
• Calculate branch target address:
BTA = (sign-extended immediate << 2) + (PC+4)
PCSrc
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
<<2
Sign Extend PCBranch
+
Result
Chapter 7 <18>
MICROARCHITECTURE Single-Cycle Processor
MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite
CLK CLK
CLK
25:21
WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
<<2
Sign Extend PCBranch
+
Result
Chapter 7 <19>
MICROARCHITECTURE Single-Cycle Control
Control
Unit MemtoReg
MemWrite
Branch
Opcode5:0 Main
ALUSrc
Decoder
RegDst
RegWrite
ALUOp1:0
ALU
Funct5:0 ALUControl 2:0
Decoder
Chapter 7 <20>
MICROARCHITECTURE Review: ALU
F2:0 Function
A B 000 A& B
N N 001 A|B
010 A+B
F 011 not used
ALU 3
100 A & ~B
N
101 A | ~B
Y
110 A-B
111 SLT
Chapter 7 <21>
MICROARCHITECTURE Review: ALU
A B
N N
0
F2
N
Cout +
[N-1] S
Extend
Zero
N N N N
1
0
3
2 F1:0
N
Y
Chapter 7 <22>
MICROARCHITECTURE Control Unit: ALU Decoder
ALUOp1:0 Meaning
00 Add
01 Subtract
10 Look at Funct
11 Not Used
R-type 000000
lw 100011
sw 101011
beq 000100
MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite
CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
<<2
Sign Extend PCBranch
+
Result
Chapter 7 <24>
MICROARCHITECTURE Control Unit: Main Decoder
R-type 000000 1 1 0 0 0 0 10
lw 100011 1 0 1 0 0 0 00
sw 101011 0 X 1 0 1 X 00
beq 000100 0 X 0 1 0 X 01
Chapter 7 <25>
MICROARCHITECTURE Single-Cycle Datapath: or
MemtoReg
Control
MemWrite
Unit
Branch 0
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite
CLK CLK
CLK 1 0
0 001 0
25:21
WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD
ALU
1 ALUResult ReadData
0 A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
1
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0 <<2
Sign Extend PCBranch
+
Result
Chapter 7 <26>
MICROARCHITECTURE Extended Functionality: addi
MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite
CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
<<2
Sign Extend PCBranch
+
Result
No change to datapath
Chapter 7 <27>
MICROARCHITECTURE Control Unit: addi
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000 1 1 0 0 0 0 10
lw 100011 1 0 1 0 0 1 00
sw 101011 0 X 1 0 1 X 00
beq 000100 0 X 0 1 0 X 01
addi 001000
Chapter 7 <28>
MICROARCHITECTURE Control Unit: addi
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000 1 1 0 0 0 0 10
lw 100011 1 0 1 0 0 1 00
sw 101011 0 X 1 0 1 X 00
beq 000100 0 X 0 1 0 X 01
addi 001000 1 0 1 0 0 0 00
Chapter 7 <29>
MICROARCHITECTURE Extended Functionality: j
Jump MemtoReg
Control
MemWrite
Unit
Branch
ALUControl 2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite
CLK CLK
CLK
0 PC' 25:21
WE3 SrcA Zero WE
0 PC Instr A1 RD1 0 Result
1 A RD
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
PCJump 15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
<<2
Sign Extend PCBranch
+
27:0 31:28
25:0
<<2
Chapter 7 <30>
MICROARCHITECTURE Control Unit: Main Decoder
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump
R-type 000000 1 1 0 0 0 0 10 0
lw 100011 1 0 1 0 0 1 00 0
sw 101011 0 X 1 0 1 X 00 0
beq 000100 0 X 0 1 0 X 01 0
j 000010
Chapter 7 <31>
MICROARCHITECTURE Control Unit: Main Decoder
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump
R-type 000000 1 1 0 0 0 0 10 0
lw 100011 1 0 1 0 0 1 00 0
sw 101011 0 X 1 0 1 X 00 0
beq 000100 0 X 0 1 0 X 01 0
j 000010 0 X X X 0 X XX 1
Chapter 7 <32>
MICROARCHITECTURE Review: Processor Performance
Program Execution Time
= (#instructions)(cycles/instruction)(seconds/cycle)
= # instructions x CPI x TC
Chapter 7 <33>
MICROARCHITECTURE Single-Cycle Performance
MemtoReg
Control
MemWrite
Unit
Branch 0 0
ALUControl 2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite
CLK CLK
CLK 1 0
010 1
25:21
WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD
ALU
1 ALUResult ReadData
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
0
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0 <<2
Sign Extend PCBranch
+
Result
Chapter 7 <35>
MICROARCHITECTURE Single-Cycle Performance Example
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Tc = ?
Chapter 7 <36>
MICROARCHITECTURE Single-Cycle Performance Example
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Chapter 7 <38>
MICROARCHITECTURE Multicycle MIPS Processor
• Single-cycle:
+ simple
- cycle time limited by longest instruction (lw)
- 2 adders/ALUs & 2 memories
• Multicycle:
+ higher clock speed
+ simpler instructions run faster
+ reuse expensive hardware on multiple cycles
- sequencing overhead paid many times
• Same design steps: datapath & control
Chapter 7 <39>
MICROARCHITECTURE Multicycle State Elements
• Replace Instruction and Data memories with
a single unified memory – more realistic
CLK CLK
CLK
WE WE3
PC' PC A1 RD1
RD
EN A A2 RD2
Instr / Data
Memory A3
Register
WD
File
WD3
Chapter 7 <40>
MICROARCHITECTURE Multicycle Datapath: Instruction Fetch
STEP 1: Fetch instruction
IRWrite
CLK CLK
CLK CLK
WE WE3
PC' PC Instr A1 RD1
b A
RD
A2 RD2
EN
Instr / Data
Memory A3
Register
WD
File
WD3
Chapter 7 <41>
MICROARCHITECTURE Multicycle Datapath: lw Register Read
STEP 2a: Read source operands from RF
IRWrite
Chapter 7 <42>
MICROARCHITECTURE Multicycle Datapath: lw Immediate
STEP 2b: Sign-extend the immediate
IRWrite
SignImm
15:0
Sign Extend
Chapter 7 <43>
MICROARCHITECTURE Multicycle Datapath: lw Address
STEP 3: Compute the memory address
IRWrite ALUControl2:0
ALU
A EN A2 RD2 ALUResult ALUOut
Instr / Data SrcB
Memory A3
Register
WD
File
WD3
SignImm
15:0
Sign Extend
Chapter 7 <44>
MICROARCHITECTURE Multicycle Datapath: lw Memory Read
ALU
A EN A2 RD2 ALUResult ALUOut
1
Instr / Data SrcB
Memory CLK A3
Register
WD
Data File
WD3
SignImm
15:0
Sign Extend
Chapter 7 <45>
MICROARCHITECTURE Multicycle Datapath: lw Write Register
ALU
A EN A2 RD2 ALUResult ALUOut
1
Instr / Data SrcB
Memory CLK
20:16
A3
Register
WD
Data File
WD3
SignImm
15:0
Sign Extend
Chapter 7 <46>
MICROARCHITECTURE Multicycle Datapath: Increment PC
STEP 6: Increment PC
ALU
EN A EN A2 RD2 00 ALUResult ALUOut
1 SrcB
Instr / Data 4 01
Memory CLK
20:16
A3 10
Register
WD 11
Data File
WD3
SignImm
15:0
Sign Extend
Chapter 7 <47>
MICROARCHITECTURE Multicycle Datapath: sw
Write data in rt to memory
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1
Instr / Data 4 01 SrcB
Memory CLK
20:16
A3 10
Register
WD 11
Data File
WD3
SignImm
15:0
Sign Extend
Chapter 7 <48>
MICROARCHITECTURE Multicycle Datapath: R-Type
• Read from rs and rt
• Write ALUResult to register file
• Write to rd (instead of rt)
PCWrite IorD MemWrite IRWrite RegDst MemtoReg RegWrite ALUSrcA ALUSrcB1:0 ALUControl 2:0
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1
Instr / Data 20:16 4 01 SrcB
0
Memory 15:11 A3 10
CLK 1 Register
WD 11
0 File
Data WD3
1
SignImm
15:0
Sign Extend
Chapter 7 <49>
MICROARCHITECTURE Multicycle Datapath: beq
• rs == rt?
• BTA = (sign-extended immediate << 2) + (PC+4)
PCEn
IorD MemWrite IRWrite RegDst MemtoReg RegWrite ALUSrcA ALUSrcB1:0 ALUControl2:0 Branch PCWrite PCSrc
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 1
Instr / Data 20:16
4 01 SrcB
0
Memory 15:11
A3 10
CLK 1 Register
WD 11
0 File
Data WD3
1
<<2
SignImm
15:0
Sign Extend
Chapter 7 <50>
MICROARCHITECTURE Multicycle Processor
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A Zero CLK
25:21
PC' PC Instr A1 RD1 1 0
0 Adr RD B
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 1
Instr / Data 20:16 4 01 SrcB
0
Memory 15:11 A3 10
CLK 1 Register
WD 11
0 File
Data WD3
1
<<2
SignImm
15:0
Sign Extend
Chapter 7 <51>
MICROARCHITECTURE Multicycle Control
Control
MemtoReg
Unit
RegDst
IorD Multiplexer
PCSrc Selects
Main ALUSrcB1:0
Controller
Opcode5:0 (FSM) ALUSrcA
IRWrite
MemWrite
Register
PCWrite
Enables
Branch
RegWrite
ALUOp1:0
ALU
Funct5:0 ALUControl 2:0
Decoder
Chapter 7 <52>
MICROARCHITECTURE Main Controller FSM: Fetch
S0: Fetch
Reset
CLK
PCWrite 1
Branch 0 PCEn
IorD Control PCSrc
MemWrite Unit ALUControl 2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK 0
CLK 0 CLK 0
0 SrcA 010
0 WE WE3 A Zero CLK 0
25:21
PC' PC Instr A1 RD1 1 0
0 Adr RD B 01
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 1
X
Instr / Data 1 20:16 4 01 SrcB
1 0
Memory 15:11 A3 10
CLK 1 X Register
WD 11
0 File
Data WD3
1
<<2
SignImm
15:0
Sign Extend
Chapter 7 <53>
MICROARCHITECTURE Main Controller FSM: Fetch
S0: Fetch
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite CLK
PCWrite PCWrite 1
Branch 0 PCEn
IorD Control PCSrc
MemWrite Unit ALUControl 2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK 0
CLK 0 CLK 0
0 SrcA 010
0 WE WE3 A Zero CLK 0
25:21
PC' PC Instr A1 RD1 1 0
0 Adr RD B 01
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 1
X
Instr / Data 1 20:16 4 01 SrcB
1 0
Memory 15:11 A3 10
CLK 1 X Register
WD 11
0 File
Data WD3
1
<<2
SignImm
15:0
Sign Extend
Chapter 7 <54>
MICROARCHITECTURE Main Controller FSM: Decode
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
CLK
PCWrite 0
Branch 0 PCEn
IorD Control PCSrc
MemWrite Unit ALUControl 2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 1
X
Instr / Data 0 20:16 4 01 SrcB
0 0
Memory 15:11 A3 10
CLK 1 X Register
WD 11
0 File
Data WD3
1
<<2
SignImm
15:0
Sign Extend
Chapter 7 <55>
MICROARCHITECTURE Main Controller FSM: Address
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
Op = LW
or
S2: MemAdr Op = SW CLK
PCWrite 0
Branch 0 PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
CLK RegDst CLK CLK 1
CLK 0 CLK 0
0 SrcA 010
X WE WE3 A Zero CLK X
25:21
PC' PC Instr A1 RD1 1 0
0 Adr RD B 10
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 1
X
Instr / Data 0 20:16 4 01 SrcB
0 0
Memory 15:11 A3 10
CLK 1 X Register
WD 11
0 File
Data WD3
1
<<2
SignImm
15:0
Sign Extend
Chapter 7 <56>
MICROARCHITECTURE Main Controller FSM: Address
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
Op = LW
or CLK
S2: MemAdr Op = SW PCWrite 0
Branch 0 PCEn
IorD Control PCSrc
ALUSrcA = 1 MemWrite Unit ALUControl2:0
ALUSrcB = 10 IRWrite ALUSrcB1:0
ALUOp = 00 31:26
Op
ALUSrcA
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK 1
CLK 0 CLK 0
0 SrcA 010
X WE WE3 A Zero CLK X
25:21
PC' PC Instr A1 RD1 1 0
0 Adr RD B 10
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 1
X
Instr / Data 0 20:16 4 01 SrcB
0 0
Memory 15:11
A3 10
CLK 1 X Register
WD 11
0 File
Data WD3
1
<<2
SignImm
15:0
Sign Extend
Chapter 7 <57>
MICROARCHITECTURE Main Controller FSM: lw
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
Op = LW
or
S2: MemAdr Op = SW
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
Op = LW
S3: MemRead
IorD = 1
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
Chapter 7 <58>
MICROARCHITECTURE Main Controller FSM: sw
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
Op = LW
or
S2: MemAdr Op = SW
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
Op = SW
Op = LW
S5: MemWrite
S3: MemRead
IorD = 1
IorD = 1
MemWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
Chapter 7 <59>
MICROARCHITECTURE Main Controller FSM: R-Type
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
Op = LW
or Op = R-type
S2: MemAdr Op = SW
S6: Execute
ALUSrcA = 1 ALUSrcA = 1
ALUSrcB = 10 ALUSrcB = 00
ALUOp = 00 ALUOp = 10
Op = SW
Op = LW S7: ALU
S5: MemWrite
Writeback
S3: MemRead
RegDst = 1
IorD = 1
IorD = 1 MemtoReg = 0
MemWrite
RegWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
Chapter 7 <60>
MICROARCHITECTURE Main Controller FSM: beq
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11
PCSrc = 0 ALUOp = 00
IRWrite
PCWrite
Op = BEQ
Op = LW
or Op = R-type
S2: MemAdr Op = SW
S6: Execute
S8: Branch
ALUSrcA = 1
ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01
ALUOp = 00 ALUOp = 10 PCSrc = 1
Branch
Op = SW
Op = LW S7: ALU
S5: MemWrite
Writeback
S3: MemRead
RegDst = 1
IorD = 1
IorD = 1 MemtoReg = 0
MemWrite
RegWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
Chapter 7 <61>
MICROARCHITECTURE Multicycle Controller FSM
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11
PCSrc = 0 ALUOp = 00
IRWrite
PCWrite
Op = BEQ
Op = LW
or Op = R-type
S2: MemAdr Op = SW
S6: Execute
S8: Branch
ALUSrcA = 1
ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01
ALUOp = 00 ALUOp = 10 PCSrc = 1
Branch
Op = SW
Op = LW S7: ALU
S5: MemWrite
Writeback
S3: MemRead
RegDst = 1
IorD = 1
IorD = 1 MemtoReg = 0
MemWrite
RegWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
Chapter 7 <62>
MICROARCHITECTURE Extended Functionality: addi
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11
PCSrc = 0 ALUOp = 00
IRWrite
PCWrite
Op = ADDI
Op = BEQ
Op = LW
or Op = R-type
S2: MemAdr Op = SW
S6: Execute S9: ADDI
S8: Branch
Execute
ALUSrcA = 1
ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01
ALUOp = 00 ALUOp = 10 PCSrc = 1
Branch
Op = SW
Op = LW S7: ALU
S5: MemWrite S10: ADDI
Writeback
S3: MemRead Writeback
RegDst = 1
IorD = 1
IorD = 1 MemtoReg = 0
MemWrite
RegWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
Chapter 7 <63>
MICROARCHITECTURE Main Controller FSM: addi
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11
PCSrc = 0 ALUOp = 00
IRWrite
PCWrite
Op = ADDI
Op = BEQ
Op = LW
or Op = R-type
S2: MemAdr Op = SW
S6: Execute S9: ADDI
S8: Branch
Execute
ALUSrcA = 1
ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 ALUSrcA = 1
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01 ALUSrcB = 10
ALUOp = 00 ALUOp = 10 PCSrc = 1 ALUOp = 00
Branch
Op = SW
Op = LW S7: ALU
S5: MemWrite S10: ADDI
Writeback
S3: MemRead Writeback
RegDst = 1 RegDst = 0
IorD = 1
IorD = 1 MemtoReg = 0 MemtoReg = 0
MemWrite
RegWrite RegWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
Chapter 7 <64>
MICROARCHITECTURE Extended Functionality: j
PCEn
IorD MemWrite IRWrite RegDst MemtoReg RegWrite ALUSrcA ALUSrcB1:0 ALUControl2:0Branch PCWrite PCSrc1:0
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2
SignImm
15:0
Sign Extend
25:0 (jump)
Chapter 7 <65>
MICROARCHITECTURE Main Controller FSM: j
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0 S11: Jump
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11 Op = J
PCSrc = 00 ALUOp = 00
IRWrite
PCWrite
Op = ADDI
Op = BEQ
Op = LW
or Op = R-type
S2: MemAdr Op = SW
S6: Execute S9: ADDI
S8: Branch
Execute
ALUSrcA = 1
ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 ALUSrcA = 1
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01 ALUSrcB = 10
ALUOp = 00 ALUOp = 10 PCSrc = 01 ALUOp = 00
Branch
Op = SW
Op = LW S7: ALU
S5: MemWrite S10: ADDI
Writeback
S3: MemRead Writeback
RegDst = 1 RegDst = 0
IorD = 1
IorD = 1 MemtoReg = 0 MemtoReg = 0
MemWrite
RegWrite RegWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
Chapter 7 <66>
MICROARCHITECTURE Main Controller FSM: j
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0 S11: Jump
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11 Op = J
PCSrc = 00 ALUOp = 00 PCSrc = 10
IRWrite PCWrite
PCWrite
Op = ADDI
Op = BEQ
Op = LW
or Op = R-type
S2: MemAdr Op = SW
S6: Execute S9: ADDI
S8: Branch
Execute
ALUSrcA = 1
ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 ALUSrcA = 1
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01 ALUSrcB = 10
ALUOp = 00 ALUOp = 10 PCSrc = 01 ALUOp = 00
Branch
Op = SW
Op = LW S7: ALU
S5: MemWrite S10: ADDI
Writeback
S3: MemRead Writeback
RegDst = 1 RegDst = 0
IorD = 1
IorD = 1 MemtoReg = 0 MemtoReg = 0
MemWrite
RegWrite RegWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
Chapter 7 <67>
MICROARCHITECTURE Multicycle Processor Performance
• Instructions take different number of cycles:
– 3 cycles: beq, j
– 4 cycles: R-Type, sw, addi
– 5 cycles: lw
• CPI is weighted average
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 11% branches
– 2% jumps
– 52% R-type
Average CPI = (0.11 + 0.02)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12
Chapter 7 <68>
MICROARCHITECTURE Multicycle Processor Performance
Multicycle critical path:
Tc = tpcq + tmux + max(tALU + tmux, tmem) + tsetup
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 1
Instr / Data 20:16 4 01 SrcB
0
Memory 15:11 A3 10
CLK 1 Register
WD 11
0 File
Data WD3
1
<<2
SignImm
15:0
Sign Extend
Chapter 7 <69>
MICROARCHITECTURE Multicycle Performance Example
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Tc = ?
Chapter 7 <70>
MICROARCHITECTURE Multicycle Performance Example
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Chapter 7 <72>
MICROARCHITECTURE Multicycle Performance Example
Program with 100 billion instructions
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(4.12)(325 × 10-12)
= 133.9 seconds
Chapter 7 <73>
MICROARCHITECTURE Multicycle Performance Example
Program with 100 billion instructions
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(4.12)(325 × 10-12)
= 133.9 seconds
Chapter 7 <74>
MICROARCHITECTURE Review: Single-Cycle Processor
Jump MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite
CLK CLK
CLK
0 25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0 Result
1 A RD
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
PCJump 15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
<<2
Sign Extend PCBranch
+
27:0 31:28
25:0
<<2
Chapter 7 <75>
MICROARCHITECTURE Review: Multicycle Processor
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 Adr RD B
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2
ImmExt
15:0
Sign Extend
25:0 (Addr)
Chapter 7 <76>
MICROARCHITECTURE Pipelined MIPS Processor
• Temporal parallelism
• Divide single-cycle processor into 5 stages:
– Fetch
– Decode
– Execute
– Memory
– Writeback
• Add pipeline registers between stages
Chapter 7 <77>
MICROARCHITECTURE Single-Cycle vs. Pipelined
Single-Cycle
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900
Instr
Time (ps)
Fetch Decode Execute Memory Write
1
Instruction Read Reg ALU Read / Write Reg
Fetch Decode Execute Memory Write
2
Instruction Read Reg ALU Read / Write Reg
Pipelined
Instr
Fetch Decode Execute Memory Write
1
Instruction Read Reg ALU Read/Write Reg
Fetch Decode Execute Memory Write
2
Instruction Read Reg ALU Read/Write Reg
Fetch Decode Execute Memory Write
3
Instruction Read Reg ALU Read/Write Reg
Chapter 7 <78>
MICROARCHITECTURE Pipelined Processor Abstraction
1 2 3 4 5 6 7 8 9 10
Time (cycles)
$0
lw DM $s2
lw $s2, 40($0) IM RF 40 + RF
$t1
add DM $s3
add $s3, $t1, $t2 IM RF $t2 + RF
$s1
sub DM $s4
sub $s4, $s1, $s5 IM RF $s5 - RF
$t5
and DM $s5
and $s5, $t5, $t6 IM RF $t6 & RF
$s1
sw DM $s6
sw $s6, 20($s1) IM RF 20 + RF
$t3
or DM $s7
or $s7, $t3, $t4 IM RF $t4 | RF
Chapter 7 <79>
MICROARCHITECTURE Single-Cycle & Pipelined Datapath
CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0 WriteReg4:0
15:11
1
PCPlus4
+
SignImm
4 15:0 <<2
Sign Extend
PCBranch
+
Result
CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD
ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0
15:11
RdE
1
+
SignImmE
4 15:0
<<2
Sign Extend PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
Chapter 7 <80>
MICROARCHITECTURE Corrected Pipelined Datapath
CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD
ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
SignImmE
+
15:0 <<2
Sign Extend
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
Chapter 7 <81>
MICROARCHITECTURE Pipelined Processor Control
CLK CLK CLK
ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+
15:0 <<2
Sign Extend SignImmE
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
Chapter 7 <83>
MICROARCHITECTURE Data Hazard
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Chapter 7 <84>
MICROARCHITECTURE Handling Data Hazards
• Insert nops in code at compile time
• Rearrange code at compile time
• Forward data at run time
• Stall the processor at run time
Chapter 7 <85>
MICROARCHITECTURE Compile-Time Hazard Elimination
• Insert enough nops for result to be ready
• Or move independent useful instructions forward
1 2 3 4 5 6 7 8 9 10
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
nop DM
nop IM RF RF
nop DM
nop IM RF RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Chapter 7 <86>
MICROARCHITECTURE Data Forwarding
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Chapter 7 <87>
MICROARCHITECTURE Data Forwarding
CLK CLK CLK
ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
PCBranchM
ResultW
RegWriteW
ForwardBE
ForwardAE
RegWriteM
Hazard Unit
Chapter 7 <88>
MICROARCHITECTURE Data Forwarding
• Forward to Execute stage from either:
– Memory stage or
– Writeback stage
• Forwarding logic for ForwardAE:
if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM)
then ForwardAE = 10
else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW)
then ForwardAE = 01
else ForwardAE = 00
Forwarding logic for ForwardBE same, but replace rsE with rtE
Chapter 7 <89>
MICROARCHITECTURE Stalling
1 2 3 4 5 6 7 8
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Chapter 7 <90>
MICROARCHITECTURE Stalling
1 2 3 4 5 6 7 8 9
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF
$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF
Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Chapter 7 <91>
MICROARCHITECTURE Stalling Hardware
CLK CLK CLK
ALU
ALUOutM ReadDataW
EN
1 10 A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
RegWriteM
ForwardAE
FlushE
StallD
StallF
Hazard Unit
Chapter 7 <92>
MICROARCHITECTURE Stalling Logic
lwstall =
((rsD==rtE) OR (rtD==rtE)) AND MemtoRegE
Chapter 7 <93>
MICROARCHITECTURE Control Hazards
• beq:
– branch not determined until 4th stage of pipeline
– Instructions after branch fetched before branch occurs
– These instructions must be flushed if branch happens
• Branch misprediction penalty
– number of instruction flushed when branch is taken
– May be reduced by determining branch earlier
Chapter 7 <94>
MICROARCHITECTURE Control Hazards: Original Pipeline
CLK CLK CLK
ALU
ALUOutM ReadDataW
1 10
EN
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM 4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
RegWriteM
ForwardAE
FlushE
StallD
StallF
Hazard Unit
Chapter 7 <95>
MICROARCHITECTURE Control Hazards
1 2 3 4 5 6 7 8 9
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF
Flush
$s4 these
or DM instructions
28 or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM
2C sub $t2, $s0, $s5 IM RF $s5 - RF
30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
Chapter 7 <96>
MICROARCHITECTURE Early Branch Resolution
CLK CLK CLK
EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 00
A RD 01
ALU
ALUOutM ReadDataW
1 10
EN
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
PCPlus4F PCPlus4D
+
CLR
CLR
EN
PCBranchD
ResultW
MemtoRegE
RegWriteW
ForwardBE
ForwardAE
RegWriteM
FlushE
StallD
StallF
Hazard Unit
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0 Flush
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF this
instruction
30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
Chapter 7 <98>
MICROARCHITECTURE Handling Data & Control Hazards
CLK CLK CLK
EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 0 00
A RD 01
ALU
ALUOutM ReadDataW
1 1 10
EN
A RD
Instruction 20:16
A2 RD2 0 00 0 SrcBE Data
Memory 01
A3 1 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D
CLR
CLR
EN
PCBranchD
ResultW
MemtoRegE
RegWriteW
ForwardBD
ForwardBE
ForwardAD
ForwardAE
RegWriteM
RegWriteE
BranchD
FlushE
StallD
StallF
Hazard Unit
Chapter 7 <99>
MICROARCHITECTURE Control Forwarding & Stalling Logic
• Forwarding logic:
ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM
ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM
• Stalling logic:
branchstall = BranchD AND
[RegWriteE AND ((WriteRegE == rsD) OR (WriteRegE == rtD))
OR
[MemtoRegM AND ((WriteRegM == rsD) OR (WriteRegM == rtD))]
Chapter 7 <100>
MICROARCHITECTURE Branch Prediction
• Guess whether branch will be taken
– Backward branches are usually taken (loops)
– Consider history to improve guess
• Good prediction reduces fraction of branches
requiring a flush
Chapter 7 <101>
MICROARCHITECTURE Pipelined Performance Example
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 11% branches
– 2% jumps
– 52% R-type
• Suppose:
– 40% of loads used by next instruction
– 25% of branches mispredicted
– All jumps flush next instruction
• What is the average CPI?
Chapter 7 <102>
MICROARCHITECTURE Pipelined Performance Example
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 11% branches
– 2% jumps
– 52% R-type
• Suppose:
– 40% of loads used by next instruction
– 25% of branches mispredicted
– All jumps flush next instruction
• What is the average CPI?
– Load/Branch CPI = 1 when no stalling, 2 when stalling
– CPIlw = 1(0.6) + 2(0.4) = 1.4
– CPIbeq = 1(0.75) + 2(0.25) = 1.25
= 1.15
Chapter 7 <103>
MICROARCHITECTURE Pipelined Performance
• Pipelined processor critical path:
Tc = max {
tpcq + tmem + tsetup
2(tRFread + tmux + teq + tAND + tmux + tsetup )
tpcq + tmux + tmux + tALU + tsetup
tpcq + tmemwrite + tsetup
2(tpcq + tmux + tRFwrite) }
Chapter 7 <104>
MICROARCHITECTURE Pipelined Performance Example
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Equality comparator teq 40
AND gate tAND 15
Memory write tmemwrite 220
Register file write tRFwrite 100
Chapter 7 <106>
MICROARCHITECTURE Processor Performance Comparison
Execution
Time Speedup
Processor (seconds) (single-cycle as baseline)
Single-cycle 92.5 1
Multicycle 133 0.70
Pipelined 63 1.47
Chapter 7 <107>
MICROARCHITECTURE Review: Exceptions
• Unscheduled function call to exception handler
• Caused by:
– Hardware, also called an interrupt, e.g. keyboard
– Software, also called traps, e.g. undefined instruction
• When exception occurs, the processor:
– Records cause of exception (Cause register)
– Jumps to exception handler (0x80000180)
– Returns to program (EPC register)
Chapter 7 <108>
MICROARCHITECTURE Example Exception
Chapter 7 <109>
MICROARCHITECTURE Exception Registers
• Not part of register file
– Cause
• Records cause of exception
• Coprocessor 0 register 13
– EPC (Exception PC)
• Records PC where exception occurred
• Coprocessor 0 register 14
• Move from Coprocessor 0
– mfc0 $t0, Cause
– Moves contents of Cause into $t0
mfc0
010000 00000 $t0 (8) Cause (13) 00000000000
Chapter 7 <111>
MICROARCHITECTURE Exception Hardware: EPC & Cause
EPCWrite IntCause CauseWrite
CLK
PCEn
IorD MemWrite IRWrite RegDst MemtoReg RegWrite ALUSrcA ALUSrcB1:0 ALUControl2:0 Branch PCWrite PCSrc1:0
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16
4 01 SrcB Overflow 10
0
Memory A3 10
15:11
1 PCJump 11
CLK Register
WD 11
0 File 0x8000 0180
Data WD3
1
<<2 27:0
<<2
SignImm
15:0
Sign Extend
25:0 (jump)
Chapter 7 <112>
MICROARCHITECTURE Control FSM with Exceptions
S12: Undefined S14: MFC0
PCSrc = 11
PCWrite RegDst = 0
IntCause = 1 Memtoreg = 10
CauseWrite RegWrite
EPCWrite
Op = others
Op = SW
Op = LW S7: ALU Overflow Overflow
S5: MemWrite S10: ADDI
Writeback S13:
S3: MemRead Writeback
Overflow
PCSrc = 11
RegDst = 1 PCWrite RegDst = 0
IorD = 1
IorD = 1 MemtoReg = 00 IntCause = 0 MemtoReg = 00
MemWrite
RegWrite CauseWrite RegWrite
EPCWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 01
RegWrite
Chapter 7 <113>
MICROARCHITECTURE Exception Hardware: mfc0
EPCWrite IntCause CauseWrite
CLK
...
CLK 0x30 0 Cause
01101 C0
0x28 1 EN
EPC 01110
EN ...
15:11
PCEn
IorD MemWrite IRWrite RegDst MemtoReg1:0 RegWrite ALUSrcA ALUSrcB1:0 ALUControl2:0Branch PCWrite PCSrc1:0
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16
4 01 SrcB Overflow 10
0
Memory A3 10
15:11
1 PCJump 11
CLK Register
WD 10 11
File 0x8000 0180
Data 00 WD3
01
<<2 27:0
<<2
SignImm
15:0
Sign Extend
25:0 (jump)
Chapter 7 <114>
MICROARCHITECTURE Advanced Microarchitecture
• Deep Pipelining
• Branch Prediction
• Superscalar Processors
• Out of Order Processors
• Register Renaming
• SIMD
• Multithreading
• Multiprocessors
Chapter 7 <115>
MICROARCHITECTURE Deep Pipelining
• 10-20 stages typical
• Number of stages limited by:
– Pipeline hazards
– Sequencing overhead
– Power
– Cost
Chapter 7 <116>
MICROARCHITECTURE Branch Prediction
• Ideal pipelined processor: CPI = 1
• Branch misprediction increases CPI
• Static branch prediction:
– Check direction of branch (forward or backward)
– If backward, predict taken
– Else, predict not taken
• Dynamic branch prediction:
– Keep history of last (several hundred) branches in branch target
buffer, record:
• Branch destination
• Whether branch was taken
Chapter 7 <117>
MICROARCHITECTURE Branch Prediction Example
add $s1, $0, $0 # sum = 0
add $s0, $0, $0 # i = 0
addi $t0, $0, 10 # $t0 = 10
for:
beq $s0, $t0, done # if i == 10, branch
add $s1, $s1, $s0 # sum = sum + i
addi $s0, $s0, 1 # increment i
j for
done:
Chapter 7 <118>
MICROARCHITECTURE 1-Bit Branch Predictor
• Remembers whether branch was taken the
last time and does the same thing
• Mispredicts first and last branch of loop
Chapter 7 <119>
MICROARCHITECTURE 2-Bit Branch Predictor
strongly weakly weakly strongly
taken taken not taken not taken
taken
taken taken taken
predict predict predict predict
taken taken not taken not taken
taken taken taken taken
Chapter 7 <120>
MICROARCHITECTURE Superscalar
• Multiple copies of datapath execute multiple
instructions at once
• Dependencies make it tricky to issue multiple
instructions at once
CLK CLK CLK CLK
CLK
PC RD A1
A A2
A3 RD1
ALUs
RD4 A1 RD1
A4
Instruction A5 Register A2 RD2
A6 File RD2
Memory RD5 Data
WD3 Memory
WD6
WD1
WD2
Chapter 7 <121>
MICROARCHITECTURE Superscalar Example
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3 Ideal IPC: 2
and $t2, $s4, $t0 Actual IPC: 2
or $t3, $s5, $s6
sw $s7, 80($t3) 1 2 3 4 5 6 7 8
Time (cycles)
$s0
lw $t0
lw $t0, 40($s0) 40 +
RF $s1 DM RF
IM
add $t1
add $t1, $s1, $s2 $s2 +
$s1
sub $t2
sub $t2, $s1, $s3 $s3 -
RF $s3 DM RF
IM
and $t3
and $t3, $s3, $s4 $s4 &
$s1
or $t4
or $t4, $s1, $s5 $s5 |
RF $s0 DM RF
IM
sw $s5
sw $s5, 80($s0) 80
Chapter 7 <122>
MICROARCHITECTURE Superscalar with Dependencies
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3 Ideal IPC: 2
and $t2, $s4, $t0 Actual IPC: 6/5 = 1.2
or $t3, $s5, $s6
sw $s7, 80($t3) 1 2 3 4 5 6 7 8 9
Time (cycles)
$s0
lw $t0
lw $t0, 40($s0) 40 +
RF DM RF
IM
$t0 $t0
add $t1
add $t1, $t0, $s1 $s1 $s1 +
RF $s2 RF $s2 DM RF
IM
sub $t0
sub $t0, $s2, $s3 $s3 $s3 -
Stall $s4
and and $t2
and $t2, $s4, $t0 $t0 &
RF $s5 DM RF
IM IM
or or $t3
or $t3, $s5, $s6 $s6 |
$t3
sw $s7
sw $s7, 80($t3) 80 +
RF DM RF
IM
Chapter 7 <123>
MICROARCHITECTURE Out of Order Processor
• Looks ahead across multiple instructions
• Issues as many instructions as possible at once
• Issues instructions out of order (as long as no
dependencies)
• Dependencies:
– RAW (read after write): one instruction writes, later
instruction reads a register
– WAR (write after read): one instruction reads, later
instruction writes a register
– WAW (write after write): one instruction writes, later
instruction writes a register
Chapter 7 <124>
MICROARCHITECTURE Out of Order Processor
• Instruction level parallelism (ILP): number
of instruction that can be issued
simultaneously (average < 3)
• Scoreboard: table that keeps track of:
– Instructions waiting to issue
– Available functional units
– Dependencies
Chapter 7 <125>
MICROARCHITECTURE Out of Order Processor Example
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3 Ideal IPC: 2
and $t2, $s4, $t0 Actual IPC: 6/4 = 1.5
or $t3, $s5, $s6
sw $s7, 80($t3) 1 2 3 4 5 6 7 8
Time (cycles)
$s0
lw $t0
lw $t0, 40($s0) 40 +
RF $s5 DM RF
IM
or $t3
or $t3, $s5, $s6 $s6 |
RAW
$t3
sw $s7
sw $s7, 80($t3) 80 +
RF DM RF
two cycle latency IM
between load and RAW
use of $t0
$t0
add $t1
add $t1, $t0, $s1 $s1 +
RF $s2 DM RF
WAR IM
sub $t0
sub $t0, $s2, $s3 $s3 -
RAW
$s4
and $t2
and $t2, $s4, $t0 $t0 &
RF DM RF
IM
Chapter 7 <126>
MICROARCHITECTURE Register Renaming
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3 Ideal IPC: 2
and $t2, $s4, $t0 Actual IPC: 6/3 = 2
or $t3, $s5, $s6
sw $s7, 80($t3) 1 2 3 4 5 6 7
Time (cycles)
$s0
lw $t0
lw $t0, 40($s0) 40 +
RF $s2 DM RF
IM
sub $r0
sub $r0, $s2, $s3 $s3 -
RAW $t0
add $t1
add $t1, $t0, $s1 $s1 +
RF $t3 DM RF
IM
sw $s7
sw $s7, 80($t3) 80 +
Chapter 7 <127>
MICROARCHITECTURE SIMD
• Single Instruction Multiple Data (SIMD)
– Single instruction acts on multiple pieces of data at once
– Common application: graphics
– Perform short arithmetic operations (also called packed
arithmetic)
• For example, add four 8-bit elements
padd8 $s2, $s0, $s1
32 24 23 16 15 8 7 0 Bit position
a3 a2 a1 a0 $s0
+ b3 b2 b1 b0 $s1
a3 + b3 a2 + b2 a1 + b1 a0 + b0 $s2
Chapter 7 <128>
MICROARCHITECTURE Advanced Architecture Techniques
• Multithreading
– Wordprocessor: thread for typing, spell checking,
printing
• Multiprocessors
– Multiple processors (cores) on a single chip
Chapter 7 <129>
MICROARCHITECTURE Threading: Definitions
• Process: program running on a computer
– Multiple processes can run at once: e.g., surfing
Web, playing music, writing a paper
• Thread: part of a program
– Each process has multiple threads: e.g., a word
processor may have threads for typing, spell
checking, printing
Chapter 7 <130>
MICROARCHITECTURE Threads in Conventional Processor
• One thread runs at once
• When one thread stalls (for example, waiting
for memory):
– Architectural state of that thread stored
– Architectural state of waiting thread loaded into
processor and it runs
– Called context switching
• Appears to user like all threads running
simultaneously
Chapter 7 <131>
MICROARCHITECTURE Multithreading
• Multiple copies of architectural state
• Multiple threads active at once:
– When one thread stalls, another runs immediately
– If one thread can’t keep all execution units busy,
another thread can use them
• Does not increase instruction-level parallelism
(ILP) of single thread, but increases
throughput
Intel calls this “hyperthreading”
Chapter 7 <132>
MICROARCHITECTURE Multiprocessors
• Multiple processors (cores) with a method of
communication between them
• Types:
– Homogeneous: multiple cores with shared memory
– Heterogeneous: separate cores for different tasks (for
example, DSP and CPU in cell phone)
– Clusters: each core has own memory system
Chapter 7 <133>
MICROARCHITECTURE Other Resources
• Patterson & Hennessy’s: Computer
Architecture: A Quantitative Approach
• Conferences:
– www.cs.wisc.edu/~arch/www/
– ISCA (International Symposium on Computer
Architecture)
– HPCA (International Symposium on High Performance
Computer Architecture)
Chapter 7 <134>