lecture08_RISCV_Impl_2
lecture08_RISCV_Impl_2
Implementa9on
1
Acknowledgements
• The notes cover Appendix C of the textbook, but we use
RISC-V instead of MIPS ISA
– Slides for general RISC ISA implementaLon are adapted from
Lecture slides for “Computer OrganizaLon and Design, FiRh
EdiLon: The Hardware/SoRware Interface” textbook for
general RISC ISA implementaLon
– Slides for RISC-V single-cycle implementaLon are adapted
from Computer Science 152: Computer Architecture and
Engineering, Spring 2016 by Dr. George Michelogiannakis from
UC Berkeley
2
Introduc9on
• CPU performance factors CPU Time = Instructions * Cycles Time
*
– InstrucLon count Program Instruction Cycle
• Determined by ISA and compiler
– CPI and Cycle Lme
• Determined by CPU hardware
• Simple subset, shows most aspects
– Memory reference: lw, sw
– ArithmeLc/logical: add, sub, and, or, slt
– Control transfer: beq, j
Components of a Computer
Memory
Processor Input
Enable?
Read/Write
Control
Program
Datapath
Address
PC Bytes
Write
Registers Data
5
Datapath and Control
rd
registers
instruction
memory
PC
rs
memory
ALU
Data
rt
+4 imm
opcode, funct
Controller
6
Logic Design Basics
• Information encoded in binary
– Low voltage = 0, High voltage = 1
– One wire per bit
– Multi-bit data encoded on multi-wire buses
• Combinational circuit
– Operate on data
– Output is a function of input
• State (sequential) circuit
– Store information
Combinational Circuits
• AND-gate
– Y=A&B n Adder A
Y
+
n Y=A+B B
A
Y
B
n ArithmeLc/Logic Unit
n MulLplexer n Y = F(A, B)
n Y = S ? I1 : I0
A
I0 M
u Y ALU Y
I1 x
B
S F
Clk
D Q
D
Clk
Q
Edge-Triggered D Flip Flops
• Value of D is sampled on posi9ve clock edge.
D Q
CLK
Q
Sequential Circuits
• Register with write control
– Only updates on clock edge when write control input is 1
– Used when stored value is required later
Clk
D Q Write
Write D
Clk
Q
Clocking Methodology
• Combinational logic transforms data during clock
cycles
– Between clock edges
– Input from state elements, output to state element
– Longest delay determines clock period
Single cycle data paths
Processor uses
synchronous logic
design (a “clock”). f! T!
1 MHz! 1 μs!
10 MHz! 100 ns!
100 MHz! 10 ns!
1 GHz! 1 ns!
D Q
All state elements act like
posiLve edge-triggered flip Reset ?
flops.
clk
Hardware Elements of CPU
• CombinaLonal circuits OpSelect
– Mux, Decoder, ALU, ... - Add, Sub, ...
- And, Or, Xor, Not, ...
Sel - GT, LT, EQ, Zero, ...
lg(n)
A0 O0 A
A1 . O1
Result
Decoder
..
O A
... Mux lg(n)
ALU
Comp?
An-1 On-1 B
D Clk
En En
ff
Clk D
Q Q
Q0 Q1 Q2 ... Qn-1
Clock WE
we
ReadSel1 rs1 rd1 ReadData1
ReadSel2 rs2 Register rd2 ReadData2
file
WriteSel ws
2R+1W
WriteData wd
15
Register File Implementa9on
• RISC-V integer instrucLons have at most 2 register source
operands rs1
rd clk wdata rdata1 rdata2 5 rs2
5 32 32 32 5
reg 0
we reg 1
…
…
…
reg 31
16
A Simple Memory Model
WriteEnable
Clock
Address
MAGIC ReadData
RAM
WriteData
17
Five Stages of Instruc9on Execu9on
18
Stages of Execu9on on Datapath
rd
registers
instruction
memory
PC
rs
memory
ALU
Data
rt
+4 imm
19
Stages of Execu9on (1/5)
• There is a wide variety of instrucLons: so what general
steps do they have in common?
• Stage 1: InstrucLon Fetch
– The 32-bit instrucLon word must first be fetched from
memory
• the cache-memory hierarchy
– also, this is where we Increment PC
• PC = PC + 4, to point to the next instrucLon: byte addressing
so + 4
20
Stages of Execu9on (2/5)
21
Stages of Execu9on (3/5)
22
Stages of Execu9on (4/5)
• Stage 4: Memory Access: only load and store instrucLons
– the others remain idle during this stage or skip it all together
– since these instrucLons have a unique step, we need this extra
stage to account for them
– as a result of the cache system, this stage is expected to be
fast
23
Stages of Execu9on (5/5)
• Stage 5: Register Write
– most instrucLons write the result of some computaLon into a
register
– examples: arithmeLc, logical, shiRs, loads, slt
– what about stores, branches, jumps?
• don’t write anything into a register at the end
• these remain idle during this fiRh stage or skip it all together
24
Stages of Execu9on on Datapath
rd
registers
instruction
memory
PC
rs
memory
ALU
Data
rt
+4 imm
25
Instruction Execution
• PC → instruction memory, fetch instruction
• Register numbers → register file, read registers
• Depending on instruction class
– Use ALU to calculate
• Arithmetic result
• Memory address for load/store
• Branch condition and target address
– Access data memory for load/store
– PC ← target address or PC + 4
CPU Components
Multiplexers
n Can’t just join wires
together
n Use mulLplexers
Control Signals
Building a Datapath
• Datapath
– Elements that process data and addresses
in the CPU
• Registers, ALUs, mux’s, memories, …
• We will build a RISCV datapath incrementally
– Refining the overview design
Instruction Fetch
Increment by
4 for next
32-bit instrucLon
register
R-Format Instructions
• Read two register operands
• Perform arithmetic/logical operation
• Write register result
Load/Store Instructions
• Read register operands
• Calculate address using 12-bit offset
– Use ALU, but sign-extend offset
• Load: Read memory and update register
• Store: Write register value to memory
Branch Instructions
• Read register operands
• Compare operands
– Use ALU, subtract and check Zero output
• Calculate target address
– Sign-extend displacement
– Shift left 2 places (word displacement)
– Add to PC + 4
• Already calculated by instruction fetch
Branch Instructions
Just
re-routes
wires
Sign-bit wire
replicated
Composing the Elements
• First-cut data path does an instruction in one clock
cycle
– Each datapath element can only do one function at a time
– Hence, we need separate instruction and data memories
• Use multiplexers where alternate data sources are
used for different instructions
R-Type/Load/Store Datapath
Full Datapath
ALU Control
• ALU used for
– Load/Store: F = add
– Branch: F = subtract
– R-type: F depends on funct field
RegWriteEn
0x4
Add clk
Inst<19:15> we
Inst<24:20> rs1
addr rs2
PC Inst<11:7> rd1
inst ALU
wa
Inst. wd rd2
clk GPRs
Memory
Inst<14:12> ALU
Control
RegWriteEn
0x4
clk
Add
Inst<19:15> we
rs1
rs2
PC addr rd1
inst Inst<11:7> wa ALU
wd rd2
clk Inst. GPRs
Memory
Inst<31:20> Imm
Select
Inst<14:12> ALU
Control
OpCode ImmSel
12 5 3 5 7
immediate12 rs1 func3 rd opcode rd ← (rs1) op immediate
31 20 19 15 14 12 11 76 0
42
Conflicts in Merging Datapath
RegWrite
0x4 Introduce
clk
Add muxes
Inst<19:15> we
rs1
Inst<24:20> rs2
PC addr rd1
Inst<11:7>
inst wa ALU
wd rd2
clk Inst. GPRs
Memory
Inst<31:20> Imm
Select
Inst<14:12> ALU
Control
OpCode ImmSel
7 5 5 3 5 7
func7 rs2 rs1 func3 rd opcode rd ← (rs1) func (rs2)
RegWriteEn
0x4
clk
Add
<19:15> we
rs1
<24:20> rs2
PC addr rd1
inst <11:7> wa ALU
wd rd2
clk Inst. GPRs
Memory
Inst<31:20> Imm
Select
<14:12> ALU
Control
<6:0>
ImmSel Op2Sel
OpCode
Reg / Imm
7 5 5 3 5 7
func7 rs2 rs1 func3 rd opcode rd ← (rs1) func (rs2)
46
Condi9onal Branches (BEQ/BNE/BLT/BGE/BLTU/BGEU)
PCSel RegWrEn
br MemWrite WBSel
pc+4
0x4
Add
Add
clk
we Br Logic Bcomp?
clk
rs1
rs2
PC addr rd1 we
inst wa addr
wd rd2 ALU
clk Inst. GPRs rdata
Memory Data
Imm Memory
Select wdata
ALU
Control
data
Extend
Mem
ir[11:7]
rf_wen
ir[31:20] Op2Sel
SType Sign
wb_sel
Extend
val
ir[31:12]
UType wa en
ALU
Reg
wd
ir[24:20] rs2 File
Reg
addr
data
ir[19:15] rs1
File
AluFun
Op1Sel
Decoder rdata
rs2
addr Data Mem
wdata
Control
Signals
mem_val
mem_rw
Note: for simplicity, the CSR File
(control and status registers) and
associated datapath is not shown
48
Execute Stage
Hardwired Control is pure Combina9onal Logic
ImmSel
Op2Sel
FuncSel
op code
combinaLonal MemWrite
logic WBSel
Equal?
WASel
RegWriteEn
PCSel
49
ALU Control & Immediate Extension
Inst<14:12> (Func3)
Inst<6:0> (Opcode)
ALUop
+
0?
FuncSel
( Func, Op, +, 0? )
Decode Map
ImmSel
( IType12, SType12,
UType20)
50
Hardwired Control Table
Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel
ALU * Reg Func no yes ALU rd pc+4
ALUi IType12 Imm Op no yes ALU rd pc+4
LW IType12 Imm + no yes Mem rd pc+4
SW SType12 Imm + yes no * * pc+4
BEQtrue SBType12 * * no no * * br
BEQfalse SBType12 * * no no * * pc+4
J * * * no no * * jabs
JAL * * * no yes PC X1 jabs
JALR * * * no yes PC rd rind
51
Single-Cycle Hardwired Control
clock period is sufficiently long for all of the following steps to
be “completed”:
1. InstrucLon fetch
2. Decode and register fetch
3. ALU operaLon
4. Data fetch if required
5. Register write-back setup Lme
=> tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
At the rising edge of the following clock, the PC, register file
and memory are updated
52
Implementa9on in Real
• Load-Store RISC ISAs designed for efficient pipelined
implementaLons
– Inspired by earlier Cray machines (CDC 6600/7600)
• RISC-V ISA implemented using Chisel hardware
construcLon language
– Chisel: h}ps://chisel.eecs.berkeley.edu/
– Ge~ng started:
• h}ps://chisel.eecs.berkeley.edu/2.2.0/ge~ng-started.html
– Check resource page for slides and other info
53
Chisel in one slides
• Module
• IO
• Wire
• Reg
• Mem
54
UCB RISC-V Sodor
• h}ps://github.com/ucb-bar/riscv-sodor
– Single-cycle:
• h}ps://github.com/ucb-bar/riscv-sodor/tree/master/src/
rv32_1stage
55