CA Chap5 Memory
CA Chap5 Memory
CA Chap5 Memory
Memory Hierarchy
1
5.1. Introduction: Memory Technology
❑ Static RAM (SRAM)
l 0.5ns – 2.5ns, $2000 – $5000 per GB
❑ Magnetic disk
l 5ms – 20ms, $0.20 – $2 per GB
❑ Ideal memory
l Access time of SRAM
l Capacity and cost/GB of disk
2
Principle of Locality
4
Memory Hierarchy Levels
5
5.2. Cache Memory
❑ Cache memory
l The level of the memory hierarchy closest to the CPU
◼ How do we know if
the data is present?
◼ Where do we look?
6
Direct Mapped Cache
◼ #Blocks is a
power of 2
◼ Use low-order
address bits
Block Number 11
7
Tags and Valid Bits
❑ How do we know which particular block is stored in a
cache location?
l Store block address as well as the data
l Actually, only need the high-order bits
l Called the tag
Tag Index
Initial Miss
(stored in cache) (in cache)
(V=0) (V=1)
8
Cache Example
9
Cache Example
10
Cache Example
11
Cache Example
12
Cache Example
13
Cache Example
14
Address Subdivision
15
Example: Larger Block Size
Block
31 10 9 4 3 0 Address
0
Tag Index Offset Block
22 bits 6 bits 4 bits Address
1
Block
Address
11
Block Address No.75 in MEM, mapped to Block number 11 in Cache
16
Block Size Considerations
17
Cache Misses
18
Write-Through
❑ On data-write hit, could just update the
block in cache fast
CPU Cache
l But then cache and memory would be
slow
inconsistent
Mem
❑ Write through: also update memory
❑ But makes writes take longer
l e.g., if base CPI = 1, 10% of instructions
are stores, write to memory takes 100
cycles
- Effective CPI = 1 + 10% × 100 = 11 CPU Cache
Buffer
❑ Solution: write buffer
l Holds data waiting to be written to memory
l CPU continues immediately Mem
- Only stalls on write if write buffer is already
full 19
Write-Back
hit
❑ Alternative: On data-write hit, just
update the block in cache CPU Cache
Mem
hit
❑ When a dirty block is replaced CPU Cache
l Write it back to memory Buffer
20
Write Allocation
❑ What should happen on a write miss?
❑ Alternatives for write-through
l Allocate on miss: fetch the block
l No write allocate (aka. Write around): don’t fetch the block
- Since programs often write a whole block before reading it (e.g.,
initialization)
❑ For write-back
l Usually fetch the block miss
CPU Cache
Mem
21
Example: Intrinsity FastMATH
❑ Embedded MIPS processor
l 12-stage pipeline
l Instruction and data access on each cycle
22
Example: Intrinsity FastMATH
23
Example: Asus K43SJ
24
Main Memory Supporting Caches
❑ Use DRAMs for main memory
l Fixed width (e.g., 1 word)
l Connected by fixed-width clocked bus
- Bus clock is typically slower than CPU clock
26
4-word block, 1-word wide memory (a)
W2 4-word block
Get Data 15bc
Word 2
4-word block
Get Data 15bc RAM
Word 3
1-
Cache word 4-word block
Get Data 15bc
wide Word 4
27
4-word block, 4-word wide memory (b)
Address 1bc
4-word block
Get Data 15bc W W W W
Word 1,2,3,4
RAM
Cache 4-word wide
28
4-word block, 4 bank 1-word memory (c)
Address 1bc
4-word block
Get Data 15bc W W W W
Word 1,2,3,4
29
Advanced DRAM Organization
30
DRAM Generations
31
Measuring Cache Performance
32
Cache Performance Example
❑ Given
l I-cache miss rate = 2% Tỷ lệ miss với cache dữ liệu
Số chu kì truy cập bộ nhớ
4%
100
33
Average Access Time
❑ Hit time is also important for performance
❑ Average memory access time (AMAT)
l AMAT = Hit time + Miss rate × Miss penalty
❑ Example
l CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles,
I-cache miss rate = 5%
l AMAT = (1 + 5% × 20)x1 = 2ns
- 2 cycles per instruction
34
Performance Summary
❑ When CPU performance increased
l Miss penalty becomes more significant
35
Associative Caches
❑ Fully associative
l Allow a given block to go in any cache entry
l Requires all entries to be searched at once
l Comparator per entry (expensive)
36
Associative Cache Example
Main memory
Block# 0 1 2 3 4 5 6 7 Set# 0 1 2 3
Data
Data Data
Tag 1 1 Tag 1
Tag
2 2 2
37
Spectrum of Associativity
❑ For a cache with 8 entries
38
Associativity Example
39
Associativity Example
◼ Fully associative
Access Block Hit/miss Cache content after access
sequence address
1 0 miss Mem[0]
2 8 miss Mem[0] Mem[8]
3 0 hit Mem[0] Mem[8]
4 6 miss Mem[0] Mem[8] Mem[6]
5 8 hit Mem[0] Mem[8] Mem[6]
%1
40
How Much Associativity
❑ Increased associativity decreases miss rate
l But with diminishing returns
41
Set Associative Cache Organization
42
Replacement Policy
❑ Direct mapped: no choice
❑ Set associative prefer
l Prefer non-valid entry, if there is one
V=0 V=1
l Otherwise, choose among entries in the set
❑ Random
l Gives approximately the same performance as LRU for high
associativity
43
Multilevel Caches
44
Multilevel Cache Example
❑ Given
l CPU base CPI = 1, clock rate = 4GHz
l Miss rate/instruction = 2%
l Main memory access time = 100ns
45
Example (cont.)
❑ Now add L-2 cache
l Access time = 5ns
l Global miss rate to main memory = 0.5%
46
Multilevel Cache Considerations
❑ Primary cache
l Focus on minimal hit time
❑ L-2 cache
l Focus on low miss rate to avoid main memory access
l Hit time has less overall impact
❑ Results
l L-1 cache usually smaller than a single cache
l L-1 block size smaller than L-2 block size
47
Interactions with Advanced CPUs
❑ Out-of-order
CPUs can execute instructions
during cache miss
l Pending store stays in load/store unit
l Dependent instructions wait in reservation stations
- Independent instructions continue
L/S ..pending..
depend L/S L/S depend
depend
depend Independent
Independent
Reservation
depend Stations
❑ Effect of miss depends on program data flow
l Much harder to analyse
l Use system simulation
48
Interactions with Software
❑Misses depend on
memory access
patterns
l Algorithm behavior
l Compiler
optimization for
memory access
49
5.3. Virtual Memory
❑ Use main memory as a “cache” for secondary
(disk) storage
l Managed jointly by CPU hardware and the
operating system (OS)
51
Page Fault Penalty
❑ On page fault, the page must be fetched from disk
l Takes millions of clock cycles
l Handled by OS code
52
Page Tables
❑ Stores placement information
l Array of Page Table Entries, indexed by virtual page number
l Page table register in CPU points to page table in physical
memory
53
Translation Using a Page Table
54
Mapping Pages to Storage
55
Replacement and Writes
❑ To reduce page fault rate, prefer least-recently used
(LRU) replacement
l Reference bit (aka use bit) in PTE set to 1 on access to page
l Periodically cleared to 0 by OS
l A page with reference bit = 0 has not been used recently
56
Fast Translation Using a TLB
57
Fast Translation Using a TLB
58
TLB Misses
❑ If page is in memory
l Load the PTE from memory and retry
l Could be handled in hardware
- Can get complex for more complicated page table structures
l Or in software
- Raise a special exception, with optimized handler
59
TLB Miss Handler
❑ TLB miss indicates
l Page present, but PTE not in TLB
l Page not present
60
Page Fault Handler
❑ Use faulting virtual address to find PTE
❑ Locate page on disk
❑ Choose page to replace
l If dirty, write to disk first (reread write back)
61
TLB and Cache Interaction
3
❑ Alternative: use virtual
address tag
Physical Memory l Complications due to
aliasing
- Different virtual
addresses for shared
physical address
physical
or virtual?
62
Memory Protection
❑ Different tasks can share parts of their virtual address
spaces
l But need to protect against errant access
l Requires OS assistance
63
The Memory Hierarchy
write back/through?
64
Block Placement
❑ Determined by associativity
l Direct mapped (1-way associative)
- One choice for placement
l n-way set associative
- n choices within a set
l Fully associative
- Any location
65
Finding a Block
❑ Hardware caches
l Reduce comparisons to reduce cost
❑ Virtual memory
l Full table lookup makes full associativity feasible
l Benefit in reduced miss rate
66
Replacement
❑ Choice of entry to replace on a miss
l Least recently used (LRU)
- Complex and costly hardware for high associativity
l Random
- Close to LRU, easier to implement
❑ Virtual memory
l LRU approximation with hardware support
67
Write Policy
❑ Write-through
l Update both upper and lower levels
l Simplifies replacement, but may require write buffer
❑ Write-back
l Update upper level only
l Update lower level when block is replaced
l Need to keep more state
❑ Virtual memory
l Only write-back is feasible, given disk write latency
68
Sources of Misses
❑ Capacity misses
l Due to finite cache size
l A replaced block is later accessed again
69
Cache Design Trade-offs
70
5.4. Virtual Machines
❑ Host
computer emulates guest operating system
and machine resources
l Improved isolation of multiple guests
l Avoids security and reliability problems
l Aids sharing of resources
❑ Examples
l IBM VM/370 (1970s technology!)
l VMWare
l Microsoft Virtual PC (Hyper-V)
71
Virtual Machine Monitor
❑ Maps virtual resources to physical resources
l Memory, I/O devices, CPUs
72
Example: Timer Virtualization
73
Instruction Set Support
74
Cache Control
31 10 9 4 3 0
Tag Index Offset
18 bits 10 bits 4 bits
75
Interface Signals
Read/Write Read/Write
Valid Valid
32 32
Address Address
32 Cache 128 Memory
CPU Write Data Write Data
32 128
Read Data Read Data
Ready Ready
76
Finite State Machines
❑ Usean FSM to
sequence control steps
❑ Setof states, transition
on each clock edge
l State values are binary
encoded
l Current state stored in a
register
l Next state
= fn (current state,
current inputs)
Could
partition into
separate
states to
reduce clock
cycle time
78
Cache Coherence Problem
3 CPU A writes 1 to X 1 0 1
79
Coherence Defined
❑ Informally: Reads return most recently written value
❑ Formally:
l P writes X; P reads X (no intervening writes)
read returns written value
l P1 writes X; P2 reads X (sufficiently later)
read returns written value
- c.f. CPU B reading X after step 3 in example
l P1 writes X, P2 writes X
all processors see writes in the same order
- End up with the same final value for X
80
Cache Coherence Protocols
❑ Operations performed by caches in multiprocessors to
ensure coherence
l Migration of data to local caches
- Reduces bandwidth for shared memory
l Replication of read-shared data
- Reduces contention for access
❑ Snooping protocols
l Each cache monitors bus reads/writes
❑ Directory-based protocols
l Caches and memory record sharing status of blocks in a
directory
81
Invalidating Snooping Protocols
❑ Cache gets exclusive access to a block when it is to be
written
l Broadcasts an invalidate message on the bus
l Subsequent read in another cache misses
- Owning cache supplies updated value
82
Memory Consistency
❑ Assumptions
l A write completes only when all processors have
seen it
l A processor does not reorder writes with other
accesses
❑ Consequence
l P writes X then writes Y
all processors that see new Y also see new X
l Processors can reorder reads, but not writes
83
Multilevel On-Chip Caches
84
2-Level TLB Organization
85
3-Level Cache Organization
86
Miss Penalty Reduction
❑ Return requested word first
l Then back-fill rest of block
87
Pitfalls
❑ Byte vs. word addressing
l Example: 32-byte direct-mapped cache,
4-byte blocks
- Byte 36 maps to block 1
- Word 36 maps to block 4
88
Pitfalls
❑ In multiprocessor with shared L2 or L3 cache
l Less associativity than cores results in conflict misses
l More cores need to increase associativity
89
Pitfalls
❑ Extending address range using segments
l E.g., Intel 80286
l But a segment is not always big enough
l Makes address arithmetic complicated
90
Concluding Remarks
❑ Principle of locality
l Programs use a small part of their memory space
frequently
❑ Memory hierarchy
l L1 cache L2 cache … DRAM memory
disk