Cache Basics and Operation
Cache Basics and Operation
Cache Basics and Operation
Cache
Generically, any structure that “memoizes” frequently used
results to avoid repeating the long-latency operations
required to reproduce the results from scratch, e.g. a web
cache
2
Caching Basics
Block (line): Unit of storage in the cache
Memory is logically divided into cache blocks that map to
locations in the cache
On a reference:
HIT: If in cache, use cached data instead of accessing memory
MISS: If not in cache, bring block into cache
Maybe have to kick something else out to do it
Address
Tag Store Data Store
Hit/miss? Data
4
A Basic Hardware Cache Design
We will start with a basic hardware cache design
5
Blocks and Addressing the Cache
Memory is logically divided into fixed-size blocks
8-bit address
Cache access:
1) index into the tag and data stores with index bits in address
2) check valid bit in tag store
3) compare tag bits in address with the stored tag in tag store
6
Direct-Mapped Cache: Placement and Access
Block: 00000
Block: 00001 Assume byte-addressable memory:
Block: 00010
Block: 00011
Block: 00100
256 bytes, 8-byte blocks 32 blocks
Block: 00101
Block: 00110
Assume cache: 64 bytes, 8 blocks
Block: 00111
Block: 01000 Direct-mapped: A block can go to only one location
Block: 01001
Block: 01010 tag index byte in block
Block: 01011
Block: 01100 2b 3 bits 3 bits Tag store Data store
Block: 01101
Block: 01110 Address
Block: 01111
Block: 10000
Block: 10001
Block: 10010
Block: 10011
Block: 10100
Block: 10101 V tag
Block: 10110
Block: 10111
Block: 11000 byte in block
Block: 11001 =? MUX
Block: 11010
Block: 11011
Block: 11100 Hit? Data
Block: 11101
Block: 11110
Block: 11111
Main memory 7
Direct-Mapped Caches
Direct-mapped cache: Two blocks in memory that map to
the same index in the cache cannot be present in the cache
at the same time
One index one entry
8
Set Associativity
Addresses 0 and 8 always conflict in direct mapped cache
Instead of having one column of 8, have 2 columns of 4 blocks
V tag V tag
=? =? MUX
=? =? =? =?
Logic Hit?
Data store
MUX
byte in block
MUX
Tag store
=? =? =? =? =? =? =? =?
Logic
Hit?
Data store
MUX
byte in block
MUX
11
Associativity (and Tradeoffs)
Degree of associativity: How many blocks can map to the
same index (or set)?
Higher associativity
++ Higher hit rate
-- Slower cache access time (hit latency and data access latency)
-- More expensive hardware (more comparators)
hit rate
associativity
12
Issues in Set-Associative Caches
Think of each block in a set having a “priority”
Indicating how important it is to keep the block in the cache
Key issue: How do you determine/adjust block priorities?
There are three key decisions in a set:
Insertion, promotion, eviction (replacement)
14
Implementing LRU
Idea: Evict the least recently accessed block
Problem: Need to keep track of access ordering of blocks
15
Approximations of LRU
Most modern processors do not implement “true LRU” (also
called “perfect LRU”) in highly-associative caches
Why?
True LRU is complex
LRU is an approximation to predict locality anyway (i.e., not
the best possible cache management policy)
Examples:
Not MRU (not most recently used)
Victim-NextVictim Replacement: Only keep track of the victim
and the next victim
16
Cache Replacement Policy: LRU or Random
LRU vs. Random: Which one is better?
Example: 4-way cache, cyclic references to A, B, C, D, E
0% hit rate with LRU policy
Set thrashing: When the “program working set” in a set is
larger than set associativity
Random replacement policy is better when thrashing occurs
In practice:
Depends on workload
Average hit rate of LRU and Random are similar
17
What Is the Optimal Replacement
Policy?
Belady’s OPT
Replace the block that is going to be referenced furthest in the
future by the program
Belady, “A study of replacement algorithms for a virtual-
storage computer,” IBM Systems Journal, 1966.
How do we implement this? Simulate?
18
What’s In A Tag Store Entry?
Valid bit
Tag
Replacement policy bits
Dirty bit?
To reduce the frequency of writing back blocks on
19
Handling Writes (I)
When do we write the modified data in a cache to the next level?
Write through: At the time the write happens
Write back: When the block is evicted
Write-back
+ Can combine multiple writes to the same block before eviction
Potentially saves bandwidth between cache levels + saves energy
-- Need a bit in the tag store indicating the block is “dirty/modified”
Write-through
+ Simpler
+ All levels are up to date. Consistency: Simpler cache coherence because
no need to check close-to-processor caches’ tag stores for presence
-- More bandwidth intensive; no combining of writes
20
Write-through is easier to implement than write-back. The cache is
always clean, so unlike write-back read misses never result in writes to
the lower level.
Write-through also has the advantage that the next lower level has the
most current copy of the data, which simplifies data coherency.
21
There are two options on a write miss:
■ Write allocate—The block is allocated on a write miss,
followed by the write hit actions above. In this natural option,
write misses act like read misses.
■ No-write allocate—This apparently unusual alternative is
write misses do not affect the cache. Instead, the block is
modified only in the lower-level memory.
22
Cache Parameters vs. Miss/Hit Rate
Cache size
Block size
Associativity
Replacement policy
Insertion/Placement policy
23
Cache Size
Cache size: total data (not including tag) capacity
bigger can exploit temporal locality better
not ALWAYS better
Too large a cache adversely affects hit and miss latency
smaller is faster => bigger is slower
access time may degrade critical path
Too small a cache hit rate
doesn’t exploit temporal locality well
useful data replaced often
“working set”
size
25
Associativity
How many blocks can be present in the same index (i.e., set)?
Larger associativity
lower miss rate (reduced conflicts)
higher hit latency and area cost (plus diminishing returns)
hit rate
Smaller associativity
lower cost
lower hit latency
Especially important for L1 caches
associativity
Is power of 2 associativity required?
26
Classification of Cache Misses
Compulsory miss
first reference to an address (block) always results in a miss
subsequent references should hit unless the cache block is
displaced for the reasons below
Capacity miss
cache is too small to hold everything needed
defined as the misses that would occur even in a fully-
associative cache (with optimal replacement) of the same
capacity
Conflict miss
defined as any miss that is neither a compulsory nor a
capacity miss
27
How to Reduce Each Miss Type
Compulsory
Caching cannot help
Prefetching can: Anticipate which blocks will be needed soon
Conflict
More associativity
Other ways to get more associativity without making the cache
associative
Victim cache
Better, randomized indexing
Software hints?
Capacity
Utilize cache space better: keep blocks that will be referenced
Software management: divide working set and computation
such that each “computation phase” fits in cache
28
How to Improve Cache Performance
Three fundamental goals
29
Restructuring Data Access Patterns (I)
Idea: Restructure data layout or data access patterns
Example: If column-major
x[i+1,j] follows x[i,j] in memory
x[i,j+1] is far away from x[i,j]
31
Restructuring Data Layout (I)
Pointer based traversal
struct Node { (e.g., of a linked list)
struct Node* next;
int key;
Assume a huge linked
char [256] name; list (1B nodes) and
char [256] school; unique keys
} Why does the code on
while (node) { the left have poor cache
if (nodekey == input-key) { hit rate?
// access other fields of node “Other fields” occupy
} most of the cache line
node = nodenext; even though rarely
}
accessed!
32
Restructuring Data Layout (II)
struct Node { Idea: separate frequently-
struct Node* next; used fields of a data
int key;
structure and pack them
struct Node-data* node-data;
} into a separate data
structure
struct Node-data {
char [256] name;
char [256] school;
Who should do this?
} Programmer
Compiler
while (node) { Profiling vs. dynamic
if (nodekey == input-key) {
// access nodenode-data
Hardware?
} Who can determine what
node = nodenext; is frequently used?
}
33
Cache Optimization
34
Six basic cache optimizations
Larger block size to reduce miss rate
The simplest way to reduce the miss rate is to take advantage
of spatial locality and increase the block size.
Larger blocks reduce compulsory misses, but they also
increase the miss penalty.
Because larger blocks lower the number of tags, they can
slightly reduce static power.
Larger block sizes can also increase capacity or conflict misses,
especially in smaller caches.
Choosing the right block size is a complex trade-off that
depends on the size of cache and the miss penalty.
35
Bigger caches to reduce miss rate
The obvious way to reduce capacity misses is to increase
cache capacity.
Drawbacks include potentially longer hit time of the larger
cache memory and higher cost and power.
Larger caches increase both static and dynamic power.
36
Higher associativity to reduce miss rate
Obviously, increasing associativity reduces conflict misses.
Greater associativity can come at the cost of increased hit
time. As we will see shortly, associativity also increases
power consumption.
37
Multilevel caches to reduce miss penalty
A difficult decision is whether to make the cache hit time fast, to
keep pace with the high clock rate of processors, or to make the
cache large to reduce the gap between the processor accesses and
main memory accesses. Adding another level of cache between the
original cache and memory simplifies the decision.
The first-level cache can be small enough to match a fast clock cycle
time, yet the second-level (or third-level) cache can be large enough
to capture many accesses that would go to main memory.
The focus on misses in second-level caches leads to larger blocks,
38
Giving priority to read misses over writes to reduce
miss penalty
A write buffer is a good place to implement this
optimization. Write buffers create hazards because they
hold the updated value of a location needed on a read miss
—that is, a read-after-write hazard through memory.
One solution is to check the contents of the write buffer on a
read miss. If there are no conflicts, and if the memory system
is available, sending the read before the writes reduces the
miss penalty. Most processors give reads priority over writes.
This choice has little effect on power consumption.
39
Avoiding address translation during indexing of the
cache to reduce hit time
Caches must cope with the translation of a virtual address
from the processor to a physical address to access
memory.
A common optimization is to use the page offset—the part
that is identical in both virtual and physical addresses—to
index the cache
40
Ten Advanced Optimizations of Cache
Performance
We can classify the ten advanced cache optimizations we
examine into five categories based on these metrics:
Reducing the hit time—Small and simple first-level caches and way
prediction. Both techniques also generally decrease power consumption.
Increasing cache bandwidth—Pipelined caches, multibanked caches, and
nonblocking caches. These techniques have varying impacts on power
consumption.
Reducing the miss penalty—Critical word first and merging write buffers.
These optimizations have little impact on power.
Reducing the miss rate—Compiler optimizations. Obviously any
improvement at compile time improves power consumption.
Reducing the miss penalty or miss rate via parallelism—Hardware
prefetching and compiler prefetching. These optimizations generally
increase power consumption, primarily due to prefetched data that are
unused.
41
Advanced Cache Optimization
First Optimization: Small and Simple First-Level Caches to Reduce Hit Time
and Power
Second Optimization: Way Prediction to Reduce Hit Time
Third Optimization: Pipelined Cache Access to Increase Cache Bandwidth
Fourth Optimization: Nonblocking Caches to Increase Cache Bandwidth
Fifth Optimization: Multibanked Caches to Increase Cache Bandwidth
Sixth Optimization: Critical Word First and Early Restart to Reduce Miss
Penalty
Seventh Optimization: Merging Write Buffer to Reduce Miss Penalty
Eighth Optimization: Compiler Optimizations to
Reduce Miss Rate
Ninth Optimization: Hardware Prefetching of Instructions and Data to
Reduce Miss Penalty or Miss Rate
Tenth Optimization: Compiler-Controlled Prefetching to Reduce Miss Penalty
or Miss Rate
42