Cache Basics and Operation

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

Cache Basics and Operation

Cache
 Generically, any structure that “memoizes” frequently used
results to avoid repeating the long-latency operations
required to reproduce the results from scratch, e.g. a web
cache

 Most commonly in the on-die context: an automatically-


managed memory hierarchy based on SRAM
 memoize in SRAM the most frequently accessed DRAM
memory locations to avoid repeatedly paying for the DRAM
access latency

2
Caching Basics
 Block (line): Unit of storage in the cache
 Memory is logically divided into cache blocks that map to
locations in the cache
 On a reference:
 HIT: If in cache, use cached data instead of accessing memory
 MISS: If not in cache, bring block into cache
 Maybe have to kick something else out to do it

 Some important cache design decisions


 Placement: where and how to place/find a block in cache?
 Replacement: what data to remove to make room in cache?
 Granularity of management: large or small blocks? Subblocks?
 Write policy: what do we do about writes?
 Instructions/data: do we treat them separately?
3
Cache Abstraction and Metrics

Address
Tag Store Data Store

(is the address (stores


in the cache? memory
+ bookkeeping) blocks)

Hit/miss? Data

4
A Basic Hardware Cache Design
 We will start with a basic hardware cache design

 Then, we will examine a multitude of ideas to make it


better

5
Blocks and Addressing the Cache
 Memory is logically divided into fixed-size blocks

 Each block maps to a location in the cache, determined by


the index bits in the address tag index byte in block
 used to index into the tag and data stores 2b 3 bits 3 bits

8-bit address

 Cache access:
1) index into the tag and data stores with index bits in address
2) check valid bit in tag store
3) compare tag bits in address with the stored tag in tag store

 If a block is in the cache (cache hit), the stored tag should


be valid and match the tag of the block

6
Direct-Mapped Cache: Placement and Access
Block: 00000
Block: 00001  Assume byte-addressable memory:
Block: 00010
Block: 00011
Block: 00100
256 bytes, 8-byte blocks  32 blocks
Block: 00101
Block: 00110
 Assume cache: 64 bytes, 8 blocks
Block: 00111
Block: 01000  Direct-mapped: A block can go to only one location
Block: 01001
Block: 01010 tag index byte in block
Block: 01011
Block: 01100 2b 3 bits 3 bits Tag store Data store
Block: 01101
Block: 01110 Address
Block: 01111
Block: 10000
Block: 10001
Block: 10010
Block: 10011
Block: 10100
Block: 10101 V tag
Block: 10110
Block: 10111
Block: 11000 byte in block
Block: 11001 =? MUX
Block: 11010
Block: 11011
Block: 11100 Hit? Data
Block: 11101
Block: 11110
Block: 11111
Main memory 7
Direct-Mapped Caches
 Direct-mapped cache: Two blocks in memory that map to
the same index in the cache cannot be present in the cache
at the same time
 One index  one entry

 Can lead to 0% hit rate if more than one block accessed in


an interleaved manner map to the same index
 Assume addresses A and B have the same index bits but
different tag bits
 A, B, A, B, A, B, A, B, …  conflict in the cache index
 All accesses are conflict misses

8
Set Associativity
 Addresses 0 and 8 always conflict in direct mapped cache
 Instead of having one column of 8, have 2 columns of 4 blocks

Tag store Data store


SET

V tag V tag

=? =? MUX

Logic byte in block


MUX
Hit?
Address
tag index byte in block
Key idea: Associative memory within the set
3b 2 bits 3 bits
+ Accommodates conflicts better (fewer conflict misses)
-- More complex, slower access, larger tag store
9
Higher Associativity
 4-way Tag store

=? =? =? =?

Logic Hit?

Data store

MUX
byte in block
MUX

+ Likelihood of conflict misses even lower


-- More tag comparators and wider data mux; larger tags
10
Full Associativity
 Fully associative cache
 A block can be placed in any cache location

Tag store

=? =? =? =? =? =? =? =?

Logic

Hit?

Data store

MUX
byte in block
MUX

11
Associativity (and Tradeoffs)
 Degree of associativity: How many blocks can map to the
same index (or set)?

 Higher associativity
++ Higher hit rate
-- Slower cache access time (hit latency and data access latency)
-- More expensive hardware (more comparators)
hit rate

 Diminishing returns from higher


associativity

associativity
12
Issues in Set-Associative Caches
 Think of each block in a set having a “priority”
 Indicating how important it is to keep the block in the cache
 Key issue: How do you determine/adjust block priorities?
 There are three key decisions in a set:
 Insertion, promotion, eviction (replacement)

 Insertion: What happens to priorities on a cache fill?


 Where to insert the incoming block, whether or not to insert the block
 Promotion: What happens to priorities on a cache hit?
 Whether and how to change block priority
 Eviction/replacement: What happens to priorities on a cache
miss?
 Which block to evict and how to adjust priorities
13
Eviction/Replacement Policy
 Which block in the set to replace on a cache miss?
 Any invalid block first
 If all are valid, consult the replacement policy
 Random
 FIFO
 Least recently used (how to implement?)
 Not most recently used
 Least frequently used?
 Least costly to re-fetch?
 Why would memory accesses have different cost?
 Hybrid replacement policies
 Optimal replacement policy?

14
Implementing LRU
 Idea: Evict the least recently accessed block
 Problem: Need to keep track of access ordering of blocks

 Question: 2-way set associative cache:


 What do you need to implement LRU perfectly?

 Question: 4-way set associative cache:


 What do you need to implement LRU perfectly?
 How many different orderings possible for the 4 blocks in the
set?
 How many bits needed to encode the LRU order of a block?
 What is the logic needed to determine the LRU victim?

15
Approximations of LRU
 Most modern processors do not implement “true LRU” (also
called “perfect LRU”) in highly-associative caches

 Why?
 True LRU is complex
 LRU is an approximation to predict locality anyway (i.e., not
the best possible cache management policy)

 Examples:
 Not MRU (not most recently used)
 Victim-NextVictim Replacement: Only keep track of the victim
and the next victim

16
Cache Replacement Policy: LRU or Random
 LRU vs. Random: Which one is better?
 Example: 4-way cache, cyclic references to A, B, C, D, E
 0% hit rate with LRU policy
 Set thrashing: When the “program working set” in a set is
larger than set associativity
 Random replacement policy is better when thrashing occurs
 In practice:
 Depends on workload
 Average hit rate of LRU and Random are similar

 Best of both Worlds: Hybrid of LRU and Random


 How to choose between the two? Set sampling

17
What Is the Optimal Replacement
Policy?
Belady’s OPT
 Replace the block that is going to be referenced furthest in the
future by the program
 Belady, “A study of replacement algorithms for a virtual-
storage computer,” IBM Systems Journal, 1966.
 How do we implement this? Simulate?

 Is this optimal for minimizing miss rate?


 Is this optimal for minimizing execution time?
 No. Cache miss latency/cost varies from block to block!
 Two reasons: Remote vs. local caches and miss overlapping

18
What’s In A Tag Store Entry?
 Valid bit
 Tag
 Replacement policy bits
 Dirty bit?
 To reduce the frequency of writing back blocks on

replacement, a feature called the dirty bit is commonly


used.
 This status bit indicates whether the block is dirty

(modified while in the cache) or clean (not modified).


 If it is clean, the block is not written back on a miss,

since identical information to the cache is found in lower


levels.

19
Handling Writes (I)
 When do we write the modified data in a cache to the next level?
 Write through: At the time the write happens
 Write back: When the block is evicted

 Write-back
+ Can combine multiple writes to the same block before eviction
 Potentially saves bandwidth between cache levels + saves energy
-- Need a bit in the tag store indicating the block is “dirty/modified”

 Write-through
+ Simpler
+ All levels are up to date. Consistency: Simpler cache coherence because
no need to check close-to-processor caches’ tag stores for presence
-- More bandwidth intensive; no combining of writes

20
 Write-through is easier to implement than write-back. The cache is
always clean, so unlike write-back read misses never result in writes to
the lower level.

 Write-through also has the advantage that the next lower level has the
most current copy of the data, which simplifies data coherency.

21
There are two options on a write miss:
■ Write allocate—The block is allocated on a write miss,
followed by the write hit actions above. In this natural option,
write misses act like read misses.
■ No-write allocate—This apparently unusual alternative is
write misses do not affect the cache. Instead, the block is
modified only in the lower-level memory.

22
Cache Parameters vs. Miss/Hit Rate
 Cache size

 Block size

 Associativity

 Replacement policy

 Insertion/Placement policy

23
Cache Size
 Cache size: total data (not including tag) capacity
 bigger can exploit temporal locality better
 not ALWAYS better
 Too large a cache adversely affects hit and miss latency
 smaller is faster => bigger is slower
 access time may degrade critical path
 Too small a cache hit rate
 doesn’t exploit temporal locality well
 useful data replaced often
“working set”
size

 Working set: the whole set of data


the executing application references
 Within a time interval cache size
24
Block Size
 Block size is the data that is associated with an address tag
 not necessarily the unit of transfer between hierarchies
 Sub-blocking: A block divided into multiple pieces (each w/ V/D bits)

 Too small blocks hit rate


 don’t exploit spatial locality well
 have larger tag overhead

 Too large blocks


 too few total # of blocks  less
temporal locality exploitation
 waste of cache space and bandwidth/energy block
size
if spatial locality is not high

25
Associativity
 How many blocks can be present in the same index (i.e., set)?

 Larger associativity
 lower miss rate (reduced conflicts)
 higher hit latency and area cost (plus diminishing returns)
hit rate
 Smaller associativity
 lower cost
 lower hit latency
 Especially important for L1 caches

associativity
 Is power of 2 associativity required?
26
Classification of Cache Misses
 Compulsory miss
 first reference to an address (block) always results in a miss
 subsequent references should hit unless the cache block is
displaced for the reasons below

 Capacity miss
 cache is too small to hold everything needed
 defined as the misses that would occur even in a fully-
associative cache (with optimal replacement) of the same
capacity

 Conflict miss
 defined as any miss that is neither a compulsory nor a
capacity miss
27
How to Reduce Each Miss Type
 Compulsory
 Caching cannot help
 Prefetching can: Anticipate which blocks will be needed soon
 Conflict
 More associativity
 Other ways to get more associativity without making the cache
associative
 Victim cache
 Better, randomized indexing
 Software hints?
 Capacity
 Utilize cache space better: keep blocks that will be referenced
 Software management: divide working set and computation
such that each “computation phase” fits in cache
28
How to Improve Cache Performance
 Three fundamental goals

 Reducing miss rate


 Caveat: reducing miss rate can reduce performance if more
costly-to-refetch blocks are evicted

 Reducing miss latency or miss cost

 Reducing hit latency or hit cost

 The above three together affect performance

29
Restructuring Data Access Patterns (I)
 Idea: Restructure data layout or data access patterns
 Example: If column-major
 x[i+1,j] follows x[i,j] in memory
 x[i,j+1] is far away from x[i,j]

Poor code Better code


for i = 1, rows for j = 1, columns
for j = 1, columns for i = 1, rows
sum = sum + x[i,j] sum = sum + x[i,j]

 This is called loop interchange


 Other optimizations can also increase hit rate
 Loop fusion, array merging, …
 What if multiple arrays? Unknown array size at compile time?
30
Restructuring Data Access Patterns (II)
 Blocking
 Divide loops operating on arrays into computation chunks so
that each chunk can hold its data in the cache
 Avoids cache conflicts between different chunks of
computation
 Essentially: Divide the working set so that each piece fits in
the cache

 But, there are still self-conflicts in a block


1. there can be conflicts among different arrays
2. array sizes may be unknown at compile/programming time

31
Restructuring Data Layout (I)
 Pointer based traversal
struct Node { (e.g., of a linked list)
struct Node* next;
int key;
 Assume a huge linked
char [256] name; list (1B nodes) and
char [256] school; unique keys
}  Why does the code on
while (node) { the left have poor cache
if (nodekey == input-key) { hit rate?
// access other fields of node  “Other fields” occupy
} most of the cache line
node = nodenext; even though rarely
}
accessed!

32
Restructuring Data Layout (II)
struct Node {  Idea: separate frequently-
struct Node* next; used fields of a data
int key;
structure and pack them
struct Node-data* node-data;
} into a separate data
structure
struct Node-data {
char [256] name;
char [256] school;
 Who should do this?
}  Programmer
 Compiler
while (node) {  Profiling vs. dynamic
if (nodekey == input-key) {
// access nodenode-data
 Hardware?
}  Who can determine what
node = nodenext; is frequently used?
}
33
Cache Optimization

34
Six basic cache optimizations
 Larger block size to reduce miss rate
 The simplest way to reduce the miss rate is to take advantage
of spatial locality and increase the block size.
 Larger blocks reduce compulsory misses, but they also
increase the miss penalty.
 Because larger blocks lower the number of tags, they can
slightly reduce static power.
 Larger block sizes can also increase capacity or conflict misses,
especially in smaller caches.
 Choosing the right block size is a complex trade-off that
depends on the size of cache and the miss penalty.

35
Bigger caches to reduce miss rate
 The obvious way to reduce capacity misses is to increase
cache capacity.
 Drawbacks include potentially longer hit time of the larger
cache memory and higher cost and power.
 Larger caches increase both static and dynamic power.

36
Higher associativity to reduce miss rate
 Obviously, increasing associativity reduces conflict misses.
Greater associativity can come at the cost of increased hit
time. As we will see shortly, associativity also increases
power consumption.

37
Multilevel caches to reduce miss penalty
 A difficult decision is whether to make the cache hit time fast, to
keep pace with the high clock rate of processors, or to make the
cache large to reduce the gap between the processor accesses and
main memory accesses. Adding another level of cache between the
original cache and memory simplifies the decision.
 The first-level cache can be small enough to match a fast clock cycle

time, yet the second-level (or third-level) cache can be large enough
to capture many accesses that would go to main memory.
 The focus on misses in second-level caches leads to larger blocks,

bigger capacity, and higher associativity. Multilevel caches are more


power efficient than a single aggregate cache. If L1 and L2 refer,
respectively, to first- and second-level caches, we can redefine the
average memory access time:
Hit timeL1 + Miss rateL1 × (Hit timeL2 + Miss rateL2 × Miss penaltyL2)

38
Giving priority to read misses over writes to reduce
miss penalty
 A write buffer is a good place to implement this
optimization. Write buffers create hazards because they
hold the updated value of a location needed on a read miss
—that is, a read-after-write hazard through memory.
 One solution is to check the contents of the write buffer on a
read miss. If there are no conflicts, and if the memory system
is available, sending the read before the writes reduces the
miss penalty. Most processors give reads priority over writes.
 This choice has little effect on power consumption.

39
Avoiding address translation during indexing of the
cache to reduce hit time
 Caches must cope with the translation of a virtual address
from the processor to a physical address to access
memory.
 A common optimization is to use the page offset—the part
that is identical in both virtual and physical addresses—to
index the cache

40
Ten Advanced Optimizations of Cache
Performance
We can classify the ten advanced cache optimizations we
examine into five categories based on these metrics:
 Reducing the hit time—Small and simple first-level caches and way
prediction. Both techniques also generally decrease power consumption.
 Increasing cache bandwidth—Pipelined caches, multibanked caches, and
nonblocking caches. These techniques have varying impacts on power
consumption.
 Reducing the miss penalty—Critical word first and merging write buffers.
These optimizations have little impact on power.
 Reducing the miss rate—Compiler optimizations. Obviously any
improvement at compile time improves power consumption.
 Reducing the miss penalty or miss rate via parallelism—Hardware
prefetching and compiler prefetching. These optimizations generally
increase power consumption, primarily due to prefetched data that are
unused.

41
Advanced Cache Optimization
 First Optimization: Small and Simple First-Level Caches to Reduce Hit Time
and Power
 Second Optimization: Way Prediction to Reduce Hit Time
 Third Optimization: Pipelined Cache Access to Increase Cache Bandwidth
 Fourth Optimization: Nonblocking Caches to Increase Cache Bandwidth
 Fifth Optimization: Multibanked Caches to Increase Cache Bandwidth
 Sixth Optimization: Critical Word First and Early Restart to Reduce Miss
Penalty
 Seventh Optimization: Merging Write Buffer to Reduce Miss Penalty
 Eighth Optimization: Compiler Optimizations to
 Reduce Miss Rate
 Ninth Optimization: Hardware Prefetching of Instructions and Data to
Reduce Miss Penalty or Miss Rate
 Tenth Optimization: Compiler-Controlled Prefetching to Reduce Miss Penalty
or Miss Rate

42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy