CA11_2023S1_new
CA11_2023S1_new
❑ Solution:
➢ misses/instruction=1%+30%x5%=0.025;
➢ memory stall cycles/instruction=0.025x100=2.5 cycles
➢ total memory stall cycles=2.5x106=2,500,000 cycles
Impacts of Cache Performance
❑ Relative cache penalty increases as processor performance
improves (faster clock rate and/or lower CPI)
➢ When calculating CPIstall, the cache miss penalty is measured in processor
clock cycles needed to handle a miss.
➢ The lower the CPIideal, the more pronounced the impact of stalls
❑ Example: Given
▪ I-cache miss rate = 2%, D-cache miss rate = 4%
▪ Miss penalty = 100 cycles
▪ Base CPI (ideal cache) = 2
▪ Load & stores are 36% of instructions
Questions:
➢ What is CPIstall? 2+(2%+36%x4%)x100 = 5.44, % time on memory stall = 63%
➢ What if the CPIideal is reduced to 1? % time on memory stall = 77%
➢ What if the processor clock rate is doubled? Miss penalty = 200, CPIstall = 8.88
Average Memory Access Time (AMAT)
❑ Hit time is also important for performance
➢ A larger cache will have a longer access time → an increase in hit time will
likely add another stage to the pipeline.
➢ At some point, the increase in hit time for a larger cache will overcome the
improvement in hit rate leading to a decrease in performance.
❑ Solution:
➢ AMAT = 1 + 0.05 × 20 = 2 cycles = 4 ns
➢ Without the cache, AMAT will be equal to miss penalty = 20 cycles = 40 ns
Reducing cache miss rates #1: cache
associativity
❑ Allow more flexible block placement
➢ In a direct mapped cache a memory block maps to exactly one cache block
➢ At the other extreme, could allow a memory block to be mapped to any cache
block → fully associative cache (no indexing)
❑ 8 requests, 2 misses
➢ Solves the ping pong effect in a direct mapped cache since now 2 memory
locations that map into the same cache set can co-exist!
Four-Way Set Associative Cache
Organization
28 = 256 sets
each with Way 0 Way 1 Way 2 Way 3
four ways
(each with
one block)
Increasing associativity
Decreasing associativity
Fully associative
Direct mapped (only one set)
(only one way) Tag is all the bits except
Smaller tags, only a block and byte offset
single comparator
❑ Least Recently Used (LRU): replace the one that has been
unused for the longest time
➢ Requires hardware to keep track of when each way’s block was used relative
to the other blocks in the set. For 2-way set associative, takes one bit per set
→ set the bit when a block is referenced (and reset the other way’s bit)
➢ Manageable for 4-way, too hard beyond that.
❑ Random
➢ Gives approximately the same performance as LRU for high associativity.
Caching Example: Write-Back Fully
Associative with LRU
❑ Matrix multiplication, cold start
➢ B has been transposed into Bt to optimize efficiency
➢ Lower LRU number = more recent used
❑ Compute C[0]
➢ Access A[0] = 0x1000:
▪ Miss, copy block A[0:3] to cache, set Tag, V, LRU bits
➢ Access Bt[0] = 0x2000: Bt 4 4 4 4
▪ Miss, copy block Bt[0:3] to cache, set new Tag, V, 3 3 3 3
LRU bits, update existing LRU bit
2 2 2 2
1 1 1 1
Tag V Dirty LRU Data
A C
1
0x100 0 0 2
1
0 0x01 0x16
0xBF 0x02 0x88
0x03 0x2B
0x04
1 2 3 4
0x200 0
0x733 1 0 1
0 0x01 0x01
0x3B 0x18 0x01
0xF1 0x01
0xB3 5 6 7 8
0x156 0 0 0 0xE6 0x57 0x49 0xEE 9 10 11 12
0x4E9 0 0 0 0xB5 0x81 0x67 0x3F 13 14 15 16
Caching Example: Write-Back Fully
Associative with LRU
❑ Compute C[0] (cont.)
➢ Access A[1] = 0x1004: hit, Access Bt[1] = 0x2004: hit
➢ Access A[2] = 0x1008: hit, Access Bt[2] = 0x2008: hit
➢ Access A[3] = 0x100C: hit, Access Bt[3] = 0x200C: hit
Miss Rate
32KB
❑ The choice of direct 64KB
6
mapped or set associative 128KB
implementation. 0
1-way 2-way 4-way 8-way
❑ Example: Given
▪ CPU base CPI = 1, clock rate = 4GHz
▪ Miss rate/instruction = 2%
▪ Main memory access time = 100ns
Questions:
➢ Compute the actual CPI with just primary cache.
➢ Compute the performance gain if we add L2 cache with
▪ Access time = 5ns
▪ Global miss rate to main memory = 0.5%
Multi-level cache: example solution
❑ With just primary cache
➢ Miss penalty = 100ns/0.25ns = 400 cycles
➢ CPIstall = 1 + 0.02 × 400 = 9
Less More