0% found this document useful (0 votes)
7 views26 pages

CA11_2023S1_new

The lecture discusses associative cache in computer architecture, focusing on memory hierarchy, cache misses, and performance metrics. It covers cache design, measuring cache performance, and strategies to reduce cache miss rates, including set associativity and replacement policies. Examples illustrate concepts like memory stall cycles and average memory access time, emphasizing the impact of cache design on overall system performance.

Uploaded by

Huy Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views26 pages

CA11_2023S1_new

The lecture discusses associative cache in computer architecture, focusing on memory hierarchy, cache misses, and performance metrics. It covers cache design, measuring cache performance, and strategies to reduce cache miss rates, including set associativity and replacement policies. Examples illustrate concepts like memory stall cycles and average memory access time, emphasizing the impact of cache design on overall system performance.

Uploaded by

Huy Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

ELT3047 Computer Architecture

Lecture 11: Associative cache

Hoang Gia Hung


Faculty of Electronics and Telecommunications
University of Engineering and Technology, VNU Hanoi
Last lecture review
❑ Memory hierarchy: have multiple levels of storage & ensure the
data the processor needs is kept in the fast(er) level(s).
➢ Temporal Locality: if address 𝑋 is accessed, it’s likely to be accessed in the
near future.
➢ Spatial Locality: if address 𝑋 is accessed, data stored in nearby locations
are likely to be accessed in the near future.

❑ Three major categories of cache misses:


1. Compulsory misses: sad facts of life. Example: cold start misses
2. Conflict misses: multiple memory location being mapped to the same
cache location. Nightmare Scenario: ping pong effect.
3. Capacity misses: the cache is not big enough to contains all the cache
blocks required by the program. Solution: increase cache size.

❑ Cache design space:


➢ total size, block size
➢ write-hit policy (write-through, write-back)
➢ write-miss policy (write allocate, write buffers)
Measuring Cache Performance
❑ The processor stalls on a cache miss
➢ When fetching instructions from the Instruction Cache (I-cache)
➢ When loading or storing data into the Data Cache (D-cache)
➢ Miss penalty is assumed equal for I-cache & D-cache
➢ Miss penalty is assumed equal for Load and Store

❑ Components of CPU time:


➢ Program execution cycles (includes cache hit time)
➢ Memory stall cycles (mainly from cache misses)
➢ CPU time = IC × CPI × CC = IC × (CPIideal + Memory-stall cycles) × CC
CPIstall
▪ CPIideal = CPI for ideal cache (no cache misses)
▪ CPIstall = CPI in the presence of memory stalls
▪ Memory stall cycles increase the CPI!
Memory Stall Cycles
❑ Sum of read-stalls and write-stalls (due to cache misses)
➢ Read-stall cycles = reads/program × read miss rate × read miss penalty
➢ Write-stall cycles = (writes/program × write miss rate × write miss penalty)
+ write buffer stalls

❑ Memory stall cycles = (I-Cache Misses + D-Cache Misses) ×


Miss Penalty
➢ I-Cache Misses = I-Count × I-Cache Miss Rate
➢ D-Cache Misses = LS-Count × D-Cache Miss Rate
▪ LS-Count (Load & Store) = I-Count × LS Frequency

❑ With simplifying assumptions:


Memory stall cycles = I-Count x misses/instruction x miss penalty
I-Cache Miss Rate + LS Frequency × D-Cache Miss Rate
➢ Memory stall cycles/instruction = I-Cache Miss Rate × Miss Penalty +
LS Frequency × D-Cache Miss Rate × Miss Penalty
➢ For write-through caches: Memory-stall cycles = miss rate × miss penalty
Memory Stall Cycles: example
❑ Example: Compute misses/instruction and memory stall cycles
for a program with the given characteristics
▪ Instruction count (I-Count) = 106 instructions
▪ 30% of instructions are loads and stores
▪ D-cache miss rate is 5% and I-cache miss rate is 1%
▪ Miss penalty is 100 clock cycles for instruction and data caches

❑ Solution:
➢ misses/instruction=1%+30%x5%=0.025;
➢ memory stall cycles/instruction=0.025x100=2.5 cycles
➢ total memory stall cycles=2.5x106=2,500,000 cycles
Impacts of Cache Performance
❑ Relative cache penalty increases as processor performance
improves (faster clock rate and/or lower CPI)
➢ When calculating CPIstall, the cache miss penalty is measured in processor
clock cycles needed to handle a miss.
➢ The lower the CPIideal, the more pronounced the impact of stalls

❑ Example: Given
▪ I-cache miss rate = 2%, D-cache miss rate = 4%
▪ Miss penalty = 100 cycles
▪ Base CPI (ideal cache) = 2
▪ Load & stores are 36% of instructions
Questions:
➢ What is CPIstall? 2+(2%+36%x4%)x100 = 5.44, % time on memory stall = 63%
➢ What if the CPIideal is reduced to 1? % time on memory stall = 77%
➢ What if the processor clock rate is doubled? Miss penalty = 200, CPIstall = 8.88
Average Memory Access Time (AMAT)
❑ Hit time is also important for performance
➢ A larger cache will have a longer access time → an increase in hit time will
likely add another stage to the pipeline.
➢ At some point, the increase in hit time for a larger cache will overcome the
improvement in hit rate leading to a decrease in performance.

❑ Average Memory Access Time (AMAT) is the average time to


access memory considering both hits and misses.
AMAT = Hit time + Miss rate × Miss penalty
❑ Example: Find the AMAT for a cache with
▪ Cache access time (Hit time) of 1 cycle = 2 ns
▪ Miss penalty of 20 clock cycles
▪ Miss rate of 0.05 per access

❑ Solution:
➢ AMAT = 1 + 0.05 × 20 = 2 cycles = 4 ns
➢ Without the cache, AMAT will be equal to miss penalty = 20 cycles = 40 ns
Reducing cache miss rates #1: cache
associativity
❑ Allow more flexible block placement
➢ In a direct mapped cache a memory block maps to exactly one cache block
➢ At the other extreme, could allow a memory block to be mapped to any cache
block → fully associative cache (no indexing)

❑ A compromise is to divide the cache into sets, each of which


consists of n “ways” (n-way set associative).
➢ A memory block maps to a unique set (specified by the index field) and can
be placed in any way of that set (so there are n choices).
Set index = (block address) modulo (# sets in the cache)
❑ Example: consider the main memory word reference for the
following string
0 4 0 4 0 4 0 4
➢ Start with an empty cache - all blocks initially marked as not valid
Set Associative Cache: Example
Main Memory

0000xx One word blocks


Cache 0001xx Two low order bits
0010xx define the byte in the
Way Set V Tag Data
0011xx word (32b words)
0 0100xx
0 0101xx
1
0110xx
0
0111xx Q2: How do we find it?
1
1
1000xx Use next 1 low order
1001xx memory address bit to
Q1: Is it there?
1010xx determine which cache
Compare all the cache 1011xx set (i.e., modulo the
tags in the set to the high 1100xx number of sets in the
order 3 memory address 1101xx
cache)
bits to tell if the memory 1110xx
block is in the cache 1111xx
Set associative cache example:
reference string mapping
0 4 0 4 0 4 0 4

0 miss 4 miss 0 hit 4 hit


000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0)

010 Mem(4) 010 Mem(4) 010 Mem(4)

❑ 8 requests, 2 misses
➢ Solves the ping pong effect in a direct mapped cache since now 2 memory
locations that map into the same cache set can co-exist!
Four-Way Set Associative Cache
Organization

28 = 256 sets
each with Way 0 Way 1 Way 2 Way 3
four ways
(each with
one block)

Content Addressable Memory


(CAM): a circuit that combines
comparison and storage in a single
device - supply the data, it will look
for a copy & returns the index of
the matching row → CAM allows
much higher set associativity (8-
way and above) than the standard
HW of SRAMs + comparators.
Range of Set Associative Caches
Used for tag compare Selects the set Selects the word in the block

Tag Index Block offset Byte offset

Increasing associativity
Decreasing associativity
Fully associative
Direct mapped (only one set)
(only one way) Tag is all the bits except
Smaller tags, only a block and byte offset
single comparator

❑ For a fixed size cache, each increase by a factor of two in


associativity doubles the number of blocks per set (= the number
of ways) and halves the number of sets – decreases the size of
the index by 1 bit and increases the size of the tag by 1 bit.
Replacement Policies
❑ A miss occurred, which way’s block do we pick for replacement?
➢ Direct mapped: no choice.
➢ Set associative: non-valid entry, then choose among entries in the set.

❑ First In First Out (FIFO): replace the oldest block in set


➢ Use one counter per set to specify the oldest block. On a cache miss replace
the block specified by counter & increment the counter.

❑ Least Recently Used (LRU): replace the one that has been
unused for the longest time
➢ Requires hardware to keep track of when each way’s block was used relative
to the other blocks in the set. For 2-way set associative, takes one bit per set
→ set the bit when a block is referenced (and reset the other way’s bit)
➢ Manageable for 4-way, too hard beyond that.

❑ Random
➢ Gives approximately the same performance as LRU for high associativity.
Caching Example: Write-Back Fully
Associative with LRU
❑ Matrix multiplication, cold start
➢ B has been transposed into Bt to optimize efficiency
➢ Lower LRU number = more recent used

❑ Compute C[0]
➢ Access A[0] = 0x1000:
▪ Miss, copy block A[0:3] to cache, set Tag, V, LRU bits
➢ Access Bt[0] = 0x2000: Bt 4 4 4 4
▪ Miss, copy block Bt[0:3] to cache, set new Tag, V, 3 3 3 3
LRU bits, update existing LRU bit
2 2 2 2
1 1 1 1
Tag V Dirty LRU Data
A C
1
0x100 0 0 2
1
0 0x01 0x16
0xBF 0x02 0x88
0x03 0x2B
0x04
1 2 3 4
0x200 0
0x733 1 0 1
0 0x01 0x01
0x3B 0x18 0x01
0xF1 0x01
0xB3 5 6 7 8
0x156 0 0 0 0xE6 0x57 0x49 0xEE 9 10 11 12
0x4E9 0 0 0 0xB5 0x81 0x67 0x3F 13 14 15 16
Caching Example: Write-Back Fully
Associative with LRU
❑ Compute C[0] (cont.)
➢ Access A[1] = 0x1004: hit, Access Bt[1] = 0x2004: hit
➢ Access A[2] = 0x1008: hit, Access Bt[2] = 0x2008: hit
➢ Access A[3] = 0x100C: hit, Access Bt[3] = 0x200C: hit

❑ Write to C[0] = 0x3000:


➢ Miss + Write, so set dirty bit as well as Tag, V, & LRU bits
➢ Update existing LRU bits Bt 4 4 4 4
➢ Main memory not updated yet 3 3 3 3
2 2 2 2
1 1 1 1
Tag V Dirty LRU Data
A C
0x100 1 0 3
2 0x01 0x02 0x03 0x04
1 2 3 4
0x200 1 0 2
1 0x04 0x04 0x04 0x04 5 6 7 8
0x300 0
0x156 1 1
0 1
0 0x28 0x57
0xE6 0x?? 0x49
0x?? 0xEE
0x?? 9 10 11 12
0x4E9 0 0 0 0xB5 0x81 0x67 0x3F 13 14 15 16
Caching Example: Write-Back Fully
Associative with LRU
❑ Compute C[1]
➢ Access A[0] = 0x1000: hit, Access Bt[4] = 0x2010:
▪ Miss, copy block Bt[4:7] to cache
➢ Access A[1] = 0x1004: hit, Access Bt[5] = 0x2014: hit
➢ Access A[2] = 0x1008: hit, Access Bt[6] = 0x2018: hit
➢ Access A[3] = 0x100C: hit, Access Bt[7] = 0x201C: hit

❑ Write to C[1] = 0x3004: Bt 4 4 4 4


➢ Hit, update LRU bits 3 3 3 3
➢ Main memory not updated yet 2 2 2 2
1 1 1 1
Tag V Dirty LRU Data
A C
0x100 1 0 3
2 0x01 0x02 0x03 0x04
1 2 3 4
0x200 1 0 2
4 0x04 0x04 0x04 0x04 5 6 7 8
0x300 1 1 1
3 0x28 0x??
0x1E 0x?? 0x?? 9 10 11 12
0x4E9
0x201 0
1 0 2
0
1 0xB5
0x03 0x81
0x03 0x67
0x03 0x3F
0x03 13 14 15 16
Caching Example: Write-Back Fully
Associative with LRU
❑ Compute C[2]
➢ Access A[0] = 0x1000: hit, Access Bt[8] = 0x2020:
▪ miss, cache full → evict block 0x200
▪ copy block Bt[8:11] to cache, update necessary bits
➢ Access A[1] = 0x1004: hit, Access Bt[9] = 0x2024: hit
➢ Access A[2] = 0x1008: hit, Access Bt[10] = 0x2028: hit
➢ Access A[3] = 0x100C: hit, Access Bt[11] = 0x202C: hit
Bt 4 4 4 4
❑ Write to C[2] = 0x3008: 3 3 3 3
➢ Hit, update LRU bits, main memory not updated yet. 2 2 2 2
1 1 1 1
Tag V Dirty LRU Data
A C
0x100 1 0 2
3 0x01 0x02 0x03 0x04
1 2 3 4
0x200
0x202 1 0 1
4
2 0x04
0x02 0x04
0x02 0x04
0x02 0x04
0x02 5 6 7 8
0x300 1 1 3
1 0x28 0x1E 0x??
0x14 0x?? 9 10 11 12
0x201 1 0 2
4 0x03 0x03 0x03 0x03 13 14 15 16
Caching Example: Write-Back Fully
Associative with LRU
❑ Compute C[3]
➢ Access A[0] = 0x1000: hit, Access Bt[12] = 0x2030:
▪ miss, cache full → evict block 0x201
▪ copy block Bt[8:11] to cache, update necessary bits
➢ Access A[1] = 0x1004: hit, Access Bt[13] = 0x2024: hit
➢ Access A[2] = 0x1008: hit, Access Bt[14] = 0x2028: hit
➢ Access A[3] = 0x100C: hit, Access Bt[15] = 0x202C: hit
Bt 4 4 4 4
❑ Write to C[3] = 0x300C: 3 3 3 3
➢ Hit, update LRU bits, main memory not updated yet. 2 2 2 2
1 1 1 1
Tag V Dirty LRU Data
A C
0x100 1 0 2
3 0x01 0x02 0x03 0x04
1 2 3 4
0x202 1 0 4
2 0x02 0x02 0x02 0x02 5 6 7 8
0x300 1 1 3
1 0x28 0x1E 0x14 0x0A
0x?? 9 10 11 12
0x203 1
0x201 0 2
1
4 0x01 0x03
0x03 0x01 0x03
0x01 0x03
0x01 13 14 15 16
Caching Example: Write-Back Fully
Associative with LRU
❑ Compute C[4]
➢ Access A[4] = 0x1010:
▪ miss, cache full → evict block 0x202
▪ copy block A[4:7] to cache, update necessary bits
➢ Access Bt[0] = 0x2000:
▪ miss, cache full → evict block 0x100
▪ copy block Bt[0:3] to cache, update necessary bits
Bt 4 4 4 4
➢ Other accesses: hits
3 3 3 3
➢ Write to C[4] = 0x3010:
2 2 2 2
▪ miss → evict block 0x203, memory not updated yet.
1 1 1 1
Tag V Dirty LRU Data
A C
0x200 1
0x100 0 2
3
4
1 0x04 0x02
0x01 0x04 0x03
0x04 0x04
1 2 3 4
0x101 1
0x202 0 3
4
1
2 0x05 0x02
0x02 0x06 0x02
0x07 0x02
0x08 5 6 7 8
0x300 1 1 4
1
2
3 0x28 0x1E 0x14 0x0A 9 10 11 12
0x301 1
0x203 1
0 1
2
3
4 0x68 0x01
0x01 0x?? 0x01
0x?? 0x01
0x?? 13 14 15 16
Caching Example: Write-Back Fully
Associative with LRU
❑ Compute C[5]
➢ Access A[4] = 0x1010: hit, Access Bt[4] = 0x2010:
▪ miss, cache full → evict block 0x300: only at this
point does main memory get updated.
▪ copy block Bt[4:7] to cache, update necessary bits
➢ Access A[5] = 0x1014: hit, Access Bt[5] = 0x2014: hit
➢ Access A[6] = 0x1018: hit, Access Bt[6] = 0x2018: hit
➢ Access A[7] = 0x101C: hit, Access Bt[7] = 0x201C: hit Bt 4 4 4 4
➢ Write to C[5] = 0x3014: 3 3 3 3
▪ Hit, update LRU bits, main memory not updated yet. 2 2 2 2
1 1 1 1
Tag V Dirty LRU Data
A C
0x200 1 0 2
4 0x04 0x04 0x04 0x04
1 2 3 4 40 30 20 10
0x101 1 0 3
2 0x05 0x06 0x07 0x08 5 6 7 8
0x300
0x201 1 1
0 4
2
1 0x28
0x03 0x1E
0x03 0x14
0x03 0x0A
0x03 9 10 11 12
0x301 1 1 1
3 0x68 0x??
0x4E 0x?? 0x?? 13 14 15 16
How Much Associativity?
❑ Increased associativity 12
4KB
decreases miss rate 10 8KB
➢ But with diminishing returns 16KB
8

Miss Rate
32KB
❑ The choice of direct 64KB
6
mapped or set associative 128KB

depends on the cost of a 4 256KB

miss versus the cost of 2


512KB

implementation. 0
1-way 2-way 4-way 8-way

❑ N-way set associative Associativity


cache costs
➢ N comparators (delay and area)
➢ MUX delay (set selection) before data is available
➢ Data available after set selection and Hit/Miss decision (c.f. direct mapped
cache: the cache block is available before the Hit/Miss decision) → can be
an important consideration (why?).
Reducing Cache Miss Rates #2: multi-
level caches
❑ Use multiple levels of caches
➢ Primary (L1) cache attached to CPU
➢ Larger, slower, L2 cache services misses from primary cache. With
advancing technology → have more than enough room on the die for L2,
normally a unified cache (i.e., it holds both instructions and data) and in some
cases even a unified L3 cache.

❑ Example: Given
▪ CPU base CPI = 1, clock rate = 4GHz
▪ Miss rate/instruction = 2%
▪ Main memory access time = 100ns
Questions:
➢ Compute the actual CPI with just primary cache.
➢ Compute the performance gain if we add L2 cache with
▪ Access time = 5ns
▪ Global miss rate to main memory = 0.5%
Multi-level cache: example solution
❑ With just primary cache
➢ Miss penalty = 100ns/0.25ns = 400 cycles
➢ CPIstall = 1 + 0.02 × 400 = 9

❑ With added L2 cache


➢ Primary miss with L2 hit: penalty = 5ns/0.25ns = 20 cycles
➢ Primary miss with L2 miss: penalty = L2 access stall + Main memory stall =
20 + 400 = 420 cycles
➢ CPIstall = 1 + (0.02 - 0.005) × 20 + 0.005 x 420 = 3.4 cycles
➢ [Alternatively, CPIstall = 1 + L1 stalls/instruction + L2 stalls/instruction = 1 +
0.02 x 20 + 0.005 x 400 = 3.4 cycles]
➢ Performance gain = 9/3.4=2.6 times.
Multilevel Cache Design Considerations
❑ Design considerations for L1 and L2 caches are very different
➢ Primary cache should focus on minimizing hit time in support of a shorter
clock cycle → smaller with smaller block sizes.
➢ Secondary cache(s) should focus on reducing miss rate to reduce the
penalty of long main memory access times → larger with larger block sizes &
higher levels of associativity.

❑ The miss penalty of the L1 cache is significantly reduced by the


presence of an L2 cache – so it can be smaller (i.e., faster) but
have a higher miss rate
❑ For the L2 cache, hit time is less important than miss rate
➢ The L2$ hit time determines L1$’s miss penalty
➢ L2$ local miss rate >> the global miss rate
▪ Local miss rate = fraction of references to one level of a cache that miss
▪ Global miss rate = fraction of references that miss in all levels of a multi-
level cache → dictates how often we must access the main memory.
Multi-level cache parameters: two real-
life examples
Intel Nehalem AMD Barcelona
Split I$ and D$; 32KB for each Split I$ and D$; 64KB for each
L1 cache organization & size
per core; 64B blocks per core; 64B blocks
4-way (I), 8-way (D) set assoc.; 2-way set assoc.; LRU
L1 associativity
~LRU replacement replacement
L1 write policy write-back, write-allocate write-back, write-allocate
Unified; 256MB (0.25MB) per Unified; 512KB (0.5MB) per
L2 cache organization & size
core; 64B blocks core; 64B blocks
L2 associativity 8-way set assoc.; ~LRU 16-way set assoc.; ~LRU
L2 write policy write-back, write-allocate write-back, write-allocate
Unified; 8192KB (8MB) shared Unified; 2048KB (2MB) shared
L3 cache organization & size
by cores; 64B blocks by cores; 64B blocks
32-way set assoc.; evict block
L3 associativity 16-way set assoc.
shared by fewest cores
L3 write policy write-back, write-allocate write-back; write-allocate
The Cache Design Space
❑ Several interacting dimensions Cache Size
➢ cache size
➢ block size Associativity
➢ associativity
➢ replacement policy
➢ write-through vs write-back
➢ write allocation
Block Size
❑ The optimal choice is a compromise
➢ depends on access characteristics
▪ workload
▪ use (I-cache, D-cache, TLB) Bad
➢ depends on technology / cost

❑ Simplicity often wins Good Factor A Factor B

Less More

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy