0% found this document useful (0 votes)
38 views9 pages

Improving Cache Performance Reducing Misses

Reducing cache misses can improve performance. There are three types of cache misses: compulsory, capacity, and conflict. To reduce misses, one can increase cache size or associativity, use larger block sizes, employ victim caches or hardware prefetching. For example, using an 8-way set associative cache rather than a direct mapped cache can lower average memory access time according to an example given. Hardware prefetching of both instructions and data from up to 8 streams has been shown to reduce misses by 50-70% for some workloads.

Uploaded by

Kunal Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views9 pages

Improving Cache Performance Reducing Misses

Reducing cache misses can improve performance. There are three types of cache misses: compulsory, capacity, and conflict. To reduce misses, one can increase cache size or associativity, use larger block sizes, employ victim caches or hardware prefetching. For example, using an 8-way set associative cache rather than a direct mapped cache can lower average memory access time according to an example given. Hardware prefetching of both instructions and data from up to 8 streams has been shown to reduce misses by 50-70% for some workloads.

Uploaded by

Kunal Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Improving Cache Performance Reducing Misses

How To Measure
Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) • Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the
Misses in infinite
• 1. Reduce the miss rate, cache, so the block must be brought into the cache. These
cache
are also called cold start misses or first reference misses.
• 2. Reduce the miss ppenalty,
y, or
– Capacity—If C is the size of the cache (in blocks) and N
Non-compulsory
l
• 3. Reduce the time to hit in the cache. there have been more than C unique cache blocks misses in size X
accessed since this cache was last accessed. fully associative
– Conflict—Any miss that is not a compulsory miss or cache
capacity miss must be a byproduct of the cache mapping Non-compulsory,
algorithm. A conflict miss occurs because too many non-capacity
acti e blocks are mapped to the same cache set.
active set misses
i

CSE 240A Dean Tullsen CSE 240A Dean Tullsen

How To Reduce Misses? Reduce Misses via Larger Block Size

• Compulsory Misses?

• C
Capacity
i Misses?
Mi ?

• Conflict Misses?

• What can the compiler do?

• 16K cache, miss penalty for 16-byte block = 42, 32-byte is 44, 64-byte is 48.
Miss rates are 3.94,, 2.87,, and 2.64%. Which gives
g best performance
p (lowest
(
AMAT)?
CSE 240A Dean Tullsen CSE 240A Dean Tullsen
Example: Avg. Memory Access Time
Reduce Misses via Higher Associativity
vs Miss Rate
vs.
• Beware: Execution time is only final measure! • Example: assume CT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for
– Will Clock Cycle time increase?
8-way vs
vs. CT direct mapped
– Hill [1988] suggested hit time external cache +10%, internal + 2% AMAT
for 2-way vs. 1-way Cache Size Associativity
(KB) 1
1-way 2
2-way 4
4-way 88-way
1 7.65 6.60 6.22 5.44
2 5.90 4.90 4.62 4.09
4 4 60
4.60 3 95
3.95 3 57
3.57 3 19
3.19
8 3.30 3.00 2.87 2.59
16 2.45 2.20 2.12 2.04
32 2.00 1.80 1.77 1.79
64 1.70 1.60 1.57 1.59
128 1.50 1.45 1.42 1.44

CSE 240A Dean Tullsen CSE 240A Dean Tullsen

Reducing Misses by emulating Reducing Misses by HW Prefetching of


associativity: Victim Cache Instruction & Data
• E.g., Instruction Prefetching
– Alpha
Al h 21064 ffetches
h 2 blblocks
k on a miss
i
• HR of associative + access – Extra block placed in stream buffer
time of direct mapped? – On miss check stream buffer
• Add bbuffer
ff tto hhold
ld ddata
t • Works with data blocks too:
recently discarded from cache – Jouppi [1990] 1 data stream buffer got 25% misses from 4KB
• Jouppi
pp [[1990]:
] 4-entryy victim cache; 4 streams got 43%
cache removed 20% to 95% – Palacharla & Kessler [1994] for scientific programs for 8 streams
got 50% to 70% of misses from 2 64KB, 4-way set associative
of conflicts for a 4 KB direct caches
mapped data cache • Prefetching
P f t hi relies
li on extra
t memory bandwidth
b d idth that
th t can be
b
used without penalty

CSE 240A Dean Tullsen CSE 240A Dean Tullsen


Reducing Misses by Reducing Misses by Various
SW Prefetching Data Compiler Optimizations
• Instructions
– Reorder procedures in memory so as to reduce misses
• Data Prefetch – Profiling to look at conflicts
– McFarling [1989] reduced cache misses by 75% on 8KB direct mapped cache
– Load data into register (HP PA-RISC, IA64, Tera) with 4 byte blocks
– Cache Prefetch: load into cache (MIPS IVIV, PowerPC
PowerPC, SPARC) • D t
Data
– Special prefetching instructions cannot cause faults; – Merging Arrays: improve spatial locality by single array of compound elements
a form of speculative execution vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order stored in
memory
• Issuing Prefetch Instructions (including address calculation) – Loop Fusion: Combine 2 independent loops that have same looping and some
takes time variables overlap
– Blocking:
Bl ki I
Improve ttemporall locality
l lit by b accessing
i “blocks”
“bl k ” off data
d t repeatedly
t dl
– Is cost of prefetch issues < savings in reduced misses? vs. going down whole columns or rows

CSE 240A Dean Tullsen CSE 240A Dean Tullsen

Merging Arrays Example Loop Interchange Example


/* Before */ /* Before */
int val[SIZE]; for (k = 0; k < 100; k = k+1)
int key[SIZE]; for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
/* After */ x[i][j] = 2 * x[i][j];
struct merge { /* After */
int val; for (k = 0; k < 100; k = k+1)
int key; for (i = 0; i < 5000; i = i+1)
}; for (j = 0; j < 100; j = j+1)
struct merge merged_array[SIZE]; x[i][j] = 2 * x[i][j];

Sequential
S ti l accesses instead
i t d off striding
t idi through
th h memory
Reducing conflicts between val & key every 100 words

CSE 240A Dean Tullsen CSE 240A Dean Tullsen


Loop Fusion Example
Blocking Example
/* Before */ /* Before */
for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j]; {r = 0;
for (k = 0; k < N; k = k+1){
for (i = 0; i < N; i = i+1)
r = r + y[i][k]*z[k][j];};
for (j = 0; j < N; j = j+1)
x[i][j] = r;
d[i][j] = a[i][j] + c[i][j]; };
• Two Inner Loops:
/* After */
– Read all NxN elements of z[]
for (i = 0; i < N; i = i+1)
– Read N elements of 1 row of y[] repeatedly
for (j = 0; j < N; j = j+1)
– Write N elements of 1 row of x[]
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
• Capacity Misses a function of N & Cache Size:
– worst case => 2N3 + N2.
2 misses per access to a & c vs
vs. one miss per access • Id compute
Idea: t on BxB
B B submatrix
b t i that
th t fits
fit in
i cache
h
CSE 240A Dean Tullsen CSE 240A Dean Tullsen

Blocking Example Key Points


/* After */ ⎛ Memory accesses ⎞
CPUtime = IC × CPI + × Miss rate × Miss penalty × Clock cycle time
for (jj = 0; jj < N; jj = jj+B) ⎝ Execution
Instruction ⎠
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
• 3 Cs: Compulsory, Capacity, Conflict Misses
for (j = jj; j < min(jj+B-1,N); j = j+1) • Reducing Miss Rate
{ = 0
{r 0; – 11. Reduce Misses via Larger Block Size
for (k = kk; k < min(kk+B-1,N); k = k+1) { – 2. Reduce Misses via Higher Associativity
r = r + y[i][k]*z[k][j];}; – 3. Reducing Misses via Victim Cache
x[i][j] = x[i][j] + r; – 4 Reducing Misses by HW Prefetching Instr,
4. Instr Data
}; – 5. Reducing Misses by SW Prefetching Data
– 6. Reducing Misses by Compiler Optimizations
• Capacity Misses from 2N3 + N2 to 2N3/B +N2 • R
Rememberb danger
d off concentrating
t ti on just
j t one parameter
t when
h
• B called Blocking Factor evaluating performance
• Conflict Misses Are Not As Easy... • Next: reducing Miss penalty

CSE 240A Dean Tullsen CSE 240A Dean Tullsen


Reducing Miss Penalty: Read Priority
Improving Cache Performance
over Write on Miss
1. Reduce the miss rate, • The easiest way to resolve RAW hazards (and other ordering issues)
b t
between loads
l d andd stores
t is
i to
t sendd them
th all
ll to
t memory in
i instruction
i t ti
2. Reduce the miss penalty, or
order.
3. Reduce the time to hit in the cache. • If always wait for write buffer to empty might increase read miss penalty
b 50%
by
• Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write
i Backk Caches?
h
– Read miss may require write of dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read, and then do
the write
– CPU stalls less since it can restart as soon as read completes

CSE 240A Dean Tullsen CSE 240A Dean Tullsen

Non-blocking Caches to reduce


Early Restart and Critical Word First
stalls on misses
• Don’t wait for full block to be loaded before restarting CPU • Non-blockingg cache ((or lockup-free
pf cache)) allowd the data
– Early restart—As soon as the requested word of the block arrives, send it cache to continue to supply cache hits during a miss
to the CPU and let the CPU continue execution
– Critical Word First—Request the missed word first from memory and
• “hit under miss” reduces the effective miss penalty by being
helpful during a miss instead of ignoring the requests of the
send it to the CPU as soon as it arrives; let the CPU continue execution
CPU
while filling the rest of the words in the block. Also called wrapped fetch
and requested word first • “hit under multiple miss” or “miss under miss” can further
lower the effective miss penalty by overlapping multiple misses
• Most useful with large blocks,
– Significantly increases the complexity of the cache controller as there can
• Spatial locality a problem; often we want the next sequential be multiple outstanding memory accesses
word soon,
soon so not always a benefit (early restart)
restart). • assumes
ass mes “stall on use”
se” not “stall on miss” which
hich works
orks naturally
nat rall
with dynamic scheduling, but can also work with static.

CSE 240A Dean Tullsen CSE 240A Dean Tullsen


Miss Penalty Reduction: Second
But…
Level Cache
• L2 Equations
• The primary way to reduce miss penalty… AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
cpu
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
cpu lowest-level
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + cache
Miss PenaltyL2)
cache
• Definitions:
next-level
– Local miss rate
rate— misses in this cache divided by the total memory/cache
number of memory accesses to this cache (Miss rateL2)
cache – Global miss rate—misses in this cache divided by the total
number of memory accesses generated by the CPU
(Mi Rate
(Miss R L1 x Miss
Mi Rate
R L2)

Memory
CSE 240A Dean Tullsen CSE 240A Dean Tullsen

Multi-level Caches,, cont. Reducing Miss Penalty Summary


• L1 cache local miss rate 10%, L2 local miss rate 40%. What are
the global miss rates?
• L1 highest priority is fast hit time. L2 typically low miss rate. • Four techniques
– Read priority over write on miss
• Design L1 and L2 caches in concert.
– E l Restart
Early R t t andd Critical
C iti l Word
W d First
Fi t on miss
i
• Property of inclusion -- if it is in L1 cache, it is guaranteed to be – Non-blocking Caches (Hit Under Miss)
in the L2 cache -- simplifies design of consistent caches. – Multi-level Caches
• L2 cache
h can have
h different
diff associativity
i i i (good
( d idea?)
id ?) or block
bl k
size (good idea?) than L1 cache.
• These principles can continue to be applied recursively to
Multilevel Caches
– Danger is that time to DRAM will grow with multiple levels in between

CSE 240A Dean Tullsen CSE 240A Dean Tullsen


Fast Hit times via
Review: Improving Cache Performance
Small and Simple Caches
1. Reduce the miss rate,
2. Reduce the miss penalty, or • This is why Alpha 21164 has 8KB Instruction and 8KB
3. Reduce the time to hit in the cache. data cache + 96KB second level cache
• I and D caches used to be typically Direct Mapped,
Mapped on chip

CSE 240A Dean Tullsen CSE 240A Dean Tullsen

DM Hit Time + Associative Hit Rate -> Fast hits by Avoiding


Way Prediction Address Translation: Virtual Cache
• Add bits (?) to each cache line to predict which way is • Send virtual address to cache? Called Virtually Addressed Cache or just Virtual
y
Cache vs. Physical Cache
going
i to hit.
hi – Every time process is switched logically must flush the cache; otherwise get false hits
• How is that going to help? ƒ Cost is time to flush + “compulsory” misses from empty cache
– Dealing with aliases (sometimes called synonyms);
– Read one tag & compare T different
Two diff t virtual
i t l addresses
dd map to
t same physical
h i l address
dd
– Speculatively read data from that one block – I/O must interact with cache…
• Next cycle • Solution to aliases
– HW that guarantees that every cache block has unique physical address
– Read other tags and compare
– SW guarantee : lower n bits must have same address; as long as covers index field &
• Pentium 4
tag data tag data lru wp
direct mapped, they must be unique; called page coloring
• Solution to cache flush
– Add process identifier tag that identifies process as well as address within process: can’t
get a hit if wrong process

CSE 240A Dean Tullsen CSE 240A Dean Tullsen


Virtual Cache Cache Bandwidth: Trace Caches

• Physical Cache Virtual Cache • Fetch Bottleneck – Cannot execute instructions faster than
you can fetch
f h themh iinto the
h processor.
cpu cpu
• Cannot typically fetch more than about one taken branch
per cycle, at best (why? Why one taken branch?)
Virtual address
TLB
Virtual address • Trace cache is an instruction cache that stores instructions
in dynamic execution order rather than program/address
Ph i l address
Physical dd
TLB cache order.
d
cache • Implemented on the Pentium 4

Physical address

CSE 240A Dean Tullsen CSE 240A Dean Tullsen

Trace Cache Cache Optimization Summary


A Technique MR MP HT Complexity
B
C Larger
g Block Size
beq J: Higher Associativity
D Victim Caches
E A B C beq D E F G A B C beq GH jsr W X ret I HW Prefetching of Instr/Data
F H jsr I J K L M N p
Compiler Controlled Prefetchingg
J: G Compiler Reduce Misses
H Priority to Read Misses
jsr W Early Restart & Critical Word 1st
… W X ret … Non-Blocking Caches
Second Level Caches
Small & Simple Caches
W
Way Prediction
X
Avoiding Address Translation
ret
Trace Cache?
Conventional Cache Trace Cache

CSE 240A Dean Tullsen CSE 240A Dean Tullsen


Cache Research at UCSD Cache Research at UCSD, cont.

• Hardware
H d prefetching
f t hi off complex
l data
d t structures
t t (e.g.,
( pointer
i t chasing)
h i )
• Event-driven compilation
p – while main thread runs,, hw monitors
identify problematic loads, then fork new compilation thread (on SMT
• Fetch Target Buffer or CMP) to alter code.
– Let branch predictor run ahead of fetch engine – Dynamic value specialization
• Runtime
R ti identification
id tifi ti off cache
h conflict
fli t misses
i – Inline
li software
f prefetching
f hi
• Speculative Precomputation (helper thread prefetching) – Helper thread prefetching (speculative precomputation)
– Spawn threads at runtime to calculate addresses of delinquent • Software Data Spreading
(problematic) loads and prefetch Æ creates prefetcher from application – IInsertt migration
i ti calls
ll in
i loops
l with
ith large
l data
d t sets,
t spreading
di the
th data
d t over
code. multiple private caches.
• Code Layout to Reduce Icache Conflict Misses • Inter-core Prefetching
– Also, for multithreaded processors – Prefetch thread runs ahead of main thread, but in another core. After an
• Code Layout to Reduce Dcache Conflict Misses interval, they swap cores. The main thread finds all of its data preloaded
– Also, for multithreaded processors into the new cache, and the prefetcher starts prefilling the next cache.

CSE 240A Dean Tullsen CSE 240A Dean Tullsen

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy