Improving Cache Performance Reducing Misses
Improving Cache Performance Reducing Misses
How To Measure
Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) • Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the
Misses in infinite
• 1. Reduce the miss rate, cache, so the block must be brought into the cache. These
cache
are also called cold start misses or first reference misses.
• 2. Reduce the miss ppenalty,
y, or
– Capacity—If C is the size of the cache (in blocks) and N
Non-compulsory
l
• 3. Reduce the time to hit in the cache. there have been more than C unique cache blocks misses in size X
accessed since this cache was last accessed. fully associative
– Conflict—Any miss that is not a compulsory miss or cache
capacity miss must be a byproduct of the cache mapping Non-compulsory,
algorithm. A conflict miss occurs because too many non-capacity
acti e blocks are mapped to the same cache set.
active set misses
i
• Compulsory Misses?
• C
Capacity
i Misses?
Mi ?
• Conflict Misses?
• 16K cache, miss penalty for 16-byte block = 42, 32-byte is 44, 64-byte is 48.
Miss rates are 3.94,, 2.87,, and 2.64%. Which gives
g best performance
p (lowest
(
AMAT)?
CSE 240A Dean Tullsen CSE 240A Dean Tullsen
Example: Avg. Memory Access Time
Reduce Misses via Higher Associativity
vs Miss Rate
vs.
• Beware: Execution time is only final measure! • Example: assume CT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for
– Will Clock Cycle time increase?
8-way vs
vs. CT direct mapped
– Hill [1988] suggested hit time external cache +10%, internal + 2% AMAT
for 2-way vs. 1-way Cache Size Associativity
(KB) 1
1-way 2
2-way 4
4-way 88-way
1 7.65 6.60 6.22 5.44
2 5.90 4.90 4.62 4.09
4 4 60
4.60 3 95
3.95 3 57
3.57 3 19
3.19
8 3.30 3.00 2.87 2.59
16 2.45 2.20 2.12 2.04
32 2.00 1.80 1.77 1.79
64 1.70 1.60 1.57 1.59
128 1.50 1.45 1.42 1.44
Sequential
S ti l accesses instead
i t d off striding
t idi through
th h memory
Reducing conflicts between val & key every 100 words
Memory
CSE 240A Dean Tullsen CSE 240A Dean Tullsen
• Physical Cache Virtual Cache • Fetch Bottleneck – Cannot execute instructions faster than
you can fetch
f h themh iinto the
h processor.
cpu cpu
• Cannot typically fetch more than about one taken branch
per cycle, at best (why? Why one taken branch?)
Virtual address
TLB
Virtual address • Trace cache is an instruction cache that stores instructions
in dynamic execution order rather than program/address
Ph i l address
Physical dd
TLB cache order.
d
cache • Implemented on the Pentium 4
Physical address
• Hardware
H d prefetching
f t hi off complex
l data
d t structures
t t (e.g.,
( pointer
i t chasing)
h i )
• Event-driven compilation
p – while main thread runs,, hw monitors
identify problematic loads, then fork new compilation thread (on SMT
• Fetch Target Buffer or CMP) to alter code.
– Let branch predictor run ahead of fetch engine – Dynamic value specialization
• Runtime
R ti identification
id tifi ti off cache
h conflict
fli t misses
i – Inline
li software
f prefetching
f hi
• Speculative Precomputation (helper thread prefetching) – Helper thread prefetching (speculative precomputation)
– Spawn threads at runtime to calculate addresses of delinquent • Software Data Spreading
(problematic) loads and prefetch Æ creates prefetcher from application – IInsertt migration
i ti calls
ll in
i loops
l with
ith large
l data
d t sets,
t spreading
di the
th data
d t over
code. multiple private caches.
• Code Layout to Reduce Icache Conflict Misses • Inter-core Prefetching
– Also, for multithreaded processors – Prefetch thread runs ahead of main thread, but in another core. After an
• Code Layout to Reduce Dcache Conflict Misses interval, they swap cores. The main thread finds all of its data preloaded
– Also, for multithreaded processors into the new cache, and the prefetcher starts prefilling the next cache.