Memory Hierarchy Design
Memory Hierarchy Design
5.1 Introduction
The five classic components of a computer:
Processor Input Control Memory Datapath
Output
Build a memory hierarchy which includes main memory & caches (internal memory) and hard disk (external memory) Instructions are first fetched from external storage such as hard disk and are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches
2
Memory Hierarchy
Levels of the Memory Hierarchy
Capacity Access Time CPU Registers 500 bytes 0.25 ns Cache 64 KB 1 ns
Registers
Blocks
Main Memory 512 MB 100ns Disk 100 GB 5 ms
Capacity
Speed
Cache
main Memory
Blk Y
From Processor
Cache Measures
CPU execution time incorporated with cache performance: CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access Memory stall clock cycles = Number of misses * miss penalty = IC*(Misses/Instruction)*Miss penalty = IC*(Memory accesses/Instruction)*Miss rate*Miss penalty = IC * Reads per instruction * Read miss rate * Read miss penalty +IC * Writes per instruction * Write miss rate * Write miss penalty Memory access consists of fetching instructions and reading/writing data
P.395 Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time (B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls. memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty = IC*(1+50%)*2%*25 = IC*0.75 then CPU(B) = (IC + IC*0.75)* Clock cycle time = 1.75*IC*clock cycle time The performance ratio is easy to get to be the inverse of the CPU execution time : CPU(B)/CPU(A) = 1.75 The computer with no cache miss is 1.75 times faster.
7
3
4 5 6 7
8
9 A B C D E F
The block have only one place it can appear in the cache. The mapping is usually (Block address) MOD ( Number of blocks in cache)
10
: :
:
Byte 1023
:
Byte 992 31
11
Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit Use the Cache Index to select the cache set Check the Tag on each block in that set No need to check index or block offset A valid bit is added to the Tag to indicate whether or not this entry contains a valid address Select the desired bytes using Block Offset Increasing associativity => shrinks index expands tag
12
Valid
Cache Tag
Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0
Cache Tag
Valid
:
Adr Tag
:
0x50
Compare
Set1 1
Mux
0 Set0
Compare
OR Hit
Cache Block
13
Valid
Cache Tag
Cache Tag
Valid
:
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
OR
Hit Cache Block
14
Random: randomly selected LRU: Least Recently Used block is removed FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategies Associativity: 2-way 4-way 8-way
Size 16 KB 64 KB 256 KB LRU Random FIFO LRU Random FIFO LRU Random FIFO 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes
15
16
Processor
Cache
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4
17
Write allocate the block is allocated on a write miss, followed by the write hit actions
No-write allocate write misses do not affect the cache. The block is modified only in the lower-level memory
Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache
18
Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[100]; 1 write hit
4 misses; 1 hit
2 misses; 3 hits
19
The organization of the data cache in the Opteron microprocessor. The 64 KB cache is two-way set associative with 64-byte blocks. The 9-bit index selects among 512 sets. The four steps of a read hit, shown as circled numbers in order of occurrence, label this organization. Three bits of the block offset join the index to supply the RAM address to select the proper 8 bytes. Thus, the cache holds two groups of 4096 64-bit words, with each group containing half of the 512 sets. Although not exercised in this example, the line from lower-level memory to the cache is used on a miss to load the cache. The size of address leaving the processor is 40 bits because it is a physical address and not a virtual address. Figure on page C-45 explains how the Opteron maps from virtual(48-bit) to physical (40) for a cache access.
22
Example 2: What is the impact of two different cache organizations (direct mapped vs. 2-way set associative) on the performance of a CPU? Ideal CPI = 1.6 (ignoring memory stalls) Clock cycle time is 0.35 ns Avg. memory references per instruction is 1.4 Cache size: 128 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.35 times to accommodate the selection multiplexer Cache miss penalty is 65 ns Hit time is 1 clock cycle Miss rate: direct mapped 2.1%; 2-way set-associative 1.9%. Calculate AMAT and then processor performance.
24
25
26
Reducing Miss Penalty 4. Multilevel Caches 5. Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer Reducing hit time 6. Avoiding address translation when indexing cache
27
The miss rate actually goes up if the block size is too large relative to the cache size.
Size of Cache
16 32 64 128
Block Size (bytes)
Take advantage of spatial locality -The larger the block, the greater the chance parts of it will be used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is small Usually high latency and high bandwidth of lower level memory encourage large block size 28
256
2: Larger Caches
0.14 0.12 2-way Miss Rate per Type 1-way
0.1
4-way
0.08 8-way
0.06 0.04 0.02 0 4 1 2 8 16 32 64 128
Capacity
Compulsory
Increasing capacity of cache reduces capacity misses May result in longer hit time and higher cost Trends: Larger L2 or L3 off-chip caches
29
3: Higher Associativity
Previous figure shows how miss rates improve with higher associativity
8-way set associative is as effective as fully associative for practical purposes 2:1 Cache Rule: Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2
Beware: Execution time is the only final measure! Will Clock Cycle time increase as a result of having a more complicated cache?
30
31
4: Multilevel Caches
Approaches Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap L1: fast hits, L2: fewer misses L2 Equations Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Average Memory Access Time = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Hit TimeL1 << Hit TimeL2 << << Hit TimeMem Miss RateL1 < Miss RateL2 < Definitions: Local miss rate misses in this cache divided by the total number of memory accesses to this cache (1st level cache Miss rateL1 , 2nd level cache Miss rateL2) Global miss ratemisses in this cache divided by the total number of memory accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2) Indicate what fraction of the memory accesses that leave the CPU go all the way to memory. 32
Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cache should be much bigger than L1
Whether data in L1 is in L2
novice approach: design L1 and L2 independently multilevel inclusion: L1 data are always present in L2
Advantage: easy for consistency between I/O and cache (checking L2 only) Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
Example: Suppose that in 1000 memory references there are 40 misses in the first level cache and 20 misses in the second level cache. What are the various miss rates? Assume the miss penalty from the L2 cache to memory is 200 clock cycles, the hit time of the L2 cache is 10 clock cycles, the hit time of L1 is 1 clock cycle, and there are 1.5 memory references per instruction. What is the average memory access time and average stall cycles per instruction?
34
Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back Suppose a read miss will replace a dirty block Normal: Write dirty block to memory, and then do the read Instead: Copy the dirty block to a write buffer, do the read, and then do the write CPU stall less since restarts as soon as do read
35
2.context switching: same VA of different processes refer to different PA, requiring the cache to be flushed
solution: increase width of cache address tag with process-identifier tag (PID)
4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
36
MEM
Conventional Organization
Overlap cache access with VA translation: requires $ index to remain invariant across translation
37
Virtual Memory
Virtual memory (VM) allows programs to have the illusion of a very large memory that is not limited by physical memory size
Make main memory (DRAM) acts like a cache for secondary storage (magnetic disk) Otherwise, application programmers have to move data in/out main memory Thats how virtual memory was first proposed
VM Example
39
25-45 bits => 13-21 32-64 bits => 25-45 bits bits
43
Virtual-Physical Translation
A virtual address consists of a virtual page number and a page offset. The virtual page number gets translated to a physical page number. The page offset is not changed
36 bits
Virtual Page Number
12 bits
Page offset Virtual Address
Translation Physical Page Number Page offset 33 bits 12 bits Physical Address
44
TLB Characteristics
The following are characteristics of TLBs
TLB size : 32 to 4,096 entries Block size : 1 or 2 page table entries (4 or 8 bytes each) Hit time: 0.5 to 1 clock cycle Miss penalty: 10 to 30 clock cycles (go to page table) Miss rate: 0.01% to 0.1% Associative : Fully associative or set associative Write policy : Write back (replace infrequently)
47
48
49
General design:
small and simple cache for 1st-level cache Keeping the tags on chip and the data off chip for 2nd-level caches The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory
51
Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit
54
05:Non-blocking Caches
For out-of-order completion processors. Blocking Caches o When a "miss" occurs, CPU stalls until the data cache successfully finds the missing data. Non-blocking Caches o Allow the CPU to continue being productive (such as continue fetching instructions) while the "miss" resolves. Uses special registers called Miss Status/Information Holding Registers (MSHR's). o Hold information about unresolved "misses" o One entry for each "miss" depending on implementation
06:Multibanked Caches
59
Single Bank
Multibank Cache
Simultaneous Access
Multibank Cache
Block Addr.
0 4 8 12
Block Addr.
0 4 8 12 1 5 9 13
Block Addr.
0 4 8 12 1 5 9 13 2 6 10 14
Block Addr.
0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
Unoptimize
MISS!
Unoptimized
Unoptimized
Unoptimized
Unoptimized
Unoptimized
Unoptimized
Early Start
MISS!
Early Start
Early Start
Early Start
Early Start
Early Start
MISS!
95
/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every 100 words; improved spatial locality
2 misses per access to a & c vs. one miss per access; improve spatial locality