0% found this document useful (0 votes)
98 views115 pages

Memory Hierarchy Design

The document discusses the memory hierarchy from the CPU to main memory and disk storage. It explains the different levels of the memory hierarchy including caches, main memory, and disk. The document also covers cache concepts such as hit rate, miss rate, block placement policies, block replacement policies, and write policies for handling writes in the cache.

Uploaded by

shivu96
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views115 pages

Memory Hierarchy Design

The document discusses the memory hierarchy from the CPU to main memory and disk storage. It explains the different levels of the memory hierarchy including caches, main memory, and disk. The document also covers cache concepts such as hit rate, miss rate, block placement policies, block replacement policies, and write policies for handling writes in the cache.

Uploaded by

shivu96
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 115

Memory Hierarchy Design

5.1 Introduction
The five classic components of a computer:
Processor Input Control Memory Datapath

Output

Where do we fetch instructions to execute?

Build a memory hierarchy which includes main memory & caches (internal memory) and hard disk (external memory) Instructions are first fetched from external storage such as hard disk and are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches
2

Memory Hierarchy
Levels of the Memory Hierarchy
Capacity Access Time CPU Registers 500 bytes 0.25 ns Cache 64 KB 1 ns

Upper Level Faster

Registers

Blocks
Main Memory 512 MB 100ns Disk 100 GB 5 ms

Memory Pages I/O Devices Files ??? Larger Lower Level


3

Capacity

Speed

Cache

5.2 ABCs of Caches


Cache: In this textbook it mainly means the first level of the memory hierarchy encountered once the address leaves the CPU applied whenever buffering is employed to reuse commonly occurring items, i.e. file caches, name caches, and so on Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

Memory Hierarchy: Terminology


Hit: data appears in some block in the cache (example: Block X) Hit Rate: the fraction of cache access found in the cache Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the main memory (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in cache + Time to deliver the block to the processor Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)
To Processor cache
Blk X

main Memory
Blk Y

From Processor

Cache Measures
CPU execution time incorporated with cache performance: CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access Memory stall clock cycles = Number of misses * miss penalty = IC*(Misses/Instruction)*Miss penalty = IC*(Memory accesses/Instruction)*Miss rate*Miss penalty = IC * Reads per instruction * Read miss rate * Read miss penalty +IC * Writes per instruction * Write miss rate * Write miss penalty Memory access consists of fetching instructions and reading/writing data

P.395 Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache?

Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time (B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls. memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty = IC*(1+50%)*2%*25 = IC*0.75 then CPU(B) = (IC + IC*0.75)* Clock cycle time = 1.75*IC*clock cycle time The performance ratio is easy to get to be the inverse of the CPU execution time : CPU(B)/CPU(A) = 1.75 The computer with no cache miss is 1.75 times faster.
7

Four Memory Hierarchy Questions


Q1 (block placement): Where can a block be placed in the upper level? Q2 (block identification): How is a block found if it is in the upper level? Q3 (block replacement): Which bock should be replaced on a miss? Q4 (write strategy): What happens on a write?

Q1(block placement): Where can a block be placed?


Direct mapped: (Block number) mod (Number of blocks in cache) Set associative: (Block number) mod (Number of sets in cache) # of set # of blocks n-way: n blocks in a set 1-way = direct mapped Fully associative: # of set = 1

Example: block 12 placed in a 8-block cache

Simplest Cache: Direct Mapped (1-way)


Block number
0 1 2

Memory 4 Block Direct Mapped Cache


Block Index in Cache 0 1 2 3

3
4 5 6 7

8
9 A B C D E F

The block have only one place it can appear in the cache. The mapping is usually (Block address) MOD ( Number of blocks in cache)
10

Example: 1 KB Direct Mapped Cache, 32B Blocks


For a 2N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M)
31 Cache Tag Example: 0x50 9 Cache Index Ex: 0x01 4 0 Byte Select Ex: 0x00

Stored as part of the cache state


Valid Bit Cache Tag 0x50 Cache Data Byte 31 Byte 63 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3

: :

:
Byte 1023

:
Byte 992 31

11

Q2 (block identification): How is a block found?


Three portions of an address in a set-associative or direct-mapped cache
Block Address Tag Cache/Set Index Block Offset (Block Size)

Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit Use the Cache Index to select the cache set Check the Tag on each block in that set No need to check index or block offset A valid bit is added to the Tag to indicate whether or not this entry contains a valid address Select the desired bytes using Block Offset Increasing associativity => shrinks index expands tag
12

Example: Two-way set associative cache


Cache Index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result
31 Cache Tag Example: 0x50 9 Cache Index Ex: 0x01 4 0 Byte Select Ex: 0x00

Valid

Cache Tag

Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0

Cache Tag

Valid

:
Adr Tag

:
0x50

Compare

Set1 1

Mux

0 Set0

Compare

OR Hit

Cache Block

13

Disadvantage of Set Associative Cache


N-way Set Associative Cache v.s. Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss.
Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0

Valid

Cache Tag

Cache Tag

Valid

:
Adr Tag

Compare

Sel1 1

Mux

0 Sel0

Compare

OR
Hit Cache Block

14

Q3 (block replacement): Which block should be replaced on a cache miss?


Easy for Direct Mapped hardware decisions are simplified
Only one block frame is checked and only that block can be replaced

Set Associative or Fully Associative


There are many blocks to choose from on a miss to replace

Three primary strategies for selecting a block to be replaced


Random: randomly selected LRU: Least Recently Used block is removed FIFO(First in, First out)

Data cache misses per 1000 instructions for various replacement strategies Associativity: 2-way 4-way 8-way
Size 16 KB 64 KB 256 KB LRU Random FIFO LRU Random FIFO LRU Random FIFO 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes

15

Q4(write strategy): What happens on a write?


Reads dominate processor cache accesses. E.g. 7% of overall memory traffic are writes while 21% of data cache access are writes Two option we can adopt when writing to the cache: Write through The information is written to both the block in the cache and to the block in the lower-level memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement, a dirty bit is used to indicate whether the block was modified in the cache (dirty) or not (clean). If clean, no write back since identical information to the cache is found. Pros and Cons WT: simple to implement, the cache is always clean, so read misses cannot result in writes. WB: writes occur at the speed of the cache and multiple writes within a block require only one write to the lower-level memory.

16

Write Stall and Write Buffer


When the CPU must wait for writes to complete during WT, the CPU is said to write stall A common optimization to reduce write stall is a write buffer, which allows the processor to continue as soon as the data are written to the buffer, thereby overlapping processor execution with memory updating

Processor

Cache

DRAM

Write Buffer

A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4
17

Write-Miss Policy: Write Allocate vs. Not Allocate


Two options on a write miss

Write allocate the block is allocated on a write miss, followed by the write hit actions

Write misses act like read misses

No-write allocate write misses do not affect the cache. The block is modified only in the lower-level memory

Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache

18

Write-Miss Policy Example


Example: Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations.
Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate?

Answer: No-write Allocate:


Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[100]. 1 write miss

Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[100]; 1 write hit

4 misses; 1 hit

2 misses; 3 hits
19

The organization of the data cache in the Opteron microprocessor. The 64 KB cache is two-way set associative with 64-byte blocks. The 9-bit index selects among 512 sets. The four steps of a read hit, shown as circled numbers in order of occurrence, label this organization. Three bits of the block offset join the index to supply the RAM address to select the proper 8 bytes. Thus, the cache holds two groups of 4096 64-bit words, with each group containing half of the 512 sets. Although not exercised in this example, the line from lower-level memory to the cache is used on a miss to load the cache. The size of address leaving the processor is 40 bits because it is a physical address and not a virtual address. Figure on page C-45 explains how the Opteron maps from virtual(48-bit) to physical (40) for a cache access.

The Opteron Data Cache


64 KB cache 2-way set associative 64 byte block Tag index block offset <25> 64 KB/64 64 byte/block = 1 KB(blocks)/2 2^6 = 64 512 sets = 2^9 i.e., <6> i.e., <9> 25 + 9 + 6 = 48 physical address
21

Impact of Memory Access on CPU Performance


Example: Lets use an in-order execution computer. Assume the cache miss penalty is 200 clock cycles, and all instructions normally take 1.0 clock cycles (ignoring memory stalls). Assume the average miss rate is 2%, there is an average of 1.5 memory references per instruction, and the average number of cache misses per 1000 instructions is 30. What is the impact on performance when behavior of the cache is included? Calculate the impact using both misses per instruction and miss rate.

22

Impact of Cache Organizations on CPU Performance


Example 1: What is the impact of two different cache organizations (direct mapped vs. 2-way set associative) on the performance of a CPU? Ideal CPI = 2.0 (ignoring memory stalls) Clock cycle time is 1.0 ns Avg. memory references per instruction is 1.5 Cache size: 64 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.25 times to accommodate the selection multiplexer Cache miss penalty is 75 ns Hit time is 1 clock cycle Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%. Calculate AMAT and then processor performance. Answer: Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction x Miss penalty) x Clock cycle time = IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC CPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC
23

Example 2: What is the impact of two different cache organizations (direct mapped vs. 2-way set associative) on the performance of a CPU? Ideal CPI = 1.6 (ignoring memory stalls) Clock cycle time is 0.35 ns Avg. memory references per instruction is 1.4 Cache size: 128 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.35 times to accommodate the selection multiplexer Cache miss penalty is 65 ns Hit time is 1 clock cycle Miss rate: direct mapped 2.1%; 2-way set-associative 1.9%. Calculate AMAT and then processor performance.

24

Summary of Performance Equations

25

Types of Cache Misses


Compulsory (cold start or process migration, first reference): first access to a block The very first access to a block cannot be in the cache, so the block must be brought into cache. They occur in an infinite cache. Capacity: Cache cannot contain all blocks accessed by the program, blocks must be discarded and retrieved Occur in fully associative cache Solution: increase cache size, smaller upper level memory-thrash Conflict (collision): Multiple memory locations mapped to the same cache location, when block placement strategy is set associative or direct. Solution 1: increase cache size Solution 2: increase associativity

26

6 Basic Cache Optimizations


AMAT = Hit Time + Miss Rate * Miss Penalty 1. 2. 3. Reducing Miss Rate Larger Block size (compulsory misses) Larger Cache size (capacity misses) Higher Associativity (conflict misses)

Reducing Miss Penalty 4. Multilevel Caches 5. Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer Reducing hit time 6. Avoiding address translation when indexing cache
27

1: Larger Block Size


25% 20% Miss Rate 15% 10% 64K 5% 0% 256K 1K 4K 16K

The miss rate actually goes up if the block size is too large relative to the cache size.

Size of Cache
16 32 64 128
Block Size (bytes)

Take advantage of spatial locality -The larger the block, the greater the chance parts of it will be used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is small Usually high latency and high bandwidth of lower level memory encourage large block size 28

256

2: Larger Caches
0.14 0.12 2-way Miss Rate per Type 1-way

0.1
4-way

0.08 8-way
0.06 0.04 0.02 0 4 1 2 8 16 32 64 128

Capacity

Cache Size (KB)

Compulsory

Increasing capacity of cache reduces capacity misses May result in longer hit time and higher cost Trends: Larger L2 or L3 off-chip caches
29

3: Higher Associativity
Previous figure shows how miss rates improve with higher associativity
8-way set associative is as effective as fully associative for practical purposes 2:1 Cache Rule: Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2

Tradeoff: higher associative cache complicates the circuit


May have longer clock cycle

Beware: Execution time is the only final measure! Will Clock Cycle time increase as a result of having a more complicated cache?
30

Reducing Cache Miss Penalty


Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory. Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Five optimizations 1. Multilevel caches 2. Giving priority to read misses over writes

31

4: Multilevel Caches
Approaches Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap L1: fast hits, L2: fewer misses L2 Equations Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Average Memory Access Time = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Hit TimeL1 << Hit TimeL2 << << Hit TimeMem Miss RateL1 < Miss RateL2 < Definitions: Local miss rate misses in this cache divided by the total number of memory accesses to this cache (1st level cache Miss rateL1 , 2nd level cache Miss rateL2) Global miss ratemisses in this cache divided by the total number of memory accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2) Indicate what fraction of the memory accesses that leave the CPU go all the way to memory. 32

Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cache should be much bigger than L1

Whether data in L1 is in L2
novice approach: design L1 and L2 independently multilevel inclusion: L1 data are always present in L2
Advantage: easy for consistency between I/O and cache (checking L2 only) Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2

multilevel exclusion: L1 data is never found in L2


A cache miss in L1 results in a swap(not replacement) of blocks between L1 and L2 Advantage: prevent wasting space in L2 i.e. AMD Athlon: 64 KB L1 and 256 KB L2 33

Example: Suppose that in 1000 memory references there are 40 misses in the first level cache and 20 misses in the second level cache. What are the various miss rates? Assume the miss penalty from the L2 cache to memory is 200 clock cycles, the hit time of the L2 cache is 10 clock cycles, the hit time of L1 is 1 clock cycle, and there are 1.5 memory references per instruction. What is the average memory access time and average stall cycles per instruction?

34

5: Giving Priority to Read Misses over Writes


Serve reads before writes have been completed Write through with write buffers
SW LW LW R3, 512(R0) ; M[512] <- R3 R1, 1024(R0) ; R1 <- M[1024] R2, 512(R0) ; R2 <- M[512] (cache index 0) (cache index 0) (cache index 0)

Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back Suppose a read miss will replace a dirty block Normal: Write dirty block to memory, and then do the read Instead: Copy the dirty block to a write buffer, do the read, and then do the write CPU stall less since restarts as soon as do read
35

6: Avoiding address translation during cache indexing


Two tasks: indexing the cache and comparing addresses virtually vs. physically addressed cache
virtual cache: use virtual address (VA) for the cache physical cache: use physical address (PA) after translating virtual address

Challenges to virtual cache


1.Protection: page-level protection (RW/RO/Invalid) must be checked
Its checked as part of the virtual to physical address translation solution: an addition field to copy the protection information from TLB and check it on every access to the cache

2.context switching: same VA of different processes refer to different PA, requiring the cache to be flushed
solution: increase width of cache address tag with process-identifier tag (PID)

3.Synonyms or aliases: two different VA for the same PA


inconsistency problem: two copies of the same data in a virtual cache hardware antialiasing solution: guarantee every cache block a unique PA Alpha 21264: check all possible locations. If one is found, it is invalidated software page-coloring solution: forcing aliases to share some address bits Suns Solaris: all aliases must be identical in last 18 bits => no duplicate PA

4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
36

Virtually indexed, physically tagged cache


CPU VA TB PA $ PA TB PA MEM Virtually Addressed Cache Translate only on miss Synonym Problem VA Tags $ VA CPU VA PA Tags CPU VA $ L2 $ MEM TB PA

MEM
Conventional Organization

Overlap cache access with VA translation: requires $ index to remain invariant across translation

37

Virtual Memory
Virtual memory (VM) allows programs to have the illusion of a very large memory that is not limited by physical memory size
Make main memory (DRAM) acts like a cache for secondary storage (magnetic disk) Otherwise, application programmers have to move data in/out main memory Thats how virtual memory was first proposed

Virtual memory also provides the following functions


Allowing multiple processes share the physical memory in multiprogramming environment Providing protection for processes (compare Intel 8086: without VM applications can overwrite OS kernel) Facilitating program relocation in physical memory space

VM Example

39

Virtual Memory and Cache


VM address translation a provides a mapping from the virtual address of the processor to the physical address in main memory and secondary storage. Cache terms vs. VM terms Cache block => page Cache Miss => page fault

Tasks of hardware and OS


TLB does fast address translations OS handles less frequently events:
page fault TLB miss (when software approach is used)
40

Virtual Memory and Cache


Parameter Block (page) size Hit time Miss Penalty Miss rate Address mapping L1 Cache 16-128 bytes 1-3 cycles 8-300 cycles 0.1-10% Main Memory 4KB 64KB 50-150 cycles 1M to 10M cycles 0.00001-0.001%

25-45 bits => 13-21 32-64 bits => 25-45 bits bits

4 Qs for Virtual Memory


Q1: Where can a block be placed in Main Memory?
Miss penalty for virtual memory is very high => Full associativity is desirable (so allow blocks to be placed anywhere in the memory) Have software determine the location while accessing disk (10M cycles enough to do sophisticated replacement)

Q2: How is a block found if it is in Main Memory?


Address divided into page number and page offset Page table and translation buffer used for address translation

4 Qs for Virtual Memory


Q3: Which block should be replaced on a miss?
Want to reduce miss rate & can handle in software Least Recently Used typically used A typical approximation of LRU
Hardware set reference bits OS record reference bits and clear them periodically OS selects a page among least-recently referenced for replacement

Q4: What happens on a write?


Writing to disk is very expensive Use a write-back strategy

43

Virtual-Physical Translation
A virtual address consists of a virtual page number and a page offset. The virtual page number gets translated to a physical page number. The page offset is not changed
36 bits
Virtual Page Number

12 bits
Page offset Virtual Address

Translation Physical Page Number Page offset 33 bits 12 bits Physical Address

44

Address Translation Via Page Table

Assume the access hits in main memory


45

TLB: Improving Page Table Access


Cannot afford accessing page table for every access include cache hits (then cache itself makes no sense) Again, use cache to speed up accesses to page table! (cache for cache?) TLB is translation lookaside buffer storing frequently accessed page table entry A TLB entry is like a cache entry
Tag holds portions of virtual address Data portion holds physical page number, protection field, valid bit, use bit, and dirty bit (like in page table entry) Usually fully associative or highly set associative Usually 64 or 128 entries

Access page table only for TLB misses


46

TLB Characteristics
The following are characteristics of TLBs
TLB size : 32 to 4,096 entries Block size : 1 or 2 page table entries (4 or 8 bytes each) Hit time: 0.5 to 1 clock cycle Miss penalty: 10 to 30 clock cycles (go to page table) Miss rate: 0.01% to 0.1% Associative : Fully associative or set associative Write policy : Write back (replace infrequently)

47

48

49

11 Advanced Cache Optimizations


Reducing hit time 1. Small and simple caches 2. Way prediction 3. Trace caches Increasing cache bandwidth 4. Pipelined caches 5. Multibanked caches 6. Nonblocking caches Reducing Miss Penalty 7. Critical word first 8. Merging write buffers Reducing Miss Rate 9. Compiler optimizations Reducing miss penalty or miss rate via parallelism 10.Hardware prefetching 11.Compiler prefetching
50

O1: Small and Simple Caches


A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address

Guideline: smaller hardware is faster


Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? Small data cache and thus fast clock rate

Guideline: simpler hardware is faster


Direct Mapped, on chip

General design:
small and simple cache for 1st-level cache Keeping the tags on chip and the data off chip for 2nd-level caches The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory

51

02. Fast Hit Times Via Way Prediction


How to combine fast hit time of direct-mapped with lower conflict misses of 2-way SA cache? Way prediction: keep extra bits in cache to predict way (block within set) of next access Multiplexer set early to select desired block; only 1 tag comparison done that cycle (in parallel with reading data) Miss check other blocks for matches in next cycle

Hit Time Way-Miss Hit Time Miss Penalty

Accuracy 85% Drawback: CPU pipeline harder if hit time is variable-length


52

O3: Trace Caches


Trace cache for instructions: find a dynamic sequence of instructions including taken branches to load into a cache block
The cache blocks contain dynamic traces of executed instructions determined by CPU rather than static sequences of instructions determined by memory branch prediction is folded into the cache: validated along with the addresses to have a valid fetch i.e. Intel NetBurst microarchitecture

advantage: better utilization


Trace caches store instructions only from the branch entry point to the exit of the trace Unused part of a long block entered or exited from a taken branch in conventional I-cache may not be fetched

Downside: store the same instructions multiple times


53

O4: Pipelined Cache Access


Simply to pipeline cache access
Multiple clock cycle for 1st-level cache hit

Advantage: fast cycle time and slow hit


Example: accessing instructions from I-cache Pentium: 1 clock cycle Pentium Pro ~ Pentium III: 2 clocks Pentium 4: 4 clocks

Drawback: Increasing the number of pipeline stages leads to


greater penalty on mispredicted branches and more clock cycles between the issue of the load and the use of the data

Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit
54

05:Non-blocking Caches
For out-of-order completion processors. Blocking Caches o When a "miss" occurs, CPU stalls until the data cache successfully finds the missing data. Non-blocking Caches o Allow the CPU to continue being productive (such as continue fetching instructions) while the "miss" resolves. Uses special registers called Miss Status/Information Holding Registers (MSHR's). o Hold information about unresolved "misses" o One entry for each "miss" depending on implementation

Blocking Cache Flow

Non-Blocking Cache Flow

06:Multibanked Caches

Increasing Cache Bandwidth Via Multiple Banks


Rather than treating cache as single monolithic block, divide into independent banks to support simultaneous accesses E.g., T1 (Niagara) L2 has 4 banks
Works best when accesses naturally spread across banks mapping of addresses to banks affects behavior of memory system Simple mapping that works well is sequential interleaving Spread block addresses sequentially across banks E,g, bank i has all blocks with address i modulo n

59

Single Bank

Multibank Cache

Simultaneous Access

Multibank Cache

Simple mapping: sequential interleaving

Block Addr.

0 4 8 12

Bank #0 - Handles addresses where:


(block_address) mod 4 = 0

Block Addr.

0 4 8 12 1 5 9 13

Bank #1 - Handles addresses where:


(block_address) mod 4 = 1

Block Addr.

0 4 8 12 1 5 9 13 2 6 10 14

Bank #2 - Handles addresses where:


(block_address) mod 4 = 2

Block Addr.

0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

Bank #3 - Handles addresses where:


(block_address) mod 4 = 3

07:Critical Word First / Early Restart

Reduce Miss Penalty: Early Restart and Critical Word First


Dont wait for full block before restarting CPU Early restartAs soon as requested word of block arrives, send to CPU and continue execution Spatial locality tend to want next sequential word, so may still pay to get that one Critical Word FirstRequest missed word from memory first, send it to CPU right away; let CPU continue while filling rest of block Long blocks more popular today Critical Word 1st widely used
block
71

Unoptimize

MISS!

Unoptimized

Unoptimized

Unoptimized

Unoptimized

Unoptimized

Unoptimized

Early Start

MISS!

Early Start

Early Start

Early Start

Early Start

Early Start

Critical Word First

MISS!

Critical Word First

Critical Word First

Critical Word First

Critical Word First

Critical Word First

Critical Word First

Critical Word First

Critical Word First

08:Merging Write Buffer

Merging Write Buffer to Reduce Miss Penalty


Write buffer lets processor continue while waiting for write to complete If buffer contains modified blocks, addresses can be checked to see if new data matches that of some write buffer entry If so, new data combined with that entry For sequential writes in write-through caches, increases block size of write (more efficient) Sun T1 (Niagara) and many others use write merging

95

Without Merging Write Buffer

Without Merging Write Buffer

Without Merging Write Buffer

Without Merging Write Buffer

Without Merging Write Buffer

Without Merging Write Buffer

75% wasted space

Without Merging Write Buffer

With Merging Write Buffer

With Merging Write Buffer

With Merging Write Buffer

With Merging Write Buffer

With Merging Write Buffer

With Merging Write Buffer

With Merging Write Buffer

9. Reducing Misses by Compiler Optimizations


McFarling [1989] reduced misses by 75% in software on 8KB direct-mapped cache, 4 byte blocks Instructions Reorder procedures in memory to reduce conflict misses Profiling to look at conflicts (using tools they developed) Data Merging arrays: Improve spatial locality by single array of compound elements vs. 2 arrays Loop interchange: Change nesting of loops to access data in memory order Loop fusion: Combine 2 independent loops that have same looping and some variable overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows

Merging Arrays Example


/* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of structures */ struct merge { int val; int key; }; struct merge merged_array[SIZE];

Reduce conflicts between val & key; improve spatial locality

Loop Interchange Example

/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Loop Fusion Example


/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality

10. Reducing Misses by Hardware Prefetching of Instructions & Data


Prefetching relies on having extra memory bandwidth that can be used without penalty Instruction prefetching Typically, CPU fetches 2 blocks on miss: requested and next Requested block goes in instruction cache, prefetched goes in instruction stream buffer Data prefetching Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes

11. Reducing Misses by Software Prefetching Data


Data prefetch Load data into register (HP PA-RISC loads) Cache prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; form of speculative execution
Prefetch instructions take time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces problem of issue bandwidth

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy