0% found this document useful (0 votes)
5 views44 pages

Chapter 5 - Memory

Chapter 5 discusses the exploitation of memory hierarchy in computer systems, detailing the boot process, principles of locality, and the structure of caches. It explains the differences between SRAM and DRAM technologies, the organization of memory, and the implications of cache hits and misses. Additionally, it covers write policies and the impact of block size on cache performance.

Uploaded by

Kiet Do
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views44 pages

Chapter 5 - Memory

Chapter 5 discusses the exploitation of memory hierarchy in computer systems, detailing the boot process, principles of locality, and the structure of caches. It explains the differences between SRAM and DRAM technologies, the organization of memory, and the implications of cache hits and misses. Additionally, it covers write policies and the impact of block size on cache performance.

Uploaded by

Kiet Do
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Chapter 5

Exploiting Memory Hierarchy


C Programs
So Far in CDA 4205 #include <stdlib.h>

RISC-V Assembly int fib(int n) {


return
fib(n-1) +
fib(n-2);
ld x5, 4(x10) }
Datapath addi x6, x5, 3
beq x6, x7, foo

Caches

Memory
What Happens at Boot?
When the computer switches ON, the CPU executes instructions from some start
address (stored in Flash ROM)

CPU
Memory mapped

0x0002000:
Code to copy firmware into
regular memory and jump
into it)

PC = 0x2000 (some default value) Address Space


What Happens at Boot?
1. BIOS*: Find a storage 4. Init: Launch an application
device and load first that waits for input in loop
sector (block of data) (e.g., Terminal/Desktop/...

2. Bootloader (stored on, e.g.,


disk): Load the OS kernel from
disk into a location in memory 3. OS Boot: Initialize
and jump into it services, drivers, etc.

*BIOS: Basic Input Output System


Principle of Locality
Programs access a small proportion of their address space at any time
Temporal locality
If a data location is referenced, then it will tend to be referenced again soon. (i.e.
program access same set of memory locations for a period of time)
Better to store frequently-accessed values nearer the CPU
e.g., instructions in a loop

Spatial locality
If a data location is referenced, data locations with nearby addresses will tend to
be referenced soon.
Useful to pre-load data that is close (in address) to other recently accessed data
E.g., sequential instruction access, array data
Great Idea #3: Principle of Locality/Memory Hierarchy
Taking Advantage of Locality
Memory hierarchy
Store everything on disk
Copy recently accessed (and nearby) items from disk to
smaller DRAM memory
Main memory
Copy more recently accessed (and nearby) items from
DRAM to smaller SRAM memory
Cache memory attached to CPU
Memory Hierarchy Levels
Block (aka line): unit of copying
May be multiple words

Upper level: It is closer to the processor, smaller,


and faster than lower level, since upper level
uses more expensive technology.
If accessed data is present in upper level
Hit: access satisfied by upper level
Hit ratio: hits/accesses
If accessed data is absent
Miss: block copied from lower level
Time taken: miss penalty
Miss ratio: misses/accesses = 1 hit ratio
Then accessed data supplied from upper level
Memory Hierarchy Levels
By implementing the memory system as a
hierarchy, the program has the illusion of a
memory that is as large as the largest level of
the hierarchy but can be accessed as if it were
all built from the fastest memory.

Flash memory has replaced disks in many


personal mobile devices, and may lead to a new
level in the storage hierarchy for desktop and
server computers
SRAM Technology
SRAMs are simply integrated circuits that are memory arrays with (usually) a
single access port that can provide either a read or a write. SRAMs have a fixed
access time to any piece of information, though the read and write access
times may differ.

It need to refresh and so the access time is very close to the cycle time.
SRAMs typically use six to eight transistors per bit to prevent the information
from being disturbed when it is read. SRAM needs only minimal power to retain
the charge in standby mode.

In the past, most PCs and server systems used separate SRAM chips for either
their primary, secondary, or even tertiary caches.

Today, thanks to Law, all levels of caches are integrated onto the
processor chip, so the market for separate SRAM chips has nearly
evaporated.
DRAM Technology
Data stored as a charge in a capacitor
Single transistor used to access the charge
Must periodically be refreshed
Read contents and write back
Advanced DRAM Organization
Bits in a DRAM are organized as a rectangular array
DRAM accesses an entire row
Burst mode: supply successive words from a row with
reduced latency
Double data rate (DDR) DRAM
Get twice as much bandwidth based on the clock rate and
the data width
Quad data rate (QDR) DRAM
Separate DDR inputs and outputs
DRAM Performance Factors
Row buffer
Allows several words to be read and refreshed in parallel

Synchronous DRAM
Allows for consecutive accesses in bursts without needing
to send each address
Improves bandwidth

DRAM banking
Allows simultaneous access to multiple DRAMs
Improves bandwidth
Flash Types
NOR flash: bit cell like a NOR gate
Random read/write access
Used for instruction memory in embedded systems

NAND flash: bit cell like a NAND gate


Denser (bits/area), but block-at-a-time access
Cheaper per GB

Not suitable for direct RAM or disk replacement


Wear levelling: remap data to less used blocks
What About SSD?
Made with transistors
Nothing mechanical that turns

Fast access to all locations, regardless of address


Still much slower than register, DRAM
Read/write blocks, not bytes
Potential reliability issues

15
Memory Terms
Memory hierarchy A structure that uses multiple levels of memories; as the distance from
the processor increases, the size of the memories and the access time both increase.
Block (or line): The minimum unit of information that can be either present or not present
in a cache.
Hit rate: The fraction of memory accesses found in a level of the memory hierarchy.
Miss rate: The fraction of memory accesses not found in a level of the memory hierarchy.
Hit time: The time required to access a level of the memory hierarchy, including the time
needed to determine whether the access is a hit or a miss.
Miss penalty: The time required to fetch a block into a level of the memory hierarchy from
the lower level, including the time to access the block, transmit it from one level to the
other, insert it in the level that experienced the miss, and then pass the block to the
requestor.
The Basics of Caches
Cache: represents the level of the memory hierarchy between the
processor and main memory in the first commercial computer to have
this extra level. The memories in the Datapath are simply replaced by
caches.
Cache Memory
The level of the memory hierarchy closest to the CPU
Given accesses X1 Xn 1, Xn (references)

How do we know if the data


is present?
Where do we look?
If each word can go in exactly one place in the cache,
then it is straightforward to find the word if it is in the
cache.

The simplest way to assign a location in the cache for


each word in memory is to assign the cache location
based on the address of the word in memory.
Direct Mapped Cache
A cache structure in which each memory location is mapped to
exactly one location in the cache.
Location determined by address
Direct mapped: only one choice Blocks is a power of 2
(Block address) modulo (#Blocks in cache) Use low-order address bits

Because there are eight words in the cache, an address X maps


to the direct-mapped cache word X modulo 8.

The low-order log2(8) = 3 bits are used as the cache index.


Addresses 00001two, 01001two, 10001two, and 11001two all map to
entry 001two of the cache

while addresses 00101two, 01101two, 10101two, and 11101two all


map to entry 101two of the cache.
Tags and Valid Bits
Valid Bit: A field in the tables of a memory hierarchy that indicates that the associated
block in the hierarchy contains valid data.

Tag: A field in a table used for a memory hierarchy that contains the address
information required to identify whether the associated block in the hierarchy
corresponds to a requested word.

How do we know which particular block is stored in a cache location?


Store block address as well as the data
Actually, only need the high-order bits
Called the tag
What if there is no data in a location?
Valid bit: 1 = present, 0 = not present
Initially 0
Address Subdivision
This cache holds 1024 words or 4 KiB.

The tag from the cache is compared against the upper portion
of the address to determine whether the entry in the cache
corresponds to the requested address.

Because the cache has 210 (or 1024) words and a block size of
one word, 10 bits are used to index the cache, leaving 32 10
2 = 20 bits to be compared against the tag.

The size of the tag field: N = K + M + 2 bits


N = length of virtual address
K-bit: Tag field in each cache entry
M-bit: Middle of virtual address that points to one
cache entry
2-Bits (byte offset): Not used to address data in the
cache

If the tag and upper 20 bits of the address are equal and the
valid bit is on, then the request hits in the cache, and the word
is supplied to the processor. Otherwise, a miss occurs.
Initial state of the cache after power-on
8-blocks, 1 word/block, direct mapped

The cache is initially empty, with all valid bits (V entry in


Index V Tag Data cache) turned off (N).
000 N
Since the cache is empty, several of the 1st references
001 N are misses
010 N
011 N
100 N
101 N
110 N
111 N
After handling a miss of address (10110two)
Word addr Binary addr Hit/miss Cache block The cache is initially empty, with all valid bits (V entry in
cache) turned off (N).
22 10 110 Miss 110

The processor requests the following addresses:


Index V Tag Data 10110two (miss), 11010two (miss), 10110two (hit), 11010two
000 N (hit), 10000two (miss), 00011two (miss), 10000two (hit),
10010two (miss), and 10000two (hit).
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
After handling a miss of address (11010two)
Word addr Binary addr Hit/miss Cache block The processor requests the following addresses:
10110two (miss), 11010two (miss), 10110two (hit), 11010two
26 11 010 Miss 010 (hit), 10000two (miss), 00011two (miss), 10000two (hit),
10010two (miss), and 10000two (hit).
Index V Tag Data
The cache contents after each miss in the sequence has
000 N been handled. When address 10010two (18) is referenced,
001 N the entry for address 11010two (26) must be replaced,
and a reference to 11010two will cause a subsequent
010 Y 11 Mem[11010] miss. The tag field will contain only the upper portion of
011 N the address. The full address of a word contained in
cache block i with tag field j for this cache is j 8 i, or
100 N equivalently the concatenation of the tag field j and the
101 N index i. For example, in cache f above, index 010two has
tag 10two and corresponds to address 10010two.
110 Y 10 Mem[10110]
111 N
After handling a miss of address (11010two)
Word addr Binary addr Hit/miss Cache block
22 10 110 Hit 110
26 11 010 Hit 010

Index V Tag Data


000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
After handling a miss of address (10000two)
Word addr Binary addr Hit/miss Cache block
16 10 000 Miss 000
3 00 011 Miss 011
16 10 000 Hit 000

Index V Tag Data


000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
After handling a miss of address (10010two)
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010

Index V Tag Data


000 Y 10 Mem[10000]
001 N
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Tags and Valid Bits
Assigned cache block (where
Decimal address of reference Binary address of reference Hit or miss in cache
found or placed)
22 10110two miss (5.6b) (10110two mod 8) = 110two
26 11010two miss (5.6c) (11010two mod 8) = 010two
22 10110two hit (10110two mod 8) = 110two
26 11010two hit (11010two mod 8) = 010two
16 10000two miss (5.6d) (10000two mod 8) = 000two
3 00011two miss (5.6e) (00011two mod 8) = 011two
16 10000two hit (10000two mod 8) = 000two
18 10010two miss (5.6f) (10010two mod 8) = 010two
16 10000two hit (10000two mod 8) = 000two
Block Size Considerations
Larger blocks should reduce miss rate
Due to spatial locality

But in a fixed-sized cache


Larger blocks fewer of them
More competition increased miss rate
Larger blocks pollution

Larger miss penalty


Can override benefit of reduced miss rate
Early restart and critical-word-first can help
Cache Misses
A request for data from the cache that cannot be filled because
the data is not present in the cache.
On cache hit, CPU proceeds normally
On cache miss
Stall the CPU pipeline
Fetch block from next level of hierarchy
Instruction cache miss
Restart instruction fetch
Data cache miss
Complete data access
Write Policy
Write-through
Update both upper and lower levels
Simplifies replacement, but may require write buffer
Write-back
Update upper level only
Update lower level when block is replaced
Need to keep more state
Virtual memory
Only write-back is feasible, given disk write latency
Write-Through
A scheme in which writes always update both the cache and the
next lower level of the memory hierarchy, ensuring that data is
always consistent between the two.
On data-write hit, could just update the block in cache
But then cache and memory would be inconsistent
But makes writes take longer
Solution: write buffer
Holds data waiting to be written to memory
CPU continues immediately
Only stalls on write if write buffer is already full
Write-Back
Alternative: On data-write hit, just update
the block in cache
Keep track of whether each block is dirty

When a dirty block is replaced


Write it back to memory
Can use a write buffer to allow replacing block
to be read first
Write Allocation
What should happen on a write miss?
Alternatives for write-through
Allocate on miss: fetch the block

Since programs often write a whole block before reading it (e.g.,


initialization)
For write-back
Usually fetch the block
Associative Caches
Fully associative
A cache structure in which a block can be placed in any location in the cache

Set associative
A cache that has a fixed number of locations (at least two) where each block can be
placed.

Direct Mapped - There is a direct mapping from any block address in memory to a single
location in the upper level of the hierarchy.
Associative Cache Example
In direct-mapped placement, there is only
one cache block where memory block 12
can be found, and that block is given by
(12 modulo 8) = 4.

In a two-way set-associative cache, there


would be four sets, and memory block 12
must be in set (12 mod 4) = 0; the memory
block could be in either element of the set.

In a fully associative placement, the


memory block for block address 12 can
appear in any of the eight cache blocks.
Spectrum of Associativity
For a cache with 8 entries
The total size of the cache in blocks is equal to the
number of sets times the associativity.

Thus, for a fixed cache size, increasing the


associativity decreases the number of sets while
increasing the number of elements per set. With
eight blocks, an eight-way set- associative cache is
the same as a fully associative cache.
Block address Cache block

Associativity Example 0
6
(0 modulo 4) = 0
(6 modulo 4) = 2

Compare 4-block caches 8 (8 modulo 4) = 0

Direct mapped, 2-way set associative,


fully associative
Block access sequence: 0, 8, 0, 6, 8

Direct mapped
Block Cache Hit/miss Cache content after access
address index
0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
Associativity Example
2-way set associative
Cache content after access Block address Cache set
Block Cache
address index Hit/miss
Set 0 Set 1 0 (0 modulo 2) = 0
0 0 miss Mem[0]
6 (6 modulo 2) = 0
8 0 miss Mem[0] Mem[8]
8 (8 modulo 2) = 0
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]

Fully associative
Block Hit/miss
address Cache content after access
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]
Set Associative Cache Organization
The comparators determine which element of
the selected set (if any) matches the tag. The
output of the comparators is used to select the
data from one of the four blocks of the
indexed set, using a multiplexor with a
decoded select signal.

In some implementations, the Output enable


signals on the data portions of the cache RAMs
can be used to select the entry in the set that
drives the output. The Output enable signal
comes from the comparators, causing the
element that matches to drive the data
outputs. This organization eliminates the need
for the multiplexor.
Block Placement
Determined by associativity
Direct mapped (1-way associative)
One choice for placement
N-way set associative
n choices within a set
Fully associative
Any location
Higher associativity reduces miss rate
Increases complexity, cost, and access time
Finding a Block
Associativity Location method Tag comparisons
Direct mapped Index 1
n-way set associative Set index, then search n
entries within the set
Fully associative Search all entries #entries
Full lookup table 0

Hardware caches
Reduce comparisons to reduce cost
Virtual memory
Full table lookup makes full associativity feasible
Benefit in reduced miss rate
Cache Design Trade-offs

Design change Effect on miss rate Negative Terminology:


performance effect Compulsory Miss (aka cold-
Increase cache size Decrease capacity May increase access start miss): first time block is
misses time accessed
Capacity Miss: replaced
Increase associativity Decrease conflict May increase access block is later accessed again,
misses time occurs due to finite cache size
Increase block size Decrease compulsory Increases miss Conflict Miss (aka collision
misses penalty. For very large miss): due to competition for
block size, may entries in a set; would not
increase miss rate occur in a fully associative
due to pollution. cache of same size
Interface Signals

Read/Write Read/Write
Valid Valid
32 32
Address Address
CPU Write Data
32 Cache Write Data
128 Memory
32 128
Read Data Read Data
Ready Ready

Multiple cycles
per access

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy