Chapter 7-Memory System Design
Chapter 7-Memory System Design
Chapter 7-Memory System Design
Introduction
RAM structure: Cells and Chips
Memory boards and modules
Two-level memory hierarchy
The cache
Virtual memory
The memory as a sub-system of the computer
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Introduction
D
A
2/e
So far, we’ve treated memory as an array of words limited in
size
only by the number of address bits. Life is seldom so easy...
What other issues can you think of that will influence memory
design?
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S In This Chapter we will cover–
D
A
2/e •Memory components:
•RAM memory cells and cell arrays
•Static RAM–more expensive, but less complex
•Tree and Matrix decoders–needed for large RAM chips
•Dynamic RAM–less expensive, but needs “refreshing”
•Chip organization
•Timing
•Commercial RAM products" SDRAM and DDR RAM
•ROM–Read only memory
•Memory Boards
•Arrays of chips give more addresses and/or wider words
•2-D and 3-D chip arrays
• Memory Modules
•Large systems can benefit by partitioning memory for
•separate access by system components
•fast access to multiple words
–more–
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S In This Chapter we will also cover–
D
A
2/e
• The memory hierarchy: from fast and expensive to slow and cheap
Example: Registers->Cache–>Main Memory->Disk
At first, consider just two adjacent levels in the hierarchy
The Cache: High speed and expensive
Kinds: Direct mapped, associative, set associative
Virtual memory–makes the hierarchy transparent
Translate the address from CPU’s logical address to the
physical address where the information is actually stored
Memory management - how to move information back and forth
Multiprogramming - what to do while we wait
The “TLB” helps in speeding the address translation process
Will discuss temporal and spatial locality as basis for success of
cache and virtual memory techniques.
• Overall consideration of the memory as a subsystem.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
Fig. 7.1 The CPU–Main Memory Interface
S
D
A
2/e
Sequence of events:
Read:
1. CPU loads MAR, issues Read, and REQUEST
2. Main Memory transmits words to MDR
3. Main Memory asserts COMPLETE.
Write:
1. CPU loads MAR and MDR, asserts Write, and REQUEST
2. Value in MDR is written into address in MAR.
3. Main Memory asserts COMPLETE.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D The CPU–Main Memory Interface - cont'd.
A
2/e
Additional points:
•if b<w, Main Memory must make w/b b-bit transfers.
•some CPUs allow reading and writing of word sizes <w.
Example: Intel 8088: m=20, w=16,s=b=8.
8- and 16-bit values can be read and written
•If memory is sufficiently fast, or if its response is predictable,
then COMPLETE may be omitted.
•Some systems use separate R and W lines, and omit REQUEST.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Table 7.1 Some Memory Properties
A
2/e
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Big-Endian and Little-Endian
D Storage
A
When data types having a word size larger than the smallest
2/e addressable unit are stored in memory the question arises,
1 AB 1 CD
0 CD 0 AB
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Table 7.2 Memory Performance
D Parameters
A
2/e
Symbol Definition Units Meaning
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Table 7.3 The Memory Hierarchy, Cost, and
D Performance
A
2/e Compo-
nent
CPU Tape
Some Cache Main Memory Disk Memory Memory
Typical
Values:† Access Random Random Random Direct Sequential
Select Select
DataIn
R/W
R/W
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Fig. 7.4 An 8-bit register as a 1D RAM array
A
2/e The entire register is selected with one select line, and uses one R/W line
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig. 7.5 A 4x8 2D Memory Cell Array
D 2-4 line decoder selects one of the four 8-bit arrays
A
2/e
2-bit
address
R/W is common
to all.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Fig. 7.6 A 64Kx1 bit static RAM (SRAM) chip
A
~square array fits IC design
2/eparadigm
This chip requires 24 pins including power and ground, and so will require a 24
pin pkg. Package size and pin count can dominate chip cost.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.8 Matrix and Tree Decoders
D
A
2/e•2-level decoders are limited in size because of gate fanin.
Most technologies limit fanin to ~8.
•When decoders must be built with fanin >8, then additional levels
of gates are required.
•Tree and Matrix decoders are two ways to design decoders with large fanin:
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D
Fig 7.9 A 6 Transistor static RAM cell
A
2/e
A value is read by
precharging the bit
lines to a value 1/2
way between a 0 and
a 1, while asserting the
word line. This allows the
latch to drive the bit lines
to the value stored in
the latch.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Figs 7.10 Static RAM Read Timing
D
A
2/e
Access time from Address– the time required of the RAM array to decode the
address and provide value to the data bus.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Figs 7.11 Static RAM Write Timing
D
A
2/e
Write time–the time the data must be held valid in order to decode address and
store value in memory cells.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.12 A Dynamic
D RAM (DRAM) Cell
A
2/e
Capacitor will
discharge in 4-15ms.
Refresh capacitor by reading
(sensing) value on bit line,
amplifyingacitor.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.13 DRAM Chip
D organization
A
2/e
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D
Figs 7.14, 7.15 DRAM Read and Write cycles
A
2/e Typical DRAM Read operation Typical DRAM Write operation
Memory Memory
Row Addr Col Addr Row Addr Col Addr
Address Address
CAS CAS
R/W W
Data Data
tA tDHR
tC tC
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S DRAM Refresh and row access
D
A
2/e
•Refresh is usually accomplished by a “RAS-only” cycle. The row address
is placed on the address lines and RAS asserted. This refreshed the entire row.
CAS is not asserted. The absence of a CAS phase signals the chip that a
row refresh is requested, and thus no data is placed on the external data lines.
•Many chips use “CAS before RAS” to signal a refresh. The chip has an internal
counter, and whenever CAS is asserted before RAS, it is a signal to refresh the row
pointed to by the counter, and to increment the counter.
•Most DRAM vendors also supply one-chip DRAM controllers that encapsulate
the refresh and other functions.
•Page mode, nibble mode, and static column mode allow rapid access to
the entire row that has been read into the column latches.
•Video RAMS, VRAMS, clock an entire row into a shift register where it can
be rapidly read out, bit by bit, for display.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Fig 7.16 A CMOS ROM Chip
A
2/e
2-D CMOS ROM Chip
+V
Row
Decoder
00 Address
CS
1 0 1 0
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Tbl 7.4 Kinds of ROM
A
2/e
ROM Type Cost Programmability Time to program Time to erase
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Memory boards and modules
A
2/e
•There is a need for memories that are larger and wider than a single chip
•Chips can be organized into “boards.”
•Boards may not be actual, physical boards, but may consist of
structured chip arrays present on the motherboard.
•A board or collection of boards make up a memory module.
•Memory modules:
•Satisfy the processor–main memory interface requirements
•May have DRAM refresh capability
•May expand the total main memory capacity
•May be interleaved to provide faster access to blocks of words.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.17 General structure of memory chip
D
A
2/e
This is a slightly different view of the memory chip than previous.
Address
Address
m Decoder
CS
Memory R/W R/W
m
Cell Address
Array
Data
s
I/O s s s
Multiplexer
Data
Bi-directional data bus.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Fig 7.18 Word Assembly from Narrow Chips
A
2/e All chips have common CS, R/W, and Address lines.
Select
Address
R/W
CS CS CS
R/W R/W R/W
Address Address Address
Data Data Data
s s s
p×s
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.19 Increasing the Number of Words by a
D Factor of 2k
A
2/e k
The additional k address bits are used to select one of 2 chips,
each one of which has 2m words:
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
Fig 7.20 Chip
S Matrix Using Two
D Chip Selects
A
2/e
Multiple chip
select lines
are used to
replace the
This scheme last level of
simplifies the gates in this
decoding from matrix
use of a (q+k)- decoder
bit scheme.
decoder
to using one
q-bit and one
k-bit decoder.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.21 A 3-D
D DRAM Array
A
2/e
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
Fig 7.22 A Memory Module interface
D
A Must provide–
2/e •Read and Write signals.
•Ready: memory is ready to accept commands.
•Address–to be sent with Read/Write command.
•Data–sent with Write or available upon Read when Ready is asserted.
•Module Select–needed when there is more than one module.
Address
Bus Interface: k+m
Address register
k
m
Chip/board
selection
Module Memory boards
Control signal generator: select and/or
for SRAM, just strobes Read Control
chips
data on Read, Provides signal
Write generator
Ready on Read/Write
Ready w
For DRAM–also provides
Data register
CAS, RAS, R/W, multiplexes
address, generates refresh Data
w
signals, and provides Ready.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Fig 7.23 DRAM module with refresh control
A .
Address
2/e k+m
Address Register
k
Chip/board
selection m/2 m/2 m/2
Address lines
Request
Refresh
Board and
Grant
chip selects
Module
select
RAS
Dynamic
Read Memory RAM Array
timing CAS
Write generator
R/W
Data lines
Ready w
Data register
w
Data
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.24 Two Kinds of Memory Module Organization.
D
A
2/e
Memory Modules
are used to allow
access to more
than one word
simultaneously.
•Scheme (a)
supports filling a
cache line.
•Scheme (b) allows
multiple processes
or processors to
access memory at
once.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.25 Timing of Multiple Modules on a
D Bus
A
2/e If time to transmit information over bus, tb, is < module cycle time, tc,
it is possible to time multiplex information transmission to several
modules;
Example: store one word of each cache line in a separate module.
tb tc tb
For a read:
•return of data from memory
•transmission of completion signal
For a write:
•Transmission of data to memory (usually simultaneous with address)
•storage of data into memory cells
•transmission of completion signal
PROPAGATION TIME FOR ADDRESS AND COMMAND TO REACH CHIP: 120 ns.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Considering any two adjacent
D levels of the memory hierarchy
A
Some definitions:
2/e
Temporal locality: the property of most programs that if a given memory
location is referenced, it is likely to be referenced again, “soon.”
Working set: The set of memory locations referenced over a fixed period of
time, or in a time window.
Notice that temporal and spatial locality both work to assure that the contents
of the working set change only slowly over execution time.
Defining the Primary and Secondary levels:
Faster, Slower,
smaller larger
Primary Secondary
CPU • • • level • • •
level
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Primary and secondary levels
D of the memory hierarchy
A
2/e Speed between levels defined by latency: time to access first word, and
bandwidth, the number of words per second transmitted between levels.
Primary Secondary
level level Typical latencies:
cache latency: a few clocks
Disk latency: 100,000 clocks
•As working set changes, blocks are moved back/forth through the
hierarchy to satisfy memory access requests.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
SFig 7.29 Addressing and Accessing a 2-Level Hierarchy
DThe
Acomputer
2/e
system, HW
or SW,
must perform
any address
translation
that is
required:
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Hits and misses; paging;
D block placement
A Hit: the word was found at the level from which it was requested.
2/e
Miss: the word was not found at the level from which it was requested.
(A miss will result in a request for the block containing the word from
the next higher level in the hierarchy.)
Demand paging: pages are moved from disk to main memory only when
a word in the page is requested by the processor.
Recall that disk accesses may require 100,000 clock cycles to complete,
due to the slow access time of the disk subsystem. Once the processor
has, through mediation of the operating system, made the proper request
to the disk subsystem, it is available for other tasks.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Decisions in designing a
D 2-level hierarchy
A
2/e
•Translation procedure to translate from system address to primary address.
•Direct access to secondary level–in the cache regime, can the processor
directly access main memory upon a cache miss?
•Write through–can the processor write directly to main memory upon a cache
miss?
•Read through–can the processor read directly from main memory upon a
cache miss as the cache is being updated?
That same 32-bit address partitioned into two fields, a block field,
and a word field. The word field represents the offset into the block
specified in the block field:
Block Number Word
26 6
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Advantages and disadvantages
D of the associative mapped cache.
A
2/e
Advantage
•Most flexible of all–any MM block can go anywhere in the cache.
Disadvantages
•Large tag memory.
•Need to search entire tag memory simultaneously means lots of
hardware.
–next–
Direct mapped caches simplify the hardware by allowing each MM block
to go into only one place in the cache.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.34 The direct mapped cache
D
A
2/e
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.35 Direct Mapped Cache Operation
D
A
2/e 1. Decode the
group number of
the incoming
MM address to
select the group
2. If Match
AND Valid
4. Compare
cache tag with
incoming tag
5. If a hit, then
gate out the
cache line, 6. and use the word field to
select the desired word.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Direct mapped caches
A
2/e
•The direct mapped cache uses less hardware, but is
much more restrictive in block placement.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Fig 7.36 2-Way Set Associative Cache
A
2/e
Example shows 256 groups, a set of two per group.
Sometimes referred to as a 2-way set associative cache.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Getting Specific:
D The Intel Pentium Cache
A
2/e
•The Pentium actually has two separate caches–one for instructions and
one for data. Pentium issues 32-bit MM addresses.
31 0
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D
Cache Read and Write policies
A
2/e
•Read and Write cache hit policies
•Write-through–updates both cache and MM upon each write.
•Write back–updates only cache. Updates MM only upon block removal.
•“Dirty bit” is set upon first write to indicate block must be written back.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Block replacement strategies
D
A
2/e
•Not needed with direct mapped cache
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Cache performance
D
A
2/e
Recall Access time, ta = h • tp + (1-h) • ts for Primary and Secondary levels.
For tp = cache and ts = MM,
Having a model for cache and MM access times, and cache line fill time,
the speedup can be calculated once the hit ratio is known.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
SFig 7.37 Getting Specific: The PowerPC 601 Cache
D
A
2/e
•The PPC 601 has a unified cache - that is, a single cache for both instructions and
data.
•It is 32KB in size, organized as 64x8block set associative, with blocks being 8 8-
byte words organized as 2 independent 4 word sectors for convenience in the
updating process
•A cache line can be updated in two single-cycle operations of 4 words each.
•Normal operation is write back, but write through can be selected on a per line
basis via software. The cache can also be disabled via software.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Virtual memory
D The Memory Management Unit, MMU is responsible for mapping logical
A addresses issued by the CPU to physical addresses that are presented to
2/e the Cache and Main Memory. CPU Chip
•Virtual Address - the address generated from the logical address by the
Memory Management Unit, MMU.
This is the origin of those “bus error” and “segmentation fault" messages
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.39 Memory management by Segmentation
D
A
2/e
•Notice that each segment’s virtual address and out of physical memory will result
in gaps between segments. This is called external fragmentation.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.40 Segmentation Mechanism
D
A
2/e
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.41 The Intel 8086 Segmentation Scheme
D
A
2/e
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.42 Memory management by paging
D
A
2/e
•This figure shows the mapping between virtual memory pages, physical memory
pages, and pages in secondary memory. Page n-1 is not present in physical
memory, but only in secondary memory.
•The MMU that manages this mapping. -more-
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.43 The Virtual to Physical Address Translation
D Process
A
2/e
•1 table per
user per
program unit
•One
translation per
memory
access
•Potentially
large page
table
A page fault will result in 100,000 or more cycles passing before the page
has been brought from secondary storage to MM.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Page Placement and Replacement
A
2/e
Page tables are direct mapped, since the physical page is computed
directly from the virtual page number.
Page tables such as those on the previous slide result in large page
tables, since there must be a page table entry for every page in the
program unit.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fast address translation:
D regaining lost ground
A
2/e
•The concept of virtual memory is very attractive, but leads to considerable
overhead:
•There must be a translation for every memory reference
•There must be two memory references for every program reference:
•One to retrieve the page table entry,
•one to retrieveMost caches are addressed by physical address, so
there must be a virtual to physical translation before the cache can be
accessed.
The answer: a small cache in the processor that retains the last few virtual
to physical translations: A Translation Lookaside Buffer, TLB.
The TLB contains not only the virtual to physical translations, but also the
valid, dirty, and protection bits, so a TLB hit allows the processor to access
physical memory directly.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.44 TLB Structure and Operation
D
A
2/e
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S
D Fig 7.45 Operation of the Memory Hierarchy
A
2/e
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.46 The PowerPC 601 MMU Operation
D
A
2/e
“Segments” are
actually more
akin to
large (256 MB)
blocks.
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall
C
S Fig 7.47 I/O Connection to a Memory with a Cache
D
A
2/e
•The memory system is quite complex, and affords many possible tradeoffs.
•The only realistic way to chose among these alternatives is to study a
typical workload, using either simulations or prototype systems.
•Instruction and data accesses usually have different patterns.
•It is possible to employ a cache at the disk level, using the disk hardware.
•Traffic between MM and disk is I/O, and Direct Memory Access, DMA can be
used to speed the transfers:
Computer Systems Design and Architecture Second Edition © 2004 Prentice Hall