ACA - Memory
ACA - Memory
Memory Hierarchy:
Objectives: To attempt to match the processor speed with the rate of information
transfer or the bandwidth of the memory at the lowest level at a reasonable cost.
Major difference exists in the hierarchical memory structures of the two systems
due to the memory reference characteristics of multiprogrammed uniprocessor
system and parallel processor system. In parallel processing system concurrent
memory requests come from different processor at the same level. A conflict
occurs when two or more of these concurrent requests reference the same
section of memory at the same level. This type of conflict degrades the
performance of the system. This type of conflict can be reduced by partitioning the
memory at a given level into several modules to achieve some degree of
concurrent access.
Access time ta
Capacity of storage S
Cost / bit C
Design Goals:
t a is less
S is less
C is high
1
For high capacity memories:
ta is large
S is large
C is low
CPU
Register
C
O
S
T Cache
&
S Main Memory
P
P
E
D
Disk
Capacity
2
Random Access Memory (RAM) – in RAM the access time t a of a memory
word is independent of its location.
access time & tb is the block transfer time. For Drums and Fixed-Head Disks t a
is the time it takes for the initial word of the desired block to rotate into
position. For Movable-Arm Disks additional “Seek Time” t s is required to move
the arms into track position.
The memory hierarchy is structured in such a way that the level i is “higher” than
those at level i+1.
If ci,ti and si are the cost per byte, average access time and total memory size at
level i respectively, then
3
ci > ci+1
ti < ti+1
si < si+1
where i>=1.
Auxiliary memory
Magnetic I/O Main
tape Processor memory
Magnetic
disks
CPU Cache
memory
4
Processor Memory
Interconnection network
Channels
M2, 2
P1 M1, 1
M2, 3 M3, 1
Level 1 2 3
Access time t1 t2 t3
M2, 0 – M2, 3 : main memory , designed either with Metal Oxide Semiconductor
(MOS) or with ferromagnetic (core) technology. The unit of information transfer
between main memory & cache is a block of contiguous information. The primary
memory can be extended with Large Core Storage (LCS) or with Extended Core
Storage (ECS).
The processor usually references an item in the memory by providing the address
of that item in the memory. The address space in level i is a subset of that in level
i+1. But Ak of level i is not necessarily the address A k at level i+1. Any information
that exists in level i that may exists in level i+1.
5
Data Inconsistency or Coherence problem: Some of the information in
level i may be more current than that in level i + 1. Data consistency problem
arises between adjacent levels because they have different copies of the same
information. Usually level i+1 is eventually updated with the modified information
from level i. Data consistency problem may also exist between the local memories
and caches when two concurrent processes executing on separate processors,
interact via one or more shared variables. One process may have the updated
value of the shared variable in its local memory, while the other process may
continue with the old value of the shared variable in its local memory.
Hit Ratio & Miss Ratio: Hit Ratio (H) is the probability of finding the requested
information in the memory of a given level. In general, H depends on the
granularity of information transfer, the capacity of the memory at that level, the
management strategy etc. Usually H is most sensitive to the memory size s.
Since copies of information in level i are assumed to exist in levels greater than i,
the probability of a hit at level i and of misses at levels i-1 is:
h i = H( si ) – H( si-1 )
Where hi is the access frequency at level i and indicates the relative number of
successful accesses to level i. The missing-item fault frequency at level i is then
f i = 1 - hi
6
The effective access time Ti from the processor to the i th level of the memory
hierarchy is the sum of the individual average access times tk of each level from
k = 1 to I:
i
Ti = tk
k=1
tk includes the Wait time due to memory conflicts at level k and the delay in the
switching network between levels k-1 and k. The degree of conflicts depend on
Number of processors
The elective access time for each memory reference in n level memory hierarchy
is
n
T = hi ti
i=1
Miss Penalty: The extra time needed to bring the desired information into the
cache is called miss penalty. In general miss penalty is the time needed to bring a
block of data from a slower unit in the memory hierarchy to a faster unit. The miss
penalty is reduced if efficient mechanisms for transferring data between the
various units of the hierarchy are implemented.
tave = hC + (1-h)M
7
C= time to access cache
8
a single, large, directly addressable and fast main memory. A virtual memory
system facilitates its users to use a large addressable memory space without
worrying about the size limitations of the physical main memory.
In order to implement a virtual memory system, the main memory is divided into
fixed size contiguous areas, called Page frames. In addition, the on line disk
storage, are also divided into pieces of the same size, called either Pages or
Segments. Only those programs pages or segments that are actually at a
particular time in the processing, needed be in the primary storage. The remaining
pages may be kept temporarily in virtual memory, from where they can be loaded
into the main memory as and when required. The binary addresses that the
processor issues for either instructions or data are called Virtual or Logical
addresses. These addresses are translated into physical addresses by a
combination of hardware & software components. If a virtual address refers to a
part of a program or data space that is currently in the physical memory, then the
content of the address is accessed immediately; otherwise it is brought to the
physical memory before using.
PROCESSOR
MMU
Physical address
CACHE
MAIN MEMORY
DMA transfer
DISK STORAGE
9
VIRTUAL MEMORY ORGANIZATION
64K
address
addresses space 4K
0 Main memory
4096 0
8191 4095
65535
4096 to 8191 addresses are mapped onto main memory address 0 - 4095
On a machine with virtual storage, it will not be an error. The following steps will
take place.
10
The address map is updated.
Virtual Physical
Address memory
Space addresses
0-4K 2 0-4K
4K-8K 1 4K-8K
8K-12K 6 8K-12K
12K-16K 0 12K-16K
16K-20K 4 16K-20K
20K-24K 3 20K-24K
24K-28K X 24K-28K
28K-32K X 28K-32K
32K-36K X
36K-40K 5
40K-44K X Page frame
44K-48K 7
48K-52K X
52K-56K X
56K-60K X Virtual Page
60K-64K X
11
In the following figure a virtual address 8196 (0010000000000100 in binary) is
mapped using the MMU.
0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
12
In this example, the 16-bit address is taken as a 4 bit virtual page number and a
12-bit address within the selected page. In this figure, the 16-bit address is 8196.
PAGE FAULT: It is assumed that the virtual page reference is in main memory.
This assumption is not always true because there is not enough room in main
memory for all virtual pages. When a reference is made to an address on a page
which is not present in the in the main memory, it is called a page fault.
If the page size is n word, the average amount of space wasted in the last page of
the program by fragmentation will be n/2 - a situation that suggests using a small
page size to minimize waste. A small page size means many pages, as well as a
large page table. If the page table is maintained in hardware, a large number of
registers are required. In addition, more time will be will be required to load and
save these registers whenever a program is started or stopped. Furthermore
small pages make inefficient use of secondary memories with long access time,
13
such as disk. Because the Transfer time is relatively is usually sorter than the
combined seek and plus relational delay.
Optimal Page Replacement Algorithm: This algorithm says that the page
with highest level should be removed. If one page will not be used for next
8 million instructions and another page will not be used for 6 million
instructions, removing the former will push back the page fault.
14
Least Recently Used (LRU) Page Replacement Algorithm: It is a good
approximation to the Optimal Algorithm. It is based on the observation that
pages that have been heavily used in the last few instructions will probably
be heavily used again in the next few. Pages that have not been used for
ages will probably remain unused for a long time. When a page fault
occurs, throw out the page that has been unused for a long time.
0 0 0
4K 4K 4K
8K 8K
12K
15
Because each segment forms a logical entity of which the programmer is aware,
such as a procedure, or an array, or a stack, different segments can have different
kinds of protection. A procedure segment can be specified as execute only,
prohibiting attempts to read from it or store into it.
Seg 6 4K Seg 6 4K
Seg 3 8K Seg 3 8K Seg 3 8K
4K Seg 5 4K
Seg 5 4K Seg 5 4K
Seg 4 7K Seg 4 7K 10k
3k 3K
A B C D E
16
ADDRESS TRANSLATION:
+
PAGE TABLE
.
.
.
17
information includes the main memory address where the page is stored and the
current status of the page. The starting address of the page table is kept in a
Page Table Base Register. By adding the virtual page number to the contents of
this register, the address of the corresponding entry in the page table is obtained.
The contents of this location give the starting address of the page if the page is
currently in the main memory.
Each entry in the page table also includes some control bits that describe the
status of the page while it is in main memory. One bit indicates whether the page
is available in the main memory. Another bit indicates whether the page is
modified during its residency in the main memory.
The page table information is used by the MMU for every read & write access.
MMU is normally implemented as part of the CPU chip. It is impossible to
accommodate the entire page table within MMU. So page table is kept in the main
memory. A copy of small portion of the page table is kept in the MMU. This portion
consists of entries that correspond to the most recently accessed pages. A small
cache, usually called the Translation Lookaside Buffer (TLB), is incorporated into
the MMU for this purpose.
Address translation proceeds as follows. Given a virtual address, the MMU looks
in the TLB for the referenced page. If the page table entry for this page is found in
the TLB, the physical address is obtained immediately. If there is a miss in the
TLB, then the required entry is obtained from the page table in the main memory,
and the TLB is updated. The address translation process in the MMU requires
some time to perform mostly dependent on the time needed to look up entries in
the TLB. We an reduce the average translation time by including one or more
special registers that retain the virtual page number and the physical page frame
of the most recently performed translations. The information in these registers can
be accessed more quickly than the TLB.
18
CACHE MAPPING:
Consideration: 1) A cache consisting of 128 block frames of 16 words each, for a
total of 2048(2K) words.
DIRECT MAPPING:
Main memory
Block frame 0 -> 0, 128, 256
Block 0
Block frame 1 -> 1, 129, 257 Block 1
Cache
tag Block 0
tag Block 1
Block 127
Block 128
Block 129
tag
Block 255
Block 127 Block 256
Tag Block Word Block 257
5 7 4
Main memory address
Block 4095
The high order 5 bits of the memory address of the block are stored in 5 tag bits
associated with its location in the cache. They identify 32 blocks in the cache. 7 bit
Block field points to a particular block frame location in the cache. 4-bit word field
represents the word number.
ADVANTAGE: Easy to implement.
19
ASSOCIATIVE – MAPPED CACHE:
Main memory
Block 0
Cache Block 1
tag Block 0
tag Block 1
Block i
Block 127
Tag Word
12 4
Block 4095
Main memory address
Main memory block can be placed into any cache block position. 12 tag bits are
required to identify a memory block when it is resident in the cache. The tag bits
of a CPU generated address are compared to the tag bits of each block of the
cache to see if the desired is present. This is called Associative Mapping
technique.
Set 0 -> 0, 64, 128, … ,4032 Set 63 -> 63,127, ….. 4095
20
Main memory
Set 1 -> 1, 65,129, ….
Cache Block 0
Set 0 tag Block 0
Block 1 Block 1
tag
tag
Set 1 Block 2
tag Block 3
Block63
Set 63 tag Block 64
Block 126 Block 65
tag
Block 127
Block 127
Block 128
Tag Set Word
6 6 4 Block 129
64 sets can be represented by 6-bit set field of the address determines which set
of the cache might contain the desired block. One more control bit, called Valid
Bit, must be provided for each block. The valid bits are all set to 0 when power is
initially applied to the system or when the main memory is loaded with new
programs and data from the disk. The valid bit of a particular cache block is set to
Block 0
1 the first time this block is loaded from the main
Blockmemory.
1 Whenever a main
memory block is updated by a source that bypass the cache, a check is made to
determine whether the block being loaded is currently in the cache. If it exists, its
Block 15
Block 16
valid bit is cleared to 0.
Main memory
Cache
Block 15 Sector 1
Sector 1 Tag
Block 16
Block 31
Tag
Sector 7 Block 112
Block 127
Sector 1023
Memory is partitioned into number of sectors and the cache is divided into number
of sector frames. If a request is made for a block not in cache, the sector to which
this block belongs is brought into the buffer. A valid bit is associated with each
block frame to indicate the blocks in a sector that have been referenced and
retrieved from memory.
ADVANTAGE: Reduce the cost of the map since it requires relatively few tags,
which permits simultaneous comparisons with all tags.
cycle tm.
22
Maximum memory bandwidth Bm = W / tm (words/s or bytes/s)
u
Utilized CPU rate B P= Rw / TP (words/s)
TP is the total CPU time required to generate the Rw result.
k bits m bits
The high order k bits name one of n modules, and the low order m bits name a
particular word in that module. When consecutive locations are accessed, as
happens when a block of data is transferred to a cache, only one module is
involved. At the same time other devices with direct memory access (DMA) can
access information in other memory modules.
23
localized area of the address space.
m bits k bits
Address in module Module
MM address
Module
Module 0 Module i k
2 -1
The more efficient way to address the modules is low order Interleaving. The low
order k bits of the memory address selects a module, and high order m bits name
a location within that module. In this way consecutive addresses are located in
successive modules. Thus, any component of the system that generates requests
for access to consecutive memory locations can keep several modules busy at
k
any one time. To implement interleaving, there must be 2 modules.
WRITE BUFFER: When the write through protocol is used, each write operation
results in writing a new value into the main memory. The CPU is slowed down by
the write requests. To improve performance, a write buffer can be included for
temporary storage of write requests. The CPU places each write request into the
24
buffer and continues the execution of the next instruction. The write requests
stored in the write buffer are sent to the main memory whenever the main memory
is not responding to read requests.
25
b) Write-Through With Invalidation Protocol: When a processor writes a new value
into its cache, this value is written into the memory module, and all copies in other
caches are invalidated. Broadcasting can be used to send the invalidation
requests throughout the system.
26