0% found this document useful (0 votes)
47 views

Ilovepdf Merged

The document discusses memory hierarchy technology. It describes how storage devices like registers, caches, main memory, disks, and backups form a hierarchy based on access time, size, cost, bandwidth, and unit of transfer. Lower levels are faster to access but smaller, more expensive, and have higher bandwidth. The hierarchy satisfies properties of inclusion, coherence, and locality of reference due to program behaviors. The memory design is impacted by these properties and aims to minimize effective access time.

Uploaded by

Walter White
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Ilovepdf Merged

The document discusses memory hierarchy technology. It describes how storage devices like registers, caches, main memory, disks, and backups form a hierarchy based on access time, size, cost, bandwidth, and unit of transfer. Lower levels are faster to access but smaller, more expensive, and have higher bandwidth. The hierarchy satisfies properties of inclusion, coherence, and locality of reference due to program behaviors. The memory design is impacted by these properties and aims to minimize effective access time.

Uploaded by

Walter White
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Computer System

Architecture
Faculty: SHIBU V S
Module:2
Topic: Memory Hierarchy Technology
Hierarchical Memory Technology
• Storage devices such as registers, caches, main memory, disk
devices, and backup storage are often organized as a hierarchy
• The memory technology and storage organization at each level are
characterized by five parameters:
• the access time(ti),
• memory size(si),
• cost per byte(ci),
• transfer bandwidth(bi),
• unit of transfer(xi).
• The access time ti refers to the round-trip time from the CPU to the
ith-level memory.
• The memory size si, is the number of bytes or words in level i.
• The cost of the ith-level memory is estimated by the product cisi
• The bandwidth bi refers to the rate at which information is
transferred between adjacent levels.
• The unit of transfer xi refers to the grain size for data transfer
between levels i and i+ 1.
• Memory devices at a lower level are faster to access, smaller in size,
and more expensive per byte, having a higher bandwidth and using
a smaller unit of transfer as compared with those at a higher level.
• ti-1 < ti , si-1 < si , ci-1 > ci ,bi-1 > bi , xi-1 < xi ,
• for i = l, 2, 3,and 4, in the hierarchy where i= 0 corresponds to the
CPU register level. The cache is at level 1, main memory at level 2,
the disks at level 3, and backup storage at level 4.
Registers
• The registers are parts of the processor: multi-level caches are built
either on the processor chip or on the processor board.
• Register assignment is made by the compiler.
• Register transfer operations are directly controlled by the processor
after instructions are decoded.
• Register transfer is conducted at processor speed, in one clock
cycle.
Caches
• The cache is controlled by the MMU and is programmer-
transparent.
• The cache can also be implemented at one or multiple levels,
depending on the speed and application requirements.
• Over the last two or three decades, processor speeds have
increased at a much faster rate than memory speeds.
• Therefore multi-level cache systems have become essential to deal
with memory access latency.
Main Memory
• The main memory is sometimes called the primary memory of a
computer system.
• It is usually much larger than the cache and often implemented by
the most cost-effective RAM chips, such as DDR SDRAMs, i.e. dual
data rate synchronous dynamic RAMs.
• The main memory is managed by a MMU in cooperation with the
operating system.
Disk Drives and backup Storage
• The disk storage is considered the highest level of on-line memory.
• lt holds the system programs such as the OS and compilers, and
user programs and their data sets.
• Optical disks and magnetic tape units are off-line memory for use as
archival and backup storage.
• They hold copies of present and past user programs and processed
results and files.
• Disk drives are also available in the form of RAID arrays.
• A typical workstation computer has the cache and main memory on
a processor board and hard disks in an attached disk drive
Peripheral Technology
• Besides disk drives and backup storage, peripheral devices include
printers, plotters, terminals, monitors, graphics displays, optical
scanners, image digitizers, output microfilm devices,etc.
• Some l/O devices are tied to special-purpose or multimedia
applications.
Inclusion, Coherence and Locality
• Information stored in a memory hierarchy (M1, M2, ….Mn)satisfies
three important properties: inclusion, coherence and locality
• We consider cache memory the innermost level M1,which directly
communicates with the CPU registers.
• The outermost level Mn contains all the information words stored.
The collection of all addressable words in Mn forms the virtual
address space of a computer.
Inclusion Property
• The inclusion is Property is stated as
• The inclusion relationship implies that all information items are
originally stored in the outermost level Mn .
• During the processing, subsets of Mn, are copied into Mn-1. Similarly,
subsets of Mn-1 are copied into Mn-2 and so on.
• In other words, if an information word is found in Mi, then copies of
the same word can also be found in all upper levels Mi+1, Mi+2 ………
Mn .
• A word stored in Mi+1 may not be found in Mi .
• A word miss in Mi implies that it is also missing from all lower levels
Mi-1, Mi-2,…., M1.The highest level is the backup storage, where
everything can be found.
• Information transfer between the CPU and cache is in terms of
words (4 or 8 bytes each depending on the word length of a
machine).
• The cache (M1) is divided into cache blocks, also called cache lines
by some authors. Each block may be typically 32 bytes (8 words).
Blocks are the units of data transfer between the cache and main
memory, or between L1 and L2 cache, etc.
• The main memory (M2) is divided into pages, say, 4Kbytes each.
Each page contains 128 blocks. Pages are the units of information
transferred between disk and main memory.
• Scattered pages are organized as a segment in the disk memory, for
example, segment F contains page A, page B, and other pages. The
size of a segment varies depending on the user‘s needs.
• Data transfer between the disk and backup storage is handled at
the file level, such as segments F and G.
Coherence Property
• The coherence property requires that copies of the information
item at successive memory levels be consistent.
• If a word is modified in the cache, copies of that word must be
updated immediately or eventually at all higher levels. The
hierarchy should be maintained as such.
• Frequently used information is often found in the lower levels in
order to minimize the effective access time of the memory
hierarchy.
• ln general, there are two strategies tor maintaining the coherence
in a memory hierarchy.
• The first method is called write-through (WT), which demands
immediate update in Mi+1 if a word is modified in Mi
for i= 1,2,... ,n- 1.
• The second method is write-back (WB), which delays the update in
Mi+1 until the word being modified in Mi is replaced or removed
from Mi.
Locality of Reference
• The memory hierarchy was developed based on a program behavior
known as Locality of reference.
• Memory references are generated by the CPU for either instruction or
data access. These accesses tend to be clustered in certain regions in
time, space, and ordering.
• In other words, most programs act in favor of a certain portion of their
address space during any time window.
• Hennessy and Patterson have pointed out a 90-10 rule which states that a
typical program may spend 90% of its execution time on only 10% of the
code such as the inner most loop of a nested looping operation .
• There are three dimensions of the locality property: temporal,
spatial, and sequential.
Temporal Locality
• Recently referenced items (instructions or data) are likely to be
referenced again in the near future. This is often caused by special
program constructs such as iterative loops, process stacks,
temporary variables, or subroutines.
• Once a loop is entered or a subroutine is called, a small code
segment will be referenced repeatedly many times. Thus temporal
locality tends to cluster the access in the recently used areas.
Spatial Locality
• This refers to the tendency for a process to access items whose
addresses are near one another. For example, operations on tables
or arrays involve accesses of a certain clustered area in the address
space.
• Program segments, such as routines and macros, tend to be stored
in the same neighborhood of the memory space.
Sequential Locality
• ln typical programs, the execution of instructions follows a
sequential order (or the program order) unless branch instructions
create out-of-order executions.
• The ratio of in-order execution to out-of-order execution is roughly
5 to 1 in ordinary programs. Besides, the access of a large data array
also follows a sequential order.
Memory Design Implications
• The sequentiality in program behavior also contributes to the
spatial locality because sequentially coded instructions and array
elements are often stored in adjacent locations.
• Each type of locality affects the design of the memory hierarchy.
• The temporal locality leads to the popularity of the least recently used
(LRU) replacement algorithm.
• The spatial locality assists us in determining the size of unit data transfers
between adjacent memory levels.
• The temporal locality also helps determine the size of memory at
successive levels.
• The sequential locality affects the determination of grain size for optimal
scheduling (grain packing).
• Prefetch techniques are heavily affected by the locality properties.
• The principle of localities guides the design of cache, main memory, and
even virtual memory organization.
Memory capacity Planning
• The performance of a memory hierarchy is determined by the
effective access time Teff to any level in the hierarchy.
• It depends on the hit ratios and access frequencies at successive
levels.
Hit Ratios : Hit ratio is a concept defined for any two adjacent levels of
a memory hierarchy.
• When an information item is found in Mi, we call it a hit, otherwise,
a miss.
• Consider memory levels Mi and Mi-1 in a hierarchy, i= 1, 2,. . ., n. The
hit ratio hi at Mi is the probability that an information item will be
found in Mi.
• It is a function of the characteristics of the two adjacent levels Mi
and Mi-1 and The miss ratio at Mi, is definedas 1-hi.
• The hit ratios at successive levels are a function of memory
capacities, management policies, and program behavior.
• Successive hit ratios are independent random variables with values
between 0 and 1. To simplify the future derivation, we assume h0 =
0 and hn = 1, which means the CPU always accesses M1, first and the
access to the outermost memory Mn is always a hit.
• The access frequency to Mi is defined as

fi=(1-h1)(1-h2)(1-h3)…..(1-hi-1)hi.

This is indeed the probability of successfully accessing Mi,- when there


are i—1 misses at the lower levels and a hit at Mi.

and f1=h1
• Due to the locality property, the access frequencies decreases very
rapidly from low to high levels; that is,

f1 >> f2>> f3 >> ….>>fn.


• This implies that the inner levels of memory are accessed more
often than the outer levels.
Effective Access Time
• ln practice, we wish to achieve as high a hit ratio as possible at M1.
• Every time a miss occurs, a penalty must be paid to access the next
higher level of memory. The misses have been called block misses in
the cache and page faults in the main memory because blocks and
pages are the units of transfer between these levels.
• The time penalty for a page fault is much longer than that for a
block miss due to the fact that t1<t2<t3.
• Stone ( 1990) pointed out that a cache miss is 2 to 4 times as costly
as a cache hit, but a page fault is 1000 to 10,000 times as costly as a
page hit.
• In modern systems a cache miss has a greater cost relative to a
cache hit, because main memory speeds have not increased as fast
as processor speed
• Using the access frequencies fi for i = 1, 2,….., n, We can formally
define the effective access time of a memory hierarchy as follows:
Hierarchy Optimization
• The total cost of a memory hierarchy is estimated as follows:

• This implies that the cost is distributed over n levels. Since


c1>c2>c3>…cn we have to choose s1<s2<s3….sn .
• The optimal design of a memory hierarchy should result in a
close to the t1 of M1 and a total cost close to the cost of Mn.
• In reality, this is difficult to achieve due to the tradeoffs among n
levels.
• The optimization process can be formulated as a linear
programming problem, given a ceiling C0 on the total cost— that is,
a problem to minimize

• Subject to the following constraints:


• si>0, ti>0 for i=1,2,…n


• the unit cost ci and capacity si, at each level Mi, depend on the
speed ti, required.
• Therefore, the above optimization involves tradeoffs among ti, ci, si
and fi or hi at all levels i = 1, 2, . . ., n.
Example: The design of a memory hierarchy
• Consider the design of a three-level memory hierarchy with the
following specifications for memory characteristics:
• The design goal is to achieve an effective memory-access time t =
850 ns with a cache hit ratio h1 = 0.93 and a hit ratio h2= 0.99 in
main memory. Also, the total cost of the memory hierarchy is
upper-bounded by $1,500.
The memory hierarchy cost is calculated as:
C=c1s1 + c2s2 +c3s3 <= 1500
The maximum capacity of the disk is thus obtained as s3= 40 Gbytes
without exceeding the budget.
• Next, we want to choose the access time [t2] of the RAM to build
the main memory. The effective memory access time is calculated
as:
• t= <=850
850 x 10-9 =0.98 x 25 x10-9 + 0.02 x 0.99 x t2 + 0.02 x 0.01 x 1 x 4 x10-3
• t2=1250ns
• Suppose one wants to double the main memory to 64 Mbytes at
the expense of reducing the disk capacity under the same budget
limit. This change will not affect the cache hit ratio. But it may
increase the hit ratio in the main memory, and thereby, the
effective memory-access time will be reduced.
Problem No:1
a) The average access time is t a  h1t1  (1  h1 )h2t 2  ht1  (1  h)10t1
= (10  9h)t1
If h=0.7 , ta =3.7t1=74ns
If h=0.9 , ta =1.9t1=38ns
If h=0.98 , ta =1.18t1=23.6ns

c1s1  c2 s2 20c2 s1  c2  4
b) The average byte cost= c 
s1  s2 s1  4000
20  0.2 s1  0.2  4000 4 s1  800
= 
s1  4000 s1  4000
S1=64 ,c = 0.26
S2=128 , c = 0.32
S3=256, c= 0.43
c) For the three designs,
1) 74ns x 0.26 = 19.24

2) 38 ns x 0.32 = 12.6

3) 23.6 x 0.43 = 10.15

The third option is the best


Problem No:2
a) The effective access time is teff  h1t1  (1  h1 )h2t 2  h1t1  (1  h1 )t 2
= 0.95t1  0.05t 2
b) The total cost, c  c1s1  c2 s2
c) i) Total cost is upper bounded by 15000$
= 0.01 512 1024  0.0005  s2  15000
s2=18.6 MB
ii) teff =40 ns
= 20  0.95  0.05  t 2  40
= t2  420ns
Computer System
Architecture
Faculty: SHIBU V S
Module:2
Topic: Advanced Processor Technology
Module 2- Processors and Memory
Hierarchy
• This chapter presents modern processor technology and the
supporting memory hierarchy.
Advanced Processor Technology
• Architectural families of modem processors are introduced, from
processors used in workstations or multiprocessors to those
designed for mainframes and supercomputers.
• Major processor families to be studied include the
CISC,RISC,superscalar, VLIW, superpipelined, vector, and symbolic
processors.
• Scalar and vector processors are for numerical computations.
• Symbolic processors have been developed for Al applications.
Design Space of Processors
Various processor families can be mapped onto a coordinated space of
clock rate versus cycles per instruction(CPI).

The two broad categories which we shall discuss are CISC and RISC.
• Under both CISC and RISC categories, products designed for multi-
core chips, embedded applications, or for low cost and or low
power consumption, tend to have lower clock speeds.

• High performance processors must necessarily be designed to


operate at high clock speeds.

• The category of vector processors has been marked VP; vector


processing features may be associated with CISC or RISC main
processors.
The Design Space
• Conventional processors like the lntel Pentium, M65040, older
VAX/8600, IBM 390, etc. fall into the family known as complex-
instruction-set computers (ClSC) architecture.

• With advanced implementation techniques, the clock rate of


today‘s CISC processors ranges up to a few GHz.

• The CPI of different CISC instructions varies from 1 to 20. Therefore,


CISC processors are at the upper part of the design space.
• Reduced-instruction-set computers (RISC) processors include
SPARE‘, Power series, MIPS, Alpha,ARM. etc.

• With the use of efficient pipelines, the average CPI of RISC


instructions has been reduced to between one and two cycles.
• An important subclass of RISC processors are the superscalar
processors, which allow multiple instructions to be issued
simultaneously during each cycle.
• Thus the effective CPI of a superscalar processor should be lower
than that of a scalar RISC processor.

• The clock rate of superscalar processors matches that of scalar RISC


processors.
• The very-long instruction word (VLIW) architecture can in theory
use even more functional units than a superscalar processor. Thus
the CPI of a VLIW processor can be further lowered. Intel’s i860 RISC
processor had VLIW architecture.

• The processors in vector supercomputers use multiple functional


units for concurrent scalar and vector operations.
Instruction Pipeline
The execution cycle of a typical instruction includes four phases: fetch,
decode,execute, and write-back. These instruction phases are often
executed by an instruction pipeline.
A pipeline cycle is defined as the time required for each phase to
complete its operation, assuming equal delay in all phases (pipeline
stages).
1) Instruction pipeline cycle—the clock period of the instruction
pipeline.

1) Instruction issue Iatency—the time (in cycles) required between the


issuing of two adjacent instructions.

1) Instruction issue rate—the number of instructions issued per cycle,


also called the degree of a superscalar processor.
4) Simple operation latency-simple operations make up the vast
majority of instructions executed by the machine, such as integer adds,
loads, stores, branches, moves, etc.Complex operations are those
requiring an order-of-magnitude longer latency, such as divides,cache
misses etc. These latencies are measured in number of cycles.

5) Resource conflicts—This refers to the situation where two or more


instructions demand use of the same functional unit at the same time.
A base scalar processor is defined as a machine with one instruction
issued per cycle, a one-cycle latency for a simple operation, and a one-
cycle latency between instruction issues. The instruction pipeline can
be fully utilized if successive instructions can enter it continuously at
the rate of one per cycle.
if the instruction issue latency is two cycles per instruction, the pipeline
can be underutilized.
Another under pipelined situation, in which the pipeline cycle time is
doubled by combining pipeline stages. ln this case, the fetch and
decode phases are combined into one pipeline stage, and execute and
write-back are combined into another stage. This will also result in
poor pipeline utilization.
• the data path architecture and control unit of a typical, simple
scalar processor which does not employ an instruction pipeline.
Main memory, ID controllers, etc. are connected to the external
bus.
• The control unit generates control signals required for the fetch,
decode, ALU operation, memory access, and write result phases of
instruction execution.
• The control unit itself may employ hardwired logic, or—as was
more common in older CISC style processors—microcoded logic.
• Modem RISC processors employ hardwired logic, and even modem
CISC processors make use of many of the techniques originally
developed for high-performance RISC processors
CISC vs RISC
Super Scalar Processors and Vector
Processors
• A CISC or a RISC scalar processor can be improved with a
superscalar or vector architecture.
• Scalar processors are those executing one instruction per cycle.
Only one instruction is issued per cycle, and only one completion of
instruction is expected from the pipeline per cycle.
• In a superscalar processor, multiple instructions are issued per cycle
and multiple results are generated per cycle.
• A vector processor executes vector instructions on arrays of data;
each vector instruction involves a string of repeated operations,
which are ideal for pipelining with one result per cycle.
Superscalar Processors
• Superscalar processors are designed to exploit more instruction-
level parallelism in user programs.
• Only independent instructions can be executed in parallel without
causing a wait slate.
• The amount of instruction level parallelism varies widely depending
on the type of code being executed.
• The instruction issue degree in a superscalar processor has thus
been limited to 2 to 5 in practice.
Pipelining in Superscalar Processors
• The fundamental structure of a three-issue superscalar pipeline is
shown in Fig.

• Superscalar processors were originally developed as an alternative


to vector processors, with a view to exploit higher degree of
instruction level parallelism.
• A superscalar processor of degree m can issue m instructions per
cycle. In this sense, the base scalar processor, implemented either
in RISC or CISC, has m = 1.

• In order to fully utilize a superscalar processor of degree m, m


instructions must be executable in parallel. This situation may not
be true in all clock cycles. In that case, some of the pipelines may be
stalling in a wait state.
• In a superscalar processor, the simple operation latency should
require only one cycle, as in the base scalar .

• Due to the desire for a higher degree of instruction-level parallelism


in programs, the superscalar processor depends more on an
optimizing compiler to exploit parallelism.
The VLIW Architecture
• The VLIW architecture is generalized from two well-established
concepts: horizontal micro coding and superscalar processing.
• A typical VLIW (very long instruction word) machine has instruction
words hundreds of bits in length.
• Multiple functional units are used concurrently in a VLIW processor.

• All functional units share the use of a common large register file.

• The operations to be simultaneously executed by the functional


units are synchronized in a VLIW instruction, say, 256 or 1024 bits
per instruction word
• Different fields of the long instruction word carry the opcodes to be
dispatched to different functional units.

• Programs written in conventional short instruction words ( say 32


bits) must be compacted together to form the VLIW instructions.

• This code compaction must be done by a compiler which can


predict branch outcomes using elaborate heuristics or run-time
statistics.
Pipelining in VLIW Architectures
• The execution of instructions by an ideal VLIW processor is shown in
Fig.
• Each instruction specifies multiple operations. The effective CPI
becomes 0.33 in this particular example.
• VLIW machines behave much like superscalar machines with three
differences:
1. the decoding of VLIW instructions is easier than that of
superscalar instructions.
2. the code density of the super scalar machine is better when the
available instruction level parallelism is less than that exploitable
by the VLIW machine. This is because the fixed VLIW format
includes bits for non-executable operations, while the superscalar
processor issues only executable instructions.
3. a superscalar machine can be object-code-compatible with a large
family of non-parallel machines. On the contrary, a VLIW machine
exploiting different amounts of parallelism would require different
instruction sets.
• Instruction parallelism and data movement in a VLIW architecture
are completely specified at compile time.
• Run-time resource scheduling and synchronization are in theory
completely eliminated.
• One can view a VLIW processor as an extreme example of a
superscalar processor in which all independent or unrelated
operations are already synchronously compacted together in
advance.
• The CPI of a VLIW processor can be even lower than that of a
superscalar processor.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy