0% found this document useful (0 votes)
141 views

COA UNIT-III Parallel Processors

The document discusses parallel processing and Flynn's classifications of computer architectures. It describes parallel processing challenges like branch prediction and finite registers. It also covers instruction-level parallelism exploitation techniques like loop unrolling. Flynn's taxonomy includes SISD, SIMD, MISD, and MIMD models based on the number of instruction and data streams. SISD refers to the traditional single processor von Neumann model, while SIMD involves simultaneous execution of a single instruction over multiple data elements.

Uploaded by

Devika csbs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views

COA UNIT-III Parallel Processors

The document discusses parallel processing and Flynn's classifications of computer architectures. It describes parallel processing challenges like branch prediction and finite registers. It also covers instruction-level parallelism exploitation techniques like loop unrolling. Flynn's taxonomy includes SISD, SIMD, MISD, and MIMD models based on the number of instruction and data streams. SISD refers to the traditional single processor von Neumann model, while SIMD involves simultaneous execution of a single instruction over multiple data elements.

Uploaded by

Devika csbs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

 

COMPUTER ORGANIZATION AND ARCHITECTURE

II Year / III Semester

UNIT III

1
PARALLEL PROCESSOR
• Parallel processing and its challenges
• Instruction level parallelism
• Flynn's classification
• Hardware multithreading: SISD, MIMD, SIMD,
SPMD and Vector multithreading
• Multicore processors: Shared memory
multiprocessor and cluster multiprocessor
UNIT-5

Parallel Processing
UNIT-5

Parallel Processing
UNIT-5

Parallel Processing
Parallel processing
• A parallel processing system can be achieved by having a
multiplicity of functional units that perform identical or
different operations simultaneously. The data can be
distributed among various multiple functional units.
• The following diagram shows one possible way of
separating the execution unit into eight functional units
operating in parallel.
• The operation performed in each functional unit is
indicated in each block if the diagram:
Parallel processing
• The adder and integer multiplier performs the arithmetic
operation with integer numbers.
• The floating-point operations are separated into three
circuits operating in parallel.
• The logic, shift, and increment operations can be
performed concurrently on different data. All units are
independent of each other, so one number can be shifted
while another number is being incremented.
PARALLEL PROCESSING CHALLENGES
The Hardware Model
• An ideal processor is one where all constraints on ILP are
removed. The only limits on ILP in such a processor are those
imposed by the actual data flows through either registers or
memory.

• The assumptions made for an ideal or perfect processor are as


follows:

1) Register renaming
There are an infinite number of virtual registers available, and hence
all WAW and WAR hazards are avoided and an unbounded
number of instructions can begin execution simultaneously.

2) Branch prediction
Branch prediction is perfect. All conditional branches are predicted
exactly.
PARALLEL PROCESSING CHALLENGES

3.Jump prediction
• —All jumps (including jump register used for return and
computed jumps) are perfectly predicted. When combined with
perfect branch prediction, this is equivalent to having a processor
with perfect speculation and an unbounded buffer of instructions
available for execution.
 
4.Memory address alias analysis
• —All memory addresses are known exactly, and a load can be
moved before a store provided that the addresses are not
identical. Note that this implements perfect address alias analysis.
PARALLEL PROCESSING CHALLENGES

5.Perfect caches
• —All memory accesses take 1 clock cycle. In practice, superscalar
processors will typically consume large amounts of ILP hiding
cache misses, making these results highly optimistic.
PARALLEL PROCESSING CHALLENGES
Limitations on the Window Size and Maximum Issue Count

• To build a processor that even comes close to perfect branch


prediction and perfect alias analysis requires extensive dynamic
analysis, since static compile time schemes cannot be perfect. Of
course, most realistic dynamic schemes will not be perfect, but
the use of dynamic schemes will provide the ability to uncover
parallelism that cannot be analyzed by static compile time
analysis.

• Thus, a dynamic processor might be able to more closely


match the amount of parallelism uncovered by our ideal
processor.
PARALLEL PROCESSING CHALLENGES
The Effects of Realistic Branch and Jump Prediction

• Our ideal processor assumes that branches can be perfectly


predicted: The outcome of any branch in the program is
known before the first instruction is executed! Of course,
no real processor can ever achieve this.
• We assume a separate predictor is used for jumps. Jump
predictors are important primarily with the most accurate branch
predictors, since the branch frequency is higher and the accuracy
of the branch predictors dominates.

1.Perfect —All branches and jumps are perfectly predicted at the


start of execution.

2.Tournament-based
branch predictor —The prediction scheme uses a correlating 2-bit
predictor and a noncorrelating 2-bit predictor together with a
selector, which chooses the best predictor for each branch.
PARALLEL PROCESSING CHALLENGES
The Effects of Finite Registers

• Our ideal processor eliminates all name dependences among


register references using an infinite set of virtual registers.

• To date, the IBM Power5 has provided the largest numbers of


virtual registers: 88 additional floating-point and 88 additional
integer registers, in addition to the 64 registers available in the
base architecture. All 240 registers are shared by two threads
when executing in multithreading mode, and all are available to a
single thread when in single-thread mode.
PARALLEL PROCESSING CHALLENGES
The Effects of Imperfect Alias Analysis

• Our optimal model assumes that it can perfectly analyze all


memory dependences, as well as eliminate all register
name dependences. Of course, perfect alias analysis is not
possible in practice: The analysis cannot be perfect at compile
time, and it requires a potentially unbounded number of
comparisons at run time (since the number of simultaneous
memory references is unconstrained).
INSTRUCTION-LEVEL-PARALLELISM

• All processors since about 1985 use pipelining to overlap the


execution of instructions and improve performance. This potential
overlap among instructions is called instruction-level parallelism
(ILP), since the instructions can be evaluated in parallel.

• There are two largely separable approaches to exploiting ILP: an


approach that relies on hardware to help discover and exploit the
parallelism dynamically, and an approach that relies on software
technology to find parallelism, statically at compile time.
Processors using the dynamic, hardware-based approach,
including the Intel Pentium series, dominate in the market; those
using the static approach, including the Intel Itanium, have more
limited uses in scientific or application-specific environments.

• The value of the CPI (cycles per instruction) for a pipelined


processor is the sum of the base CPI and all contributions from
stalls: Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data
hazard stalls + Control stalls
INSTRUCTION-LEVEL-PARALLELISM

• The ideal pipeline CPI is a measure of the maximum performance


attainable by the implementation. By reducing each of the terms
of the right-hand side to minimize the overall pipeline CPI or,
alternatively, increase the IPC (instructions per clock).

• The simplest and most common way to increase the ILP is to


exploit parallel- ism among iterations of a loop. This type of
parallelism is often called loop-level parallelism.There are a
number of techniques for converting such loop- level parallelism
into instruction-levelparallelism. Basically, such techniques work
by unrolling the loop either statically by the compiler or
dynamically by the hardware. An important alternative method
for exploiting loop-level parallelism is the use of vector
instructions . A vector instruction exploits data- level parallelism
by operating on data items in parallel.
INSTRUCTION-LEVEL-PARALLELISM

• The ideal pipeline CPI is a measure of the maximum performance


attainable by the implementation. By reducing each of the terms
of the right-hand side to minimize the overall pipeline CPI or,
alternatively, increase the IPC (instructions per clock).

• The simplest and most common way to increase the ILP is to


exploit parallel- ism among iterations of a loop. This type of
parallelism is often called loop-level parallelism.There are a
number of techniques for converting such loop- level parallelism
into instruction-levelparallelism. Basically, such techniques work
by unrolling the loop either statically by the compiler or
dynamically by the hardware. An important alternative method
for exploiting loop-level parallelism is the use of vector
instructions . A vector instruction exploits data- level parallelism
by operating on data items in parallel.
Flynn's Classification
• In 1966, Michael Flynn proposed a classification for computer
architectures based on the number of instruction steams and data
streams (Flynn’s Taxonomy).
• Flynn uses the stream concept for describing a machine's
structure.
• A stream simply means a sequence of items (data or
instructions).

• The classification of computer architectures based on the number


of instruction steams and data streams (Flynn’s Taxonomy).
Flynn’s Taxonomy
• SISD: Single instruction single data
 Classical von Neumann architecture

•  SIMD: Single instruction multiple data

• MISD: Multiple instructions single data


–  Non existent, just listed for completeness

• MIMD: Multiple instructions multiple data


–  Most common and general parallel machine
Multiple Processor Organization
• Single instruction, single data stream -
SISD
• Single instruction, multiple data stream -
SIMD
• Multiple instruction, single data stream -
MISD
• Multiple instruction, multiple data stream-
MIMD
Single Instruction, Single Data Stream -
SISD
• SISD (Singe-Instruction stream, Singe-Data stream)

• Single processor
• Single instruction stream
• Data stored in single memory
• Uni-processor

•  SISD corresponds to the traditional mono-processor ( von


Neumann computer). A single data stream is being processed by
one instruction stream

• A single-processor computer (uni-processor) in which a single


stream of instructions is generated from the program.
Single Instruction, Multiple Data Stream -
SIMD
• SIMD (Single-Instruction stream, Multiple-Data streams)

• Single machine instruction


• Controls simultaneous execution
• Number of processing elements
• Lockstep basis
• Each processing element has associated data memory
• Each instruction executed on different set of data by different
processors
• Vector and array processors

• Each instruction is executed on a different set of data by different


processors i.e multiple processing units of the same type process
on multiple-data streams.
• This group is dedicated to array processing machines.
• ·Sometimes,  vector processors can also be seen as a part of this
group.
Multiple Instruction, Single Data Stream -
MISD
MISD (Multiple-Instruction streams, Singe-Data stream)

• Sequence of data
• Transmitted to set of processors
• Each processor executes different instruction sequence
• Never been implemented
• Each processor executes a different sequence of instructions.

• In case of MISD computers, multiple processing units operate on


one single-data stream .
• In practice, this kind of organization has never been used
Multiple Instruction, Multiple Data
Stream- MIMD
• MIMD (Multiple-Instruction streams, Multiple-Data streams)
• Set of processors
• Simultaneously execute different instruction sequences
• Different sets of data
• SMPs, clusters and NUMA systems

• Each processor has a separate program.


• An instruction stream is generated from each program.
• Each instruction operates on different data.
•  This last machine type builds the group for the traditional multi-
processors. Several processing units operate on multiple-data
streams
Taxonomy of Parallel Processor
Architectures
HARDWARE MULITHREADING
Exploiting Thread-Level Parallelism within a Processor.

• Multithreading allows multiple threads to share the


functional units of a single processor in an overlapping
fashion.

• To permit this sharing, the processor must duplicate the


independent state of each thread.

• For example, a separate copy of the register file, a


separate PC, and a separate page table are required for
each thread.
HARDWARE MULITHREADING
There are two main approaches to multithreading

 Fine-grained multithreading 
 Coarse-grained multithreading 
HARDWARE MULITHREADING

Fine-grained multithreading 

 Fine-grained multithreading switches between threads on


each instruction, causing the execution of multiples
threads to be interleaved.

 This interleaving is often done in a round-robin fashion,


skipping any threads that are stalled at that time
HARDWARE MULITHREADING
HARDWARE MULITHREADING

Coarse-grained multithreading

 Coarse-grained multithreading was invented as an


alternative to fine-grained multithreading. Coarse-grained
multithreading switches threads only on costly stalls, such
as level two cache misses.

 This change relieves the need to have thread-switching be


essentially free and is much less likely to slow the
processor down, since instructions from other threads will
only be issued, when a thread encounters a costly stall
HARDWARE MULITHREADING
HARDWARE MULITHREADING
Simultaneous Multithreading:

• Simultaneous multithreading (SMT) is a variation on


multithreading that uses the resources of a multiple issue,
dynamically-scheduled processor to exploit TLP at the
same time it exploits ILP.

• The key insight that motivates SMT is that modern


multiple-issue processors often have more functional unit
parallelism available than a single thread can effectively
use.

• Furthermore, with register renaming and dynamic


scheduling, multiple instructions from independent threads
can be issued without regard to the dependences among
them; the resolution of the dependences can be handled
by the dynamic scheduling capability.
Simultaneous Multithreading:
Simultaneous Multithreading:

The following figure illustrates the differences in a processor’s


ability to exploit the resources of a superscalar for the
following processor configurations:

• a superscalar with no multithreading support,


• a superscalar with coarse-grained multithreading,
• a superscalar with fine-grained multithreading, and
• a superscalar with simultaneous multithreading.
Simultaneous Multithreading:
Simultaneous Multithreading:

• In the superscalar without multithreading support, the use of


issue slots is limited by a lack of ILP.
• In the coarse-grained multithreaded superscalar, the long stalls
are partially hidden by switching to another thread that uses the
resources of the processor.

• In the fine-grained case, the interleaving of threads eliminates


fully empty slots. Because only one thread issues instructions in a
given clock cycle.In the SMT case, thread-level parallelism (TLP)
and instruction-level parallelism (ILP) are exploited
simultaneously; with multiple threads using the issue slots in a
single clock cycle.

• The above figure greatly simplifies the real operation of these


processors it does illustrate the potential performance advantages
of multithreading in general and SMT in particular.
A multi-core processor :
 A multi-core processor is a processing system composed of two or
more independent cores (or CPUs). The cores are typically
integrated onto a single integrated circuit die (known as a chip
multiprocessor or CMP), or they may be integrated onto multiple
dies in a single chip package.

 A multi-core processor implements multiprocessing in a single


physical package. Cores in a multi-core device may be coupled
together tightly or loosely. For example, cores may or may not
share caches, and they may implement message passing or
shared memory inter-core communication methods. Common
network topologies to interconnect cores include: bus, ring, 2-
dimentional mesh, and crossbar.
A multi-core processor :
• All cores are identical in symmetric multi-core systems and they
are not identical in asymmetric multi-core systems. Just as with
single-processor systems, cores in multi-core systems may
implement architectures such as superscalar, vector processing,
or multithreading.

• Multi-core processors are widely used across many application


domains including: general-purpose, embedded, network, digital
signal processing, and graphics.
A multi-core processor :
• The amount of performance gained by the use of a multi-core
processor is strongly dependent on the software algorithms and
implementation.

• Multi-core processing is a growing industry trend as single core


processors rapidly reach the physical limits of possible complexity
and speed.

• Companies that have produced or are working on multi-core


products include AMD, ARM, Broadcom, Intel, and VIA.

•  with a shared on-chip cache memory, communication events can


be reduced to just a handful of processor cycles.
A multi-core processor :
• therefore with low latencies, communication delays have a much
smaller impact on overall performance.

• threads can also be much smaller and still be effective

• automatic parallelization more feasible.


A multi-core processor :

Multiple cores run in parallel


A multi-core processor :
Properties of Multi-core systems

• Cores will be shared with a wide range of other applications


dynamically.
• ¡  Load can no longer be considered symmetric across the cores.

• ¡  Cores will likely not be asymmetric as accelerators become


common for scientific hardware.

• ¡  Source code will often be unavailable, preventing compilation


against the specific hardware configuration.
A multi-core processor :
Applications that benefit from multi-core

• Database servers
• ¡  Web servers
• ¡  Telecommunication markets
• ¡  Multimedia applications
• ¡  Scientific applications
Shared Memory Multiprocessors
• In shared-memory multiprocessors, numerous processors are
accessing one or more shared memory modules. The processors
may be physically connected to the memory modules in many
ways, but logically every processor is connected to every memory
module.
• One of the major characteristics of shared memory
multiprocessors is that all processors have equally direct access
to one large memory address space. The limitation of shared
memory multiprocessors is memory access latency.
• The figure shows shared-memory multiprocessors.
Shared Memory Multiprocessors
Shared Memory Multiprocessors
• Shared memory multiprocessors have a major benefit over other
multiprocessors since all the processors sent a similar view of the
memory.
• These processors are also termed Uniform Memory Access (UMA)
systems. This term denotes that memory is equally accessible to
every processor, providing access at the same performance rate
Clustered Multiprocessors
• The clustered system usually consists of integrating several
machines into one machine to complete tasks.
• Cluster systems are a mix of hardware clusters and software
groups.
• Hardware clusters help to share high-performance disks among
devices.
• The device clusters make both systems work together. Every
node of the clustered systems contains the cluster program.
Clustered Multiprocessors
Clustered Multiprocessors

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy