Content-Length: 3139184 | pFad | https://www.scribd.com/document/796859600/unit-5
3Unit 5
Unit 5
Unit 5
UNIT - 5
Topics covered:
Parallelism
Need, types of Parallelism
Applications of Parallelism
Parallelism in Software
Instruction level parallelism
Data level parallelism
Challenges in parallel processing
Architectures of Parallel Systems - Flynn‟s classification
SISD,SIMD
MIMD, MISD
Hardware multithreading
Coarse Grain parallelism, Fine Grain parallelism
Uniprocessor and Multiprocessors
Multi-core processors
Textbooks
Chapter 1 and 2
and
Online Resource
2
Parallel Computing
3
Parallel Processing
4
Example of Parallelism
5
What is a Parallel Architecture?
6
What is Parallelism?
7
Parallelism
8
Importance of Parallelism
The single thread performance, CPU clock frequency (MHz), CPU power
consumption (Watts), and the number of cores on CPU chips are shown from 1970
to 2020. The Parallel Computing era begins about 2005, when the core count in
CPU chips began to rise, while the clock frequency and power consumption
plateaued, yet performance steadily increased. (Horowitz et al. and Rupp)
9
MOORE‟s law
10
Challenges in Parallel Processing
11
Challenges in Parallel Processing
12
Challenges in Parallel Processing
13
Types of Parallelism
Parallelism in Hardware (Uniprocessor)
Parallelism in a Uniprocessor
• Pipelining
• Superscalar, VLIW etc.
SIMD instructions, Vector processors, GPUs
Multiprocessor
• Symmetric shared-memory multiprocessors
• Distributed-memory multiprocessors
• Chip-multiprocessors a.k.a. Multi-cores
Multicomputers a.k.a. clusters
Parallelism in Software
Instruction level parallelism
Task-level parallelism
Data parallelism
Transaction level parallelism
14
Uniprocessor vs Multi-processor system
Most computer manufacturers started with the development of
systems with a single central processor, called a uniprocessor
system.
Uniprocessor systems have their limit in achieving high
performance.
The computing power in a uniprocessor can be further upgraded
by allowing the use of multiple processing elements under one
controller.
One can also extend the computer structure to include multiple
processors with shared memory space and peripherals under the
control of one integrated operating system. Such a computer is
called a multi-processor system.
As far as parallel processing is concerned, the general
architectural trend is being shifted away from conventional
uniprocessor systems to multiprocessor systems or to an array of
processing elements controlled by one uniprocessor.
15
Basic Uniprocessor Architecture
A typical uniprocessor computer consists of three major components: the
main memory, the central processing unit (CPU), and the input-output (I/O)
subsystem.
The architectures of two commercially available uniprocessor computers
are given below to show the possible interconnection of structures among
the three sub-systems.
Examine major components in the CPU and in the I/O subsystem in Next
Slide. It shows the architectural components of the super minicomputer
VAX-11/780, manufactured by Digital Equipment Company. The CPU
contains the master controller of the VAX system.
There are sixteen 32-bit general-purpose registers, one of which
serves as the program counter (PC). There is also a special CPU status
register containing information about the current state of the
processor and of the program being executed.
The CPU contains an arithmetic' and logic unit (A LU) with an optional
floating-point accelerator, and some local cache memory with an
optional diagnostic memory. The CPU can be intervened by the operator
through the console connected to a floppy disk.
16
Basic Uniprocessor Architecture
17
Basic Uniprocessor Architecture
The CPU, the main memory (232 words of 32 bits each), and
the I/O sub-systems are all connected to a common bus, the
synchronous backplane inter-connect (SBI). Through this bus,
all I/O devices can communicate with each other, with the
CPU, or with the memory. Peripheral storage or I/0 devices
can be connected directly to the SBI through the unibus and
its controller (which can be connected to PDP-11 series
minicomputers), or through a massbus and its controller.
18
Parallel Processing Mechanisms
19
Taxonomy of Parallel Computers
20
Flynn‟s Classification
21
Flynn‟s Classification
22
Flynn‟s Computer System Classification
23
Flynn‟s Classification
24
Flynn‟s Classification
25
Flynn‟s Classification
26
Flynn‟s Classification
27
Types of Parallelism in Applications
28
Types of Parallelism in Applications
29
Levels of Parallelism
30
Levels of Parallelism
31
Levels of Parallelism
32
Multithreading: Basics
Thread
Instruction stream with state (registers and memory)
Register state is also called “thread context”
Threads could be part of the same process (program) or
from different programs
Threads in the same program share the same address
space (shared memory model)
Traditionally, the processor keeps track of the context of a
single thread
Multitasking: When a new thread needs to be executed, old
thread‟s context in hardware written back to memory and
new thread‟s context loaded
33
Hardware and Software Parallelism
34
Hardware and Software Parallelism
36
Hardware and Software Parallelism
37
Hardware and Software Parallelism
38
Hardware and Software Parallelism
39
Hardware and Software Parallelism
To solve the mismatch problem between software parallelism and hardware
parallelism, one approach is to develop compilation support, and the other is
through. hardware redesign for more efficient exploitation by an intelligent
compiler.
These two approaches must cooperate with each other to produce the best result.
Hardware processors can be better designed to exploit parallelism by an
optimizing compiler.
Pioneer work in processor technology with this objective can be found in the IBM
801, Stanford MIPS, and Berkeley RISC.
Most processors use a large register fie and sustained instruction pipelining to
execute nearly one instruction per cycle. The large register file supports fast
access to temporary values generated by an optimizing compiler. The registers
are exploited by the code optimizer and global register allocator in such a
compiler.
The instruction scheduler exploits the pipeline hardware by filling branch and load
delay slots. In superscalar and superpipelining, hardware and software branch
prediction, multiple instruction issue, speculative execution, high bandwidth
instruction cache, and support for dynamic scheduling are needed to facilitate the
detection of parallelism opportunities. The architecture must be designed
interactively with the compiler.
40
Granularity: Fine Grain and Coarse Grain Parallelism
Grain size or granularity is a measure or the amount of computation
involved in a software process.
The simplest measure is to count the number of instructions in a
grain (program segment).
Grain size determines the basic program segment chosen for
parallel processing.
Grain sizes are commonly described as fine, medium or coarse
depending on the processing levels involved.
Latency is a time measure of the communication overhead incurred
between machine subsystems. For example, the memory latency is
the time required by a processor to access the memory.
The time required for two processes to synchronize with each
other is called the synchronization latency. Computational
granularity and communication latency are closely related.
41
Granularity: Fine Grain and Coarse Grain Parallelism
42
Granularity: Fine Grain and Coarse Grain Parallelism
43
Granularity: Fine Grain and Coarse Grain Parallelism
Instruction Level
At instruction or statement level, a typical grain contains
less than 20 instructions, called fine grain. Depending on
individual programs, fine grain parallelism at this level
may range from two to thousands.
The advantage of fine-grain computation lies in the
abundance of parallelism. The exploitation of fine-grain
parallelism can be assisted by an optimizing compiler
which should be able to automatically detect parallelism
and translate the source code to a parallel form which
can be recognized by the run-time system. Instruction-
level parallelism is rather tedious for an ordinary
programmer to detect in a source code.
44
Granularity: Fine Grain and Coarse Grain Parallelism
Loop Level
This corresponds to the iterative loop operations. A
typical loop contains less than 500 instructions. Some
loop operations, if independent n successive iterations,
can be vectorized for pipelined execution or for lock-
step execution on SIMD machines.
Some loop operations can be self-scheduled for parallel
execution on MIMD machines. Loop-level parallelism is
the most optimized program construct to execute on a
parallel or vector computer. However, recursive loops are
rather difficult to parallelize.
Vector processing is mostly exploited at the loop level
(level 2) by a vectorizing compiler. The loop level is still
considered a fine grain of computation.
45
Granularity: Fine Grain and Coarse Grain Parallelism
Procedure Level
This level corresponds to medium-grain size at the task,
procedural, subroutine, and coroutine levels. A typical
grain at this level contains less than 2000 instructions.
Detection of parallelism at this level is much more
difficult than at the finer-grain levels. Interprocedural
dependence analysis is much more involved and history-
sensitive.
The communication requirement is often less compared
with that required in MIMD execution mode. SPMD
execution mode is a special case at this level.
Multitasking also belongs in this category. Significant
efforts by programmers may be needed to restructure a
program at this level, and some compiler assistance is
also needed.
46
Granularity: Fine Grain and Coarse Grain Parallelism
Subprogram Level
This corresponds to the level of job steps and related
subprograms. The grain size may typically contain
thousands of instructions, .lob steps can overlap across
different jobs. Subprograms can be scheduled for
different processors in SPMD or MPMD mode, often on
message-passing multicomputers.
Multiprogramming on a uniprocessor or on a
multiprocessor is conducted at this level. In the past,
parallelism at this level has been exploited by algorithm
designers or programmers, rather than by compilers. We
do not have good compilers for exploiting medium- or
coarse-grain parallelism at present.
47
Granularity: Fine Grain and Coarse Grain Parallelism
48
Summary
Fine-grain parallelism is often exploited at instruction or loop levels,
preferably assisted by a parallelizing or vectorizing compiler.
Medium-grain parallelism at the task or job step demands significant
roles for the programmer as well as compilers.
Coarse-grain parallelism at the program level relies heavily on an
effective OS and on the efficiency of the algorithm used. Shared-
variable communication is often used to support fine-grain and
medium-grain computations.
Mess age-passing multicomputers have been used for medium- and
coarse-grain computations. In general, the finer the grain size, the
higher the potential for parallelism and the higher the communication
and scheduling overhead.
Fine grain provides a higher degree of parallelism, but heavier
communication overhead, as compared with coarse-grain computations.
Massive parallelism is often explored at the fine-grain level, such as
data parallelism on SIMD or MIMD computers.
49
Hardware Multithreading
50
Hardware Multithreading
Benefit
Latency tolerance
Better hardware utilization (when?)
Reduced context switch penalty
Cost
Requires multiple thread contexts to be implemented in
hardware (area, power, latency cost)
Usually reduced single-thread performance
Resource sharing, contention
Switching penalty (can be reduced with additional
hardware)
51
Types of Hardware Multithreading
Fine-grained
Cycle by cycle
Coarse-grained
Switch on event (e.g., cache miss)
Switch on quantum/timeout
Simultaneous
Instructions from multiple threads executed
concurrently in the same cycle
52
Fine-grained Multithreading
53
Coarse-grained Multithreading
54
Simultaneous Multithreading
55
Application of Parallelism
56
Application of Parallelism
57
Application of Parallelism
Energy Resources Exploration Medical, Military, and Basic Research
Seismic exploration Computer-assisted tomography
Reservoir modeling Genetic engineering
Weapon research and defense
Plasma fusion vower
• Multiwarhead nuclear weapon design (Cray-1.)
Nuclear reactor safety • Simulation of atomic weapon effects by solving hydrodynamics and
• On-line analysis of reactor •
radiation problems (Cyber-205)
Intelligence gathering, such as radar signal processing on the
conditions associative processor for the antiballistic missile (ABM) program
58
Multi-core processor
59
Multi-core processor
An AMD Athlon X2
Diagram of a generic dual- 6400+ dual-core
core processor with CPU- processor
local level-1 caches and a
shared, on-die level-2 cache
60
Multi-core processor
A multi-core processor implements multiprocessing in a single
physical package. Designers may couple cores in a multi-core
device tightly or loosely.
For example, cores may or may not share caches, and they may
implement message passing or shared-memory inter-core
communication methods. Common network topologies used to
interconnect cores include bus, ring, two-dimensional mesh, and
crossbar.
Homogeneous multi-core systems include only identical cores;
heterogeneous multi-core systems have cores that are not
identical (e.g. big.LITTLE have heterogeneous cores that share
the same instruction set, while AMD Accelerated Processing
Units have cores that do not share the same instruction set).
Just as with single-processor systems, cores in multi-core
systems may implement architectures such as VLIW,
superscalar, vector, or multithreading.
61
Multi-core processor
Types of Processor Cores
A simple single threaded processor.
A multi-threaded complex processor.
A processor that is just an arithmetic logic unit with a
multiplier and an accumulator (MAC unit).
A hybrid system combining simple and complex processors.
Number of Processing Cores
A small number of identical cores < 16.
A medium number of cores (of the order of hundreds).
A very large number (thousands) of multiply accumulate (MAC)
units.
Interconnection of processor cores
The cores may be interconnected using a bus.
The cores may be interconnected by a ring, a crossbar switch,
or they may form a grid.
62
Design Consideration of Multi-core Microprocessor
63
Design Consideration of Multi-core Microprocessor
64
Multi-core processor
65
Instruction Parallelism and Data Parallelism
66
Fetched URL: https://www.scribd.com/document/796859600/unit-5
Alternative Proxies: