Unit 2 Classification of Parallel Computers: Structure Nos
Unit 2 Classification of Parallel Computers: Structure Nos
Unit 2 Classification of Parallel Computers: Structure Nos
Parallel Computers
UNIT 2 CLASSIFICATION OF PARALLEL
COMPUTERS
Structure Page Nos.
2.0 Introduction 27
2.1 Objectives 27
2.2 Types of Classification 28
2.3 Flynn’s Classification 28
2.3.1 Instruction Cycle
2.3.2 Instruction Stream and Data Stream
2.3.3 Flynn’s Classification
2.4 Handler’s Classification 33
2.5 Structural Classification 34
2.5.1 Shared Memory System/Tightly Coupled System
2.5.1.1 Uniform Memory Access Model
2.5.1.2 Non-Uniform Memory Access Model
2.5.1.3 Cache-only Memory Architecture Model
2.5.2 Loosely Coupled Systems
2.6 Classification Based on Grain Size 39
2.6.1 Parallelism Conditions
2.6.2 Bernstein Conditions for Detection of Parallelism
2.6.3 Parallelism Based on Grain Size
2.7 Summary 44
2.8 Solutions/ Answers 44
2.0 INTRODUCTION
Parallel computers are those that emphasize the parallel processing between the
operations in some way. In the previous unit, all the basic terms of parallel processing and
computation have been defined. Parallel computers can be characterized based on the data
and instruction streams forming various types of computer organisations. They can also
be classified based on the computer structure, e.g. multiple processors having separate
memory or one shared global memory. Parallel processing levels can also be defined
based on the size of instructions in a program called grain size. Thus, parallel computers
can be classified based on various criteria. This unit discusses all types of classification of
parallel computers based on the above mentioned criteria.
2.1 OBJECTIVES
After going through this unit, you should be able to:
• explain the various criteria on which classification of parallel computers are based;
• discuss the Flynn’s classification based on instruction and data streams;
• describe the Structural classification based on different computer organisations;
• explain the Handler's classification based on three distinct levels of computer:
Processor control unit (PCU), Arithmetic logic unit (ALU), Bit-level circuit (BLC),
and
• describe the sub-tasks or instructions of a program that can be executed in parallel
based on the grain size.
27
Elements of Parallel
Computing and 2.2 TYPES OF CLASSIFICATION
Architecture
0 5 6 15
Operand
Operation Operand Address
Code Addressing mode
The control unit of the CPU of the computer fetches instructions in the program, one at a
time. The fetched Instruction is then decoded by the decoder which is a part of the control
unit and the processor executes the decoded instructions. The result of execution is
temporarily stored in Memory Buffer Register (MBR) (also called Memory Data
Register). The normal execution steps are shown in Figure 2.
28
Classification of
Start Parallel Computers
YES NO
Are there
More Stop
Instructions?
Data stream
Figure 3: Instruction and data stream
29
Elements of Parallel Thus, it can be said that the sequence of instructions executed by CPU forms the
Computing and Instruction streams and sequence of data (operands) required for execution of instructions
Architecture
form the Data streams.
I s = Ds = 1
Ds Main
Is ALU
Control Unit Memory
Is
Figure 4: SISD Organisation
30
Is = 1 Classification of
Ds > 1 Parallel Computers
DS1
PE1
MM1
Control DS2
Unit PE2 MM2
DSn
MMn
PEn
Is
IS1 DS
CU1 PE1
DS
IS2
CU2 PE2 Main Memory
DS
ISn
CUn PEn
DS
ISn
IS2
IS1
Figure 6: MISD Organisation
This classification is not popular in commercial machines as the concept of single data
streams executing on multiple processors is rarely applied. But for the specialized
applications, MISD organisation can be very helpful. For example, Real time computers
need to be fault tolerant where several processors execute the same data for producing the
redundant data. This is also known as N- version programming. All these redundant data
31
Elements of Parallel are compared as results which should be same; otherwise faulty unit is replaced. Thus
Computing and MISD machines can be applied to fault tolerant real time computers.
Architecture
Is > 1
Ds > 1
DS1
CU1 ISIS
1
2
PEIS1 n DS
DS MM1
ISn DSn
PEn MMn
CUn
IS1
Figure 7: MIMD Organisation
Of the classifications discussed above, MIMD organization is the most popular for a
parallel computer. In the real sense, parallel computers execute the instructions in MIMD
mode.
Check Your Progress 1
1) What are various criteria for classification of parallel computers?
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
2) Define instruction and data streams.
…………………………………………………………………………………………
…………………………………………………………………………………………
32
………………………………………………………………………………………… Classification of
………………………………………………………………………………………… Parallel Computers
3) State whether True or False for the following:
a) SISD computers can be characterized as Is > 1 and Ds > 1
b) SIMD computers can be characterized as Is > 1 and Ds = 1
c) MISD computers can be characterized as Is = 1 and Ds = 1
d) MIMD computers can be characterized as Is > 1 and Ds > 1
Handler's classification uses the following three pairs of integers to describe a computer:
Computer = (p * p', a * a', b * b')
Where p = number of PCUs
Where p'= number of PCUs that can be pipelined
Where a = number of ALUs controlled by each PCU
Where a'= number of ALUs that can be pipelined
Where b = number of bits in ALU or processing element (PE) word
Where b'= number of pipeline segments on all ALUs or in a single PE
The following rules and operators are used to show the relationship between various
elements of the computer:
• The '*' operator is used to indicate that the units are pipelined or macro-pipelined
with a stream of data running through all the units.
• The '+' operator is used to indicate that the units are not pipelined but work on
independent streams of data.
• The 'v' operator is used to indicate that the computer hardware can work in one of
several modes.
• The '~' symbol is used to indicate a range of values for any one of the parameters.
• Peripheral processors are shown before the main processor using another three pairs
of integers. If the value of the second element of any pair is 1, it may omitted for
brevity.
Handler's classification is best explained by showing how the rules and operators are used
to classify several machines.
The CDC 6600 has a single main processor supported by 10 I/O processors. One control
unit coordinates one ALU with a 60-bit word length. The ALU has 10 functional units
which can be formed into a pipeline. The 10 peripheral I/O processors may work in
parallel with each other and with the CPU. Each I/O processor contains one 12-bit ALU.
The description for the 10 I/O processors is:
33
Elements of Parallel CDC 6600I/O = (10, 1, 12)
Computing and
Architecture The description for the main processor is:
CDC 6600main = (1, 1 * 10, 60)
The main processor and the I/O processors can be regarded as forming a macro-pipeline
so the '*' operator is used to combine the two structures:
CDC 6600 = (I/O processors) * (central processor = (10, 1, 12) * (1, 1 * 10, 60)
Texas Instrument's Advanced Scientific Computer (ASC) has one controller coordinating
four arithmetic units. Each arithmetic unit is an eight stage pipeline with 64-bit words.
Thus we have:
ASC = (1, 4, 64 * 8)
The Cray-1 is a 64-bit single processor computer whose ALU has twelve functional units,
eight of which can be chained together to from a pipeline. Different functional units have
from 1 to 14 segments, which can also be pipelined. Handler's description of the Cray-1
is:
Cray-1 = (1, 12 * 8, 64 * (1 ~ 14))
Another sample system is Carnegie-Mellon University's C.mmp multiprocessor. This
system was designed to facilitate research into parallel computer architectures and
consequently can be extensively reconfigured. The system consists of 16 PDP-11
'minicomputers' (which have a 16-bit word length), interconnected by a crossbar
switching network. Normally, the C.mmp operates in MIMD mode for which the
description is (16, 1, 16). It can also operate in SIMD mode, where all the processors are
coordinated by a single master controller. The SIMD mode description is (1, 16, 16).
Finally, the system can be rearranged to operate in MISD mode. Here the processors are
arranged in a chain with a single stream of data passing through all of them. The MISD
modes description is (1 * 16, 1, 16). The 'v' operator is used to combine descriptions of
the same piece of hardware operating in differing modes. Thus, Handler's description for
the complete C.mmp is:
C.mmp = (16, 1, 16) v (1, 16, 16) v (1 * 16, 1, 16)
The '*' and '+' operators are used to combine several separate pieces of hardware. The 'v'
operator is of a different form to the other two in that it is used to combine the different
operating modes of a single piece of hardware.
While Flynn's classification is easy to use, Handler's classification is cumbersome. The
direct use of numbers in the nomenclature of Handler’s classification’s makes it much
more abstract and hence difficult. Handler's classification is highly geared towards the
description of pipelines and chains. While it is well able to describe the parallelism in a
single processor, the variety of parallelism in multiprocessor computers is not addressed
well.
Structure of Parallel
Computers
Tightly Loosely
Coupled Coupled
systems systems
P1
Interconnection
Shared
P2 network
Memory
Pn
LM P1
Interconnection
LM P2 network
LM Pn
D1
I/O- Processor
Interconnection
P1 P2 Pn Network D2
Dn
Processor-Memory
Interconnection Network
M1 M2 Mn
Shared
Memory
Figure 11: Tightly coupled system organization
36
processor may use cache memory for the frequent references made by the processor as Classification of
shown in Figure 12. Parallel Computers
P1 P2 Pn
C C C
Interconnection network
M1 M2 Mn
The shared memory multiprocessor systems can further be divided into three modes
which are based on the manner in which shared memory is accessed. These modes are
shown in Figure 13 and are discussed below.
37
Elements of Parallel the access is not uniform. It depends on the location of the memory. Thus, all memory
Computing and words are not accessed uniformly.
Architecture
2.5.1.3 Cache-Only Memory Access Model (COMA)
As we have discussed earlier, shared memory multiprocessor systems may use cache
memories with every processor for reducing the execution time of an instruction. Thus in
NUMA model, if we use cache memories instead of local memories, then it becomes
COMA model. The collection of cache memories form a global memory space. The
remote cache access is also non-uniform in this model.
2.5.2 Loosely Coupled Systems
These systems do not share the global memory because shared memory concept gives rise
to the problem of memory conflicts, which in turn slows down the execution of
instructions. Therefore, to alleviate this problem, each processor in loosely coupled
systems is having a large local memory (LM), which is not shared by any other processor.
Thus, such systems have multiple processors with their own local memory and a set of
I/O devices. This set of processor, memory and I/O devices makes a computer system.
Therefore, these systems are also called multi-computer systems. These computer systems
are connected together via message passing interconnection network through which
processes communicate by passing messages to one another. Since every computer system
or node in multicomputer systems has a separate memory, they are called distributed
multicomputer systems. These are also called loosely coupled systems, meaning that
nodes have little coupling between them as shown in Figure 14.
LM P1
Node
Message
passing
Interconnection
LM P2 network
Node
LM Pn
LM: local memory
Node P1, Pn: processing elements
Since local memories are accessible to the attached processor only, no processor can
access remote memory. Therefore, these systems are also known as no-remote memory
access (NORMA) systems. Message passing interconnection network provides connection
to every node and inter-node communication with message depends on the type of
interconnection network. For example, interconnection network for a non-hierarchical
system can be shared bus.
Check Your Progress 2
1) What are the various rules and operators used in Handler’s classification for various
machine types?
…………………………………………………………………………………………..
………………………………………………………………………………………….
.…………………………………………………………………………………………
..……………………………………………………………………………………….
38
2) What is the base for structural classification of parallel computers? Classification of
………………………………………………………………………………………….. Parallel Computers
…………………………………………………………………………………………..
…………………………………………………………………………………………..
…………………………………………………………………………………………..
3) Define loosely coupled systems and tightly coupled systems.
………………………………………………………………………………………….
.…………………………………………………………………………………………
..………………………………………………………………………………………..
…..……………………………………………………………………………………..
4) Differentiate between UMA, NUMA and COMA.
…………………………………………………………………………………………..
…………………………………………………………………………………………..
…………………………………………………………………………………………..
…………………………………………………………………………………………..
Program Flow
S1 S2
S S
S2 S1
S3 S3
But it is not sufficient to check for the parallelism between statements or processes in a
program. The decision of parallelism also depends on the following factors:
• Number and types of processors available, i.e., architectural features of host
computer
• Memory organisation
• Dependency of data, control and resources
2.6.1 Parallelism Conditions
As discussed above, parallel computing requires that the segments to be executed in
parallel must be independent of each other. So, before executing parallelism, all the
conditions of parallelism between the segments must be analyzed. In this section, we
discuss three types of dependency conditions between the segments
(shown in Figure 16).
39
Elements of Parallel
Computing and Dependency
Architecture conditions
Data Dependency: It refers to the situation in which two or more instructions share same
data. The instructions in a program can be arranged based on the relationship of data
dependency; this means how two instructions or segments are data dependent on each
other. The following types of data dependencies are recognised:
i) Flow Dependence : If instruction I2 follows I1 and output of I1 becomes input of
I2, then I2 is said to be flow dependent on I1.
ii) Antidependence : When instruction I2 follows I1 such that output of I2 overlaps
with the input of I1 on the same data.
iii) Output dependence : When output of the two instructions I1 and I2 overlap on
the same data, the instructions are said to be output dependent.
iv) I/O dependence : When read and write operations by two instructions are
invoked on the same file, it is a situation of I/O dependence.
Consider the following program instructions:
I1 : a = b
I2 : c = a + d
I3 : a = c
In this program segment instructions I1 and I2 are Flow dependent because variable a is
generated by I1 as output and used by I2 as input. Instructions I2 and I3 are Antidependent
because variable a is generated by I3 but used by I2 and in sequence I2 comes first. I3 is
flow dependent on I2 because of variable c. Instructions I3 and I1 are Output dependent
because variable a is generated by both instructions.
Control Dependence: Instructions or segments in a program may contain control
structures. Therefore, dependency among the statements can be in control structures also.
But the order of execution in control structures is not known before the run time. Thus,
control structures dependency among the instructions must be analyzed carefully. For
example, the successive iterations in the following control structure are dependent on one
another.
For ( i= 1; I<= n ; i++)
{
if (x[i - 1] == 0)
x[i] =0
else
x[i] = 1;
}
Resource Dependence : The parallelism between the instructions may also be affected
due to the shared resources. If two instructions are using the same shared resource then it
is a resource dependency condition. For example, floating point units or registers are
shared, and this is known as ALU dependency. When memory is being shared, then it is
called Storage dependency.
40
2.6.2 Bernstein Conditions for Detection of Parallelism Classification of
Parallel Computers
For execution of instructions or block of instructions in parallel, it should be ensured that
the instructions are independent of each other. These instructions can be data dependent /
control dependent / resource dependent on each other. Here we consider only data
dependency among the statements for taking decisions of parallel execution. A.J.
Bernstein has elaborated the work of data dependency and derived some conditions based
on which we can decide the parallelism of instructions or processes.
Bernstein conditions are based on the following two sets of variables:
i) The Read set or input set RI that consists of memory locations read by the statement
of instruction I1.
ii) The Write set or output set WI that consists of memory locations written into by
instruction I1.
The sets RI and WI are not disjoint as the same locations are used for reading and writing
by SI.
The following are Bernstein Parallelism conditions which are used to determine whether
statements are parallel or not:
1) Locations in R1 from which S1 reads and the locations W2 onto which S2 writes
must be mutually exclusive. That means S1 does not read from any memory
location onto which S2 writes. It can be denoted as:
R1∩W2=φ
2) Similarly, locations in R2 from which S2 reads and the locations W1 onto which S1
writes must be mutually exclusive. That means S2 does not read from any memory
location onto which S1 writes. It can be denoted as: R2∩W1=φ
3) The memory locations W1 and W2 onto which S1 and S2 write, should not be read by
S1 and S2. That means R1 and R2 should be independent of W1 and W2. It can be
denoted as : W1∩W2=φ
Parallelism Levels
Degree of
Parallelism
Level 2
Loop Level
Level 3
Procedure or SubProgram
Level
Level 4
Program Level
42
1) Instruction level: This is the lowest level and the degree of parallelism is highest at Classification of
this level. The fine grain size is used at instruction or statement level as only few Parallel Computers
instructions form the grain size here. The fine grain size may vary according to the
type of the program. For example, for scientific applications, the instruction level
grain size may be higher. As the higher degree of parallelism can be achieved at this
level, the overhead for a programmer will be more.
2) Loop Level : This is another level of parallelism where iterative loop instructions can
be parallelized. Fine grain size is used at this level also. Simple loops in a program are
easy to parallelize whereas the recursive loops are difficult. This type of parallelism
can be achieved through the compilers.
2) Use Bernstein’s conditions for determining the maximum parallelism between the
instructions in the following segment:
S1: X = Y + Z
S2: Z = U + V
S3: R = S + V
S4: Z = X + R
S5: Q = M + Z
43
Elements of Parallel …………………………………………………………………………………………
Computing and …………………………………………………………………………………………
Architecture
…………………………………………………………………………………………
………………………………………………………………………………
3) Discuss instruction level parallelism.
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
………………………………………………………………………………
2.7 SUMMARY
44
• The '+' operator is used to indicate that the units are not pipelined but work on Classification of
Parallel Computers
independent streams of data.
• The 'v' operator is used to indicate that the computer hardware can work in one of
several modes.
• The '~' symbol is used to indicate a range of values for any one of the parameters.
• Peripheral processors are shown before the main processor using another three
pairs of integers. If the value of the second element of any pair is 1, it may be
omitted for brevity.
1) The base for structural classification is multiple processors with memory being
globally shared between processors or all the processors have their local copy of the
memory.
1) When multiprocessors communicate through the global shared memory modules then
this organization
is called shared memory computer or tightly coupled systems . When every processor
in a multiprocessor system, has its own local memory and the processors
communicate via messages transmitted between their local memories, then this
organization is called distributed memory computer or loosely coupled system.
1) In UMA, each processor has equal access time to shared memory. In NUMA, local
memories are connected with every processor and one reference to a local memory of
the remote processor is not uniform. In COMA, all local memories of NUMA are
replaced with cache memories.
Check Your Progress 3
1) Instructions I1 and I2 are both flow dependent and antidependent both. Instruction I2
and I3 are output dependent and instructions I1 and I3 are independent.
2) R1 = {Y,Z} W1 = {X}
R2 = {U,V} W2 = {Z}
R3 = {S,V} W3 = {R}
R4 = {X,R} W4 = {Z}
R5 = {M,Z} W5= {Q}
45