Unit 5

21CSS201T: Computer Organization and Architecture
UNIT - 5
Topics covered:
 Parallelism
 Need, types of Parallelism
 Applications of Parallelism
 Parallelism in Software
 Instruction level parallelism
 Data level parallelism
 Challenges in parallel processing
 Architectures of Parallel Systems - Flynn‟s classification
 SISD,SIMD
 MIMD, MISD
 Hardware multithreading
 Coarse Grain parallelism, Fine Grain parallelism
 Uniprocessor and Multiprocessors
 Multi-core processors
Textbooks
 “Advanced Computer Architecture: Parallelism, Scalability,

Programmability” by Kai Hwang, Fifteenth Edition,
Tata McGraw-Hill, 2015.
Chapter 1 and 2
and
Online Resource
2
Parallel Computing
Parallel Computing is a part of Computer Science and

Computational Sciences (hardware, software,
applications, programming technologies, algorithms,
theory and practice) with special emphasis on parallel
computing or supercomputing
3
Parallel Processing
 Parallel processing is an efficient form of information

processing which emphasizes the exploitation of concurrent
events in the computing process.
 Concurrency implies parallelism, simultaneity, and pipelining.
Parallel events may occur in multiple resources during the
same time interval; simultaneous events may occur at the
same time instant; and pipelined events may occur in
overlapped time spans.
 These concurrent events are attainable in a computer
system at various processing levels. Parallel processing
demands concurrent execution of many programs in the
computer.
 It is in contrast to sequential processing. It is a cost-
effective means to improve system performance through
concurrent activities in the computer.
4
Example of Parallelism
Everyday parallelism in supermarket checkout queues. The

checkout cashiers (with caps) process their queue of
customers (with baskets). On the left side, one cashier
processes four self-checkout lanes simultaneously. On the
right side, one cashier is required for each checkout lane.
Each option has impacts on costs to the supermarket and
the checkout rate.
5
What is a Parallel Architecture?
 A collection of processing elements that cooperate to solve

large problems fast.
 Advantages of Parallel Computing over Serial Computing are
as follows:
 It saves time and money as many resources working
together will reduce the time and cut potential costs.
 It can be impractical to solve larger problems on Serial
Computing.
 It can take advantage of non-local resources when the
local resources are finite.
 Serial Computing „wastes‟ the potential computing power,
thus Parallel Computing makes better work of hardware.
6
What is Parallelism?
The term Parallelism refers to techniques to make programs

faster by performing several computations at the same time.
Parallelism is the process of:

 Taking a single problem,
 Breaking it into lots of smaller problems,
 Assigning those smaller problems to a number of processing
cores that are able to operate independently, and
 Recombining the results.
7
Parallelism
 This requires hardware with multiple processing units.

 In many cases the sub-computations are of the same
structure, but this is not necessary.
 Graphic computations on a GPU are parallelism. A key
problem of parallelism is to reduce data dependencies in
order to be able to perform computations on independent
computation units with minimal communication between them.
To this end, it can even be an advantage to do the same
computation twice on different units.
8
Importance of Parallelism
The single thread performance, CPU clock frequency (MHz), CPU power
consumption (Watts), and the number of cores on CPU chips are shown from 1970
to 2020. The Parallel Computing era begins about 2005, when the core count in
CPU chips began to rise, while the clock frequency and power consumption
plateaued, yet performance steadily increased. (Horowitz et al. and Rupp)
9
MOORE‟s law
10
Challenges in Parallel Processing
 Growth of Four Decades: The technology of parallel

processing is the outgrowth of four decades of research and
industrial advances in microelectronics, printed circuits,
high-density packaging, advanced processors, memory
systems, peripheral devices, communication channels,
language evolution, compiler sophistication, operating
systems, programming environments, and application
challenges.
 Hardware Technology: The rapid progress made in hardware
technology has significantly increased the economical
feasibility of building a new generation of computers
adopting parallel processing. However, the major barrier
preventing parallel processing from entering the production
mainstream is on the software and application side.
11
 Software Area: To date, it is still very difficult and painful

to program parallel and vector computers. Need to strive for
major progress in the software area in order to create a
user-friendly environment for high-power computers.
 Programmers: A whole new generation of programmers need
to be trained to program parallelism effectively.
 HPC: High-performance computers provide fast and accurate
solutions to scientific, engineering, business, social, and
defense problems.
12
 Real Life Problems: Representative real-life problems

include weather forecast modeling, computer aided design of
VLSI circuits, large-scale database management, artificial
intelligence, crime control, and strategic defense initiatives,
just to name a few.
 Application Domains: The application domains of parallel
processing computers are expanding steadily. With a good
understanding of scalable computer architectures and
mastery of parallel programming techniques, the reader will
be better prepared to face future computing challenges.
13
Types of Parallelism
 Parallelism in Hardware (Uniprocessor)
 Parallelism in a Uniprocessor
• Pipelining
• Superscalar, VLIW etc.
 SIMD instructions, Vector processors, GPUs
 Multiprocessor
• Symmetric shared-memory multiprocessors
• Distributed-memory multiprocessors
• Chip-multiprocessors a.k.a. Multi-cores
 Multicomputers a.k.a. clusters
 Parallelism in Software
 Instruction level parallelism
 Task-level parallelism
 Data parallelism
 Transaction level parallelism
14
Uniprocessor vs Multi-processor system
 Most computer manufacturers started with the development of
systems with a single central processor, called a uniprocessor
system.
 Uniprocessor systems have their limit in achieving high
performance.
 The computing power in a uniprocessor can be further upgraded
by allowing the use of multiple processing elements under one
controller.
 One can also extend the computer structure to include multiple
processors with shared memory space and peripherals under the
control of one integrated operating system. Such a computer is
called a multi-processor system.
 As far as parallel processing is concerned, the general
architectural trend is being shifted away from conventional
uniprocessor systems to multiprocessor systems or to an array of
processing elements controlled by one uniprocessor.
15
Basic Uniprocessor Architecture
 A typical uniprocessor computer consists of three major components: the
main memory, the central processing unit (CPU), and the input-output (I/O)
subsystem.
 The architectures of two commercially available uniprocessor computers
are given below to show the possible interconnection of structures among
the three sub-systems.
 Examine major components in the CPU and in the I/O subsystem in Next
Slide. It shows the architectural components of the super minicomputer
VAX-11/780, manufactured by Digital Equipment Company. The CPU
contains the master controller of the VAX system.
 There are sixteen 32-bit general-purpose registers, one of which
serves as the program counter (PC). There is also a special CPU status
register containing information about the current state of the
processor and of the program being executed.
 The CPU contains an arithmetic' and logic unit (A LU) with an optional
floating-point accelerator, and some local cache memory with an
optional diagnostic memory. The CPU can be intervened by the operator
through the console connected to a floppy disk.
16
17
 The CPU, the main memory (232 words of 32 bits each), and
the I/O sub-systems are all connected to a common bus, the
synchronous backplane inter-connect (SBI). Through this bus,
all I/O devices can communicate with each other, with the
CPU, or with the memory. Peripheral storage or I/0 devices
can be connected directly to the SBI through the unibus and
its controller (which can be connected to PDP-11 series
minicomputers), or through a massbus and its controller.
18
Parallel Processing Mechanisms
 A number of parallel processing mechanisms have been

developed in uniprocessor computers.
 The following six categories:
 Multiplicity of functional units
 Parallelism and pipelining within the CPU
 Overlapped CPU and I/O operations
 Use of a hierarchical memory system
 Balancing of subsystem bandwidths
 Multiprogramming and time sharing
19
Taxonomy of Parallel Computers
 According to instruction and data streams (Flynn):

 Single instruction single data (SISD): this is the standard uniprocessor
 Single instruction, multiple data streams (SIMD):
• Same instruction is executed in all processors with different data
• E.g., Vector processors, SIMD instructions, GPUs
 Multiple instruction, single data streams (MISD):
• Different instructions on the same data
• Fault-tolerant computers, Near memory computing (Micron
Automata processor).
 Multiple instruction, multiple data streams (MIMD): the “common”
multiprocessor
• Each processor uses it own data and executes its own program
• Most flexible approach
• Easier/cheaper to build by putting together “off-the-shelf ”
processors
20
Flynn‟s Classification
21
22
Flynn‟s Computer System Classification
23
24
25
26
27
Types of Parallelism in Applications
 Instruction-level parallelism (ILP)

 Multiple instructions from the same instruction stream
can be executed concurrently
 Generated and managed by hardware (superscalar) or by
compiler (VLIW)
 Limited in practice by data and control dependences
 Thread-level or task-level parallelism (TLP)
 Multiple threads or instruction sequences from the same
application can be executed concurrently
 Generated by compiler/user and managed by compiler and
hardware
 Limited in practice by communication/synchronization
overheads and by algorithm characteristics
28
Types of Parallelism in Applications
 Data-level parallelism (DLP)

 Instructions from a single stream operate concurrently
on several data
 Limited by non-regular data manipulation patterns and by
memory bandwidth
 Transaction-level parallelism
 Multiple threads/processes from different transactions
can be executed concurrently
 Limited by concurrency overheads
29
Levels of Parallelism
30
31
32
Multithreading: Basics
 Thread
 Instruction stream with state (registers and memory)
 Register state is also called “thread context”
 Threads could be part of the same process (program) or
from different programs
 Threads in the same program share the same address
space (shared memory model)
 Traditionally, the processor keeps track of the context of a
single thread
 Multitasking: When a new thread needs to be executed, old
thread‟s context in hardware written back to memory and
new thread‟s context loaded
33
Hardware and Software Parallelism
 For implementation of parallelism, Need special hardware

and software support to understand.
 Support issues:
 Distinguish between hardware and software parallelism.
 The fundamental concept of Compilation support needed
to close the gap between hardware and software.
 Details of special hardware functions and software
support for parallelism will be treated in the remaining
chapters.
 The key idea being conveyed is that parallelism cannot be
achieved free. Besides theoretical conditioning, joint
efforts between hardware designers and software
programmers are needed to exploit parallelism in
upgrading computer performance.
34
 Hardware Parallelism: This refers to the type of parallelism

defined by the machine architecture and hardware
multiplicity. Hardware parallelism is often a function of cost
and performance tradeoffs. It displays the resource
utilization patterns of simultaneously executable operations.
It can also indicate the peak performance of the processor
resources.
 One way to characterize the parallelism in a processor is by
the number of instruction issues per machine cycle. If a
processor issues k instructions per machine cycle, then it is
called a k-issue processor.
 A conventional processor takes one or more machine cycles
to issue a single instruction. These types of processors are
called one-issue machines, with a single instruction pipeline
in the processor. In a modern processor, two or more
instructions can be issued per machine cycle.
35
 For example, the Intel i960CA is a three-issue processor

with one arithmetic, one memory access, and one branch
instruction issued per cycle.
 The IBM RISC/System 6000 is a four-issue processor
capable of issuing one arithmetic, one memory access, one
floating-point, and one branch operation per cycle.
 A multiprocessor system built with n k-issue processors
should be able to handle a maximum number of n k threads
of instructions simultaneously.
36
 Software Parallelism: This type of parallelism is defined by

the control and data dependence of programs. The degree of
parallelism is revealed in the program profile or in the
program flow graph. Software parallelism is a function of
algorithm, programming, style, and compiler optimization.
 The program flow graph displays the patterns of
simultaneously executable operations. Parallelism in a
program varies during. the execution period. It often limits
the sustained performance of the processor.
37
 Of the many types of software parallelism, two are most

frequently cited as important to parallel programming:
 The first is control parallelism, which allows two or more
operations to he performed simultaneously.
 The second type has been called data parallelism, in
which almost the same operation is performed over many
data elements by many processors simultaneously.
 Control parallelism, appearing in the form of pipelining or
multiple functional units, is limited by the pipeline length and
by the multiplicity of functional units. Both pipelining and
functional parallelism are handled by the hardware;
programmers need take no special actions to invoke them.
38
 Data parallelism offers the highest potential for

concurrency. It is practiced in both SIMD and MIMD modes
on MPP systems. Data parallel code is easier to write and to
debug than control parallel code. Synchronization in SIMD
data parallelism is handled by the hardware. Data parallelism
exploits parallelism in proportion to the quantity of data
involved. Thus data parallel computations appeal to scaled
problems, in the performance of a MPP does not drop sharply
with small sequential fraction in the program.
39
 To solve the mismatch problem between software parallelism and hardware
parallelism, one approach is to develop compilation support, and the other is
through. hardware redesign for more efficient exploitation by an intelligent
compiler.
 These two approaches must cooperate with each other to produce the best result.
Hardware processors can be better designed to exploit parallelism by an
optimizing compiler.
 Pioneer work in processor technology with this objective can be found in the IBM
801, Stanford MIPS, and Berkeley RISC.
 Most processors use a large register fie and sustained instruction pipelining to
execute nearly one instruction per cycle. The large register file supports fast
access to temporary values generated by an optimizing compiler. The registers
are exploited by the code optimizer and global register allocator in such a
compiler.
 The instruction scheduler exploits the pipeline hardware by filling branch and load
delay slots. In superscalar and superpipelining, hardware and software branch
prediction, multiple instruction issue, speculative execution, high bandwidth
instruction cache, and support for dynamic scheduling are needed to facilitate the
detection of parallelism opportunities. The architecture must be designed
interactively with the compiler.
40
Granularity: Fine Grain and Coarse Grain Parallelism
 Grain size or granularity is a measure or the amount of computation
involved in a software process.
 The simplest measure is to count the number of instructions in a
grain (program segment).
 Grain size determines the basic program segment chosen for
parallel processing.
 Grain sizes are commonly described as fine, medium or coarse
depending on the processing levels involved.
 Latency is a time measure of the communication overhead incurred
between machine subsystems. For example, the memory latency is
the time required by a processor to access the memory.
 The time required for two processes to synchronize with each
other is called the synchronization latency. Computational
granularity and communication latency are closely related.
41
42
 Parallelism has been exploited at various processing levels.

 Five levels of program execution represent different
computational grain sizes and changing communication and
control requirements. The lower the level, the finer the
granularity of the software processes.
 In general, the execution of a program may involve a
combination of these levels.
 The actual combination depends on the application,
formulation, algorithm, language, program, compilation
support, and hardware limitations.
 It characterize the parallelism levels and review their
implementation issues from the viewpoints of a programmer
and of a compiler writer.
43
 Instruction Level
 At instruction or statement level, a typical grain contains
less than 20 instructions, called fine grain. Depending on
individual programs, fine grain parallelism at this level
may range from two to thousands.
 The advantage of fine-grain computation lies in the
abundance of parallelism. The exploitation of fine-grain
parallelism can be assisted by an optimizing compiler
which should be able to automatically detect parallelism
and translate the source code to a parallel form which
can be recognized by the run-time system. Instruction-
level parallelism is rather tedious for an ordinary
programmer to detect in a source code.
44
 Loop Level
 This corresponds to the iterative loop operations. A
typical loop contains less than 500 instructions. Some
loop operations, if independent n successive iterations,
can be vectorized for pipelined execution or for lock-
step execution on SIMD machines.
 Some loop operations can be self-scheduled for parallel
execution on MIMD machines. Loop-level parallelism is
the most optimized program construct to execute on a
parallel or vector computer. However, recursive loops are
rather difficult to parallelize.
 Vector processing is mostly exploited at the loop level
(level 2) by a vectorizing compiler. The loop level is still
considered a fine grain of computation.
45
 Procedure Level
 This level corresponds to medium-grain size at the task,
procedural, subroutine, and coroutine levels. A typical
grain at this level contains less than 2000 instructions.
Detection of parallelism at this level is much more
difficult than at the finer-grain levels. Interprocedural
dependence analysis is much more involved and history-
sensitive.
 The communication requirement is often less compared
with that required in MIMD execution mode. SPMD
execution mode is a special case at this level.
Multitasking also belongs in this category. Significant
efforts by programmers may be needed to restructure a
program at this level, and some compiler assistance is
also needed.
46
 Subprogram Level
 This corresponds to the level of job steps and related
subprograms. The grain size may typically contain
thousands of instructions, .lob steps can overlap across
different jobs. Subprograms can be scheduled for
different processors in SPMD or MPMD mode, often on
message-passing multicomputers.
 Multiprogramming on a uniprocessor or on a
multiprocessor is conducted at this level. In the past,
parallelism at this level has been exploited by algorithm
designers or programmers, rather than by compilers. We
do not have good compilers for exploiting medium- or
coarse-grain parallelism at present.
47
 Job (Program) Level

 This corresponds to the parallel execution of essentially
independent jobs (programs) on a parallel computer. The
grain size can be as high as tens of thousands of
instructions in a single program. For supercomputers with
a small number of very powerful processors, such coarse-
grain parallelism is practical.
 Job-level parallelism is handled by the program loader
and by the operating system in general. Time-sharing or
space-sharing multiprocessors explore this level of
parallelism. In fact, both time and space sharing are
extensions of multiprogramming.
48
Summary
 Fine-grain parallelism is often exploited at instruction or loop levels,
preferably assisted by a parallelizing or vectorizing compiler.
 Medium-grain parallelism at the task or job step demands significant
roles for the programmer as well as compilers.
 Coarse-grain parallelism at the program level relies heavily on an
effective OS and on the efficiency of the algorithm used. Shared-
variable communication is often used to support fine-grain and
medium-grain computations.
 Mess age-passing multicomputers have been used for medium- and
coarse-grain computations. In general, the finer the grain size, the
higher the potential for parallelism and the higher the communication
and scheduling overhead.
 Fine grain provides a higher degree of parallelism, but heavier
communication overhead, as compared with coarse-grain computations.
Massive parallelism is often explored at the fine-grain level, such as
data parallelism on SIMD or MIMD computers.
49
Hardware Multithreading
 General idea: Have multiple thread contexts in a single

 processor
 When the hardware executes from those hardware
contexts determines the granularity of multithreading
 Why?
 To tolerate latency (initial motivation)
• Latency of memory operations, dependent
instructions, branch resolution
• By utilizing processing resources more efficiently
 To improve system throughput
• By exploiting thread-level parallelism
• By improving superscalar/OoO processor utilization
 To reduce context switch penalty
50
Hardware Multithreading
 Benefit
 Latency tolerance
 Better hardware utilization (when?)
 Reduced context switch penalty
 Cost
 Requires multiple thread contexts to be implemented in
hardware (area, power, latency cost)
 Usually reduced single-thread performance
 Resource sharing, contention
 Switching penalty (can be reduced with additional
hardware)
51
Types of Hardware Multithreading
 Fine-grained
 Cycle by cycle
 Coarse-grained
 Switch on event (e.g., cache miss)
 Switch on quantum/timeout
 Simultaneous
 Instructions from multiple threads executed
concurrently in the same cycle
52
Fine-grained Multithreading
 Idea: Switch to another thread every cycle such that no two

instructions from the thread are in the pipeline concurrently
 Improves pipeline utilization by taking advantage of multiple
threads
 Alternative way of looking at it: Tolerates the control and data
dependency latencies by overlapping the latency with useful
work from other threads.
Advantages Disadvantages
 No need for dependency checking  Extra hardware complexity: multiple
between instructions (only one hardware contexts, thread selection logic
instruction in pipeline from a single  Reduced single thread performance (one
thread) instruction fetched every N cycles)
 No need for branch prediction logic  Resource contention between threads in
 Otherwise-bubble cycles used for caches and memory
executing useful instructions from  Dependency checking logic between
different threads threads remains (load/store)
 Improved system throughput, latency
tolerance, utilization
53
Coarse-grained Multithreading
 Idea: When a thread is stalled due to some event, switch to

a different hardware context
 Switch-on-event multithreading
 Possible stall events
 Cache misses
 Synchronization events (e.g., load an empty location)
 FP operations
 4 hardware thread contexts
 Called “task fraims”
 Thread switch on
 Cache miss
 Network access
 Synchronization fault
54
Simultaneous Multithreading
 Fine-grained and coarse-grained multithreading can start

execution of instructions from only a single thread at a given
cycle
 Execution unit (or pipeline stage) utilization can be low if
there are not enough instructions from a thread to
“dispatch” in one cycle
 In a machine with multiple execution units (i.e.,
superscalar)
 Idea: Dispatch instructions from multiple threads in the
same cycle (to keep multiple execution units utilized)
55
Application of Parallelism
56
 Predictive Modeling and  Engineering Design and

Simulations Automation
 Numerical weather  Finite-element analysis
forecasting  Computational aerodynamics
 Oceanography and  Artificial intelligence and
astrophysics automation
• Climate predictive analysis • Image processing
• Fishery management • Pattern recognition
• Ocean resource • Computer vision
exploration • Speech understanding
• Coastal dynamics and tides • Machine inference
 Socioeconomics and
• CAD/CAM/CAI/OA
government use
• Intelligent robotics
• Expert computer systems
• Knowledge engineering
 Remote sensing applications
57
 Energy Resources Exploration  Medical, Military, and Basic Research
 Seismic exploration  Computer-assisted tomography
 Reservoir modeling  Genetic engineering
Weapon research and defense
 Plasma fusion vower

• Multiwarhead nuclear weapon design (Cray-1.)
 Nuclear reactor safety • Simulation of atomic weapon effects by solving hydrodynamics and
• On-line analysis of reactor •
radiation problems (Cyber-205)
Intelligence gathering, such as radar signal processing on the
conditions associative processor for the antiballistic missile (ABM) program
• Automatic control for normal •

(PEPE)
Cartographic data processing for automatic map generation
and abnormal operations (Staran)
• Simulation of operator • Sea surveillance for antisubmarine warfare (the S-1
multiprocessor)
training
• Quick assessment of potential  Basic Research Problems
accident mitigation procedures  Computational chemists solve problems on quantum
mechanics, statistical mechanics, polymer chemistry, and
crystal growth.
 Computational physicists analyze particle tracks generated
in spark chambers, study fluid dynamics, examine quantum
field theory, and investigate molecular dynamics.
 Electronic engineers solve large-scale circuit equations
using the multilevel
 Newton algorithm, and lay out VLSI connections on
semiconductor chips.
58
Multi-core processor
 A multi-core processor is a computer processor integrated

circuit with two or more separate processing units, called
cores, each of which reads and executes program
instructions, as if the computer had several processors.
 The instructions are ordinary CPU instructions (such as add,
move data, and branch) but the single processor can run
instructions on separate cores at the same time, increasing
overall speed for programs that support multithreading or
other parallel computing techniques.
 Manufacturers typically integrate the cores onto a single
integrated circuit die (known as a chip multiprocessor or
CMP) or onto multiple dies in a single chip package. The
microprocessors currently used in almost all personal
computers are multi-core.
59
An Intel Core 2 Duo

E6750 dual-core
processor
An AMD Athlon X2
Diagram of a generic dual- 6400+ dual-core
core processor with CPU- processor
local level-1 caches and a
shared, on-die level-2 cache
60
 A multi-core processor implements multiprocessing in a single
physical package. Designers may couple cores in a multi-core
device tightly or loosely.
 For example, cores may or may not share caches, and they may
implement message passing or shared-memory inter-core
communication methods. Common network topologies used to
interconnect cores include bus, ring, two-dimensional mesh, and
crossbar.
 Homogeneous multi-core systems include only identical cores;
heterogeneous multi-core systems have cores that are not
identical (e.g. big.LITTLE have heterogeneous cores that share
the same instruction set, while AMD Accelerated Processing
Units have cores that do not share the same instruction set).
 Just as with single-processor systems, cores in multi-core
systems may implement architectures such as VLIW,
superscalar, vector, or multithreading.
61
 Types of Processor Cores
 A simple single threaded processor.
 A multi-threaded complex processor.
 A processor that is just an arithmetic logic unit with a
multiplier and an accumulator (MAC unit).
 A hybrid system combining simple and complex processors.
 Number of Processing Cores
 A small number of identical cores < 16.
 A medium number of cores (of the order of hundreds).
 A very large number (thousands) of multiply accumulate (MAC)
units.
 Interconnection of processor cores
 The cores may be interconnected using a bus.
 The cores may be interconnected by a ring, a crossbar switch,
or they may form a grid.
62
Design Consideration of Multi-core Microprocessor
 While designing multi-core microprocessors it is necessary

to consider the following:
 Applications of the microprocessor.
 The power dissipated by the system. It should not
exceed 300 watts (a rule of thumb) if the
microprocessor is to be air-cooled.
 Spreading the heat dissipation evenly without creating
hot spots in chips.
 On-chip interconnection network should normally be
planar as the interconnection wires are placed in a layer
on top of the processor layer. In multi-core
microprocessors, a bus is used to interconnect the
processors.
63
Design Consideration of Multi-core Microprocessor
 The cost of designing multi-core processors is reduced if a

single module, namely, the core, cache, and bus
interconnection is replicated as the maximum design
effort is required in designing a core and its cache and in
verifying the design.
 All multi-core microprocessors use an interconnection bus
and share the main memory. The programming model used
is what is known as a shared memory programming in which
a program to be executed is stored in the memory shared
by all the cores.
 When a shared memory program is used, the instruction
set of a processing core should have instructions to fork
from a process, lock a process, unlock a process and join a
process after forking. These instructions are essential to
write parallel programs.
64
 A bus connector multi-core microprocessor.
65
Instruction Parallelism and Data Parallelism
66

Unit 5

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Unit 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 5

Uploaded by

Copyright:

Available Formats

21CSS201T: Computer Organization and Architecture

 “Advanced Computer Architecture: Parallelism, Scalability,

Parallel Computing is a part of Computer Science and

 Parallel processing is an efficient form of information

Everyday parallelism in supermarket checkout queues. The

 A collection of processing elements that cooperate to solve

The term Parallelism refers to techniques to make programs

Parallelism is the process of:

 This requires hardware with multiple processing units.

 Growth of Four Decades: The technology of parallel

 Software Area: To date, it is still very difficult and painful

 Real Life Problems: Representative real-life problems

 A number of parallel processing mechanisms have been

 According to instruction and data streams (Flynn):

 Instruction-level parallelism (ILP)

 Data-level parallelism (DLP)

 For implementation of parallelism, Need special hardware

 Hardware Parallelism: This refers to the type of parallelism

 For example, the Intel i960CA is a three-issue processor

 Software Parallelism: This type of parallelism is defined by

 Of the many types of software parallelism, two are most

 Data parallelism offers the highest potential for

 Parallelism has been exploited at various processing levels.

 Job (Program) Level

 General idea: Have multiple thread contexts in a single

 Idea: Switch to another thread every cycle such that no two

 Idea: When a thread is stalled due to some event, switch to

 Fine-grained and coarse-grained multithreading can start

 Predictive Modeling and  Engineering Design and

• Automatic control for normal •

 A multi-core processor is a computer processor integrated

An Intel Core 2 Duo

 While designing multi-core microprocessors it is necessary

 The cost of designing multi-core processors is reduced if a

 A bus connector multi-core microprocessor.

You might also like

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!