0% found this document useful (0 votes)

12 views32 pages

Lecture 2

Parallel computing involves using multiple processors concurrently to solve problems faster. It requires partitioning problems into tasks that can be solved simultaneously and coordinating communication between processors. Key challenges include minimizing overhead from communication, idling processors, and synchronization between tasks.

Uploaded by

ahmedtarek86519623

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views32 pages

Lecture 2

Uploaded by

ahmedtarek86519623

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

High Performance

Computing
LECTURE #2

1
Agenda
o What is parallel computing?
o Terminologies of parallel computing
o Performance Evaluation of parallel computing
o Challenges of parallel computing
o Parallel processing concepts:
o How is parallelism expressed in a program
o Architectural concepts related to parallelism (parallel processing)

2
What is parallel computing?
Multiple processors cooperating concurrently to solve one problem.

3
What is parallel computing?

“A parallel computer is a collection of processing elements that can

communicate and cooperate to solve large problems fast”
Almasi/Gottlieb

“communicate and cooperate”

• Nodes and interconnect architecture
• Problem partitioning (Co-ordination of events in a process)

“large problems fast”

• Programming model
• Match of model and architecture

4
What is parallel computing?

Some broad issues:

• Resource Allocation:
– How large a collection?
– How powerful are the elements?

• Data access, Communication and Synchronization

– How are data transmitted between processors?
– How do the elements cooperate and communicate?
– What are the abstractions and primitives for cooperation?

• Performance and Scalability

– How does it all translate into performance?
– How does it scale? A service is said to be scalable when ??
5
Terminologies
❑ Core a single computing unit with its own independent control

❑Multicore is a processor having several cores that can access the same memory
concurrently

❑A computation is decomposed into several parts called Tasks that can be computed
in parallel

❑ Finding enough parallelism is (one of the) critical steps for high performance
(Amdahl’s law).

6
Performance Metrics
❑ Execution time:
The time elapsed between the beginning and the end of its execution.

❑ Speedup:
The ration between serial and parallel time.
Speedup= Ts/Tp

❑ Efficiency:
Ratio of speedup to the number of processors.
Efficiency= Speedup/P

7
Performance Metrics
❑ Amdahl’s Law
Used to predict maximum speedup using multiple processors.
• Let f = fraction of work performed sequentially.
• (1 - f) = fraction of work that is parallelizable.
• P = number of processors
On 1 cpu: T1 = f + (1 – f ) = 1.
(1−𝑓 )
On P processors: Tp = f +
𝑝

• Speedup
𝑇1 1 1
= <
𝑇𝑝 𝑓+(1−𝑓)/𝑝 𝑓

Speedup limited by sequential part

9
Challenges
All parallel programs contain:
❑ Parallel sections
❑ Sequential sections
❑Sequential sections are with work is being duplicated or no useful work is being done,
(waiting for others)

1) Building efficient algorithms avoiding:

❑ Communication delay
❑ Idling
❑ Synchronization
2) Memory system challenges
10
Challenges
1) Sources of overhead in parallel programs
❑ Inter process interaction:
The time spent communicating data between processing elements is
usually the most significant source of parallel processing overhead.

❑ Idling:
Processes may become idle due to many reasons such as load
imbalance, synchronization, and presence of serial components in a
program.

❑ Excess Computation:
The fastest known sequential algorithm for a problem may be difficult
or impossible to parallelize, forcing us to use a parallel algorithm
based on a poorer but easily parallelizable sequential algorithm.

11
Challenges
❖ 2)Memory system challenges
❖ The effective performance of a program on computer relies :
1. Speed of processor (clock rates of processors increased from 40MHz (e.g MIPS R3000, 1988), to 2.0 GHz
(e.g pentium 4, 2002), to, 8.429GHz (AMD's Bulldozer based FX chips, 2012)
2. Ability of memory to feed data to processor

❖ Memory System Performance is mainly captured by two parameters, latency and bandwidth.

• Latency is the time from the issue of a memory request to the time the data is Memory
available at the processor.
Example: if memory has latency 100 ns (no caches). Assume that the processor has two Data Path
multiply-add units and is capable of executing four instructions in each cycle of 1 ns.
Then processor must wait 100 cycles before it can process the data
Processor
◦ Bandwidth is the rate at which data can be pumped to the processor by the memory system.
13
How is parallelism expressed in a program
IMPLICITLY EXPLICITLY
❑Define tasks only, rest implied; or ❑Define tasks, work decomposition,
define tasks and work decomposition data decomposition, communication,
rest implied; synchronization.

❑OpenMP is a high-level parallel ❑MPI is a library for fully explicit

programming model, which is mostly an parallelization.
implicit model.

Hidden from programmer Exposed to programmer

( implicit parallelism) ( explicit parallelism)
17
1- IMPLICITLY
❑It is a characteristic of a programming language that allows a compiler or interpreter to
automatically exploit the parallelism inherent to the computations expressed by some of the
language's constructs.

❑ A pure implicitly parallel language does not need special directives, operators or functions
to enable parallel execution.

❑ Programming languages with implicit parallelism include Axum, HPF, Id, LabVIEW, MATLAB
M-code,

❑Example: taking the sine or logarithm of a group of numbers, a language that provides
implicit parallelism might allow the programmer to write the instruction thus:

18
Advantages Disadvantages

❑ A programmer does not need to worry ❑ It reduce the control that the
about task division or process programmer has over the parallel
communication, execution of the program,

❑focusing instead in the problem that ❑resulting sometimes in less-than-optimal

his or her program is intended to solve. parallel efficiency.

❑It generally facilitates the design of ❑ Sometimes debugging is difficult.

parallel programs .

19
2- EXPLICITLY
How is parallelism expressed in a program
❑ it is the representation of concurrent computations by means of primitives in the form
of special-purpose directives or function calls.

❑ Most parallel primitives are related to process synchronization, communication or task

partitioning.

Advantages Disadvantages

❑The absolute programmer control ❑ programming with explicit parallelism

over the parallel execution. is often difficult, especially for non
computing specialists,

❑ A skilled parallel programmer takes

advantage of explicit parallelism to ❑ because of the extra work involved in
produce very efficient code. planning the task division and
synchronization of concurrent
26
processes.
Architectural concepts related to parallelism
❖ Important architectural concepts relate to parallel processing :

1- Implicit parallel platforms.

2- Explicit parallel platforms.

2
1
Parallel Processing Concepts

1- Implicit Parallelism
❖Concerning Memory- processor data path bottlenecks, microprocessor
designer invent (trend) alternate routs to cost effective performance.

❖Execution of multiple instruction in a single clock cycle.

Ex: microprocessors Itanium, Sparc ultra, MIPS and Power4

❖So, what is the mechanism used by various processors to support Execution

of multiple instruction in a single clock cycle
1.1- Pipelining and superscalar execution.
1.2- Very long instruction word processors.
2
2
Parallel Processing Concepts
Implicit Parallelism
1.1- Pipelining & super scalar execution

❖ Pentium4 operates at 2.0GHz has 20 stage pipeline

◦ Actually for increasing the speed more we have to dived more smaller and
smaller

❖Also, the speed of a single pipeline is limited by largest atomic task in

pipeline

◦ For more enhancement in speed a multiple pipelines is used

◦ During each cycle, multiple instruction are piped into processor in parallel
then these instruction are executed on multiple functional units.

2
3
Parallel Processing Concepts
Implicit Parallelism
Parallel Processing Concepts
Implicit Parallelism
Parallel Processing Concepts
Implicit Parallelism
Parallel Processing Concepts
Implicit Parallelism

❖Use the Idea of Pipelining in a

Computer
Fetch + Decode + Execution + Write

❖Pipelining overlaps various stages of

instruction execution to achieve
performance.

❖Superscalar execution : the ability of a

processor to issue multiple instructions
in the same cycle

28
Parallel Processing Concepts
Implicit Parallelism
❖Consider a processor with two pipelines & ability to simultaneously issue
two instructions (per clock cycle) (hence it is named dual issue execution)
- Consider the execution for adding 4 numbers

Three Different code fragments for adding a list of four numbers

29
Parallel Processing Concept
Implicit Parallelism

30
Parallel Processing Concepts
Implicit Parallelism
Limitation:
❖ True Data Dependency: The result of one operation is an input to the next.

❖ Resource Dependency: Two operations require the same resource.

❖Branch Dependency: in typical program traces, every 5-6th instruction is a

conditional jump! This requires very accurate branch prediction.

❖Scheduling instructions across conditional branch statements cannot be done

deterministically a-priori.

32
Parallel Processing Concepts
Implicit Parallelism
Superscalar Execution
❖ Scheduling of instructions is determined by a number of factors:
• True Data Dependency
• Resource Dependency
• Branch Dependency

• The scheduler, a piece of hardware looks at a large number of instructions

in an instruction queue and selects appropriate number of instructions to
execute concurrently based on these factors.

• The complexity of this hardware is an important constraint on superscalar

processors.
33
Parallel Processing Concepts
Superscalar Execution (cont.): Implicit Parallelism
Issue Mechanisms
❖In the simpler model, instructions can be issued only in the order in which they
are encountered. That is, if the second instruction cannot be issued because it
has a data dependency with the first, only one instruction is issued in the cycle.
This is called in-order issue.

❖In a more aggressive model, instructions can be issued out of order. In this case,
if the second instruction has data dependencies with the first, but the third
instruction does not, the first and third instructions can be co-scheduled. This is
also called dynamic issue.

❖Performance of in-order issue is generally limited.

Efficiency Considerations
❖Due to limited parallelism in typical instruction traces, dependencies, or the
inability of the scheduler to extract parallelism, the performance of superscalar
18
Parallel Processing Concepts
1.2 Very long instruction word processors VLIW Implicit Parallelism

❖The parallelism extracted by superscalar processor is often limited by the

instruction look ahead.

❖The hardware cost and complexity of the superscalar scheduler is a major

consideration in processor design.

❖ An alternative concept for exploiting instruction parallelism used in VLIW.

❖Where processors relies on the compiler to resolve dependencies & resource

availability at compile time, where :

◦ Instruction that can be executed concurrently are packed into groups given to
processor as a single long instruction word to be executed on multiple functional
units at the same time.
◦ Very sensitive to compilers ability to detect data & resource dependencies 19
2- Explicit Parallelism
❖ Elements of a Parallel Computer Hardware
*Hardware:
▪ Multiple Processors
▪ Multiple Memories
▪ Interconnection Network System Software

*System Software:
▪ Parallel Operating System
▪ Programming Constructs to Express/Orchestrate Concurrency Application Software
*Application Software:
▪ Parallel Algorithms

❖ Goal:
▪ Utilize the Hardware, System, & Application Software to either:
- Achieve Speedup.
- Solve problems requiring a large amount of memory.
36
What is Think - Different
How many people doing the work → (Degree of Parallelism)

What is needed to begin the work → (Initialization)

Who does what → (Work distribution)

Access to work part → (Data/IO access)

Whether they need info from each other to finish their own job → (Communication)

When are they all done → (Synchronization)

What needs to be done to collate the result

Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
U1&U2 PADCOM-25 (2)
No ratings yet
U1&U2 PADCOM-25 (2)
95 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Module 2
No ratings yet
Module 2
127 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
1st Ia Preparation
No ratings yet
1st Ia Preparation
15 pages
CONTACT NUMBERS - NLRC
100% (2)
CONTACT NUMBERS - NLRC
1 page
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
PC 1
No ratings yet
PC 1
53 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
Introduction To Parallel Computing-Dr Nousheen
No ratings yet
Introduction To Parallel Computing-Dr Nousheen
43 pages
Parallel Processing Chapter - 2
0% (1)
Parallel Processing Chapter - 2
135 pages
2nd
No ratings yet
2nd
19 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
CS-3006_2_PDC_Overview_compressed
No ratings yet
CS-3006_2_PDC_Overview_compressed
107 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
BCSE412L - Parallel Computing 01
No ratings yet
BCSE412L - Parallel Computing 01
27 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
No ratings yet
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
51 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
UNIT 1 (1)
No ratings yet
UNIT 1 (1)
34 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Unit 5
No ratings yet
Unit 5
66 pages
Parallel Programming- Unit 1
No ratings yet
Parallel Programming- Unit 1
81 pages
Slides
No ratings yet
Slides
36 pages
Module 1
No ratings yet
Module 1
14 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Lecture-2-06.01.2025
No ratings yet
Lecture-2-06.01.2025
21 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Watercolor Organic Shapes SlidesMania
No ratings yet
Watercolor Organic Shapes SlidesMania
23 pages
Multiprocessing vs Multithreading 2
No ratings yet
Multiprocessing vs Multithreading 2
16 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Week1-Parallel-and-Distributed-Computing
No ratings yet
Week1-Parallel-and-Distributed-Computing
55 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Code Erreurs Sharp Copieurs
100% (1)
Code Erreurs Sharp Copieurs
303 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Lect 1 Overview
No ratings yet
Lect 1 Overview
17 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Orders (Sample Superstore Subset Excel) - Migrated Data
No ratings yet
Orders (Sample Superstore Subset Excel) - Migrated Data
651 pages
SGM Business Plan Sample
100% (2)
SGM Business Plan Sample
6 pages
Assembly Disassembly Hand Out
No ratings yet
Assembly Disassembly Hand Out
23 pages
Coduri de Eroare Imprimante
No ratings yet
Coduri de Eroare Imprimante
1,598 pages
Osy 1
No ratings yet
Osy 1
14 pages
DeltaV and Virtualization Bruce Greenwald CDI UE 2012
0% (1)
DeltaV and Virtualization Bruce Greenwald CDI UE 2012
34 pages
SanDisk Product Catalog
No ratings yet
SanDisk Product Catalog
8 pages
IT 101 - Chapter 2
No ratings yet
IT 101 - Chapter 2
91 pages
B200M4 - ESX 7.x
No ratings yet
B200M4 - ESX 7.x
31 pages
EC303 - Chapter 1 Computer Architecture Organization
No ratings yet
EC303 - Chapter 1 Computer Architecture Organization
61 pages
I/O Management in OS
No ratings yet
I/O Management in OS
57 pages
Question Bank-1913104-Design of Embedded Systems
50% (2)
Question Bank-1913104-Design of Embedded Systems
12 pages
5.subroutines and Stacks PDF
No ratings yet
5.subroutines and Stacks PDF
23 pages
Acer Aspire 7551 - 7551G Wistron JE70-DN, SJV71-DN, HM72-DN RevSB Schematic
No ratings yet
Acer Aspire 7551 - 7551G Wistron JE70-DN, SJV71-DN, HM72-DN RevSB Schematic
63 pages
Honeywell Hmi
No ratings yet
Honeywell Hmi
3 pages
Offer From Johnshen's Frank-2013!10!10
No ratings yet
Offer From Johnshen's Frank-2013!10!10
52 pages
2021-07-29 20.42.54 Crash
No ratings yet
2021-07-29 20.42.54 Crash
5 pages
Operating System Device Management
No ratings yet
Operating System Device Management
23 pages
Virtualisation
No ratings yet
Virtualisation
7 pages
Communications For The 4000 Series Controllers Manual
No ratings yet
Communications For The 4000 Series Controllers Manual
2 pages
MultiRack For DiGiCo SD Consoles User Guide
No ratings yet
MultiRack For DiGiCo SD Consoles User Guide
8 pages
Embedded Engineer RoadMap 2022
0% (1)
Embedded Engineer RoadMap 2022
1 page
PICME User Guide PDF
No ratings yet
PICME User Guide PDF
22 pages
Power Management HAT For Raspberry Pi, With Embedded Arduino MCU and RTC
No ratings yet
Power Management HAT For Raspberry Pi, With Embedded Arduino MCU and RTC
3 pages
No Rtos Vendor: Montavista Software
No ratings yet
No Rtos Vendor: Montavista Software
3 pages
Multiboot: Brief Description
No ratings yet
Multiboot: Brief Description
3 pages
4th Periodical Computer Hardware
No ratings yet
4th Periodical Computer Hardware
3 pages
Assignment 2: Input Devices and Output Devices
No ratings yet
Assignment 2: Input Devices and Output Devices
8 pages
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 2

Uploaded by

Lecture 2

Uploaded by

High Performance

“A parallel computer is a collection of processing elements that can

“communicate and cooperate”

“large problems fast”

Some broad issues:

• Data access, Communication and Synchronization

• Performance and Scalability

Speedup limited by sequential part

1) Building efficient algorithms avoiding:

❑OpenMP is a high-level parallel ❑MPI is a library for fully explicit

Hidden from programmer Exposed to programmer

❑focusing instead in the problem that ❑resulting sometimes in less-than-optimal

❑It generally facilitates the design of ❑ Sometimes debugging is difficult.

❑ Most parallel primitives are related to process synchronization, communication or task

❑The absolute programmer control ❑ programming with explicit parallelism

❑ A skilled parallel programmer takes

1- Implicit parallel platforms.

❖Execution of multiple instruction in a single clock cycle.

❖So, what is the mechanism used by various processors to support Execution

❖ Pentium4 operates at 2.0GHz has 20 stage pipeline

❖Also, the speed of a single pipeline is limited by largest atomic task in

◦ For more enhancement in speed a multiple pipelines is used

❖Use the Idea of Pipelining in a

❖Pipelining overlaps various stages of

❖Superscalar execution : the ability of a

Three Different code fragments for adding a list of four numbers

❖ Resource Dependency: Two operations require the same resource.

❖Branch Dependency: in typical program traces, every 5-6th instruction is a

❖Scheduling instructions across conditional branch statements cannot be done

• The scheduler, a piece of hardware looks at a large number of instructions

• The complexity of this hardware is an important constraint on superscalar

❖Performance of in-order issue is generally limited.

❖The parallelism extracted by superscalar processor is often limited by the

❖The hardware cost and complexity of the superscalar scheduler is a major

❖ An alternative concept for exploiting instruction parallelism used in VLIW.

❖Where processors relies on the compiler to resolve dependencies & resource

What is needed to begin the work → (Initialization)

Who does what → (Work distribution)

Access to work part → (Data/IO access)

When are they all done → (Synchronization)

What needs to be done to collate the result

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.