0% found this document useful (0 votes)

140 views

Mcap Notes

Uploaded by

bhuvanesh.cse23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views

Mcap Notes

Uploaded by

bhuvanesh.cse23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 186

www.rejinpaul.

com – Multicore Architectures and programming (IV CSE )

UNIT - I
Single core to Multi-core architectures – SIMD and MIMD systems – Interconnection networks
Symmetric and Distributed Shared Memory Architectures – Cache coherence - Performance Issues –
Parallel program design.

om
Single-Core Processors
Single core processors have only one processor in die to process instructions. All the processor
developed by different manufacturers till 2005 were single core. In today’s’ computers we use multicore
processors but single core processor also perform very well. Single core processors have been discontinued
in new computers, so these are available at very cheap rates.

.c
ul
pa
Problems of Single Core Processors:
➢ Due to increase in the clock speed of this processor, the amount of heat produced by the chip also
jin
increases.
➢ It is a big hindrance in the way of single core processors to continue evolving.

MULTI-CORE PROCESSORS
➢ Multicore processor are the latest processors which became available in the market after 2005.
.re

➢ These processors use two or more cores to process instructions at the same time by using hyper
threading.
➢ The multiple cores are embedded in the same die.
➢ The multicore processor may looks like a single processor but actually it contains two (dual-core),
w

three (tricore), four (quad-core), six (hexa-core), eight (octa-core) or ten (deca-core) cores.
➢ Some processor even have 22 or 32 cores.
➢ Due to power and temperature constraint, the multicore processors are only practical solution for
w

increasing the speed of future computers.

Page | 1
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
Problems with Multicore processors:
➢ According to Amdahl’s law, the performance of parallel computing is limited by its serial
components.
➢ So, increasing the number of cores may not be the best solution.There is need to increase the clock
speed of individual cores.

om
Comparison of Single Core and Multi Cores :
Parameter Single Core Multi Core
Number of cores on a die Single Multiple
Instruction Execution Can execute Single instruction at a Can execute multiple instructions
time Can execute multiple by using
instructions by using multiple cores

.c
Gain Speed up every program or Speed up the programs which are
software being designed
executed for multi-core processors

ul
Performance Dependent on the clock frequency Dependent on the frequency,
of the core number of cores and program to be
executed
Examples Processor launched before 2005 Processor launched after 2005 like
pa
like80386,486, AMD 29000, AMD Core-2Duo,Athlon 64 X2, I3,I5
K6, Pentium I,II,III etc. And I7 etc.

SIMD and MIMD Systems:

Computer architecture can be classified into four main categories. These categories are defined under the
jin
Flynn’s Taxonomy. Computer architecture is classified by the number of instructions that are running in parallel and
how its data is managed. The four categories that computer architecture can be classified under are:
1. SISD: Single Instruction, Single Data
2. SIMD: Single Instruction, Multiple Data
3. MISD: Multiple Instruction, Single Data
.re

4. MIMD: Multiple Instruction, Multiple Data

The following diagram explains about Flynns technology

w
w
w

Page | 2
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
SIMD Architecture

Single Instruction stream, Multiple Data stream (SIMD) processors one instruction works on several
data items simultaneously by using several processing elements, all of which carried out same operation.

om
.c
SIMD system comprise one of the three most commercially successful classes of parallel computers
(the other beinvector supercomputer and MIMD systems). A number of factors have contributed to this

ul
success including :
➢ Simplicity of concept and programming
➢ Regularity of structure pa
➢ Easy scalability of size and performance
➢ Straightforward applicability in a number of fields which demands parallelism to achieve necessary
performance.

Basic Principles:
jin
➢ There is a two-dimensional array of processing elements, each connected to its four nearest
neighbors.
➢ All processors execute the same instruction simultaneously.
➢ Each processor incorporates local memory.
.re

➢ The processors are programmable, that is, they can perform a variety of functions.
➢ Data can propagate quickly through the array.
w
w
w

Implementing SIMD Architecture:

Two types of SIMD architectures exist:
1. True SIMD
2. Pipelined SIMD

Page | 3
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
True SIMD architecture: True SIMD architectures can be determined by its usage of distributed
memory and shared memory. Both true SIMD architectures possess similar implementation as seen on
Fig.4, but differ on placement of processor and memory modules.

om
.c
ul
True SIMD architecture with distributed memory:
➢ A true SIMD architecture with distributed memory possesses a control unit that interacts with every
processingelement on the architecture.
pa
➢ Each processor possesses their own local memory as observe on Fig.5.
➢ The processor elements are used as an arithmetic unit where the instructions are provided by the
controlling unit.
➢ In order for one processing element to communicate with another memory on the same architecture,
such as for information fetching, it will have to acquire it through the controlling unit.
jin
➢ This controlling unit handles the transferring of the information from one processing element to
another.
➢ The main drawback is with the performance time where the controlling unit has to handle the data
transfer.
.re
w
w
w

True SIMD architecture with Shared Memory:

➢ In this architecture, a processing element does not have a local memory but instead it’s connected to
a network where it can communicate with a memory component.
➢ In this architecture, the controlling unit is ignored when it comes to processing elements sharing
information.
➢ However, provides an instruction to the processing elements for computational reasons.
➢ The disadvantage in this architecture is that if there is a need to expand this architecture, each

Page | 4
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
module (processing elements and memory) has to be added separately and configured.
➢ This architecture is still beneficial since it improved performance time and the information can be
transferred more freely without the controlling unit.

om
.c
Pipelined SIMD Architecture:
➢ It implements the logic behind pipelining an instruction. Each processing element will receive an
instruction from the controlling unit, using a shared memory, and will perform computation at

ul
multiple stages.
➢ The controlling unit provides the parallel processing elements with instructions.
➢ The sequential processing element is used to handle other instructions.
pa
jin
.re

MIMD Architecture

MIMD stands for Multiple Instruction, Multiple Data. The MIMD class of parallel architecture is the
most familiar and possibly most basic form of parallel processor. MIMD architecture consists of a
collection of N independent, tightly coupled processors, each with memory that may be common to all
w

processors, and /or local and not directly accessible by the other processors.
w
w

Page | 5
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
Two types of MIMD architecture-:
1. Shared Memory MIMD architecture
2. Distributed Memory MIMD architecture
Shared Memory MIMD architecture:
➢ Create a set of processors and memory modules. Any processor can directly access any memory
module via an interconnection network

om
➢ The set of memory modules defines a global address space which is shared among the processors.

.c
ul
Distributed Memory MIMD architecture:
➢ It replicates the processor/memory pairs and connects them via an interconnection network.
➢ The processor/memory pair is called processing element (PE).
pa
➢ Each processing element (PE) can interact with each other via sending messages.
jin
.re
w

Comparison of SIMD and MIMD :

SIMD MIMD
Features
w

Abbreviation Single Instruction Multiple Data Multiple Instruction Multiple Data

Ease of programming and Single program, Processing Multiple communication programs,
Debugging element(PE) operate synchronously Processing element(PE) operate
w

asynchronously
Lower program memory One copy of the program is stored Each PE stores it own program
requirements
Lower instruction cost One decoder in control unit One decoder in each PE
Complexity of architectures Simple Complex
Cost Low Medium
Size and Performance Scalability in size and performance Complex size and good performance
Conditional Statements Conditional statements depends The multiple instruction stream of

Page | 6
www.rejinpaul.com
CS – Multicore Architectures and programming (IV CSE )
upon data local to processors, all of MIMD allow for more efficient
instructions of then block must execution of conditional statements
broadcast, followed by all else (e.g.-: if then else) because each
block processor can independently
follow either decision path
Low synchronization Implicit in program Explicit data structures and operations

om
overheads needed
Low PE-to-PE Automatic synchronization of all Explicit synchronization and
communication overheads “send” and “receive” operations identification protocols needed.
Efficient execution of Total execution time equals the Total execution time equals the
variable-time instructions sum of maximal execution times maximum execution time on a given
through all processors processor

.c
Using Flynn’s Taxonomy-: SIMD and MIMD. SIMD allow for more faster and multiple computation in
this field where sacrifice cannot be made on the delay of time. SIMD processing architecture
example-: a graphic processor processing instructions for translation or rotation or other operations are

ul
done on multiple data. MIMD processing architecture example is super computer or distributed computing
systems with distributed or single shared memory

Interconnection networks:
pa
➢ Networking strategy was originally employed in the 1950's by the telephone industry as a means of
reducing the time required for a call to go through.
➢ Similarly, the computer industry employs networking strategy to provide fast communication between
computer subparts, particularly with regard to parallel machines.
jin
➢ Any parallel system that employs more than one processor per application program must be designed
to allow its processors to communicate efficiently; otherwise, the advantages of parallel processing
may be negated by inefficient communication.
➢ This fact emphasizes the importance of interconnection networks to overall parallel system
.re

performance.
➢ In many proposed or existing parallel processing architectures, an interconnection network is used to
realize transportation of data between processors or between processors and memory modules.
Network Topology :
➢ Network topology refers to the layouts of links and switch boxes that establish interconnections.
w

➢ The links are essentially physical wires (or channels); the switch boxes are devices that connect a set
of input links to a set of output links. There are two groups of network topologies: static and
w

dynamic. Static networks provide fixed connections between nodes.

➢ With a static network, links between nodes are unchangeable and cannot be easily reconfigured.
Dynamic networks provide reconfigurable connections between nodes.
➢ The switch box is the basic component of the dynamic network.
w

➢ With a dynamic network the connections between nodes are established by the setting of a set of
interconnected switch boxes.

Static networks:
There are various types of static networks, all of which are characterized by their node degree; node
degree is the number of links (edges) connected to the node. Some well-known static networks are the

Page | 7
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
following:

Degree 1: shared bus

Degree 2: linear array, ring
Degree 3: binary tree, fat tree, shuffle-exchange
Degree 4: two-dimensional mesh (Illiac, torus)

om
Varying degree: n-cube, n-dimensional mesh, k-ary n-cube
A measurement unit, called diameter, can be used to compare the relative performance
characteristics of different networks. More specifically, the diameter of a network is defined as the largest
minimum distance between any pair of nodes. The minimum distance between a pair of nodes is the
minimum number of communication links (hops) that data from one of the nodes must traverse in order to

.c
reach the other node.

Shared bus :

ul
➢ The shared bus, also called common bus, is the simplest type of static network.
➢ The shared bus has a degree of 1. In a shared bus architecture, all the nodes share a common
communication link. pa
➢ The shared bus is the least expensive network to implement.
➢ Also, nodes (units) can be easily added or deleted from this network.
➢ However, it requires a mechanism for handling conflict when several nodes request the bus
simultaneously. This mechanism can be achieved through a bus controller, which gives access to the
bus either on a first-come, first-served basis or through a priority scheme.
jin
➢ The shared bus has a diameter of 1 since each node can access the other nodes through the shared
bus.
.re

Linear array :
The linear array (degree of 2) has each node connected with two neighbors (except the far ends
w

nodes). The linear quality of this structure comes from the fact that the first and last nodes are not
connected. Although the linear array has a simple structure, its design can mean long communication
delays, especially between far-end nodes. This is because any data entering the network from one end must
w

pass through a number of nodes in order to reach the other end of the network. A linear array, with N
nodes, has a diameter of N-1.
w

Ring :
➢ Another networking configuration with a simple design is the ring structure.
➢ A ring network has a degree of 2.
➢ Similar to the linear array, each node is connected to two of its neighbors, but in this case the
first and last nodes are also connected to form a ring.
➢ A ring can be unidirectional or bidirectional.

Page | 8
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ In a unidirectional ring the data can travel in only one direction, clockwise or counterclockwise.
Such a ring has a diameter of N-1, like the linear array.
➢ Bidirectional ring,in which data travel in both directions, reduces the diameter by a factor of 2, or
less if N is even. A bidirectional ring with N nodes has a diameter of 2/N . lthough this ring's
diameter is much better than that of the linear array, its configuration can still cause long
communication delays between distant nodes for large N.

om
.c
Binary tree :
➢ The top node is called the root, the four nodes at the bottom are called leaf (or terminal) nodes, and
the rest of the nodes are called intermediate nodes.

ul
➢ In such a network, each intermediate node has two children. The root has node address 1. The
addresses of the children of a node are obtained by appending 0 and 1 to the node's address that is,
the children of node x are labeled 2x and 2x+1.
➢ A binary tree with N nodes has diameter 2(h-1), where h= log2 N is the height of the tree. The binary
pa
tree has the advantages of being expandable and having a simple implementation.
jin

Fat tree:
➢ The structure of the fat tree is based on a binary tree. Each edge of the binary tree corresponds to
.re

two channels of the fat tree.

➢ One of the channels is from parent to child, and the other is from child to parent.
➢ The number of communication links in each channel increases as we go up the tree from the leaves
and is determined by the amount of hardware available.
w

➢ The fat tree can be used to interconnect the processors of a general-purpose parallel machine.
➢ Since its communication bandwidth can be scaled independently from the number of processors, it
provides great flexibility in design.
w
w

Page | 9
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
Shuffle-exchange :
➢ Another method for establishing networks is the shuffle-exchange connectioshuffle-exchange
network is a combination of two functions: shuffle and exchange.
➢ Each is a simbijection function in which each input is mapped onto one and only one output.
➢ Let sn-1, sn-2 ... s0 binary representation of a node address; then the shuffle function can be described
as shuffle(sn-1 sn-2 ... s0) = sn-2sn-3 ... s0 sn-1.

om
For example, using the shuffle function for N=8 (i.e. 2 nodes) the following connections can be established
between the nodes.

.c
ul
The reason that the function is called shuffle is that it reflects the process of shuffling cards. Given that there
are eight cards, the shuffle function performs a perfect playing card shuffle as follows. First, the deck is cut
in half, between cards 3 and 4. Then the two half decks are merged by selecting cards from each half
pa
in an alternative order.
In general, a polynomial of degree N can be represented as

where a0, a1, ... aN are the coefficients and x is a variable. One way to evaluate such a polynomial is to use
jin
the architecture shown in figure.In this figure, assume that each node represents a processor having three
registers: one to hold the coefficient, one to hold the variable x, and the third to hold a bit called the mask
bit with three registers
.re

The following represents the steps involved in the computation of aixi. Figure a shows the initial values of
w

the registers of each node. The coefficient ai, for i=0 to 7, is stored in node i. The value of the variable x is
stored in each node. The mask register of node i, for i=1, 3, 5, and 7, is set to 1; others are set to 0. In each
w

step of computation, every node checks the content of its mask register. When the content of the mask
register is 1, the content of the coefficient register is multiplied with the content of the variable register,
and the result is stored in the coefficient register. When the content of the mask register is zero, the content
w

of the coefficient register remains unchanged. The content of the variable register is multiplied with itself.
The contents of the mask registers are shuffled between the nodes using the shuffle network. Figures b,
c, and d show the values of the registers after the first step, second step, and third step, respectively. At the
end of the third step, each registers contains aixi.

Page | 10
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )

om
.c
ul
pa
jin
.re
w

Two-dimensional mesh :
w

➢ A two-dimensional mesh consists of k1*k0 nodes, where ki ≥ 2 denotes the number of nodes along
dimension i.
➢ The following represents a two-dimensional mesh for k0=4 and k1=2.
w

➢ There are four nodes along dimension 0, and two nodes along dimension 1. two-dimensional mesh
network each node is connected to its north, south, east, and west neighbors.
➢ In general, a node at row i and column j is connected to the nodes at locations (i-1, j), (i+1, j), (i, j-1),
and (i,j+1).
➢ The nodes on the edge of the network have only two or three immediate neighbors.

Page | 11
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )

om
Following figure shows a mesh with 16 nodes. From this point forward, the term mesh will indicated a two
dimensional mesh with an equal number of nodes along each dimension.

.c
ul
The routing of data through a mesh can be accomplished in a straightforward manner. The following
simple routing algorithm routes a packet from source S to destination D in a mesh with n nodes.
1. Compute the row distance R as
pa
2. Compute the column distance C as
jin
3. Add the values R and C to the packet header at the source node.
4. Starting from the source, send the packet for R rows and then for C columns.

➢ The values R and C determine the number of rows and columns that the packet needs to travel. The
.re

direction the message takes at each node is determined by the sign of the values R and C.
➢ When R (C) is positive, the packet travels downward (right); otherwise, the packet travels upward
(left). Each time that the packet travels from one node to the adjacent node downward, the value R is
decremented by 1, and when it travels upward, R is incremented by 1. Once R becomes 0, the packet
starts traveling in the horizontal direction.
w

➢ Each time that the packet travels from one node to the adjacent node in the right direction, the value
C is decremented by 1, and when it travels in the left direction, C is incremented by 1.
➢ When C becomes 0, the packet has arrived at the destination. For example, to route a packet from
w

node 6 (i.e., S=6) to node 12 (i.e., D= 12), the packet goes through two paths as shown in the figure
w

Page | 12
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
n-Cube or Hypercube :
➢ n is called the dimension of the n-cube network. When the node addresses are considered as the
corners of an n dimensional cube, the network connects each node to its n neighbors.
➢ n-cube, individual nodes are uniquely identified by n-bit addresses ranging from 0 to N-1.
➢ Given a node with binary address d, this node is connected to all nodes whose binary addresses
differ from d in exactly 1 bit.

om
➢ For example, in a 3cube, in which there are eight nodes, node 7(111) is connected to nodes 6(110),
5(101),and 3(011).

.c
ul
pa
The above figure explains about the connection between various nodes.
➢ As can be seen in the 3-cube, two nodes are directly connected if their binary addresses differ by 1
bit.
➢ This method of connection is used to control the routing of data through the network in a simple
manner. The following simple routing algorithm routes a packet from its source S = (sn-1 . . . s0) to
jin
destination D = (dn-1 . . . d0).
.re
w
w
w

➢ As shown in the above figure to route a packet from node 0 to node 5, the packet could go through
two different paths, P1 and P2. Here T=000 101 = 101.
➢ If we first consider the bit t0 and then t2, the packet goes through the path P1. Since t0 =1, the packet
is sent through the 0th-dimension link to node 1.

Page | 13
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ At node 1, t0 is set to 0; thus T now becomes equal to 100.
➢ Now, since t2=1, the packet is sent through the second-dimension link to node 5.
➢ If, instead of t0, bit t2 is considered first, the packet goes through P2.
➢ The maximum distance between nodes is 3. This is because the distance etween nodes is equal to the
number of bit positions in which their binary addresses differ.
➢ Since each address consists of 3 bits, the difference between two addresses can be at most 3 when

om
every bit at the same position differs.
➢ In general, in an n-cube the maximum distance between nodes is n, making the diameter equal to n.
➢ The n-cube network has several features that make it very attractive for parallel computation.
➢ It appears the same from every node, and no node needs special treatment.

.c
Symmetric and Distributed Shared Memory Architectures:
➢ Numerous designs on how to interconnect the processing nodes and memory modules were
published in the literature.
➢ Examples of such message based systems included Intel Paragon, N-Cube, IBM' SP systems.

ul
➢ As compared to shared memory systems, distributed memory (or message passing) systems can
accommodate larger number of computing nodes.
➢ This scalability was expected to increase the utilization of message-passing architectures.
pa
There are two principal types of MIMD systems:
• shared-memory systems and
• distributed-memory systems
In a shared-memory system a collection of autonomous processors is connected to a memory
system via an interconnection network, and each processor can access each memory location. In
jin
ashared-memory system, the processors usually communicate implicitly by accessing shared data
structures.
.re
w

In a distributed-memory system, each processor is paired with its own private memory, and the
w

processor-memory pairs communicate over an interconnection network. So in distributed-

memory systems the processors usually communicate explicitly by sending messages or by using
special functions that provide access to the memory of another processor.
w

Page | 14
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )

om
Shared-memory systems
The most widely available shared-memory systems use one or more multicore
processors.A multicore processor has multiple CPUs or cores on a single chip. Typically, the cores
have private level 1 caches, while other caches may or may not be shared between the cores.

.c
➢ In shared-memory systems to different multicore processors, the interconnect can either
interface all the processors legitimately to primary memory or every processor can have an
immediate association with a square of fundamental memory, and the processors can get to

ul
every others' squares of principle memory through exceptional equipment incorporated
with the processors. See Figures 2.5 and 2.6.
➢ In the first type of system, the time to access all the memory locations will be the same for
all the cores, while in the second type a memory location to which a core is directly
pa
connected can be accessed more quickly than a memory location that must be accessed
through another chip.
➢ Thus, the first type of system is called a uniform memory access, or UMA, system, while
the second type is called a nonuniform memory access, or NUMA, system.
jin
.re

➢ UMA frameworks are normally simpler to program, since the software engineer doesn't
w

have to stress over various access times for various memory areas.
➢ This advantage can be offset by the faster access to the directly connected memory in
NUMA systems. Furthermore NUMA systems have the potential to use larger amounts of
w

memory than UMA systems.

Page | 15
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
Distributed-memory systems
➢ The most widely available distributed-memory systems are called clusters.
➢ They are composed of a collection of commodity systems—for example, PCs—connected
by commodity interconnection network—for example, Ethernet.
➢ In fact, the nodes of these systems, the individual computational units joined together by

om
the communication network, are usually shared-memory systems with one or more
multicore processors.
➢ To distinguish such systems from pure distributed-memory systems, they are sometimes
called hybrid systems.
➢ These days, it's generally comprehended that a group will have shared-memory hubs. The
framework gives the foundation important to transform huge systems of geologically

.c
appropriated PCs into a brought together dispersed memory framework. As a rule, such a
framework will be heterogeneous, that is, the individual hubs might be worked from
various sorts of equipment.

ul
Differences in Types of implementation:
➢ There are two major differences in the two implementation, related to how the work is distributed and
how the worker processes access the needed data.
pa
➢ The Pthread version shows that each worker process is given the address of the data they need for
their work.
➢ In the MPI version, the actual data is sent to the worker processors. The worker processes of the
Pthread version access the needed data directly as if the data is local.
jin
➢ It can also be seen that the worker processors directly accumulate their partial results in a single
global variable (using mutual exclusion).
➢ The worker processes of the MPI program are supplied the actual data via messages, and they send
their partial results back to the master for the purpose of accumulation.
.re

Issues in Shared Memory Systems :

1. Data Coherency.
➢ The use of cache memories is so pervasive in today’s computer systems it is difficult to
imagine processors without them.
w

➢ Cache memories, along with virtual memories and processor registers form a continuum of
memory hierarchies that rely on the principle of locality of reference.
2. Cache Coherency.
➢ Techniques are needed to ensure that consistent data is available to all processors in a
w

multiprocessor system.
➢ Cache coherency can be maintained either by hardware techniques or software techniques
w

Entry and Scope Consistency :

➢ In both Weak ordering and Release consistency models, the shared memory is made consistent when
any synchronization variable is released (or accessed).
➢ if we can associate a set of shared variable with each synchronization variable, then we only need to
maintain consistency on these variables when the associated synchronization variable is released (or
accessed).

Page | 16
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ Entry consistency requires that the program specify the association between shared memory and
synchronization variables.
➢ Scope consistency is similar to Entry consistency, but the associations between shared variables and
synchronization variables is implicitly extracted.

Cache Coherence:

om
➢ CPU reserves are overseen by framework equipment: software engineers don't have direct power
over them.
➢ A few significant ramifications for shared-memory frameworks. To comprehend these issues,
assume we have a mutual memory framework with two centers, every one of which has its own
private information reserve. See Figure 2.17. For whatever length of time that the two centers just
read shared information, there is no issue.

.c
➢ For instance, assume that x is a mutual variable that has been instated to 2, y0 is private and claimed
by center 0, and y1 and z1 are private and possessed by center 1. Presently assume the
accompanying explanations are executed at the showed occasions

ul
pa
➢ Then the memory location for y0 will eventually get the value 2, and the memory location for y1
will eventually get the value 6.
➢ However, it’s not so clear what value z1 will get. It might at first appear that since core 0 updates x
jin
to 7 before the assignment to z1, z1 will get the value 4*7= 28.
➢ However, at time 0, x is in the cache of core 1. So unless for some reason x is evicted from core 0’s
cache and then reloaded into core 1’s cache, it actually appears that the original value x = 2 may be
used, and z1 will get the value 4*2= 8.
.re
w
w
w

➢ Note that this unpredictable behavior will occur regardless of whether the system is using a write-
through or a write-back policy.

Page | 17
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ If it’s using a write-through policy, the main memory will be updated by the assignment x = 7.
➢ However, this will have no effect on the value in the cache of core 1. If the system is using a write-
back policy,the new value of x in the cache of core 0 probably won’t even be available to core 1
when it updates z1.
➢ Clearly, this is a problem. The programmer doesn’t have direct control over when the caches are
updated, so her program cannot execute these apparently innocuous statements and know what will

om
be stored in z1.
At the point when shared information are reserved, the mutual worth might be recreated in numerous
stores. Notwithstanding the decrease in get to dormancy and required memory transfer speed, this
replication likewise gives a decrease in dispute that may exist for shared information things that are
being perused by various processors at the same time. This issue is called as reserve cognizance.There
are two main approaches to insuring cache coherence:
➢ snooping cache coherence

.c
➢ directory-based cache coherence.

Snooping cache coherence :

The idea behind snooping comes from bus-based systems:

ul
1. When the cores share a bus, any signal transmitted on the bus can be “seen” by all the cores
connected to the bus.
2. Hence, when center 0 updates the duplicate of x put away in its store, on the off chance that it
pa
additionally communicates this data over the transport, and if center 1 is "snooping" the transport, it
will see that x has been refreshed and it can check its duplicate of x as invalid. This is pretty much
how snooping reserve rationality functions.
3. The principal difference between our description and the actual snooping protocol is that the
broadcast only informs the other cores that the cache line containing x has been updated, not that x
jin
has been updated.
A couple of points should be made regarding snooping.
1. First, it’s not essential that the interconnect be a bus, only that it support broadcasts from each
processor to all the other processors.
2. Second, snooping works with both write-through and write back caches.
.re

3. In principle, if the interconnect is shared—as with a bus—with write through caches there’s no need
for additional traffic on the interconnect, since each core can simply “watch” for writes.
4. With write-back caches, on the other hand, an extra communication is necessary, since updates to the
cache don’t get immediately sent to memory.

Directory-based cache coherence :

1. Unfortunately, in large networks broadcasts are expensive, and snooping cache coherence requires a
broadcast every time a variable is updated .
w

2. So snooping cache coherence isn’t scalable, because for larger systems it will cause
performance to degrade.
3. For example, suppose we have a system with the basic distributed-memory architecture (Figure 2.4).
However, the system provides a single address space for all the memories. So, for example, core 0
w

can access the variable x stored in core 1’s memory, by simply executing a statement such as y = x.
4. Such a system can, in principle, scale to very large numbers of cores. However, snooping cache
coherence is clearly a problem since a broadcast across the interconnect will be very slow relative to
the speed of accessing local memory.

➢ Directory-based cache coherence protocols endeavor to take care of this issue using an information
structure called a catalog.

Page | 18
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ The registry stores the status of each reserve line. Commonly, this information structure is
appropriated; in our model, each center/memory pair may be answerable for putting away the piece
of the structure that determines the status of the store lines in its neighborhood memory.
➢ Along these lines, when a line is added something extra to, state, center 0's reserve, the registry
passage comparing to that line would be refreshed demonstrating that center 0 has a duplicate of the
line.
➢ At the point when a variable is refreshed, the index is counseled, and the store controllers of the

om
centers that have that variable's reserve line in their reserves are refuted.
➢ Plainly there will be considerable extra stockpiling required for the index, however when a reserve
variable is refreshed, just the centers putting away that variable should be reached

Performance:
➢ Speedup and efficiency

.c
Usually the best we can hope to do is to equally divide the work among the cores, while at the same
time introducing no additional work for the cores. If we succeed in doing this, and we run our program with
p cores, one thread or process on each core, then our parallel program will run p times faster than the serial
program.

ul
o If we call the serial run-time Tserial and our parallel run-time Tparallel, then the best we can hope
for is Tparallel = Tserial/p. When this happens, we say that our parallel program has linear
speedup.So if we define the speedup of a parallel program to be
pa
then linear speedup has S = p, which is unusual. Furthermore, as p increases, we expect S to become
jin
a smaller and smaller fraction of the ideal, linear speedup p.
o Another method for saying this is S=p will most likely get littler and littler as p increases..This
esteem, S=p, is in some cases called the effectiveness of the equal program. In the event that we
substitute the recipe for S, we see that the proficiency is
.re

o Many parallel programs are developed by dividing the work of the serial program among the
processes/threads and adding in the necessary“parallel overhead” such as mutual exclusion or
communication.
w

o Therefore, if Toverhead denotes this parallel overhead, it’s often the case that Tparallel = Tserial/p +
Toverhead.
o Furthermore, as the problem size is increased, Toverhead often grows more slowly than Tserial.
w

When this is the case the speedup and the efficiency will increase.
o A final issue to consider is what values of Tserial should be used when reporting speedups and
efficiencies.
w

➢ Amdahl’s law
o Gene Amdahl mentioned an objective fact that is gotten known as Amdahl's law. It says, generally,
that except if basically the entirety of a sequential program is parallelized, the conceivable speedup
will be exceptionally constrained—paying little heed to the quantity of centers accessible.
o Suppose, for example, that we’re able to parallelize 90% of a serialprogram.
o Further suppose that the parallelization is “perfect,” that is, regardless of the number of cores p we
use, the speedup of this part of the program will be p.

Page | 19
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
o If the serial run-time is Tserial= 20 seconds, then the run-time of the parallelized part will be
0.9*Tserial/p = 18/p and
o The run-time of the “unparallelized” part will be 0.1*Tserial =2. The overall parallel run-time will
be

om
o Now as p gets larger and larger, 0.9*Tserial/p =18/p gets closer and closer to 0, so the total parallel
run-time can’t be smaller than 0.1_Tserial D 2. That is, the denominator in S can’t be smaller than

.c
0.1_Tserial D 2. The fraction S must therefore be smaller than

ul
o That is, . This is saying that even though we’ve done a perfect job in parallelizing 90% of
the program, and even if we have, say, 1000 cores, we’ll never get a speedup better than 10.
pa
➢ Scalability
o The word “scalable” has a wide variety of informal uses. Roughly speaking, a technology is
scalable if it can handle ever-increasing problem sizes.
o However, in discussions of parallel program performance,scalability has a somewhat more formal
jin
definition.
o Suppose we run a parallel program with a fixed number of processes/threads and a fixed input
size, and we obtain an efficiency E.
o Suppose we now increase the number of processes/threads that are used by the program. If we can
find a corresponding rate of increase in the problem size so that the program always has
.re

efficiency E, then the program is scalable.

o As an example, suppose that Tserial =n, where the units of Tserial are in microseconds, and n is
also the problem size. Also suppose that Tparallel = n/p+1. Then
w

o To see if the program is scalable, we increase the number of processes/threads by a factor of k,

and we want to find the factor x that we need to increase the problem
w

size by so that E is unchanged. The number of processes/threads will be kp and the

problem size will be xn, and we want to solve the following equation for x:
w

o Well, if x =k, there will be a common factor of k in the denominator xn+kp =kn+kp = k(n+p), and
we can reduce the fraction to get

➢ Taking timings

Page | 20
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
There are a lot ofdifferent approaches, and with parallel programs the details may depend on the
API.However, there are a few general observations we can make that may make things a little easier.
oThe first thing to note is that there are at least two different reasons for taking timings.
oDuring program development we may take timings in order to determine if the
program is behaving as we intend.
oFor example, in a distributed-memory program we might be interested in finding out how much time

om
the processes are spending waiting for messages, because if this value is large, there is almost
certainly something wrong either with our design or our implementation.
o On the other hand, once we’ve completed development of the program, we’re often interested in
determining how good its performance is.
oPerhaps surprisingly, the way we take these two timings is usually different

Parallel Program Design:

.c
➢ The two main types of parallel systems are: shared memory systems and distributed
memory systems.
➢ In a shared-memory system, the cores can share access to the computer’s memory; in

ul
principle, each core can read and write each memory location.
➢ In a shared-memory system, we can coordinate the cores by having them examine and
update shared-memory locations. pa
➢ In a distributed-memory system, on the other hand, each core has its own, private memory,
and the cores must communicate explicitly by doing something like sending messages
across a network.
jin
.re
w

➢ The structural and computational patterns are composed to deﬁne software architecture.
➢ A software architecture deﬁnes the components that make up an application, the communication
among components, and the fundamental computations that occur inside components.
w

➢ The software architecture, however, has little to say about how the software architecture is mapped
onto the hardware of a parallel computer.
➢ To address parallelism, we combined the computational and structural patterns with the lower level,
parallel programming design pattern languages from these parallel programming patterns deﬁne a
w

distinct pattern language for parallel programming (PLPP).

➢ It's quite direct to compose a sequential program that creates a histogram. We have to choose what
the containers are, decide the quantity of estimations in each receptacle, and print the bars of the
histogram. Since we're not concentrating on I/O, we'll constrain ourselves to simply the initial two
stages, so the info will be
1. the number of measurements, data count
2. an array of data count floats, data

Page | 21
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
3. the minimum value for the bin containing the smallest values, min
4. the maximum value for the bin containing the largest values, max
5. the number of bins, bin count;
➢ The output will be an array containing the number of elements of data that lie in each bin. To make
things precise, we’ll make use of the following data structures:

om
➢ The cluster receptacle maxes will store the upper destined for each container, and canister checks
will store the quantity of information components in each canister. To be unequivocal, we can
characterize

Then bin maxes will be initialized by

.c
➢ We'll embrace the show that receptacle b will be all the estimations in the range

ul
➢ Obviously, this doesn't bode well if b = 0, and right now utilize the standard that receptacle 0 will be
the estimations in the range
pa
➢ This implies we generally need to treat receptacle 0 as an uncommon case, however this isn't
excessively burdensome. When we've instated container maxes and doled out 0 to all the
components of bin_counts, we can get the checks by utilizing the accompanying pseudo-code
jin

➢ The Find_bin work restores the receptacle that data[i] has a place with. This could be a
straightforward direct pursuit work: search through receptacle maxes until you discover a canister b
.re

that fulfills
w
w
w

Page | 22
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
UNIT - II
Performance – Scalability – Synchronization and data sharing – Data races – Synchronization primitives
(mutexes, locks, semaphores, barriers) – deadlocks and livelocks – communication between threads
condition variables, signals, message queues and pipes).

om
The past few decades have seen large ﬂuctuations in the perceived value of parallel computing. At
times, parallel computation has optimistically been viewed as the solution to all of our computational
limitations. At different occasions, many have contended that it is a misuse of exertion given the rate at
which processor paces and memory costs keep on improving. Recognitions keep on swaying between these
two limits because of various variables, among them: the consistent changes in the "hot" issues being

.c
explained, the programming conditions accessible to clients, the supercomputing market, the merchants
associated with building these supercomputers, and the scholarly network's concentration at some random
point and time. The outcome is a fairly muddied picture from which it is difﬁcult to impartially pass

ul
judgment on the worth and guarantee of equal registering.
➢ Picking a small workload will mislead you as to which parts of the code need to be optimized.
➢ You may have spent time optimizing the algorithmically simpler part of the code, when the

complex part of the code.

pa
performance of the application in a real-world situation will be dominated by the algorithmically

➢ This emphasizes why it is important to select appropriate workloads for developing and testing the
application.
➢ Different parts of the application will scale differently as the workload size changes, and regions that
jin
appear to take no time can suddenly become dominant.
➢ Another important point to realize is that a change of algorithm is one of the few things that can
make an order of magnitude difference to performance.
➢ If 80% of the application's runtime was spent arranging a 1,000-component cluster, at that point
.re

changing from an air pocket sort to a quicksort could have a 300× effect to the exhibition of that
work, making the time spent arranging 300× littler than it recently performed.
➢ The 80% of the runtime spent arranging would to a great extent vanish, and the application would
wind up running around multiple times quicker..
The following table illustrates, it takes remarkably few elements for an O(N2) algorithm to
w

start consuming significant amounts of time.

w
w

The challenges to realizing this potential can be grouped into two main problems: the hardware
problem and the software problem.

Structure Impacts Performance

Three attributes of the construction of an application can be considered as “structure.”

Page | 1
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
1. The first of these is the assemble structure, for example, how the source code is appropriated
between the source records.
2. The second structure is the manner by which the source records are consolidated into applications
and supporting libraries
3. Finally, and likely the most self-evident, is that way information is sorted out in the application

om
➢ Performance and Convenience Trade-Offs in Source Code and Build Structures The structure of the
source code for an application can make contrasts its exhibition.
➢ Source code is frequently disseminated across source documents for the comfort of the engineers.
➢ Performance openings are lost when the compiler sees just a solitary record at once.
➢ The single record may not present the compiler with all the open doors for enhancements that it may
have had if it somehow managed to see a greater amount of the source code.

.c
➢ This sort of restriction is obvious when a program utilizes an accessor work—a short capacity that
profits the estimation of some factor.
➢ A trivial optimization is for the compiler to replace this function call with a direct load of the value

ul
of the variable

Parallel Architectures : pa
This dissertation categorizes parallel platforms as being one of three rough types:
1. Distributed memory
2. Shared memory
3. Shared address space.
➢ This taxonomy is somewhat coarse given the wide variety of parallel architectures that have been
jin
developed, but it provides a useful characterization of current architectures for the purposes of this
dissertation.
➢ Distributed memory machines are considered to be those in which each processor hasa local memory
with its own address space.
.re

➢ A processor's memory can't be gotten to legitimately by another processor, requiring the two
processors to be included when imparting esteems starting with one memory then onto the next.
Instances of circulated memory machines incorporate item Linux bunches.
➢ Shared memory machines are those in which a solitary location space and worldwide memory are
w

shared between different processors.

➢ Each processor claims a neighborhood store, and its qualities are kept sound with the worldwide
memory by the working framework.
➢ Data can be traded between processors essentially by putting the qualities, or pointers to values, in a
w

predeﬁned area and synchronizing suitably.

➢ Examples of shared memory machines include the SGI Origin series and the Sun Enterprise. Shared
w

address space architectures are those in which each processor has its own local memory, but a single
shared address space is mapped across the distinct memories.
➢ Such architectures allow a processor to access the memories of other processors without theirdirect
involvement, but they differ from shared memory machines in that there is no implicit caching of
values located on remote machines.
➢ The primary example of a shared address machine is Cray’s T3D/T3E line.

Page | 2
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
The CTA Machine Model

➢ Compilation and execution on these diverse architectures by describing them using a single machine
model known as the Candidate Type Architecture (CTA)
➢ The CTA is a reasonably vague model, and deliberately so. It characterizes parallel machines as a
group of von Neumann processors, connected by a sparse network of unspeciﬁed topology. Each

om
processor has a local memory that it can access at unit cost.
➢ Processors can also access other processors’ values at a cost signiﬁcantly higher than unit cost by
communicating over the network.
➢ The CTA also speciﬁes a controller used for global communications and synchronization, though
that will not be of concern in this discussion. The following figure explains about CTA architecture

.c
ul
pa
Using Libraries to Structure Applications :
jin
Libraries are the usual mechanism for structuring applications as they become larger. There are some good
technical reasons to use libraries:
➢ Common functionality can be extracted into a library that can be shared between different projects or
.re

applications. This can lead to better code reuse, more efficient use of developer time, and more
effective use of memory and disk space.
➢ Placing functionality into libraries can lead to more convenient upgrades where only the library is
upgraded instead of replacing all the executables that use the library.
➢ Libraries can provide better separation between interface and implementation. The implementation
w

details of the library can be hidden from the users, allowing the implementation of the library to
evolve while maintaining a consistent interface.
➢ Stratifying functionality into libraries according to frequency of use can improve application start-up
w

time and memory footprint by loading only the libraries that are needed. Functionality can be loaded
on demand rather than setting up all possible features when the application starts.
➢ Libraries can be used as a mechanism to dynamically provide enhanced functionality. The
w

functionality can be made available without having to change or even restart the application.
➢ Libraries can enable functionality to be selected based on the runtime environment or characteristics
of the system. For instance, an application may load different optimized libraries depending on the
underlying hardware or select libraries at runtime depending on the type of work it is being asked to
perform.

Page | 3
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
On the other hand, there are some nontechnical reasons why functionality gets placed into libraries. These
reasons may represent the wrong choice for the user.
➢ Libraries frequently speak to an advantageous item for an authoritative unit. One gathering of
engineers may be answerable for a specific library of code, however that doesn't naturally suggest
that a solitary library speaks to the most ideal route for that code to be conveyed to the end clients.
➢ Libraries are likewise used to gather related usefulness. For instance, an application may contain a

om
library of string-taking care of capacities. Such a library may be suitable in the event that it contains
a huge group of code. Then again, in the event that it contains just a couple of little schedules, it may
be progressively proper to join it with another library.

There are a few contributors to cost:

➢ Library calls might be executed utilizing a table of capacity addresses. This table might be a

.c
rundown of addresses for the schedules remembered for a library. A library routine calls into this
table, which at that point hops to the real code for the everyday practice.
➢ Each library and its information are normally put onto new TLB passages. Calls into a library will as

ul
a rule additionally bring about an ITLB miss and conceivably a DTLB miss if the code gets to
library-explicit information.
➢ If the library is being languid stacked (that is, stacked into memory on request), there will be costs
pa
related with circle access and setting up the addresses of the library capacities in memory.
➢ Unix stages regularly give libraries as position-free code. This empowers a similar library to be
partaken in memory between various running applications. The expense of this is an expansion in
code length. Windows makes the contrary exchange off; it utilizes position-subordinate code in
jin
libraries, lessening the chance of sharing libraries between running applications however delivering
somewhat quicker code.
.re
w
w

Impact of Data Structures on Performance :

➢ When an application needs an item of data, it fetches it from memory and installs it in cache.
➢ The idea with caches is that data that is frequently accessed will become resident in the cache.
➢ The cost of fetching data from the cache is substantially lower than the cost of fetching it from
memory.
➢ Hence, the application will spend less time waiting for frequently accessed data to be retrieved from
memory.

Page | 4
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ The amount of data loaded into each level of cache by a load instruction depends on the size of the
cache line.
➢ 64 bytes is a typical length for a cache line; however, some caches have longer lines than this, and
some caches have shorter lines.
➢ Often the caches that are closer to the processor have shorter lines, and the lines further from the
processor have longer lines.

om
The following figure illustrates what happens when a line is fetched into cache from memory.
➢ On a cache miss, a cache line will be fetched from memory and installed into the second level cache.
The portion of the cache line requested by the memory operation is installed into the first-level
cache.
➢ In this scenario, Accesses to data on the same 16-byte cache line as the original item will also be

.c
available from the first-level cache.

ul
pa
jin

➢ Accesses to data that share the same 64-byte cache line will be fetched from the second-level cache.
.re

➢ Accesses to data outside the 64-byte cache line will result in another fetch from memory.

If data is fetched from memory when it is needed, the processor will experience the entire latency of the
memory operation. On a modern processor, the time taken to perform this fetch can be several hundred
cycles.
w

There are techniques that reduce this latency:

1. Out-of-request execution is the place the processor will look the guidance stream for future
guidelines that it can execute.
w

2. Hardware prefetching of information streams Hardware prefetching can be exceptionally compelling

in circumstances where information is brought as a stream or through a strided get to design. It can't
prefetch information where the entrance design is less evident.
w

3. Software prefetching is the demonstration of adding guidelines to bring information from memory
before it is required.

➢ Another approach to covering memory latency costs is with CMT processors. When one thread stalls
because of a cache miss, the other running threads get to use the processor resources of the stalled
thread.

Page | 5
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ This approach does not improve the execution speed of a single thread. This can enable the
processor to achieve more work by sustaining more active threads, improving throughput rather than
single-threaded performance.
➢ There are a number of common coding styles that can often result in suboptimal layout
of data in memory.

om
Improving Performance Through Data Density and Locality :

➢ Paying regard for the request in which factors are pronounced and spread out in memory can
improve execution.
➢ When a heap acquires a variable from memory, it likewise gets the remainder of the store line in
which the variable resides.

.c
➢ Placing factors that are ordinarily gotten to together into a structure so they dwell on a similar store
line will prompt execution gains.
struct s

ul
{
int var1;
int padding1[15]; pa
int var2;
int padding2[15];
}
When the structure member var1 is accessed, the fetch will also bring in the surrounding 64 bytes.
The size of an integer variable is 4 bytes, so the total size of var1 plus padding1 is 64 bytes. This ensures
jin
that the variable var2 is located on the next cache line.Important Structure Members Are Likely to Share a
Cache Line
struct s
{
.re

int var1;
int var2;
int padding1[15];
int padding2[15];
w

➢ If the structure doesn't fit precisely into the length of the reserve line, there will be circumstances
w

when the neighboring var1 and var2 are part more than two cache lines.
➢ This presents an issue. Is it better to pack the structures as close as conceivable to fit however many
of them as would be prudent into a similar reserve line, or is it better to add cushioning to the
w

structures to make them reliably line up with the store line limits? The accompanying figure shows
the two circumstances.
➢ The answer will rely upon different components. By and large, the best answer is likely to pack the
structures as firmly as would be possible.
➢ This will imply that when one structure is gotten to, the entrance will likewise bring pieces of the
encompassing structures.

Page | 6
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )

om
➢ Where it is appropriate to add padding to the structure is when the structures are always accessed
randomly, so it is more important to ensure that the critical data is not split across a cache line.

.c
➢ The performance impact of poorly ordered structures can be hard to detect. The cost is spread over
all the accesses to the structure over the entire application.
➢ Reordering the structure members can improve the performance for all the routines that access the

ul
structures.
➢ Determining the optimal layout for the structure members can also be difficult.One guideline would
be to order the structure members by access frequency or group them by those that are accessed in
pa
the hot regions of code.
➢ It is also worth considering that changing the order of structure members could introduce a
performance regression if the existing ordering happens to have been optimal for a different
frequently executed region of code.
➢ A similar optimization is structure splitting, where an existing structure is split into members that are
jin
accessed frequently and members that are accessed infrequently.
➢ If the infrequently accessed structure members are removed and placed into another structure, then
each fetch of data from memory will result in more of the critical structures being fetched in one
action.
.re

➢ Taking the previous example, where we assume that var3 is rarely needed, we would end up with a
resulting pair of structures, as shown in following figure.
w
w
w

Page | 7
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
In this instance, the original structure s has been split into two, with s0 containing all the frequently accessed
data and s1 containing all the infrequently accessed data. In the limit, this optimization is converting what
might be an array of structures into a set of arrays, one for each of the original structure members.

Selecting the Appropriate Array Access Pattern :

One common data access pattern is striding through elements of an array. The performance of the

om
application would be better if the array could be arranged so that the selected elements were contiguous.The
following code explains Non contagious memory access pattern .
Noncontiguous Memory Access Pattern
{
double ** array;

.c
double total=0;
…
for (int i=0; i<cols; i++)
for (int j=0; j<rows; j++)

ul
total += array[j][i];
…
} pa
➢ These components won't be situated in bordering memory. In Fortran, the contrary requesting is
followed, so neighboring components of the primary file are nearby in memory.
➢ This is called section significant request. Getting to components by a walk is a typical blunder in
codes made an interpretation of from Fortran into C. shows how memory is tended to in C, where
jin
nearby components straight are neighboring in memory.
.re
w

➢ Fortunately, most compilers are often able to correctly interchange the loops and improve the
memory access patterns.
➢ However, there are many situations where the compiler is unable to make the necessary
w

transformations because of aliasing or the order in which the elements are accessed in the loop.
➢ In these cases, it is necessary for the developer to determine the appropriate layout and then
restructure the code appropriately
w

Scalability :

Data Races and the Scaling Limitations of Mutex Locks :

1. Bugs as a result of information races are the most evident appearance of a parallelization
issue.
2. These identify with updates of factors without guaranteeing selective access to the factors.

Page | 8
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
3. These are normally settled by including synchronization natives, (for example, mutex locks)
into the code to guarantee selective access to the factors.
4. Although mutex locks can be utilized to guarantee that lone a solitary string approaches an
asset at once, they can't authorize the requesting of gets to information.
➢ An alternative approach is required if there is an ordering constraint on the accesses to shared
resources.

om
➢ For example, if two threads need to update a variable, a mutex can ensure that they do not update the
variable at the same time.
➢ However, a mutex cannot force one of the two threads to be the last to perform the update. The
problem with adding mutex locks into the code is that they serialize the access to the variables.
➢ Only a single thread can hold the lock, so if there are multiple threads that need to access the data,

.c
the application effectively runs serially because only one thread can make progress at a time.
➢ Even if requiring the mutex lock is a rare event, it can become a bottleneck if the lock is held for a
long time or if there are many threads requiring access to the lock.

ul
Here we find out scaling issues in code provided not only in application and also in libraries.The
following is the Code to Testing Scaling of malloc() and free()
#include <stdlib.h>
#include <pthread.h>
int nthreads;
pa
void *work( void * param )
{
jin
int count = 1000000 / nthreads;
for( int i=0; i<count; i++ )
{
void *mem = malloc(1024);
.re

free( mem );
}
}
int main( int argc, char*argv[] )
{
w

pthread_t thread[50];
nthreads = 8;
if ( argc > 1 )
w

{
nthreads = atoi( argv[1] );
}
w

for( int i=0; i<nthreads; i++ )

{
pthread_create( &thread[i], 0, work, 0 );
}
for( int i=0; i<nthreads; i++ )
{

Page | 9
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
pthread_join( thread[i], 0 );
}
return 0;
}
➢ If a default execution of malloc() and free() utilizes a solitary mutex lock then execution won't
improve with numerous strings.

om
➢ Consider an option malloc() that utilizes an alternate calculation. Each string has its own load of
memory, so it doesn't require a mutex lock. This option malloc() scales as the quantity of strings
increments.
➢ As expected, the default execution doesn't scale, so the runtime doesn't improve. The expansion in
runtime is a direct result of more strings battling for the single mutex lock..

.c
➢ The elective execution shows awesome scaling. As the quantity of strings builds, the runtime of the
application decreases.
➢ For the singlethreaded case, the default malloc() gives preferred execution over the elective usage.

ul
The calculation that gives improved scaling likewise adds an expense to the single-strung
circumstance; it tends to be difficult to create a calculation that is quick for the single-strung case
and scales well with different strings.

Superlinear Scaling :
pa
➢ Doubling the resources yet get more than double the performance as a result.
➢ In most instances, going from one thread to two will result in, at most, a doubling of performance.
➢ However, there will be applications that do see super linear scaling—the application ends up running
jin
more than twice as fast.
➢ This is typically because the data that the application uses becomes cache resident at some point.
Imagine an application that uses 4MB of data.
➢ On a processor with a 2MB cache, only half the data will be resident in the cache.
.re

➢ Adding a second processor adds an additional 2MB of cache; then all the data becomes cache
resident, and the time spent waiting on memory becomes substantially lower.
Challenges to Parallel Programming :

➢ Writing parallel programs is strictly more difﬁcult than writing sequential ones.
w

➢ In sequential programming, the programmer must design an algorithm and then express it to the
computer in some manner that is correct, clear, and efﬁcient to execute.
➢ Parallel programming involves these same issues, but also adds a number of additional challenges
w

that complicate development and have no counterpart in the sequential realm.

➢ These challenges include: ﬁnding and expressing concurrency, managing data distributions,
managing interprocessor communication, balancing the computational load, and simply
w

implementing the parallel algorithm correctly.

Hardware constraints applicable to improve scaling.

There are three critical areas that can make a large difference to scaling.
1. The amount of bandwidth to cache and the memory will be divided among the active threads on the
system.

Page | 10
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
2. The design of the caches will determine how much time is lost because of capacity and conflict-
induced cache misses.
3. The way that the processor core pipelines are shared between active software threads will determine
how instruction issue rates change as the number of active threads increases.

Bandwidth Sharing Between Cores :

om
➢ Bandwidth is another asset shared between threads.
➢ The data transfer capacity limit of a framework relies upon the structure of the processor and the
memory framework just as the memory chips and their area in the framework.
➢ The data transfer capacity a processor can expend is a component of the quantity of remarkable
memory demands and the rate at which these can be returned.
➢ These memory solicitations can emerge out of either equipment or programming prefetches, just as

.c
from burden or store tasks.
➢ Since each string can give memory demands, the more strings that a processor can run, the more data
transmission the processor can expend. string-taking care of library schedules, for example, strlen()

ul
or memset() can be enormous shoppers of memory transfer speed.
➢ The following code explains memset to define memory bandwidth

#include <stdio.h>
#include <stdlib.h>
pa
#include <strings.h>
#include <pthread.h>
jin
#include <sys/time.h>
#define BLOCKSIZE 1024*1025
int nthreads = 8;
char * memory;
.re

double now()
{
struct timeval time;
gettimeofday( &time, 0 );
return (double)time.tv_sec + (double)time.tv_usec / 1000000.0;
w

}
void *experiment( void *id )
{
w

unsigned int seed = 0;

int count = 20000;
for( int i=0; i<count; i++ )
w

{
memset( &memory[BLOCKSIZE * (int)id], 0, BLOCKSIZE );
}
if ( seed == 1 ){ printf( "" ); }
}
int main( int argc, char* argv[] )
{

Page | 11
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
pthread_t threads[64];
memory = (char*)malloc( 64*BLOCKSIZE );
if ( argc > 1 )
{
nthreads = atoi( argv[1] );
}

om
double start = now();
for( int i=0; i<nthreads; i++ )
{
pthread_create( &threads[i], 0, experiment, (void*)i );
}

.c
for ( int i=0; i<nthreads; i++ )
{
pthread_join( threads[i], 0 );

ul
}
double end = now();
printf( "%i Threads Time %f s Bandwidth %f GB/s\n", nthreads,(end – start) ,( (double)nthreads *
BLOCKSIZE * 20000.0 ) / ( end – start) / 1000000000.0 );

}
return 0;
pa
the bandwidth measured by the test code for one to eight virtual CPUs on a system with 64 virtual CPUs.
Memory Bandwidth Measured on a System with 64 Virtual CPUs
jin
1 Threads Time 7.082376 s Bandwidth 2.76 GB/s
2 Threads Time 7.082576 s Bandwidth 5.52 GB/s
3 Threads Time 7.059594 s Bandwidth 8.31 GB/s
4 Threads Time 7.181156 s Bandwidth 10.89 GB/s
.re

5 Threads Time 7.640440 s Bandwidth 12.79 GB/s

6 Threads Time 11.252412 s Bandwidth 10.42 GB/s
7 Threads Time 14.723671 s Bandwidth 9.29 GB/s
8 Threads Time 17.267288 s Bandwidth 9.06 GB/s
For this particular system, the bandwidth scales nearly linearly with the number of threads until about six
w

threads. After six threads, the bandwidth reduces.

There are several effects that can cause this :
➢ The threads are interfering on the processor.
w

➢ A second interaction effect is if the threads start interfering in the caches, such as multiple threads
attempting to load data to the same set of cache lines.
➢ One other impact is the conduct of memory chips when they become soaked. the chips begin
w

encountering lining latencies where the reaction time for each solicitation increments. Memory chips
are organized in banks. Getting to a specific location will prompt a solicitation to a specific bank of
memory. Each bank needs a hole between returning two reactions. In the event that different strings
happen to hit a similar bank, at that point the reaction time becomes represented by the rate at which
the bank can return memory back.

Page | 12
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
Memory Bandwidth Measured on a System with Four Virtual CPUs
1 Threads Time 7.437563 s Bandwidth 2.63 GB/s
2 Threads Time 15.238317 s Bandwidth 2.57 GB/s
3 Threads Time 24.580981 s Bandwidth 2.39 GB/s
4 Threads Time 37.457352 s Bandwidth 2.09 GB/s
False Sharing

om
False sharing is the situation where multiple threads are accessing items of data held on a single cache line.
Although the threads are all using separate items of data, the cache line
itself is shared between them so only a single thread can write to it at any one time. This
is purely a performance issue
Example of False Sharing

.c
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

ul
#include <sys/time.h>
double now()
{
struct timeval time;
gettimeofday( &time, 0 );
pa
return (double)time.tv_sec + (double)time.tv_usec / 1000000.0;
}
#define COUNT 100000000
jin
volatile int go = 0;
volatile int counters[20];
void *spin( void *id )
{
.re

int myid = (int)id + 1;

while( !go ) {}
counters[myid] = 0;
while ( counters[myid]++ < COUNT ) {}
}
w

int main( int argc, char* argv[] )

{
pthread_t threads[256];
w

int nthreads = 1;
if ( argc > 1 ) { nthreads = atoi( argv[1] ); }
for( int i=1; i<nthreads; i++ )
w

{
pthread_create( &threads[i], 0, spin, (void*)i );
}
double start = now();
go = 1;
spin( 0 );

Page | 13
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
double end = now();
printf("Time %f ns\n", ( end – start ) );
for( int i=0; i<nthreads; i++ )
{
pthread_join( threads[i], 0 );
}

om
return 0;
}
If we run ABOVE CODE WITH a single thread, the thread completes its work in about nine seconds on a
system with two dual-core processors. Using four threads on the same system results in a runtime for the
code of about 100 seconds—a slowdown of about 10 times.It is very easy to solve false sharing by padding

.c
the accessed structures so that the variable used by each thread resides on a separate cache line.

Cache Conflict and Capacity :

➢ The two issues that can occur with shared caches: capacity misses and conflict misses.A conflict

ul
cache miss is where one thread has caused data needed by another thread to be evicted from the
cache.
➢ Data structures such as stacks tend to be aligned on cache line boundaries, which increases the
pa
likelihood that structures from different processes will map onto the same address.
➢ The following code is used to Print the Stack Address for Different Threads
printf("Stack base address = %x for thread %i\n", &stack, (int)param);
➢ The expected output when this code is run on 32-bit Solaris indicates that threads are created with a
jin
1MB offset between the start of each stack.
➢ For a processor with a cache size that is a power of two and smaller than 1MB, a stride of 1MB
would ensure the base of the stack for all threads is in the same set of cache lines.
➢ The associativity of the cache will reduce the chance that this would be a problem.
➢ A cache with an associativity greater than the number of threads sharing is less likely to have a
.re

problem with conflict misses.

Data Races

➢ Data races are the most widely recognized programming blunder found in equal code. An
w

information race happens when numerous strings utilize similar information thing and at
least one of those strings are refreshing it. It is best delineated by a model. Assume you
w

have the code appeared in following code, where a pointer to a whole number variable is
passed in and the capacity augments the estimation of this variable by 4.
w

Page | 14
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )

om
➢ In the example, each thread adds 4 to the variable, but because they do it at exactly the
same time, the value 14 ends up being stored into the variable. If the two threads had
executed the code at different times, then the variable would have ended up with the value
of 18.

.c
➢ Another circumstance may be the point at which one string is running, yet the other string
has been setting turned off of the processor. Envision that the primary string has stacked
the estimation of the variable an and afterward gets setting turned off the processor. At the

ul
point when it in the end runs once more, the estimation of the variable a will have changed,
and the last store of the reestablished string will cause the estimation of the variable a to
relapse to an old worth. pa
➢ "Using POSIX Threads." The code makes two threads, the two of which execute the routine
func(). The fundamental thread at that point hangs tight for both the threads to finish their
work. The two threads will endeavor to augment the variable counter. We can incorporate
this code with GNU gcc and afterward use Helgrind, which is a piece of the Valgrind1
jin
suite, to distinguish the information race. Valgrind is an tool that empowers an application
to be instrumented and its runtime conduct inspected. The Helgrind tool utilizes this
instrumentation to accumulate information about information races.
➢ The output from Helgrind shows that there is a potential data race between two threads,
.re

both executing line 7 in the file race.c.

➢ Another tool that is able to detect potential data races is the Thread Analyzer in Oracle
Solaris Studio. This tool requires an instrumented build of the application, data collection is
done by the collect tool, and the graphical interface is launched with the command tha. The
w

following code performs detecting Data racses using Sun studio Thread Analyzer.
w
w

➢ The above figure explains list of data races detected by Solaris studio Thread Analyzer

Page | 15
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
Synchronization primitives:
➢ Synchronization is used to coordinate the activity of multiple threads. Most operating systems
provide a rich set of synchronization primitives.
➢ It is usually most appropriate to use these rather than attempting to write custom methods of
synchronization.
➢ The tools will be able to do a better job of detecting data races or correctly labeling synchronization

om
costs.

Mutexes and Critical Regions :

➢ The simplest form of synchronization is a mutually exclusive (mutex) lock.
➢ Only one thread at a time can acquire a mutex lock, so they can be placed around a data structure to

.c
ensure that the data structure is modified by only one thread at a time.

ul
pa
jin

➢ In the example, the two routines Increment() and Decrement() will either increment or decrement the
variable counter.
.re

➢ To modify the variable, a thread has to first acquire the mutex lock. Only one thread at a time can do
this; all the other threads that want to acquire the lock need to wait until the thread holding the lock
releases it.
➢ The two schedules utilize the equivalent mutex; therefore, just each string in turn can either addition
or decrement the variable counter.
w

➢ On the off chance that different strings are endeavoring to procure the equivalent mutex
simultaneously, at that point just one string will succeed, and different strings should pause. This
circumstance is known as a fought mutex.
w

➢ The district of code between the obtaining and arrival of a mutex lock is known as a basic area, or
basic locale. Code right now be executed by just each string in turn.As an example of a critical
section, imagine that an operating system does not have an implementation of malloc() that is thread-
w

safe, or safe for multiple threads to call at the same time.

➢ One way to fix this is to place the call to malloc() in a critical section by surrounding it with a mutex
lock.

Page | 16
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )

➢ On the off chance that all the calls to malloc() are supplanted with the threadSafeMalloc() call, at

om
that point just each string in turn can be in the first malloc() code, and the calls to malloc() become
string safe.
➢ Threads square on the off chance that they endeavor to secure a mutex lock that is as of now held by
another string.
➢ Blocking implies that the strings are sent to rest either promptly or after a couple of ineffective

.c
endeavors to gain the mutex. One issue with this methodology is that it can serialize a program.
➢ On the off chance that numerous strings all the while call threadSafeMalloc(), just each string in turn
will gain ground.
➢ This causes the multithreaded program to have just a solitary executing string, which prevents the

ul
program from exploiting various centers

Spin Locks
pa
➢ Spin locks are basically mutex locks. The distinction between a mutex lock and a Spin lock is that a
Thread holding on to secure a Spin lock will continue attempting to gain the lock without dozing.
➢ The benefit of utilizing Spin locks is that they will get the lock when it is discharged, while a mutex
lock should be woken by the working framework before it can get the lock.
jin
➢ The hindrance is that a turn lock will Spin on a virtual CPU cornering that asset.
➢ In correlation, a mutex lock will rest and free the virtual CPU for another string to utilize
Semaphores
➢ Semaphores are counters that can be either incremented or decremented. An example might be a
.re

buffer that has a fixed size.

➢ Every time an element is added to a buffer, the number of available positions is decreased. Every
time an element is removed, the number available is increased.
➢ Semaphores can also be used to mimic mutexes; if there is only one element in the semaphore, then
it can be either acquired or available, exactly as a mutex can be either locked or unlocked.
w

➢ Semaphores will also signal or wake up threads that are waiting on them to use available resources;
hence, they can be used for signaling between threads.
➢ For example, a thread might set a semaphore once it has completed some initialization. Other threads
w

could wait on the semaphore and be signaled to start work once the initialization is complete.
Readers-Writer Locks :
➢ Data races are a worry just when shared information is changed. Various threads perusing the
w

common information don't present an issue.

➢ Peruse just information doesn't, in this way, need assurance with a lock. In any case, once in a while
information that is ordinarily perused just should be refreshed.
➢ A readerswriter lock (or different peruser lock) permits numerous threads to peruse the common
information however would then be able to bolt the perusers strings out to permit one string to
secure an author lock to alter the information.

Page | 17
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ An essayist can't get the compose lock until all the perusers have discharged their peruser locks.
Thus, the locks will in general be one-sided toward journalists; when one is lined, the lock quits
permitting further perusers to enter.
➢ This activity causes the quantity of perusers holding the lock to decrease and will in the end
the essayist to get selective access

om
.c
ul
Barriers :
➢ There are circumstances where various strings have to all finish their work before any of the strings
pa
can begin the following assignment.
➢ For instance, assume various strings figure the qualities put away in a network.
➢ The variable complete should be determined utilizing the qualities put away in the network.
➢ A boundary can be utilized to guarantee that all the strings total their calculation of the lattice before
jin
the variable absolute is determined. The following code clarifies this idea

➢ The variable total can be computed only when all threads have reached the barrier.
.re

➢ This avoids the situation where one of the threads is still completing its computations while the other
threads start using the results of the calculations.
➢ Notice that another barrier could well be needed after the computation of the value for total if that
value is then used in further calculations.
The following code explains multiple barriers
w
w

Deadlocks and Live locks.

The fundamental ways to share access to resources between threads :
w

1. Deadlock
2. Livelocks
➢ The deadlock, where two or more threads that they need are held by different threads. It is most
effortless to clarify this with a model.
➢ Assume two threads need to get mutex locks An and B to finish some errand.
➢ On the off chance that thread 1 has just procured lock An and string 2 has just gained lock B, at that
point A can't gain forward ground since it is hanging tight for lock B, and thread 2 can't gain ground
Page | 18
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
since it is sitting tight for lock A. The two threads are halted.

om
➢ The best way to avoid deadlocks is to ensure that threads always acquire the locks in the same order.
➢ So if thread 2 acquired the locks in the order A and then B, it would stall while waiting for lock A
without having first acquired lock B.

.c
➢ This would enable thread 1 to acquire B and then eventually release both locks, allowing thread 2 to
make progress.
➢ A livelock traps threads in an unending loop releasing and acquiring locks. Livelocks can be caused

ul
by code to back out of deadlocks.
➢ If the thread cannot obtain the second lock it requires, it releases the lock that it already holds. The
two routines update1() and update2() each have an outer loop.
pa
jin
.re
w

➢ Routine update1() acquires lock A and then attempts to acquire lock B, whereas update2() does this
w

in the opposite order.

➢ This is a classic deadlock opportunity, and to avoid it, the developer has written some code that
causes the held lock to be released if it is not possible to acquire the second lock.
w

➢ The routine canAquire(), in this example, returns immediately either having acquired the lock or
having failed to acquire the lock.
Communication Between Threads and Processes - Condition Variables :
➢ Condition variables communicate readiness between threads by enabling a thread to be woken up
when a condition becomes true.

Page | 19
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ Without condition variables, the waiting thread would have to use some form of polling to check
whether the condition had become true.
➢ For example, the producer consumer model can be implemented using condition variables. Suppose
an application has one producer thread and one consumer thread.
➢ The producer adds data onto a queue, and the consumer removes data from the queue.
➢ If there is no data on the queue, then the consumer needs to sleep until it is signaled that an item of

om
data has been placed on the queue.

Acquire Mutex();
Add Item to Queue();
If ( Only One Item on Queue )

.c
{
Signal Conditions Met();
}

ul
Release Mutex();

➢ The producer thread needs to signal a waiting consumer thread only if the queue was empty and it
has just added a new item into that queue.
pa
➢ If there were multiple items already on the queue, then the consumer thread must be busy processing
those items and cannot be sleeping.
Acquire Mutex();
Repeat Item = 0;
jin
If ( No Items on Queue() )
{
Wait on Condition Variable();
}
.re

If (Item on Queue())
{
Item = remove from Queue();
}
Until ( Item != 0 );
w

Release Mutex();

➢ The producer thread can use two types of wake-up calls: Either it can wake up a single thread or it
w

can broadcast to all waiting threads.

Consumer Thread Code with Potential Lost Wake-Up Problem
Repeat
w

Item = 0;
If ( No Items on Queue() )
{
Acquire Mutex();
Wait on Condition Variable();
Release Mutex();

Page | 20
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
}
Acquire Mutex();
If ( Item on Queue() )
{
Item = remove from Queue();
}

om
Release Mutex();
Until ( Item!=0 );

Signals and Events

➢ Signals are a UNIX instrument where one procedure can impart a sign to another procedure and have

.c
a handler in the accepting procedure play out some errand upon the receipt of the message.
➢ Numerous highlights of UNIX are actualized utilizing signals. Halting a showing application to
squeezing.
➢ Windows has a comparable instrument for occasions. The treatment of console presses and mouse

ul
moves are performed through the occasion system.
➢ Squeezing one of the catches on the mouse will make a tick occasion be sent to the objective
window.
pa
➢ Signs and occasions are truly upgraded for sending restricted or no information alongside the sign,
and all things considered they are likely not the best component for correspondence when contrasted
with different alternatives Installing and Using a Signal Handler
void signalHandler(void *signal)
jin
{ ... }
int main()
{
installHandler( SIGNAL, signalHandler );
.re

sendSignal( SIGNAL );
}

Message Queues
➢ A message queue is a structure that can be shared between multiple processes.
w

➢ Messages can be placed into the queue and will be removed in the same order in which they were
added.
➢ Constructing a message queue looks rather like constructing a shared memory segment. The first
w

thing needed is a descriptor, typically the location of a file in the file system.
➢ This descriptor can either be used to create the message queue or be used to attach to an existing
message queue.
w

➢ Once the queue is configured, processes can place messages into it or remove messages from it.
Once the queue is finished, it needs to be deleted.
➢ Creating and Placing Messages into a Queue ID = Open Message Queue Queue( Descriptor ); Put
Message in Queue( ID, Message ); ... Close Message Queue( ID ); Delete Message Queue(
Description );

Page | 21
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ Using the descriptor for an existing message queue enables two processes to communicate by
sending and receiving messages through the queue.
➢ Opening a Queue and Receiving Messages ID=Open Message Queue ID(Descriptor);
Message=Remove Message from Queue(ID); ... Close Message Queue(ID);

Named Pipes:

om
➢ UNIX uses pipes to pass data from one process to another. For example, the output from the
command ls, which lists all the files in a directory, could be piped into the wc command, which
counts the number of lines, words, and characters in the input.
➢ The combination of the two commands would be a count of the number of files in the directory.
Named pipes provide a similar mechanism that can be controlled programmatically.

.c
➢ Setting Up and Writing into a Pipe Make Pipe( Descriptor ); ID = Open Pipe( Descriptor ); Write
Pipe( ID, Message, sizeof(Message) ); ... Close Pipe( ID ); Delete Pipe( Descriptor );
➢ Opening an Existing Pipe to Receive Messages ID=Open Pipe( Descriptor ); Read Pipe( ID, buffer,

ul
sizeof(buffer) ); ... Close Pipe( ID );

other approaches to Sharing Data Between Threads:

pa
➢ There are several other approaches to sharing data. For example, data can be written to a file to be
read by another process at a later point.
➢ This might be acceptable if the data needs to be stored persistently or if the data will be used at some
later point.
➢ Still, writing to disk presents a long latency operation, which is not the best mechanism if the
jin
purpose is purely communication
➢ There is also operating system–specific approaches to sharing data between processes.
➢ Solaris doors allow one process to pass an item of data to another process and have the processed
result returned.
.re

➢ Doors are optimized for the round-trip and hence can be cheaper than using two different messages.

Evaluating Parallel Programs :

➢ For any of the parallel programming approaches described in the previous section, there are a
w

number of metrics that can be used to evaluate its effectiveness.

➢ This section describes ﬁve of the most important metrics that will be used to evaluate parallel
programming in this dissertation: performance, clarity, portability, generality, and a programmer’s
w

ability to reason about the implementation.

Performance :
➢ Performance is typically viewed as the bottom line in parallel computing.
w

➢ Since improved performance is often the primary motivation for using parallel computers, failing to
achieve good performance reﬂects poorly on a language, library, or compiler.

Clarity :
➢ The importance of clarity is often brushed aside in favor of the all-consuming pursuit of
performance. However, this is a mistake that should not be made.

Page | 22
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ Clarity is perhaps the single most important factor that prevents more scientists and programmers
from utilizing parallel computers today. Local-view libraries continue to be the predominant
approach to parallel programming, yet their syntactic overheads are such that clarity is greatly
compromised.
➢ This requires programmers to focus most of their attention on making the program

om
Generality :
➢ Generality simply refers to the ability of a parallel programming approach to express algorithms for
varying types of problems.
➢ For example, a library which only supports matrix , multiplication operations is not very general, and
would not be very helpful for writing a parallel quicksort algorithm.
➢ Conversely, a global-view functional language might make it

.c
ul
pa
jin
.re
w
w
w

Page | 23
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
UNIT - III

OpenMP Execution Model – Memory Model – OpenMP Directives – Work-sharing Constructs - Library
functions – Handling Data and Functional Parallelism – Handling Loops - PerformanceConsiderations

om
Open specifications for Multi Processing via collaborative work between interested parties from the
hardware and software industry, government and academia. OpenMP is an Application Program Interface
(API) that may be used to explicitly direct multi-threaded, shared memory parallelism.API components:

.c
Compiler Directives, Runtime Library Routines, Environment Variables. OpenMP is a directive-based
method to invoke parallel computations on share-memory multiprocessor

ul
➢ OpenMP and Pthreads are both APIs for shared-memory programming,they have many fundamental
differences.
➢ Pthreads requires that the programmer explicitly specify the behavior of each thread.
pa
➢ OpenMP, on the other hand, sometimes allows the programmer to simply state that a block of code
should be executed in parallel, and the precise determination of the tasks and which thread should
execute them is left to the compiler and the run-time system.
jin
➢ This suggests a further difference between OpenMP and Pthreads, that is, that Pthreads (like MPI) is
a library of functions that can be linked to a C program, so any Pthreads program can be used with
any C compiler, provided the system has a Pthreads library. OpenMP, on the other hand, requires
.re

compiler support for some operations, and hence it’s entirely possible that you may run across a C
compiler that can’t compile OpenMP programs into parallel programs.
➢ Critical directives insure that only one thread at a time can execute the structured block. If multiple
threads try to execute the code in the critical section, all but one of them will block before the critical
w

section. As threads finish the critical section, other threads will be unblocked and enter the code.

➢ Named critical directives can be used in programs having different critical sections that can be
w

executed concurrently. Multiple threads trying to execute code in critical section(s) with the same
name will be handled in the same way as multiple threads trying to execute an unnamed critical
w

section. However, threads entering critical sections with different names can execute concurrently.

OpenMP uses the fork-join model of parallel execution.

o All OpenMP programs begin with a single master thread.
o The master thread executes sequentially until a parallel region is encountered, when it creates
a team of parallel threads (FORK).

Page | 1
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
o When the team threads complete the parallel region, they synchronize and terminate, leaving
only the master thread that executes sequentially (JOIN).

Simple Open MP hello world program:

om
.c
ul
pa
jin
.re

5.1.1Compiling and running OpenMP programs

To compile this with gcc we need to include the
w

To run the program, we specify the number of threads on the command line.
w

The following will be the output

Hello from thread 0 of 4

Hello from thread 1 of 4
Hello from thread 2 of 4
Hello from thread 3 of 4

Page | 2
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
Open MP Memory Model :
➢ The OpenMP API gives a casual consistency, shared-memory model. All OpenMP strings approach
a spot to store and to recover factors, called the memory.
➢ Likewise, each string is permitted to have its own impermanent perspective on the memory. The

om
impermanent perspective on memory for each string is certainly not a necessary piece of the
OpenMP memory model, yet can speak to any sort of mediating structure,
➢ for example, machine registers, reserve, or other neighborhood stockpiling, between the string and
the memory.
➢ The brief perspective on memory permits the string to store factors and in this manner to abstain

.c
from going to memory for each reference to a variable.
➢ Each string likewise approaches another kind of memory that must not be gotten to by different

ul
strings, called thread private memory.
➢ The minimum size at which a memory update may likewise peruse and compose back nearby factors
that are a piece of another variable (as exhibit or structure components) is usage characterized yet is
pa
no bigger than required by the base language.
➢ A solitary access to a variable might be actualized with various burden or store guidelines and,
consequently, isn't destined to be nuclear regarding different gets to a similar variable.
jin
➢ Gets to factors littler than the usage characterized least size or to C or C++ bit-fields might be
executed by perusing, adjusting, and modifying a bigger unit of memory, and may consequently
meddle with updates of factors or fields in a similar unit of memory.
.re

➢ In the event that various strings compose without synchronization to a similar memory unit,
including cases because of atomicity contemplations as depicted above, at that point an information
race happens.
➢ Correspondingly, if in any event one string peruses from a memory unit and at any rate one string
w

composes without synchronization to that equivalent memory unit,

Shared-Memory Model :
w

➢ The shared-memory model is an abstraction of the generic centralized multiprocessor.

➢ The underlying hardware is assumed to be a collection of processors, each with access to
w

the same shared memory.

➢ Because they have access 10 the same memory locations, processors can interact and
synchronize with each other through shared variables.

Page | 3
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )

om
➢ A key difference, at that point, between the mutual memory model and the message passing model is
that in the message-passing model all procedures regularly stay dynamic all through the execution of

.c
the program, though in the common memory model the quantity of dynamic strings is one at the
program's beginning and finish and may change dynamically all through the execution of the

ul
program.
➢ To see, a consecutive program as an uncommon instance of a mutual memory equal program: it is
pa
just one with no fork/participates in it. Equal shared-memory programs run from those with just a
solitary fork/join around a solitary circle to those in which the majority of the code sections are
executed in equal.
➢ Henceforth the common memory model backings gradual paI1lllelization, the way toward changing
jin
a successive program into an equal program each square of code in turn.
➢ The capacity of the common memory model to help steady parallelization is probably the best
preferred position over the message-passing model. consider each square sick turn starting with the
.re

most tedious, parallelize each square managable to resemble execution. what's more, stop when the
required to accomplish further execution enhancements isn't justified.
➢ Consider, in contrast, message-passing programs. They have no shared memory 0 hold variables,
and the parallel processes are active throughout the execution of the program. Transforming a
w

sequential program into a parallel program is not incremental

Device Data Environments
w

➢ When an OpenMP program starts, every gadget has an underlying gadget information
condition. The underlying gadget information condition for the host gadget is the
w

information condition related with the underlying initial task region.

➢ Directives that accept data-mapping attribute clauses determine how an original variable is
mapped to a corresponding variable in a device data environment.
➢ The original variable is the variable with the same name that exists in the data environment
of the task that encounters the directive.

Page | 4
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ If a comparing variable is available in the encasing gadget information condition, the new
gadget information condition acquires the relating variable from the encasing gadget
information condition.
➢ If a corresponding variable is not present in the enclosing device data environment, a new

om
corresponding variable (of the same type and size) is created in the new device data
environment.
➢ In the latter case, the initial value of the new corresponding variable is determined from the
clauses and the data environment of the encountering thread.

.c
➢ The corresponding variable in the device data environment may share storage with the
original variable. Writes to the corresponding variable may alter the value of the original

ul
variable.
➢ When a task executes in the context of a device data environment, references to the original
variable refer to the corresponding variable in the device data environment.
pa
➢ The relationship between the value of the original variable and the initial or final value of
the corresponding variable depends on the map-type. Details of this issue, as well as other
issues with mapping a variable
jin
➢ The unique variable in an information domain and the comparing variable(s) in at least one
gadget information conditions may share stockpiling. Without mediating synchronization
data races can happen.
.re

The Flush Operation

➢ The memory model has relaxed-consistency because a thread’s temporary view of memory
is not required to be consistent with memory at all times.
w

➢ A value written to a variable can remain in the thread’s temporary view until it is forced to
memory at a later time. Likewise, a read from a variable may retrieve the value from the
w

thread’s temporary view, unless it is forced to read from memory.

➢ The OpenMP flush operation enforces consistency between the temporary view and
w

memory.
➢ The flush operation is applied to a set of variables called the flush-set. The flush operation
restricts reordering of memory operations that an implementation might otherwise do.

Page | 5
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ Implementations must not reorder the code for a memory operation for a given variable, or
the code for a flush operation for the variable, with respect to a flush operation that refers
to the same variable.

OpenMP directives :

om
➢ OpenMP directives exploit shared memory parallelism by defining various types of parallel regions.
Parallel regions can include both iterative and non-iterative segments of program code.
➢ Pragmas fall into these general categories:

.c
✓ Pragmas that let you define parallel regions in which work is done by threads in parallel
(#pragma omp parallel). Most of the OpenMP directives either statically or dynamically bind
to an enclosing parallel region.

ul
✓ Pragmas that let you define how work is distributed or shared across the threads in a parallel
region (#pragma omp section, #pragma omp for, #pragma omp single, #pragma omp task).
pa
✓ Pragmas that let you control synchronization among threads (#pragma omp atomic, #pragma
omp master, #pragma omp barrier, #pragma omp critical, #pragma omp flush, #pragma omp
ordered) .
✓ Pragmas that let you define the scope of data visibility across threads (#pragma omp
jin
threadprivate).
✓ Pragmas for task synchronization (#pragma omp taskwait, #pragma omp barrier)
.re

OpenMP compiler directives

➢ parallel, which goes before a square of code to be executed in equal by numerous strings for, which
goes before a for circle with free iteratiOlls that might be separated among strings executing in equal
w

➢ parallel for, a combination of the parallel and for directives

➢ sections, which goes before a progression of hinders that might be executed in equal
➢ parallel sections, a combination of the parallel
w

➢ critical , which precedes a critical section

➢ sing1e, which precedes a code block executed as a single thread
w

Four important OpenMP functions:

➢ omp_get_num_procs, which returns the number of CPUs in the multiprocessor on which this thread
is executing
➢ omp_get_num_threads, which returns the number of threads active in the current parallel region

Page | 6
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ omp_get _thread_num, which returns the thread identification number
➢ omp_set_num_threads, which allows you to fiX the number of threads executing the parallel
sections of code

om
Directive Format

The syntax of an OpenMP directive is formally specified by the grammar

#pragma omp directive-name [clause[ [,] clause]...] new-line

.c
➢ Each directive starts with #pragma omp, to reduce the potential for conflict with other (non-OpenMP
or vendor extensions to OpenMP) pragma directives with the same names.

ul
➢ The remainder of the directive follows the conventions of the C and C++ standards for compiler
directives. In particular, white space can be used before and after the #, and sometimes white space
must be used to separate the words in a directive.
pa
➢ Preprocessing tokens following the #pragma omp are subject to macro replacement.
➢ Directives are case-sensitive. The order in which clauses appear in directives is not significant.
Clauses on directives may be repeated as needed, subject to the restrictions listed in the description
jin
of each clause.
➢ If variable-list appears in a clause, it must specify only variables. Only one directive-name can be
specified per directive.
.re

OpenMP fixed source form

The following formats for specifying directives are equivalent (the first line represents the position of the
w

first 9 columns):
w

C23456789
!$OMP PARALLEL DO SHARED(A,B,C)
w

C$OMP PARALLEL DO
C$OMP+SHARED(A,B,C)

C$OMP PARALLELDOSHARED(A,B,C)

Page | 7
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
OpenMP free source form

The following formats for specifying directives are equivalent (the first line represents the position of the
first 9 columns):

om
!23456789
!$OMP PARALLEL DO &
!$OMP SHARED(A,B,C)

.c
!$OMP PARALLEL &
!$OMP&DO SHARED(A,B,C)

ul
!$OMP PARALLEL DO SHARED(A,B,C)
pa
➢ One or more blanks or tabs must be used to separate adjacent keywords in directives
➢ Comments are allowed inside directives. Comments can appear on the same line as a directive.
➢ In free source form, the exclamation point initiates a comment; in fixed source form, it initiates a
jin
comment when it appears after column 6.
➢ Regardless of form, the comment extends to the end of the source line and is ignored. If the first
nonblank character after the initial prefix (or after a continuation directive line in fixed source form)
.re

is an exclamation point, the line is ignored.

Conditional Compilation
w

➢ Fortran statements can be compiled conditionally as long as they are preceded by one of the
following conditional compilation prefixes: !$, C$, or *$.
➢ The prefix must be followed by a Fortran statement on the same line.
w

➢ During compilation, the prefix is replaced by two spaces, and the rest of the line is treated as a
normal Fortran statement.
w

➢ The program must be compiled with the -mp option in order for the compiler to honor statements
preceded by conditional compilation prefixes; without the mp command line option, statements
preceded by conditional compilation prefixes are treated as comments.
➢ First define the _OPENMP symbol to be used for conditional compilation.

Page | 8
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ This symbol is defined during OpenMP compilation to have the decimal
value YYYYMM where YYYY and MM are the year and month designators of the version of the
OpenMP Fortran API is supported.
➢ The !$ prefix is accepted when compiling either fixed source form files or free source form files.

om
The C$ and *$ prefixes are accepted only when compiling fixed source form. The source form you
are using also dictates the following:
➢ In fixed source form, the prefixes must start in column one and appear as a single word with no
intervening white space.
➢ Fortran fixed form line length, case sensitivity, white space, continuation, and column rules apply to

.c
the line. Initial lines must have a space or zero in column six, and continuation lines must have a
character other than a space or zero in column six.

ul
➢ In free source form, the !$ prefix can appear in any column as long as it is preceded only by white
space.
➢ It must appear as a single word with no intervening white space. Fortran free source form line
pa
length, case sensitivity, white space, and continuation rules apply to the line. Initial lines must have a
space after the prefix.
➢ Continued lines must have an ampersand as the last nonblank character on the line prior to any
jin
comment appearing on the conditionally compiled line.
➢ Continuation lines can have an ampersand after the prefix, with optional white space before and after
the ampersand.
.re

Data dependences :

➢ If a for loop fails to satisfy one of the rules outlined in the preceding section, the compiler will
w

simply reject it.

➢ For example, suppose we try to compile a program with the following linear search function:
w
w

Page | 9
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
The gcc compiler reports:

Line 6: error: invalid exit from OpenMP structured block

➢ A more insidious problem occurs in loops in which the computation in one iteration depends on the

om
results of one or more previous iterations. As an example, consider the following code, which
computes the ﬁrst n ﬁbonacci numbers:

.c
ul
Although we may be suspicious that something isn’t quite right, let’s try parallellizing the for loop with a
parallel for directive:
pa
jin
The compiler will create an executable without complaint. However, if we try running it with more than one
thread, we may ﬁnd that the results are, at best, unpredictable.For example, on one of our systems if we try
.re

using two threads to compute the ﬁrst 10 Fibonacci numbers, we sometimes get

1 1 2 3 5 8 13 21 34 55
w

How the Per-Data Environment ICVs Work

➢ At the point when an errand develop or equal build is experienced, the produced task(s) acquire the
w

estimations of the information condition checked ICVs from the creating undertaking's ICV esteems.
➢ At the point when an assignment build is experienced, the produced task acquires the estimation of
w

nthreads-var from the creating errand's nthreads-var esteem.

➢ At the point when an equal build is experienced, and the producing undertaking's nthreads-var list
contains a solitary ICV Scope component, the created task(s) acquire that rundown as the estimation
of nthreads-var.

Page | 10
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ At the point when an equal develop is experienced, and the producing undertaking's nthreads-var list
contains numerous components, the created task(s) acquire the estimation of nthreads-var as the
rundown got by erasure of the principal component from the creating errand's nthreads-var esteem.
➢ The quandary var ICV is dealt with similarly as the nthreads-var ICV.

om
➢ At the point when a gadget build is experienced, the new gadget information condition acquires the
estimations of the information condition perused ICVs from the encasing gadget information
condition of the gadget that will execute the district.
➢ On the off chance that a groups develop with a thread_limit proviso is experienced, as far as possible
var ICV of the new gadget information condition isn't acquired yet rather is set to a worth that is not

.c
exactly or equivalent to the worth determined in the statement.
➢ While experiencing a circle worksharing locale with schedule(runtime), all verifiable undertaking

ul
districts that comprise the coupling equal area must have a similar incentive for run-sched-var in
their information surroundings.
➢ Something else, the conduct is unknownWork-sharing Constructs
pa
➢ This is one of the category of Open MP language Extensions, it distribute work among threads using
do/parallel and do/section directives
jin
SPMD vs. worksharing

➢ A parallel construct by itself creates an SPMD or “Single Program Multiple Data” program i.e., each
thread redundantly executes the same code.
.re

➢ To split up pathways through the code between threads within a team.This is called worksharing.
#pragma omp c
􀂃 Thread creation construct
PARALLEL / parallel
w

➢ Original process (master thread) forked additional threads to run code enclosed in the parallel
construct.
w

➢ Thread ID for master thread is 0. Work-sharing constructs

– DO / for split up loop iterations among the threads in a parallel region
w

– SECTIONS / sections
1. divide consecutive but independent section(s) of code block amongst the threads
2. barrier implied at the end unless the NOWAIT/nowait clause if used.
➢ code block to be executed by 1 thread
➢ barrier implied at the end unless the NOWAIT/nowait clause if used

Page | 11
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ MASTER / master : code block to be executed by master thread only (no barrier on other threads
implied)

Workshare :
➢ execution of structured block is divided into separate units of work, each to be executed once

om
➢ Structured block must consist only
1. array or scalar assignment
2. FORALL and WHERE statements
3. FORALL and WHERE constructs

.c
4. atomic, critical or parallel constructs
Data scoping clauses

ul
1. SHARED / shared : data are visible and accessible by all threads simultaneously. All
variables in work-sharing region are shared by default,except the loop iteration.
2. PRIVATE / private : data is private to each thread. A private variable is not initialized. Loop
pa
iteration counter in work-sharing region is private.
3. DEFAULT / default : default data scoping in the work-sharing region (shared / private /
none)
jin
Synchronization clauses
1. CRITICAL / critical : enclosed code block to be executed by all threads, but only one thread
.re

at a time
2. ATOMIC / atomic : a mini-critical section specifying that a memory location must be
updated atomically
3. ORDERED / ordered : iteration of enclosed code block is executed in same order as
w

sequential
4. BARRIER / barrier : synchronizes all threads at the barrier. The barrier synchronization is
w

implied at end of work-sharing construct.

5. NOWAIT / nowait : thread completing code block execution can proceed, at end of work-
w

sharing construct where barrier is implied by default

Scheduling clauses
➢ SCHEDULE( type, [chunk] ) / schedule( type, [chunk] )
1. STATIC / static : loop iterations are divided into chunk size and statically assigned to the
threads. The iterations are evenly divided contiguously if no chunk is specified.

Page | 12
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
2. DYNAMIC / dynamic : loop iterations are divided into chunk size and dynamically
scheduled amongst the threads. The default chunk size is 1.
3. GUIDED / guided : For a chunk_size of 1, the size of each chunk is proportional to the
number of unassigned iterations divided by the number of threads in the team, decreasing to

om
4. RUNTIME /runtime : scheduling decision deferred until runtime by environment variable
OMP_SECHEULE
➢ AUTO / auto : scheduling decision delegated to the compiler and/or runtime system
The loop worksharing Constructs
The loop workharing construct splits up loop iterations among the threads in a team

.c
#pragma omp parallel
{

ul
#pragma omp for
for (I=0;I<N;I++){ // The variable I is made “private” to each thread by default. We
could do this explicitly with a “private(I)” clause //
NEAT_STUFF(I);
pa
}
}
jin
Library functions
➢ OpenMP is a library for parallel programming in the SMP (symmetric multi-processors, or shared-
memory processors) model.
.re

➢ When programming with OpenMP, all threads share memory and data. OpenMP supports C, C++
and Fortran.
➢ The OpenMP functions are included in a header file called omp.h .
➢ OpenMP program structure: An OpenMP program has sections that are sequential and sections that
w

are parallel.
➢ In general an OpenMP program starts with a sequential section in which it sets up the environment,
w

initializes the variables, and so on.

➢ When run, an OpenMP program will use one thread (in the sequential sections), and several threads
w

(in the parallel sections).

➢ There is one thread that runs from the beginning to the end, and it's called the master thread.
➢ The parallel sections of the program will cause additional threads to fork. These are called
the slave threads.

Page | 13
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ A section of code that is to be executed in parallel is marked by a special directive (omp pragma).
When the execution reaches a parallel section (marked by omp pragma), this directive will cause
slave threads to form.
➢ Each string executes the equal segment of the code freely. At the point when a string completes, it

om
joins the ace. At the point when all strings finish, the ace proceeds with code following the equal
area.
➢ Each thread has an ID attached to it that can be obtained using a runtime library function
(called omp_get_thread_num()). The ID of the master thread is 0.
➢ First we have to Modify/Check the number of threads

.c
omp_set_num_threads()
omp_get_num_threads()

ul
omp_get_thread_num()
omp_get_max_threads()
The active parallel region is found as omp_in_parallel()
pa
The system to dynamically vary the number of threads from one parallel construct to another
➢ omp_set_dynamic,
➢ omp_get_dynamic();
jin
To find number of processors in a system we use omp_num_procs() .Set the default number of threads to
use.
OMP_NUM_THREADS int_literal sets the number of threads to use in a team
.re

➢ This subroutine sets the quantity of strings that will be utilized in the following equal area. The
dynamic strings component adjusts the impact of this daily schedule.
➢ Whenever empowered, indicates the most extreme number of strings that can be utilized for any
equal area.
w

➢ Whenever impaired, indicates definite number of strings to use until next call to this everyday
practice.
w

➢ This routine must be called from the sequential parts of the code. This bring has priority over the
OMP_NUM_THREADS condition variable.void omp_set_num_threads(int num_threads)
w

Execution Environment Routine.

1. omp_set_num_threads :
➢ The omp_set_num_threads routine affects the number of threads to be used for subsequent parallel
regions that do not specify a num_threads clause, by setting the value of the first element of the
nthreads-var ICV of the current task.

Page | 14
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )

Format
void omp_set_num_threads(int num_threads);
subroutine omp_set_num_threads(num_threads)

om
2. Constraints on Arguments
➢ The value of the argument passed to this routine must evaluate to a positive integer, or
else the behavior of this routine is implementation defined.
➢ The binding task set for an omp_set_num_threads region is the generating task.
➢ The effect of this routine is to set the value of the first element of the nthreads-var ICV

.c
of the current task to the value specified in the argument.
omp_get_num_threads

ul
The omp_get_num_threads routine returns the number of threads in the current
team. int omp_get_num_threads(void);
➢ The binding region for an omp_get_num_threads region is the innermost enclosing
parallel region.
pa
➢ The omp_get_num_threads routine returns the number of threads in the team executing the parallel
region to which the routine region binds. If called from the sequential part of a program, this routine
jin
returns
omp_get_max_threads
➢ The omp_get_max_threads routine returns an upper bound on the number of threads that could be
.re

used to form a new team if a parallel construct without a num_threads clause were encountered after
execution returns from this routine.
➢ The binding task set for an omp_get_max_threads region is the generating task.
➢ The value returned by omp_get_max_threads is the value of the first element of the nthreads-var
w

ICV of the current task. This value is also an upper bound on the number of threads that could be
used to form a new team if a parallel region without a num_threads clause were encountered after
w

execution returns from this routine.

omp_get_thread_num
w

➢ The omp_get_thread_num routine returns the thread number, within the current team, of the calling
thread.
int omp_get_thread_num(void);
➢ The binding thread set for an omp_get_thread_num region is the current team. The binding region
for an omp_get_thread_num region is the innermost enclosing parallel region.

Page | 15
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ The omp_get_thread_num routine returns the thread number of the calling thread, within the team
executing the parallel region to which the routine region binds.
➢ The thread number is an integer between 0 and one less than the value returned by omp_get_ num_
threads , inclusive. The thread number of the master thread of theteam is 0. The routine returns 0 if it

om
is called from the sequential part of a program.
omp_in_parallel
➢ The omp_in_parallel routine returns true if the active-levels-var ICV is greater than zero; otherwise,
it returns false. int omp_in_parallel(void);
➢ The binding task set for an omp_in_parallel region is the generating task.

.c
➢ The effect of the omp_in_parallel routine is to return true if the current task is enclosed by an active
parallel region, and the parallel region is enclosed by the outermost initial task region on the device;

ul
otherwise it returns false.
Handling Data and Functional Parallelism
➢ First think about a calculation to process a connected rundown of undertakings. Think about a
pa
comparable calculation when planned an answer for the report grouping..
➢ In that design, assumed a message-passing model. Because that model has no shared memory, to
gave a single process, which are called the manager, responsibility for maintaining the entire list of
jin
tasks.
➢ Worker tasks sent messages to the manager when they were ready to process another task.
➢ In contrase, the shared-memory model allows every thread to access the same “to=do” list. So there
.re

is no need for a separate manager thread.

➢ The following code segments are part of a program that processes work stored in a singly linked to-
do list.
Int main(int arc,argv[])
w

{
Struct job_struct job_ptr;
w

Struct task_struct task_ptr;

---
w

Task_ptr=get_next_task(&job_ptr);
While(task_ptr!=NULL){
Complete_task(task_ptr);
Task_ptr=get_next_task(&job_ptr);
}

Page | 16
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
---
}
Char get-next_task(struct job_struct job_ptr){
Struct task_struct answer;

om
It(job_ptr==NULL) answer = Null;
Else{
Answer=(job-ptr)->task;
Job_ptr=(job_ptr)-> next;
}

.c
Return answer;
}

ul
An algorithm to process a linked list of tasks. The shared-memory model allows every thread to access the
same “to-do” list, so there is no need for a separate manager thread.
The following code segments are part of a program that processes work stored in a singly linked to-do list.
pa
To ensure that no two threads take the same task from the list. It is important to execute function
get_next_task automatically.
jin
Handling Loops
➢ A compiler order in C or C++ is called pragma. The word pragama is short for sober minded data.
➢ A pragama is an approach to impart data to the compiler.
➢ The data is superfluous as in the compiler may disregard the data and still produce a right article
.re

program.
➢ A pragama begin with the # character.A pragama in C or C++ has this syntax:
#pragama omp <rest of pragama>
w

➢ The first pragama we are going to consider is the parallel for pragama. The simplest for, of the
parallel for pragama is :
w

#pragama omp parallel for

➢ Putting this line immediately before the for loop instructs the compiler to try to parallelize the loop:
#pragama omp parallel for
w

For (i=first;i<size;i+=prime) marked[i]=1;

➢ The for loop must not contain statements that allow the loop to be exited prematurely.
➢ Examples include the break statement, return statement, exit statement, and go to statement to labels
outside the loop.

Page | 17
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ The continue statement is allowed, however, because its execution does not affect the number of
loop iterations.

om
.c
ul
➢ During parallel execution of the for circle, the ace string makes extra strings, and all strings
pa
cooperate to cover the emphasess of the circle.
➢ Each string has its own execution setting; a location space containing the entirety of the factors the
string may get to.
jin
➢ The execution setting incorporate static factors, progressively assigned information structures in the
pile, and factors on the run-time stack.
➢ The execution context includes its own additional fun-time stack, where the frames for functions it
.re

invokes are kept.

➢ Other variables may either be shared or private. A shared variables has the same address in the
execution context of every thread.
➢ All threads have access to shared variables. A private variable has a different address in the
w

execution context of every thread.

➢ A thread can access its own private variables, but cannot access the private variable of another
w

thread.
➢ In the case of the parallel for pragma, variables are by default shared, with the exception that the
w

loop index variables its private.

Conditionally Executing Loops

➢ If a loop does not have enough iterations , the time spent forking and joining threads may exceed the
time saved by dividing the loop iteration among multiple threads.

Page | 18
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ Consider, for example the parallel implementation of the rectangle rule examined earlier.
Area = 0.0
#pragama omp parallel for private(x) reduction(+;area)
For(i=o;i<n;i++)

om
{
X=(i+0.5)/n;
Area+=4.0 / (1.0 + x* x);
}
Pi = area /n;

.c
➢ The following figure reveals the average execution time of this program segment on a sun enterprise

ul
server 4000, for various values of n and various numbers of threads.
➢ When n is 100, the sequential execution time is so small that adding threads only increased overall
execution time.
pa
➢ When n is 100,000 the parallel program executing on four threads achievers a speedup of 3.16 over
the sequential program.
➢ If clause gives us the ability to direct the compiler to inser code that determines at run-time whether
jin
the loop should be executed in parallel or.
The clause has this syntax :
if(<scalar expression>)
.re

if the scalar expression avaluated to true, the loop will be executed in parallel. Otherwise it will be executed
serially.

Finding loop-carried dependences :

➢ Perhaps the ﬁrst thing to observe is that when we’re attempting to use a parallel for directive, we
only need to worry about loop-carried dependences.
w

➢ We don’t need to worry about more general data dependences. For example, in the loop
w

➢ There is a data dependence between Lines 2 and 3. However, there is no problem with the
parallelization

Page | 19
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )

om
Odd-even transposition sort :
➢ Odd-even transposition sort is a sorting algorithm that’s similar to bubble sort, but that has
considerably more opportunities for parallelism. that serial odd-even transposition sort can be
implemented as follows:

.c
ul
➢
pa
The list a stores n ints, and the algorithm sorts them into increasing order. During an “even phase”
(phase % 2 == 0), each odd-subscripted element, a[i], is compared to the element to its “left,” a[i1],
jin
and if they’re out of order, they’re swapped.
➢ During an “odd” phase, each odd-subscripted element is compared to the element to its right, and if
they’re out of order, they’re swapped.
.re

➢ A theorem guarantees that after n phases, the list will be sorted.

Scheduling Loops:
➢ In some-loops the time needed to execute different loop iterations varies considerable. For example ,
w

consider the following doubtly nested loop that initializes an upper triangular matrix:
w

fori(i=0;i<n;i++1)
{
w

for(j=1;j<n;j++)
a[i][j]=alpha_omegaa(i,j);
➢ Assuming there are no data dependences among iterations, prefer to execute the outermost loop . in
parallel in order to minimize fork/join overhead.

Page | 20
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ If every call to function alpha_omega takes the same amount of time, then the first iteration of the
outermost loop(when I equals0) requires n times more work than the last iteration (when I equals n-
1). Inverting the two loops will not remedy the imbalance.
➢ Suppose these n iterations are being executed on t threads. If each thread is assigned a continguous

om
block of either[n/t] or [n/t] threads, the parallel loop execution will have poor efficiency, because
some threads will complete their share of the iterations much faster than others.
Performance Considerations

The following are some general techniques for improving performance of OpenMP applications.

.c
➢ Minimize synchronization.
o Avoid or minimize the use of BARRIER, CRITICAL sections, ORDERED regions, and

ul
locks.
o Use the NOWAIT clause where possible to eliminate redundant or unnecessary barriers. For
pa
example, there is always an implied barrier at the end of a parallel region.
Adding NOWAIT to a final DO in the region eliminates one redundant barrier.
o Use named CRITICAL sections for fine-grained locking.
o Use explicit FLUSH with care. Flushes can cause data cache restores to memory, and
jin
subsequent data accesses may require reloads from memory, all of which decrease efficiency.
➢ As a matter of course, idle threads will be taken care of after a specific break period. It may be the
case that the default break period isn't adequate for your application, making the threads rest too
.re

early or past the point of no return. The SUNW_MP_THR_IDLE condition variable can be utilized
to abrogate the default break period, even up to where the idle threads will never be taken care of
and stay dynamic constantly.
➢ Parallelize at the highest level possible, such as outer DO/FOR loops. Enclose multiple loops in one
w

parallel region. In general, make parallel regions as large as possible to reduce parallelization
overhead
w

➢ Sometimes transforming a sequential for loop into a parallel for loop can actually increase a
program’s execution time.
w

Inverting loops :
Consider the following code segment :
for(i=1;i<m;i++)
for(j=1;j<m;j++)
a[i][j]=2*a[i-1][j];

Page | 21
www.rejinpaul.com – Multicore Architectures and programming (IV CSE )
➢ Draw a data dependence diagram to help us understand data dependencies in this code. The diagram
appears in following figure.
➢ The two rows may not be updated simultaneously, because there are data depedences between rows.
The columns may be updated simultaneously.
➢ This means the loop indexed by j may be executed to parallel, but not the loop indexed by Insert a

om
parallel for pragma before the inner loop, the resulting parallel program will execute correctly, but it
may not exhibit good performance, because it will require m-1 fork/join steps, one per iteration of
the outer loop.
#pragama parallel for private(i)

.c
for(j=0;j<n;j++)
{

ul
for(i=1;i<m;i++)
a[i][j]=2*a[i-j][j];
➢ Only a single’s fork/join step is required . the data dependences have not changed, the iterations of
pa
the loop indexed by j are still independent of each other. In this respect definintely improved the
code.
➢ How code transformations affect the cache hit rate. Each thread is now working through-columns, of
jin
a rather than rows. Since C matrices are stored in row-major order, inverting loops may lower the
cache hit rate, depending upon m,n, the number of activate threads, and the architecture of the
underlying system.
.re
w
w
w

Page | 22
www.rejinpaul.com

om
UNIT IV - DISTRIBUTED MEMORY PROGRAMMING WITH MPI

.c
ul
SYLLABUS

pa
MPI program execution – MPI constructs – libraries – MPI send and receive – Point-to-point and
Collective communication – MPI derived datatypes – Performance evaluation

jin
.re
w
w
w
ldenti
www.rejinpaul.com MPI processes

om
■ Common practice to identify processes by

.c
nonnegative integer ranks.

ul
■ p processes are numbered 0, 1, 2, .. p-1

pa
jin
.re
w
w
w
www.rejinpaul.com
1 / For strlen
#include <stdio.h> l

om
2 #include <string.h> /+ For MPI functions, etc l
3 #include <mpi.h>
4
const int MAX_STRING = 100;
6
7 int main(void) {8 char greeting[MAX_STRING];

.c
9 int comm_sz; / Number of processes l
10 int my_rank; /» My process rank l
11
12 MP I_Init (NULL, NULL);

ul
13 MP I_Comm_size(MP I_C0MM_W0RLD, &comm_sz );
14 MP I_Comm_rank (MP I_C0MM_W0RLD, &my_rank );
15
16 if (my_rank != 0) {

pa
17 sprintf(greeting, "Greetings from process Sd of %d!",
18 my_rank, comm_sz);
19 MP I_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0,
20 MP I_COMM_W0RLD);
21 } else {
22
23
24
25
26
jin
printf("Greetings from process %d of %d!\n", my_rank,
for (int q = 1; q < comm_sz; q++) {
MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q,
0, MP?I_C0MM_W0RLD, MPI_STATUS_IGN0RE);
printf("ss\n, greeting);
comm_sz);
.re
27 }
28 }
29
30 MPI_Finalize 0;
31 return 0;
w

32 } / main l
w
w
Com ilation
www.rejinpaul.com

om
wrapper script to compile

.c
\ source file

ul
mpicc -g -Wall -o mpi_hello mpi_hello.c

pa
produce
;
create this executable file name
debugging
information
jin (as opposed to default a.out)
.re
turns on all warnings
w
w
w
Execution
www.rejinpaul.com

om
.c
mpiexec -n <number of processes> <executable>

ul
pa
mpiexec -n 1 ./mpi_hello

jin L run with 1 process

.re
mpiexec -n 4 ./mpi_hello

L
w

run with 4 processes

w
w
Execution
www.rejinpaul.com

om
mpiexec -n 1 ./mpi_hello

.c
Greetings from process O of 1 !

ul
pa
mpiexec -n 4 ./mpi_hello

jin
Greetings from process O of 4!
.re
Greetings from process 1 of 4!
Greetings from process 2 of 4!
w

Greetings from process 3 of 4!

w
w
MPI Pro rams
www.rejinpaul.com

om
■ Written in C.
■ Has main.

.c
■ Uses stdio.h, string.h, etc.

ul
■ Need to add mpi.h header file.

pa
■ Identifiers defined by MPI start with
"MPI ".
jin
■ First letter following underscore is
.re
uppercase.
■ For function names and MPl-defined types.
w

■ Helps to avoid confusion.

w
w
MPI Com onents
www.rejinpaul.com

om
■ MPI lnit

.c
■ Tells MPI to do all the necessary setup.

ul
int MPI_Init(
int+ argc_p / in/out /,

pa
char # argv_p / in/out D);

■ MPI Finalize
■ jin
Tells MPI we're done, so clean up anything
.re
allocated for this program.
I int MPI_Finalize(void );
w
w
w
Basic Outline
www.rejinpaul.com

om
#include <mpi.h>

.c
int main(int argc, char+ argv[]) {

ul
/+ No MPI calls before this

pa
MPI_Init(&argc, &argv);

jin
MPI_Finalize0;
/+ No MPI calls after this l
.re
return 0;
}
w
w
w
Communicators
www.rejinpaul.com

om
■ A collection of processes that can send

.c
messages to each other.
■ MPI - lnit defines a communicator that

ul
consists of all the processes created when

pa
the program is started.

■
jin
Called MPI - COMM -WORLD.
.re
w
w
w
Communicators
www.rejinpaul.com

om
int MPI_Comm_size(
MP I_Comm comm /,

.c
comm_Sz_P / out /);

ul
number of processes in the communicator

pa
int MPI_Comm_rank(
MP I_Comm
int+ jincomm
my_rank_p
/+
/
in /
out +D);
'!>
.re
my rank
(the process making this call)
w
w
w
SPMD
www.rejinpaul.com

om
■ Single-Program Multiple-Data

.c
■ We compile one program.
■ Process O does something different.

ul
■ Receives messages and prints them while the

pa
other processes do the work.

■ jin
The if-else construct makes our program
.re
SPMD.
w
w
w
Communication
www.rejinpaul.com

om
int MP I_Send(

.c
■
void + mSG_ buf P /+ 1n /,
■
int mSg_S1ze /+ 1n /,

ul
■
■
MP I _Datatype msg_type /+ 1n / '

pa
■
int de st /+ 1n /
■
'
int tag /+ 1n /,
■

MP I Comm
jin
communicator /+ 1n /);
.re
w
w
w
www.rejinpaul.com
Data t es

om
MPI datatype C datatype
MP I - CHAR signed char

.c
MP I - SHORT signed short int
MP I - INT signed int

ul
MP I - LONG signed long int

pa
MP I - LONG LONG signed long long int
MP I UNSIGNED CHAR unsigned char
MP I UNSIGNED - SHORT unsigned short int

jin
MP I UNSIGNED
MP I UNSIGNED - LONG
unsigned int
unsigned long int
.re
MP I - FLOAT float
MP I DOUBLE double
MP I - LONG_DOUBLE long double
w

MP I - BYTE
w

MP I - PACKED
w
Communication
www.rejinpaul.com

om
int MP I_Rec(

.c
void + mSg_ buf _ p /+ out +/ ,
. .
int buf s 7e /+ 1n +/ ,

ul
- 1
.

pa
MP I _Datatype buf _type /+ 1n /,
.
int source /+ 1n /,
.
int tag /+ 1n /,
MP I Comm jin communicator /+
.
1n/
/+ out /);
':a
.re
MP I - Status status P
w
w
w
Messa e matching
www.rejinpaul.com

om
MP I_Send(send_buf_p, send_buf_sz, sen_type. @esp.ssm_tad.
<3cad_soi. /

.c
r

ul
MP/ Send

pa
jin dest = r
.re
MP I_Recv(recv'_buf_p, recv_buf_sz, reev_type.Groce_tad
<res_eon». &status):
~
w

q
w
w
Receivin
www.rejinpaul.commessages

om
■ A receiver can get a message without

.c
knowing:
■ the amount of data in the message,

ul
■ the sender of the message,

pa
■ or the tag of the message.

jin
.re
w
w
w
status_p argument
www.rejinpaul.com

om
MP I_Recv(recv_buf_p, recv_buf_sz, recv_type, src, recv_tag,

.c
recv_comm, &status);

ul
MPI Status

pa
MPI_Status* status;
jin MP/ SOURCE
MP/ TAG
.re
MP/ ERROR
status.MPI SOURCE
status.MPI TAG
w
w
w
How much data am I receiving?
www.rejinpaul.com

om
.c
int MP I_Get_count (
"
MP I - Status+ status _P /+ In / ~

ul
I

MP I _Datatype type /+ Ir / ~

int + count _P /+ out /);

pa
jin
.re
w
w
w
Issues with send and receive
www.rejinpaul.com

om
■ Exact behavior is determined by the MPI

.c
implementation.
■ MPI_Send may behave differently with

ul
regard to buffer size, cutoffs and blocking.

pa
■ MPI_Recv always blocks until a matching

jin
message is received.
■ Know your implementation;
.re
don't make assumptions!
w
w
w
www.rejinpaul.com

om
.c
ul
pa
jin
.re
TRAPEZOIDAL RULE IN MPI
w
w
w
The Tra ezoidal Rule
www.rejinpaul.com

om
.c
y y

ul
pa
a jin
b X a b X
.re
{a) {b)
w
w
w
The Tra ezoidal Rule
www.rejinpaul.com

om
h
= ,[f(») +fan)]

.c
Area of one trapezoid

ul
/]•
. b-a
n

pa
o=a,n=a+h,x=a+2h, ..., =a+(n1)h, x,=b
jin Jn 1
.re
Sum of trapezoid areas = h[f(ro)/2+f() +f()+·· +f(n1)+f()/2]
w
w
w
One tra ezoid
www.rejinpaul.com

om
y
y=f(x)

.c
f(x)

ul
f(x;4) +--------•

pa
jin
.re
X, Xi X
w

h
w
w
Pseudo-code
www.rejinpaul.com for a serial

om
program

.c
/ Input: a, b, n l
h = (b a)/n;

ul
approx = (f(a) + f(b))/2.0;

pa
for (i = 0; i <=nl; i++) {
x i =a + ih;
jin
approx += f(x_i);
.re
}
approx = h+approx;
w
w
w
Parallelizin
www.rejinpaul.com the Trapezoidal Rule

om
1. Partition problem solution into tasks.

.c
2. Identify communication channels between

ul
tasks.

pa
3. Aggregate tasks into composite tasks.
4. Map composite tasks to cores.
jin
.re
w
w
w
Parallel Pseudo-code
www.rejinpaul.com

om
I Get a, b, n;
2 h = (b a)/n;
3 local n - n/comm_sz ;

.c
4 local_a = a + my_rank+local_nh;
5 local_b = local_a + local_nh;

ul
6 local_integral = Trap(local_a. local_b. local_n. h);
7 if (my_rank != 0)

pa
8 Send local_integral to process 0;
9 else / my_rank == 0 l
10 total_integral = local_integral;
11
12 jin
for (proc = l; proc < comm_sz; proc++) {
Receive local_integral from proc;
.re
13 total_integral += local_integral;
14 }
15
w

16 if (my_rank == 0)
17 print result;
w
w
Tasks and communications for
www.rejinpaul.com

om
Trapezoidal Rule

.c
mpute a mpute a mpute a

ul
of trap 0 of trap 1 trap n-

pa
jin
.re
w
w
w
First version 1)
www.rejinpaul.com

om
I int main(void) {
2 int my_rank. comm_sz, n = 1024. local_n;
3 double a = 0.0, b = 3.0, h, local_a. local_b;

.c
4 double local_int, total_int;
5 int source;

ul
6
7 MP I_Init(NULL, NULL);
8 MP I_Comm_rank (MP I_COMM_WORLD, &my_rank );

pa
9 MP I_Comm_size (MP I_COMM_WORLD, &comm_sz );
10
H h = (ba)/n; / h is the same for all processes l
12
13
14
localn n/comm_sz;
=

jin/ So is the number of trapezoids

local a = a + my_rank+local_nh;
l
.re
15 local b = local_a + local_nh;
16 local int = Trap(local_a. local_b, local_n, h);
17
I8 if (my_rank != 0) {
w

19 MP I_Send(&local_int, I, MPI_DOUBLE, 0, 0,
20 MP I_COMM_WORLD);
w
w
First version 2)
www.rejinpaul.com

om
21 } else {
22 total_int = local_int;
for (source = l; source < comm_sz; source++) {

.c
23
24 MP I_Recv(&local_int, I, MPI_DOUBLE, source, 0,
25 MP I_COMM_WORLD, MPI_STATUS_IGNORE);

ul
26 total int += local_int;
27 }

pa
28 }
29
30 if (my_rank == 0) {
31 printf("With n = %d trapezoids, our estimate\n", n);
32
33
34 }
jin
printf("of the integral from %f to %f = %.15e\n",
a. b, total_int);
.re
35 MPI_Finalize 0;
36 return 0;
37 } / man al
w
w
w
First version 3)
www.rejinpaul.com

om
1 double Trap(
2 double left_endpt / in /

.c
'
3 double right_endpt / in /
'
4 int trap_count / in /
'

ul
5 double base len / in /) {
6 double estimate, x;
7 int i;

pa
8
9 estimate = (f(left_endpt) + f(right_endpt ))/2.0;
10 for (i = l; i <= trap_count l; i++){
11
12
13 }
jin
x = left_endpt + ibase_len;
estimate += f(x);
.re
14 estimate = estimatebase_len;
15
16 return estimate;
w

17 } /+ Trap /
w
w
Deal in with 1/0
www.rejinpaul.com

om
#include <stdio.h>
#include <mpi.h>

.c
Each process just
int main(void) { prints a message.

ul
int my_rank, comm_sz;

pa
MP I_Init (NULL, NULL);
MP I_Comm_Size (MP I_COMM_WORLD, &comm_sz );
MP I_Comm_rank (MP I_COMM_WORLD. &my_rank);

jin
printf("Proc %d of %d > Does
my_rank. comm_sz);
anyone have a toothpick?\n",
.re
MPI_Finalize 0;
return 0;
} / main l
w
w
w
_www.rejinpaul.com
processes R

om
Proc 0 of 6 > Does anyone have a toothpick?

.c
Proc I of 6 > Does anyone have a
toothpick? Proc ?.....
Does anyone have£
6 >
a toothpick?

ul
I O
Proc 4 OI 6 > Does anyone
£
have a toothpick?

pa
Proc 3 0I 6 > Does anyone
£
have a toothpick?
Proc 5 OI 6 > Does anyone
£
have a toothpick?

jin
unpredictable output
.re
w
w
w
Inwww.rejinpaul.com
ut

om
■ Most MPI implementations only allow

.c
process O in MPI_COMM_WORLD access
to stdin.

ul
■ Process O must read the data (scanf) and

pa
send to the other processes.

jin
MP I_Comm_rank (MP I_COMM_WORLD,
MP I_Comm_Size (MP I_COMM_WORLD,
&my_rank);
&comm_sz );
.re
Get_data(my_rank, comm_sz, &a, &b, &n);
w

h = (b a)/n;
w
w
Function for reading user input
www.rejinpaul.com

om
void Get_input(
int my_ rank /+in /,
int comm- sz / in /,
double + a _P / out /,

.c
double+ b _P / out /,
int+ n _P / out /) {
int dest;

ul
if (my_rank -- 0) {

pa
printf("Enter a, b, and n\n");
scanf("Slf Slf d". a_p. b_p. n_p):
for (dest = l; dest < comm_sz; dest++) {
MP I_Send(a_p. I, MPI_DOUBLE, dest, 0, MPI_COMM_WORLD);

} jin
MP I_Send(b_p.
MPI_Send(n_P.
I, MPI_DOUBLE, dest,
I, MPI_INT, dest, 0,
0, MPI_COMM_WORLD);
MPI_COMM_WORLD);
.re
} else { /+ my_rank != 0 l
MPI_Recv(a_p. I, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD,
MP I_STATUS_IGNORE):
MPI_Recv(b_p. I, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD,
w

MP I_STATUS_IGNORE );:
MP I_Recv(n_P. I, MPI_INT, 0, 0, MPI_COMM_WORLD,
MP I_STATUS_IGNORE):
w

}
} /» Get_input l
w
www.rejinpaul.com

om
.c
ul
pa
jin
.re
COLLECTIVE
w

COMMUNICATION
w
w
Tree-structured
www.rejinpaul.com communication

om
1. In the first phase:
(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and

.c
7 sends to 6.

ul
(b) Processes 0, 2, 4, and 6 add in the received values.
(c) Processes 2 and 6 send their new values to

pa
processes 0 and 4, respectively.
(d) Processes 0 and 4 add the received values into their
new values.
jin
.re
2. (a) Process 4 sends its newest value to process 0.
(b) Process 0 adds the received value to its newest
w

value.
w
w
Awww.rejinpaul.com
tree-structured global sum

om
Processes
0 1 2 3 4 5 6 7

s (

.c
3

ul
pa
jin
.re
w
w
w
An alternative tree-structured
www.rejinpaul.com

om
global sum
Processes

.c
0 1 2 3 5 6 7
3 @ € (

ul
pa
jin
.re
w
w
w
MPI Reduce
www.rejinpaul.com
-

om
int MP I - Reduce (
void + /%

.c
input_ data _P In / +

void + output_ data D / out +/ ,,

ul
int count /% 1n +/ "
MP I _Datatype datatype /+ 1n +/ "

pa
MP I _Op operator /% In /
'
int de st _process /% 1n +/ "
MP I Comm comm /¥ 1n /);

jin
MP I_Reduce(&local_int, &total_int, I, MPI_DOUBLE, MPI_SUM, 0,
.re
MP I_COMM_WORLD);

double local_x[N], sum[N];

MP I_Reduce(local_x, sum, N, MPI_DOUBLE, MPI_SUM, 0,

MP I_COMM_WORLD);
w
w
Predefined
www.rejinpaul.com reduction operators

om
in MPI
[ Operation Value [ Meaning

.c
MP I_MAX Maximum
MP I_MIN Minimum

ul
MP I_SUM Sum

pa
MP I_PROD Product
MP I_LAND Logical and
Bitwise and
MP I_BAND
MP I_LOR
MPI....BOR
jinLogical or
Bitwise or
.re
MP I_LXOR Logical exclusive or
MPI....BXOR Bitwise exclusive or
w

MP I_MAXLC Maximum and location of maximum

MP I_MINLOC Minimum and location of minimum

w
Collective
www.rejinpaul.comvs. Point-to-Point

om
Communications
■ All the processes in the communicator

.c
must call the same collective function.

ul
pa
■ For example, a program that attempts to
jin
match a call to MPI - Reduce on one
process with a call to MPI_Recv on
.re
another process is erroneous, and, in all
likelihood, the program will hang or crash.
w
w
w
Collective
www.rejinpaul.comvs. Point-to-Point

om
Communications
■ The arguments passed by each process to

.c
an MPI collective communication must be

ul
"compatible."

pa
■
jin
For example, if one process passes in O as
the dest_process and another passes in 1,
.re
then the outcome of a call to MPI - Reduce
is erroneous, and, once again, the
w

program is likely to hang or crash.

w
w
Collective
www.rejinpaul.comvs. Point-to-Point

om
Communications
■ The output_data_p argument is only used

.c
on dest_process.

ul
pa
■ However, all of the processes still need to
pass in an actual argument corresponding
jin
to output_data_p, even if it's just NULL.
.re
w
w
w
Collective
www.rejinpaul.comvs. Point-to-Point

om
Communications
■ Point-to-point communications are

.c
matched on the basis of tags and

ul
communicators.

pa
■ Collective communications don't use tags.
jin
■ They're matched solely on the basis of the
.re
communicator and the order in which
they're called.
w
w
w
Exam
www.rejinpaul.com

om
.c
[ Time ][Process 0 I Process 1 [Process 2

ul
0 a =1; c= 2 a= 1; C = 2 a = l; c = 2
I MPI_Reduce(a, &b, ...) MP I_Reduce (&c, &d, ...) MP I_Reduce(&a, &b, ...)
2 MPI_Reduce(&c, &d, ...) MPI_Reduce(&a, &b, ...) MP I_Reduce(&c, d, ...)

pa
jin
Multiple calls to MPI_Reduce
.re
w
w
w
Exam
www.rejinpaul.com

om
■ Suppose that each process calls

.c
MPI_Reduce with operator MPI_SUM, and
destination process 0.

ul
pa
■ At first glance, it might seem that after the

jin
two calls to MPI_Reduce, the value of b
will be 3, and the value of d will be 6.
.re
w
w
w
Exam
www.rejinpaul.com

om
■ However, the names of the memory

.c
locations are irrelevant to the matching of

ul
the calls to MPI - Reduce.

pa
■ The order of the calls will determine the
jin
matching so the value stored in b will be
1+2+1 = 4, and the value stored in d will
.re
be 2+1+2 =5.
w
w
w
MPI Allreduce
www.rejinpaul.com
-

om
■ Useful in a situation in which all of the

.c
processes need the result of a global sum

ul
in order to complete some larger
computation.

pa
int MPI_Allreduce(
~
void+
void+ jin input_data_p
output_data_p
/+
Ik
1n
out
~
/
kI
'
'
.re
int count / 1n /
., '
MP I_Datatype datatype /+ 1n / 7
~
MP I_Op operator /+ 1n /
' .
w

.,
MPI Comm comm /+ 1n / ) ,.
w
w
www.rejinpaul.com

om
Broadcast

.c
■ Data belonging to a single process is sent

ul
to all of the processes in the

pa
communicator.
int MPI_Bcast(
void+ jin data_p Ik 1n I out kI
"
.re
int count Ik 1n kI
"
MP I_Datatype datatype /+ 1n kI
"
int source_proc /+ 1n kI
"
w

MP I Comm comm /+ 1n kI ) ,.,

w
w
www.rejinpaul.com

om
( A tree-structured broadcast.

.c
I
I
I
I

ul
6

pa
I
I I
I I
I I

$ I
jin 6
I
.re
I I
I I I
I I I I

@ 0 @ 0
w

0 1 2 3 4 5 6 7
w

Processes
w
Awww.rejinpaul.com
version of Get_input that uses

om
MPI - Beast
void Get_input (

.c
int my_ rank /+ In /
'
int

ul
comm - S Z / In /
'
double + a_P /+ out /
'
double + b p out /

pa
/+
'
int ·k n_P / out /) {

==
if (my_rank

jin
0) {
printf("Enter a, b, and n\n");
scanf("If %lf %d", a_p. b_p. n_p):
.re
}
MPI_Bcast(a_p. I. MPI_DOUBLE, O, MPI_COMM_WORLD);
MPI_Bcast(b_p. I, MPI_DOUBLE, O, MPI_COMM_WORLD);
w

MPI_Bcast(n_p. I. MPI_INT, O, MP I_COMM_WORLD);

} / Get_input l
w
w
www.rejinpaul.com

om
■ Block partitioning

.c
■ Assign blocks of consecutive components to
each process.

ul
■ Cyclic partitioning

pa
■ Assign components in a round robin fashion.
■ Block-cyclic partitioning
■ jin
Use a cyclic distribution of blocks of
.re
components.
w
w
w
Parallel implementation of
www.rejinpaul.com

om
vector addition

.c
void Parallel_vector_sum(
double local_x [] / 1n #/
"

ul
double local_y [l / 1n /
"
double local_z [] / out /
"
int / n /) {

pa
local n
int local_i;

for (local_i = 0; local_i < local_n: local_i++)

} he jin
local_z[local_i] = local_x[local_i] + local_y[local_il:
Parallel_vector_sum
.re
w
w
w
Scatter
www.rejinpaul.com

om
■ MPI Scatter can be used in a function that

.c
reads in an entire vector on process O but
only sends the needed components to

ul
each of the other processes.

pa
int MP I_Scatter(
void +
int
jin
MP I _Datatype
send - buf _P
send - count
send _type
/%
/
/
In
n
n
/ ,,
/ ,,
/,
.re
void + recv - buf _P /# out / ,,
int recv - count / n / ,,
MP I _Datatype recv_type / n /,
int src_proc / In / ,,
w

MP I - Comm comm / n /);

w
w
www.rejinpaul.com

om
void Read_vector(
double local_a[] / out /,
int local n /+ in /,
int n /+ in /,

.c
char vec_name[] /+ in /,
int my_ rank / in /,

ul
MP I Comm comm /+ in /) {

double + a = NULL;

pa
int i :

if (my_rank == 0) {
a = malloc(n+sizeof (double ));

jin
printf("Enter the vector %s\n", vec_name);
for (i = 0; i <n; i++)
scanf("Slf", &ali):
.re
MPI_Scatter(a, local_n, MPI_DOUBLE, local_a. local_n, MPI_DOUBLE.
0, comm);
free(a);
} else {
w

MPI_Scatter(a, local_n, MPI_DOUBLE, local_a. local_n, MPI_DOUBLE.

0, comm);
}
w

} / Read_vector l
w
Gather
www.rejinpaul.com

om
■ Collect all of the components of the vector

.c
onto process 0, and then process O can
process all of the components.

ul
pa
int MP I - Gather (
void k send - buf _P / 1n /
"
int send - count / 1n /
"

void k jin
MP I _Datatype send _type
recv - buf p.
/
/
1n
out
/
/
,,
"
.re
int recv - count / 1n /
"
MP I _Datatype recv_type / 1n / "
int dest _proc / 1n / ,,
w

MP I - Comm comm / 1n / ) .,,

w
w
Print a distributed vector (1)
www.rejinpaul.com

om
void Print_vector(

.c
double local_b[] / in l,
int local n /+ in l,

ul
int n /+ in l,

pa
char title[] / in #l
'

int my_rank /+ in l,

jin
MP I Comm comm /+ in +D) {
.re
double+ b = NULL;
int i;
w
w
w
Print a distributed vector (2)
www.rejinpaul.com

om
.c
if (my_rank == 0) {
b = malloc(n+sizeof(double ));

ul
MPI_Gather(local_b. local_n, MPI_DOUBLE, b. local_n, MPI_DOUBLE,
0, comm);
printf("%s\n", title):

pa
for (i = 0: i < n; i++)
printf(&E ", b[i]):
printf("\n");
free(b);
} else {
MPI_Gather(local_b,
jin
local_n, MPI_DOUBLE, b. local_n, MPI_DOUBLE,
.re
0, comm);
}
} h» Print_vector l
w
w
w
All ather
www.rejinpaul.com

om
■ Concatenates the contents of each

.c
process' send_buf_p and stores this in
each process' recv_buf_p.

ul
■ As usual, recv_count is the amount of data

pa
being received from each process.
void +
int
jin
int MPI_Allgather(
send - buf _P
send - count
/%
/%
n
In
/
'
/ ,,
.re
MP I _Datatype send _type /% In /
'
void + recv - buf _P /% out / ,
int recv - count /+ In +/ ,,
w

MPI _Datatype recv_type /+ In +/

'
MPI - Comm comm /+ In /) ,,.
w
w
Multi a matrix by a vector
www.rejinpaul.com

om
.c
/« For each row of A /
for (i = 0; i <m; i++) {

ul
/# Form dot product of ith row with x +l

pa
v[il = 0.0:
for (j = 0; j < n; j++)

} jin
ylil += Alillil+xlil:
.re
Serial pseudo-code
w
w
w
Cwww.rejinpaul.com
st le arra s

om
0 l 3 2

.c
6 7
4 5
\

ul
, ._

8 9 10 11
stored as

pa
jin
.re
w
w
w
Serial matrix-vector
www.rejinpaul.com

om
multiplication
d Mat_vect_mult(
Vol•

.c
double A[] / In /
"
double x[] /» 1n /
"

ul
double y[] / out /
"
int m / 1n / ,.

pa
int n / 1n /) {
.
int 1 J "
'

jin
for (i = 0;
yli] = 0.0;
i <m; i++) {
.re
for (j = 0; <
n; j++)
j
yIi] += A[i+n+jl+x[jl:
}
w
w
w
An MPI matrix-vector
www.rejinpaul.com

om
multiplication function (1)
void Mat_vect_mult(

.c
■

double local_A [L /·.·.,.- In /

"
double ■

ul
local_x[] / In /
"
double local_y[l
- out
/· · /
"

pa
■

int I··- 1n
local m
-/ ·•~- ■
/
"
int n / ,.
-• 1n ■

int local n /±" /

jin
MP I Comm
double+ x;
comm /%
-•
1n
■

1n
"
/) {
.re
int local_i, j:
int local ok = 1:'
w
w
w
An MPI matrix-vector
www.rejinpaul.com

om
multiplication function (2)

.c
x = malloc(n+sizeof (double ));

ul
MP I_Allgather(local_x. local_n, MPI_DOUBLE,
x, local_n, MPI_DOUBLE, comm);

pa
for (local_i = 0; local_i < local_m; local_i++) {
local_y[local_i] = 0.0;

jin
for (j = 0; j < n; j++)
local_y[local_i] += local_A[local_in+jl+x[jl;
.re
}
w
w
w
www.rejinpaul.com

om
.c
ul
pa
jin
.re
MPI DERIVED DATATYPES
w
w
w
www.rejinpaul.com

om
■ Used to represent any collection of data items in
memory by storing both the types of the items

.c
and their relative locations in memory.

ul
■ The idea is that if a function that sends data
knows this information about a collection of data

pa
items, it can collect the items from memory

jin
before they are sent.
■ Similarly, a function that receives data can
.re
distribute the items into their correct destinations
in memory when they're received.
w
w
w
Derived DATATYPES
www.rejinpaul.com

om
■ Formally, consists of a sequence of basic
MPI data types together with a

.c
displacement for each of the data types.

ul
■ Trapezoidal Rule example:

pa
[Variable Address]
a
b
24
40 jin
{(MP I_DOUBLE,O), (MP I_DOUBLE, 16), (MP I_INT,24)}
.re
n 48
w
w
w
MPI_Type
www.rejinpaul.comcreate_struct

om
■ Builds a derived datatype that consists of

.c
individual elements that have different

ul
basic types.

pa
int MPI_Type_create_struct(
int count /+ In /
'
int
MPI Aint
MP I_Datatype
jin
array_of_blocklengths[]
array_of_displacements[]
array_of_types[]
/+
/+
/+
In
In
In
+/
/
+/
'
·~
.re
'
MP I_Datatype+ new_type_p /+ out ±/ ) ,,
4
w
w
w
MPI Get address
www.rejinpaul.com
- -

om
■ Returns the address of the memory

.c
location referenced by location_p.

ul
■ The special type MPI_Aint is an integer
type that is big enough to store an address

pa
on the system.

void+
jin
int MPI_Get_address(
,.
.re
location_p /+ 1n /
'
MP I Aint+ address_p /# out +/) ■

'
w
w
w
MPI_Type_commit
www.rejinpaul.com

om
■ Allows the MPI implementation to optimize

.c
its internal representation of the datatype
for use in communication functions.

ul
pa
int MPI_Type_commit(MPI_Datatype new_mpi_t_p /+ in/out /);

jin
.re
w
w
w
MPI_Type_free
www.rejinpaul.com

om
■ When we're finished with our new type,

.c
this frees any additional storage used.

ul
pa
int MPI_Type_free(MPI_Datatype old_mpi_t_p /+ in/out /);

jin
.re
w
w
w
Get input function with a derived
www.rejinpaul.com

om
datatype (1)

.c
void Build_mpi_type(
double + n /

ul
a_P /#
'
double + b -P /+ n /
'
int + n_P /#% In /,

pa
MPI _Datatype+ input_mpi_t_p /# out +/) {

int array_of_blocklengths[3] = {I, 1, 1};

jin
MPI_Datatype array_of_types[3] = {MPI_DOUBLE,
MPI Aint a_addr, b_addr. n_addr;
MPIAint array_of_displacements[3] = {0};
MPI_DOUBLE, MPI_INT};
.re
w
w
w
Get input function with a derived
www.rejinpaul.com

om
datatype (2)

.c
MP I_Get_address(a_p. &a_addr );

ul
MP I_Get_address(b_p. &b_addr);
MP I_Get_address(n_p. &n_addr ):

pa
array_of_displacements[I] = b_addr-a_addr;
array_of_displacements[2] = n_addra_addr;

jin
MP I_Type_create_struct(3, array_of_blocklengths,
array_of_displacements, array_of_types,
.re
input_mpi_t_p):
MP I_Type_commit(input_mpi_t_p);
} h» Build_mpi_type «/
w
w
w
Get input function with a derived
www.rejinpaul.com

om
datatype (3)

.c
void Get_input(int my_rank, int comm_sz, double# a_p, double+ b_p.
int+ n_p) {

ul
MP I_Datatype input_mpi_t;

pa
Build_mpi_type(a_p. b_p. n_p. &input_mpi_t);

if (my_rank 0) {
==

}
jin
printf("Enter a, b, and n\n");
scanf("&If %lf %d", a_p. b_p. n_p):
.re
MPI_Bcast(a_p. I. input_mpi_t, 0, MPI_COMM_WORLD);

MP I_Type_free(&input_mpi_t);
w

} /+ Get_input /
w
w
www.rejinpaul.com

om
.c
ul
pa
jin
.re
PERFORMANCE EVALUATION
w
w
w
Ela sed arallel time
www.rejinpaul.com

om
■ Returns the number of seconds that have

.c
elapsed since some time in the past.

ul
double MP I_Wtime(void );

pa
double start, finish;

start =
jin
MPI_Wtime0;
/ Code to be timed l
.re
finish = MPI_Wtime0);
printf("Proc d > Elapsed time = Se seconds\n"
w

my_rank, finish-start);
w
w
Ela sed serial time
www.rejinpaul.com

om
■ In this case, you don't need to link in the

.c
MPI libraries.
■ Returns time in microseconds elapsed

ul
from some point in the past.

pa
#include "timer.h"

jin
double now;
.re
GET_TIME(now);
w
w
w
Ela sed serial time
www.rejinpaul.com

om
.c
#.nc.1u d e "" t 1mer.h""

ul
double start, finish;

pa
GET_TIME(start):
/ Code to be timed l

GET_TIME(finish);
printf("Elapsed time jin = Se seconds\n", finish-start);
.re
w
w
w
MPI Barrier
www.rejinpaul.com
-

om
■ Ensures that no process will return from

.c
calling it until every process in the
communicator has started calling it.

ul
pa
int MPI_Barrier(MPI_Comm comm /+ in 1);

jin
.re
w
w
w
www.rejinpaul.com

om
MPI - Barrier

.c
double local_start, local_finish,

ul
local_elapsed, elapsed;

pa
MP I_Barrier(comm);
local start = MPI_Wtime (0);
/ Code to be timed el

jin
local_finish = MPI_Wtime0);
.re
local_elapsed = local_finish - local_start;
MP I_Reduce(&local_elapsed, &elapsed, I,
MPI_DOUBLE, MP I_MAX. 0, comm);
w

if (my_rank 0)
w

printf("Elapsed time = %e seconds\n", elapsed);

I
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

UNIT V- PARALLEL PROGRAM DEVELOPMENT

Case studies - n-Body solvers – Tree Search – OpenMP and MPI implementations
and comparison.

Introduction
Many physical phenomena directly or indirectly (when solving a discrete version of a

om
continuous problem) involve, or can be simulated with particle systems, where each
particle interacts with all other particles according to the laws of physics. Examples
include the gravitational interaction among the stars in a galaxy or the Coulomb forces
exerted by the atoms in a molecule. The challenge of efficiently carrying out the related
calculations is generally known as the N-body problem.
Mathematically, the N-body problem can be formulated as

.c
U (x 0 ) F (x 0 ,x i ) (1)
i

where U(x0) is a physical quantity at x0 which can be obtained by summing the pairwise

ul
interactions F(x0,xi) over the particles of the system. For instance, assume a system of
N particles, located at xi and having a mass of mi. The gravitational force exerted on a
particle x having a mass m is then expressed as
pa
N
x xi
F(x) Gmmi 3 (2)
i 1 x xi

where G is the gravitational constant.

The task of evaluating the function U(x0) for all N particles using (1) requires O(n)
jin

operations for each particle, resulting in a total complexity of O(n2). In this paper we
will see how this complexity can be reduced to O(n log n) or O(n) by using efficient
methods to approximate the sum in the right hand term of (1), while still preserving
such important physical properties as energy and momentum.
.re

Example N-body problem: 10 million-star galaxy

The summation techniques presented in this paper may be applied to a wide range of
N-body problems. However, in order to clarify it is beneficial to have an actual
problem to apply the ideas on. In this case, I have chosen the problem of simulating the
w

movements of the stars of a galaxy. Let’s assume that there are about N:=10 million
starts in the galaxy, although this is clearly much less than “in the real world”.
Furthermore, for the sake of clarity the simulation will be done in two dimensions.
w

In the model, each star has the following quantities associated with it:
mass, mi
w

position, xi (depends on time)

Newtonian (let’s not allow Einstein to mess things up) physics tells us that for each
star
2 N x j xi
x (t) Gmi 3 (3)
x 2 j i 1, j i x j xi

Then, for each timestep from time tk to tk+1:= t+tk we need to integrate the right hand
term of equation (3) in order to obtain the change in position:
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

xj F(x j (t))dtdt (4)

t k ,t k 1

where
N xj xi
F(x j ) Gmi 3 (5)
i 1 xj xi

(4) is a somewhat difficult integral equation, since xj is present on both sides. Also, xi

om
is dependent on t, which means we have a system of N coupled integral equations for
each time step.
A discrete version of (4) (which can be obtained by making certain assumptions) has
the general form1
k
xj ci F(x j (t hi )), k (6)

.c
i 1

and is thus a linear combination of the function F evaluated at different time points;
different discrete integration schemes yield different coefficients ci and hi. A commonly

ul
used integrator is the so-called Leapfrog integration scheme.
We can now formulate an algorithm for the simulation:
pa
1. Set initial positions
2. for each timestep t do
3. for each particle j do
4. evaluate F(xj(t)) at timepoints required by
the integrator
5. use the integrator to calculate xj
jin

6. xj(t+ t)=xj(t)+ x
7. endfor
8. endfor
The function F is of the form (1), and thus the N-body force calculation algorithms
.re

presented in this paper can be used to speed up step 4 of the algorithm.

The Particle-Particle (PP) method

The method of evaluating the right hand side of (1) directly is generally referred to as
the Particle-Particle (PP) method2. This brute-force approach is clearly not feasible for
w

large amount of particles due to the O(n2) time requirement, but can effectively be used
for small amounts of particles3. As this method does not approximate the sum, the
accuracy equals machine precision.
w

Tree codes
Let a be the radius of the smallest disc, call it D, so that the set of particles
w

P:=(xi1,..,xiN) are inside the disc. Many physical systems have the property that the
field U(x) generated by the the particle set P may be very complex inside D, but
smooth (“low on information content”) at some distance c a from D. The gravitational
force, for instance, has this property. 4
This observation is used in the so-called tree code-approach to the N-body problem:
Clusters of particles at enough distance from the “target particle” x0 of equation 1 are
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

given a computationally simpler representation in order to speed up summation5. This

approach can be illustrated by the following example: when calculating the
gravitational force exerted by earth on an orbiting satellite, we do not try to sum the
gravitational forces of all atoms that constitute planet earth. Instead we approximate
the force with that of an infinitesimal particle, which have the same mass as earth and is
located at earth’s center of mass.

om
P
x0

.c
P
x0

ul
Figure 1 Approximating the set P

The tree-code approach is effective because it can be applied recursively6. For

instance, in Figure 5, we might also be able to use a subset of P’ P to approximate the
pa
force on the particle down to the left, as the radius a’ of the disc D’ enclosing P’ is less
than a, and thus the minimal distance for approximation, ca’, is also smaller.
We need a data structure that supports the recursive nature of the idea, in 2D this is a
quadtree and in 3D an octree. Each node in the quadtree is a square, and has four child
jin

nodes (unless it is a leaf node), representing a break-up into four smaller squares, ¼
the size of the original square (see Figure 2). An octree is built in a similar manner
(each node has 8 children).
.re
w
w

Figure 2 Quadtree
w

We now construct a quadtree, so that each leaf node contains only 1 particle. This is
done by recursively subdividing the computational box; each node is further subdivided
if it contains more than one particle.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

om
.c
Figure 3 Adaptive quadtree with one particle/leaf. The picture is from [Demmel1]

ul
Assuming the particles are not at arbitrarily small distances from each other (at least
the machine precision sets a limit) a quadtree can be built in O(n min(b,log n)) time,
pa
where b is the machine precision7.
Now assume that the distance between a cluster and a particle must be at least the
length of the side of the cluster box in order to obtain an accurate approximation.
When calculating the force on a particle x0, the tree is recursively traversed from the
root. At each level, there may be no more than 9 boxes (the ones surrounding the box
jin

containing the particle) which need further subdivision, limiting the number of force
calculations on the next level to 27 (=2*3*2*3-9, see Figure 4). Thus as each level a
maximum of 27 O(1) operations are performed. The depth of the tree is min(b,log(n))
yielding a total complexity (for all N particles) of O(n min(b,log n)).
.re

2 2 2 2 2 2
1 3 3 3 3 3 3
2 2 3 3 3 2
3 3 X 3
2 2 3 2
w

3 3
1 3 3 3 3 3 3
2 2 3 3 3 3 3 3 2

2 2 2 2 2 2
w

1
2 2 2 2 2 2
w

1 1 1 1

Figure 4 Quadtree nodes calculated at each level for particle x

Tree codes thus reduces the computational complexity from O(n2) to O(n log n) or
O(n) depending on your point of view - certainly a vast improvement! But as the
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

saying goes, there’s no such thing as a free lunch: tree codes are less accurate than
simple PP, and require more auxiliary storage8.

The Barnes-Hut algorithm: applying tree codes.

The Barnes-Hut algorithm is a good example of how to use tree codes in an algorithm.
The algorithm was presented in [Barnes&Hut] in 1986, and is widely used in
astrophysics. The algorithm has also been successfully parallized.9 This discussion of

om
the algorithm is mainly based on [Demmel1], which describes Barnes-Hut for a 2D N-
body system similar to our example.
The main idea is to approximate long-range forces by aggregating particles into one
particle, and using the force exerted by this particle. A quadtree structure (or octree in
3D10), as described in the previous section, is used to store the particles. The tree steps
of the algorithm are

.c
1. Build quadtree as described in section 0
2. Traverse the quadtree from the leaves to the root, computing
center of mass and total mass for each parent node.
3. For each particle, traverse the tree from the root,

ul
calculating the force during the traversal

Step 2
pa
Step 2 calculates the approximations for the long-range force. The approximation is
made by considering several particles as one, with a position equal to the center of
mass of the approximated particles, and a mass the sum of the approximated particles’
masses. More formally, to find the mass and position associated with a node N:
calculate_approxiamtions( N )
jin

if N is a leaf node
return; // Node has a (real) particle => has mass & position
for all children n of N do
calculate_approximations( n )
M := 0
cm := (0,0)
.re

for all children n of N do

M := M + mass of n
cm := cm + mass of n * position of n
endfor
cm := 1/M * cm
mass of N := M
position of N := cm
w

end
Step 3
Consider the ratio,
w

D
(7)
r
where D is the size of the current node (call it A) “box” and r is the distance to the
w

center of mass of another node (called B). If this ratio is sufficiently small, we can use
the center of mass and mass of B to compute the force in A. If this is not the case, we
need to go to the children of B and do the same test. Figure 4 shows this for =1.0;
the numbers indicate the relative depth of the nodes. It can clearly be seen that further
away from the particle, x, large nodes are used and closer to it smaller. The method is
accurate to approximately 1% with =1.011. Expressed in pseudocode
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

treeForce(x, N)
if N is a leaf or size(N)/ x-N <
return force(x,N)
else
F := 0
for all children n of N do
F := F + treeForce(x, n)
endfor
return F
endif
end

om
The Particle-Mesh (PM) method
Whereas we have discretized time, position is still continuous. The PM method goes
one step further: it effectively discretizes position too12. However, before we explore
this idea further, we need the concept of potential.

.c
Let us assume that there exists a quantity , which is related to the physical quantity U
we are studying according to:
U (8)

ul
and, furthermore, that
U c (9)
where is the density function (e.g. mass, charge) obtained from the particle
pa
distribution and c is a constant . This leads to the Poisson equation13
2
c (10)
In the sample N-body problem U corresponds to the force, and to the potential
energy in the gravitational field generated by the particles. is the mass density
jin

(mass/area unit). In the continued discussion of the PM method the quantities used will
be these.
The idea of the PM method is that we set up a mesh (grid) over the computational box,
and then solve the potential (i.e. Poisson’s equation) at the meshpoints. Forces at the
.re

meshpoints can then be obtained by calculating the gradient of the potential. To find
the force on a particle not located at a meshpoint we can either use the force at the
nearest meshpoint, or interpolate the force from the closest meshpoints.14
Mesh point
w

1 2 Mesh cell
Particle mi
w

3 4
w

Figure 5 Mesh density function assignment

Creating the density function

There are several ways of assigning the particle masses to the density function15 :
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Nearest gridpoint (NGP): The mass of each particle mi is assigned to the gridpoint
closest to the mass. In Figure 5, this would mean assigning the entire mass of the
particle to cell 2. NGP is also referred to as zero-order interpolation
Cloud-in-Cell (CIC): The mass of each particle is weighted over the four (in 2D)
closest cells; the weighting is proportional to the intersection of the “cloud”
surrounding the particle and the cell. In Figure 5 almost all mass would be assigned to
cell 2, then approximately the same amount to cells 1 and 4, and finally about 1/16 to

om
cell 3. CIC implements first order (linear) interpolation.
Higher order interpolations: The “cloud“ (weighting function) around the particle
can be made to cover even more cells, resulting in higher order interpolations, e.g.
TSC (triangular shaped cloud).

Calculating the potentials

.c
Now that we have the density function , we can solve (10) in order to obtain the
potential at the meshpoints. This is done by rewriting (10) as a system of discrete
difference equations, and solving the system. The system can be solved in O(G logG)

ul
time, where G is the number of gridpoints, by using the Fast Fourier Transform.16

Finding the forces

pa
Thus the potential is solved. Calculating the force is easy, we only need to calculate the
gradient of the potential by taking the difference between two potential values. Some
special considerations regarding the degree of force interpolation must however be
taken in order to conserve such physical properties as momentum and energy.17
jin

PM pros and cons

The PM method has a complexity of O(n) with respect to the particles, the slowest
steps being the FFT requiring O(G log G) operations.18 The cost of the speed is the
limited spatial resolution of PM - phenomena at a smaller scale than the mesh spacing
.re

are not modelled accurately, i.e. it is only suited for modelling of systems where
collisions occur. On the other hand, large-scale phenomena can be shown quite
accurately.19

The Particle-Particle/Particle-Mesh algorithm (P3M)

The Particle-Particle/Particle-Mesh algorithm is a hybrid algorithm that strives to

correct the shortcomings of PM when it comes to modelling of short-range
phenomena. The following discussion is based on [Hockney] 20 and applied to our
w

sample N-body problem.

The general idea is to split up the gravitational force in short-range and long-range
forces, just as we did in section 0. The PM method is used to determine the long range
w

force, and PP to correct the PM force at short distances.

Let re be the distance (typically 2-3 times the size of the mesh cell), below which we
want to use the PP method for force calculation. Denote by x0 the position where the
force is to be evaluated. In order to be able to efficiently find the particles xi whose
distance from x0 is less than re we employ as so-called chaining mesh for the particles,
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

see Figure 6. As can be see from the figure, those xi closer to x0 than re must be in the
neighbouring 8 chaining mesh cells of x0’s chaining mesh cell.

Chaining mesh
mesh cell

om
x0

.c
ul
Figure 6 Mesh cells and chaining mesh

Mathematically, we define the correction, or short-range, force fsr as

pa
f tot R f sr f sr f tot R (11)
where ftot is the total force and R is the force obtained using the PM method.
Suppose we have a particle xi closer to than re. We’d like to calculate the total force
on x0, using direct PP to correct the force R obtained by PM calculations. With some
jin

lengthy mathematical manipulations it can be shown that PM treats the particle xi as a

“mass cloud” (the density and shape of which depend on the details of the PM
method). Thus, we can correct the PM force by subtracting the force induced by such
a “mass cloud” (this force is included in the PM force) and then adding the direct PP
.re

interaction force. Mathematically, in the case of pointlike masses

x0 xi
f sr Gmmi 3 e(x 0 ,x i ) (12)
x 0 xi
where e(x0,xi) is the function for the gravitational force of the “mass cloud” the PM
method treats xi as. We can now formulate the P3M algorithm:
w

1.for each particle x do

2. find force on x=:F using the PM method
3. locate particles closer than re for PP calculation using the
w

chaining mesh. Calculate short range force using these

particles
4. force on x:= F + short range forces
5. endfor
w

The P3M method has been used widely in cosmological simulations, and is easy to use
when forces can be easily split into short and long-range (i.e. gravity). A problem is
that the algorithm easily becomes dominated by the PP phase.21

Fast multipole method (FMM)

Multipole expansion can be viewed as a tree code, with the following main differences
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Potential instead of force is used

Multipole expansions of the potential of a box are more accurate than a simple
center of mass substitution, but also computationally “heavier”. This is compensated
by having more than 1 particle per tree leaf.22
FMM only require evaluation of F(x0,xi) at leaf nodes, and possesses a “richer
analytical structure”, meaning it can be adapted to a wider variety of problems23

om
Potentials and multipole expansion of the gravitational field in
2D
The mathematics associated with FMM is somewhat lengthy, but not excessively
advanced. This section will deal with multipole expansions of the gravitational field of
our example N-body problem, similar methods are used when expanding other
quantities and/or in other dimensionalities24. The following presentation of the

.c
mathematical ideas is based on [Demmel2].
Recall that the potential of a particle satisfies Poissions equation (equation 10). A
solution to (10) for a point mass located at x0 in 2D is

ul
(x) log( x x o ) (13)
Using the complex number z=a+bi to represent the point x=(a,b) the potential can be
pa
rewritten as the real part of the complex logarithm, which is analytic. Remembering
that potentials are additive, the total potential from n particles can be expressed as
n n
(z) mi log(z zi ) mi (log(z) log(1 zi / z)) (14)
i 1 i 1
jin

Since (z) is analytic, it is possible to do a Taylor expansion around origo of log(1-

zi/z), yielding
j
(z) M log(z) j z (12)
j 1
.re

where M is the combined mass of all particles and

n
mi zi j
j (13)
i 1 j
We approximate the potential (z) by summing a finite number, p, of terms in (12).
w

The error is then proportional to

p 1
max(| zi | )
(14)
|z|
w

Now, suppose that all zi lie inside a DD square centered at origin and z is evaluated
outside a 3D 3D square centered at origin, then 2.12-p (see Figure 7). We say that z
w

and zi are well-separated when this condition holds. Also note that the potential is
expressed as a power series, and the gradient can thus easily be computed analytically,
avoiding further discretization errors (compare to PM!).
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

3D
zi

om
D

Figure 7 The expansion of the particles zi converge outside the outer box

Expansion around a point zc instead of origo is similar. Denote by outer(M, 1,...,

.c
p,zc) this expansion.

Similarly, we can do an inner expansion which approximates the potential from

particles outside a 3D 3D box and is valid in a D D box in the center of the 3D 3D box.

ul
This is denoted by inner(M, 1,..., p,zc)
Furthermore, we’ll need functions to translate the center of expansion for the inner and
outer expansions. These functions are defined as
pa
outer(M, 1,..., p ,zc’)=outer_shift(outer(M, 1,..., p ,zc),zc’)
and
inner(M, 1
,..., p
,zc’)=inner_shift(inner(M, 1
,..., p
,zc),zc’)
jin

These functions can be implemented in O(p2)

Finally a conversion function between outer and inner expansions is needed (i.e. when
an outer expansion of a box A is valid in the region of another box B, the outer
expansion of A can be converted to an inner expansion around B’s center). This can
.re

also be performed in O(p2) and is denoted with

inner(M, 1
,..., p
,zc’)=convert(outer(M, 1,..., ,zc),zc’)
p

To further compress the notation, I’ll use only the cell as identifier for an expansion,
e.g. the inner expansion of cell A centered at zc is written as:
w

inner(A)

The FMM algorithm: Treecodes with a

twist
Now that we are done with the math, let’s see how all these expansions help us to
build a fast and accurate force calculation algorithm. The discussion is based on
w

[Demmel2]
We begin by constructing a quadtree, as in section 0, with the exception that
subdivison is not continued until there is 1 particle per cell, but less particles than some
limit s. Then the quadtree is traversed from the leaves toward the root, computing
outer expansions for each quadtree node. Next, the quadtree is traversed from the root
to the leaves, computing inner expansions for the nodes (using the outer expansions
provided in the previous step). Now, the force on the particles can be calculated by
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

adding the inner expansion for the node (accounts for the potential for all particles well
separated from the node) and the direct sum of potentials from particles in nearby (non
well-separated) nodes. A more detailed description follows:
Step 1: Building the quadtree
For simplicity, we well assume that the tree is fully populated, that is each leave is at
the same distance from the root (this can be achieved by augmenting the tree with
empty leaves). A version of FMM using adaptive trees can be found in [Carrier].

om
Step 2: Computing outer() for each node
Recall that we were able to move the center of expansion with the outer_shift
function. Now we can move the center of expansion for the child cells B1..B4 to the
center of parent cell A (za), and then simply add the coefficients j of the shifted
expansions to obtain an outer expansion for cell A. The expansions of B1..B4 converge

.c
at distances larger than D from za, and thus the criterion of convergence for the outer
expansion of A is satisfied.
We start from form the leaves, by calculating their outer expansions directly, and then

ul
proceed towards the root, “merging” outer expansions as described above. Note the
similarity to the center of mass calculation step in the Barnes&Hut algorithm.
B1 B2
pa
outer_shift outer_shift
A
outer_shift outer_shift
jin

B3 B4

Figure 8 Step 2 of the FMM

.re

Step 3: Computing inner() for each node

The idea is that we want to be able to calculate an expansion, which is valid inside a
cell A, by using the outer expansions of other cells. Since the outer expansion of a cell
Bi only converges (with the desired accuracy) at a distance 1.5D from the center of
w

the cell, the immediate neighbour cells of A cannot be used to calculate the inner
expansion. More formally, we define the interaction set I of A, in which our particle
resides:
w
w
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

I(A):={nodes Bi such that Bi is a child of a neighbour of parent(A),

but Bi is not itself a neighbour of A}

om
.c
ul
pa
Figure 9 Interaction set for a cell A. From [Demmel2]
jin

The interaction set of A is pictured in Figure 9.

In this stage, we start traversing the tree from the root. At each level, we have an inner
expansion which is generated from the interaction set of the previous level (the root
does not have any interaction set, and thus it’s expansion from “previous levels” is
.re

empty). Now, by converting the outer expansions of the interaction set at the current
level to inner expansions for A, summing them up, and finally adding the shifted inner
expansion of A’s parent, we have obtained an inner expansion for A that includes the
potential of all cells except the neighbours of A. If A is itself a parent node, we then
recurse once more. More formally:
w

Build_inner(A)
P:=parent(A)
inner(A):=EMPTY
for all Bi I(A) do
w

inner(A):=inner(A)+convert(outer(Bi),A)
endfor
inner(A):=inner(A)+inner_shift(inner(P),A)
for all C:=children of A do
build_inner(C)
w

endfor
end
Note the difference from Barnes-Hut: This step does not calculate the force on a single
particle, rather the potential of an entire leaf cell.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

om
.c
ul
pa
A "binary search tree" (BST) or "ordered binary tree" is a type of binary tree where the nodes are arranged in order:
for each node, all elements in its left subtree are less-or-equal to the node (<=), and all the elements in its right
subtree are greater than the node (>). The tree shown above is a binary search tree -- the "root" node is a 5, and its
left subtree nodes (1, 3, 4) are <= 5, and its right subtree nodes (6, 9) are > 5. Recursively, each of the subtrees must
also obey the binary search tree constraint: in the (1, 3, 4) subtree, the 3 is the root, the 1 <= 3 and 4 > 3. Watch out
jin
for the exact wording in the problems -- a "binary search tree" is different from a "binary tree".

The nodes at the bottom edge of the tree have empty subtrees and are called "leaf" nodes (1, 4, 6) while the others
are "internal" nodes (3, 5, 9).
.re

Binary Search Tree Niche

Basically, binary search trees are fast at insert and lookup. The next section presents the code for these two
algorithms. On average, a binary search tree algorithm can locate a node in an N node tree in order lg(N) time (log
base 2). Therefore, binary search trees are good for "dictionary" problems where the code inserts and looks up
information indexed by some key. The lg(N) behavior is the average case -- it's possible for a particular tree to be
much slower depending on its shape.
w

Strategy
w

Some of the problems in this article use plain binary trees, and some use binary search trees. In any case, the
problems concentrate on the combination of pointers and recursion. (See the articles linked above for pointer articles
that do not emphasize recursion.)
w

For each problem, there are two things to understand...

The node/pointer structure that makes up the tree and the code that manipulates it
The algorithm, typically recursive, that iterates over the tree

When thinking about a binary tree problem, it's often a good idea to draw a few little trees to think about the
various cases.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Typical Binary Tree Code in C/C++

As an introduction, we'll look at the code for the two most basic binary search tree operations -- lookup() and
insert(). The code here works for C or C++.

In C or C++, the binary tree is built with a node type like this...

om
struct node {
int data;
struct node* left;
struct node* right;
}

.c
Lookup()

Given a binary search tree and a "target" value, search the tree to see if it contains the target. The basic pattern of
the lookup() code occurs in many recursive tree algorithms: deal with the base case where the tree is empty, deal

ul
with the current node, and then use recursion to deal with the subtrees. If the tree is a binary search tree, there is
often some sort of less-than test on the node to decide if the recursion should go left or right.

/* pa
Given a binary tree, return true if a node
with the target data is found in the tree. Recurs
down the tree, chooses the left or right
branch by comparing the target to each node.
*/
static int lookup(struct node* node, int target) {
jin
// 1. Base case == empty tree
// in that case, the target is not found so return false
if (node == NULL) {
return(false);
}
.re

else {
// 2. see if found here
if (target == node->data) return(true);
else {
// 3. otherwise recur down the correct subtree
if (target < node->data) return(lookup(node->left, target));
else return(lookup(node->right, target));
w

}
}
}
w

The lookup() algorithm could be written as a while-loop that iterates down the tree. Our version uses recursion to
help prepare you for the problems below that require recursion.
w

Pointer Changing Code

There is a common problem with pointer intensive code: what if a function needs to change one of the pointer
parameters passed to it? For example, the insert() function below may want to change the root pointer. In C and
C++, one solution uses pointers-to-pointers (aka "reference parameters"). That's a fine technique, but here we will
use the simpler technique that a function that wishes to change a pointer passed to it will return the new value of
the pointer to the caller. The caller is responsible for using the new value. Suppose we have a change() function
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

that may change the the root, then a call to change() will look like this...

// suppose the variable "root" points to the tree

root = change(root);

We take the value returned by change(), and use it as the new value for root. This construct is a little awkward, but
it avoids using reference parameters which confuse some C and C++ programmers, and Java does not have reference

om
parameters at all. This allows us to focus on the recursion instead of the pointer mechanics.

Insert()

Insert() -- given a binary search tree and a number, insert a new node with the given number into the tree in the
correct place. The insert() code is similar to lookup(), but with the complication that it modifies the tree structure.
As described above, insert() returns the new tree pointer to use to its caller. Calling insert() with the number 5 on

.c
this tree...

2
/ \

ul
1 10

returns the tree...

1
2
/ \
10
pa
/
5
jin
The solution shown here introduces a newNode() helper function that builds a single node. The base-case/recursion
structure is similar to the structure in lookup() -- each call checks for the NULL case, looks at the node at hand, and
then recurs down the left or right subtree if needed.

/*
Helper function that allocates a new node
.re

with the given data and NULL left and right

pointers.
*/
struct node* NewNode(int data) {
struct node* node = new(struct node); // "new" is like "malloc"
node->data = data;
w

node->left = NULL;
node->right = NULL;

return(node);
w

/*
w

Give a binary search tree and a number, inserts a new node

with the given number in the correct place in the tree.
Returns the new root pointer which the caller should
then use (the standard trick to avoid using reference
parameters).
*/
struct node* insert(struct node* node, int data) {
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

// 1. If the tree is empty, return a new, single node

if (node == NULL) {
return(newNode(data));
}
else {
// 2. Otherwise, recur down the tree
if (data <= node->data) node->left = insert(node->left, data);

om
else node->right = insert(node->right, data);

return(node); // return the (unchanged) node pointer

}
}

The shape of a binary tree depends very much on the order that the nodes are inserted. In particular, if the nodes

.c
are inserted in increasing order (1, 2, 3, 4), the tree nodes just grow to the right leading to a linked list shape where
all the left pointers are NULL. A similar thing happens if the nodes are inserted in decreasing order (4, 3, 2, 1). The
linked list shape defeats the lg(N) performance. We will not address that issue here, instead focusing on pointers
and recursion.

ul
Binary Tree Problems
Here are 14 binary tree problems in increasing order of difficulty. Some of the problems operate on binary search
pa
trees (aka "ordered binary trees") while others work on plain binary trees with no special ordering. The next
section, shows the solution code in C/C++. gives the background and solution code in Java. The basic structure
and recursion of the solution code is the same in both languages -- the differences are superficial.

Reading about a data structure is a fine introduction, but at some point the only way to learn is to actually try to
solve some problems starting with a blank sheet of paper. To get the most out of these problems, you should at least
jin
attempt to solve them before looking at the solution. Even if your solution is not quite right, you will be building up
the right skills. With any pointer-based code, it's a good idea to make memory drawings of a a few simple cases to
see how the algorithm should work.

build123()
.re

This is a very basic problem with a little pointer manipulation. (You can skip this problem if you are already
comfortable with pointers.) Write code that builds the following little 1-2-3 binary search tree...

2
/ \
w

1 3

Write the code in three different ways...

a: by calling newNode() three times, and using three pointer variables

b: by calling newNode() three times, and using only one pointer variable
c: by calling insert() three times passing it the root pointer to build up the tree
w

(In Java, write a build123() method that operates on the receiver to change it to be the 1-2-3 tree with the given
coding constraints.

struct node* build123() {

size()
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

This problem demonstrates simple binary tree traversal. Given a binary tree, count the number of nodes in the tree.

int size(struct node* node) {

maxDepth()

om
Given a binary tree, compute its "maxDepth" -- the number of nodes along the longest path from the root node down
to the farthest leaf node. The maxDepth of the empty tree is 0, the maxDepth of the tree on the first page is 3.

int maxDepth(struct node* node) {

minValue()

.c
Given a non-empty binary search tree (an ordered binary tree), return the minimum data value found in that tree.
Note that it is not necessary to search the entire tree. A maxValue() function is structurally very similar to this
function. This can be solved with recursion or with a simple while loop.

ul
int minValue(struct node* node) {

printTree() pa
Given a binary search tree (aka an "ordered binary tree"), iterate over the nodes to print them out in increasing
order. So the tree...

4
/ \
jin
2 5
/ \
1 3

Produces the output "1 2 3 4 5". This is known as an "inorder" traversal of the tree.
.re

Hint: For each node, the strategy is: recur left, print the node data, recur right.

void printTree(struct node* node) {

printPostorder()
w

Given a binary tree, print out the nodes of the tree according to a bottom-up "postorder" traversal -- both subtrees of
a node are printed out completely before the node itself is printed, and each left subtree is printed before the right
subtree. So the tree...
w

4
/ \
w

2 5
/ \
1 3

Produces the output "1 3 2 5 4". The description is complex, but the code is simple. This is the sort of bottom-up
traversal that would be used, for example, to evaluate an expression tree where a node is an operation like '+' and
its subtrees are, recursively, the two subexpressions for the '+'.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

void printPostorder(struct node* node) {

hasPathSum()
We'll define a "root-to-leaf path" to be a sequence of nodes in a tree starting with the root node and proceeding
downward to a leaf (a node with no children). We'll say that an empty tree contains no root-to-leaf paths. So for

om
example, the following tree has exactly four root-to-leaf paths:

5
/ \
4 8
/ / \
11 13 4
/ \ \

.c
7 2 1

Root-to-leaf paths:
path 1: 5 4 11 7

ul
path 2: 5 4 11 2
path 3: 5 8 13
path 4: 5 8 4 1
pa
For this problem, we will be concerned with the sum of the values of such a path -- for example, the sum of the
values on the 5-4-11-7 path is 5 + 4 + 11 + 7 = 27.

Given a binary tree and a sum, return true if the tree has a root-to-leaf path such that adding up all the values
along the path equals the given sum. Return false if no such path can be found. (Thanks to Owen Astrachan for
suggesting this problem.)
jin
int hasPathSum(struct node* node, int sum) {

printPaths()
.re

Given a binary tree, print out all of its root-to-leaf paths as defined above. This problem is a little harder than it
looks, since the "path so far" needs to be communicated between the recursive calls. Hint: In C, C++, and Java,
probably the best solution is to create a recursive helper function printPathsRecur(node, int path[], int pathLen),
where the path array communicates the sequence of nodes that led up to the current call. Alternately, the problem
may be solved bottom-up, with each node returning its list of paths. This strategy works quite nicely in Lisp, since
it can exploit the built in list and mapping primitives. (Thanks to Matthias Felleisen for suggesting this problem.)
w

Given a binary tree, print out all of its root-to-leaf paths, one per line.

void printPaths(struct node* node) {

mirror()
w

Change a tree so that the roles of the left and right pointers are swapped at every node.

So the tree...
4
/ \
2 5
/ \
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

1 3

is changed to...
4
/ \
5 2
/ \

om
3 1

The solution is short, but very recursive. As it happens, this can be accomplished without changing the root node
pointer, so the return-the-new-root construct is not necessary. Alternately, if you do not want to change the tree
nodes, you may construct and return a new mirror tree based on the original tree.

void mirror(struct node* node) {

.c
doubleTree()
For each node in a binary search tree, create a new duplicate node, and insert the duplicate as the left child of the

ul
original node. The resulting tree should still be a binary search tree.

So the tree...
2 pa
/ \
1 3

is changed to...
2
/ \
jin
2 3
/ /
1 3
/
1
.re

As with the previous problem, this can be accomplished without changing the root node pointer.

void doubleTree(struct node* node) {

sameTree()
w

Given two binary trees, return true if they are structurally identical -- they are made of nodes with the same values
arranged in the same way. (Thanks to Julie Zelenski for suggesting this problem.)
w

int sameTree(struct node* a, struct node* b) {

countTrees()
w

This is not a binary tree programming problem in the ordinary sense -- it's more of a math/combinatorics recursion
problem that happens to use binary trees. (Thanks to Jerry Cain for suggesting this problem.)

Suppose you are building an N node binary search tree with the values 1..N. How many structurally different
binary search trees are there that store those values? Write a recursive function that, given the number of distinct
values, computes the number of structurally unique binary search trees that store those values. For example,
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

countTrees(4) should return 14, since there are 14 structurally unique binary search trees that store 1, 2, 3, and 4. The
base case is easy, and the recursion is short but dense. Your code should not construct any actual trees; it's just a
counting problem.

int countTrees(int numKeys) {

om
Binary Search Tree Checking (for problems 13 and 14)

This background is used by the next two problems: Given a plain binary tree, examine the tree to determine if it
meets the requirement to be a binary search tree. To be a binary search tree, for every node, all of the nodes in its
left tree must be <= the node, and all of the nodes in its right subtree must be > the node. Consider the following four
examples...

.c
a. 5 -> TRUE
/ \
2 7

ul
b. 5 -> FALSE, because the 6 is not ok to the left of the 5
/ \
6 7 pa
c. 5 -> TRUE
/ \
2 7
/
jin
1

d. 5 -> FALSE, the 6 is ok with the 2, but the 6 is not ok with the 5
/ \
2 7
/ \
.re

1 6

For the first two cases, the right answer can be seen just by comparing each node to the two nodes immediately
below it. However, the fourth case shows how checking the BST quality may depend on nodes which are several
layers apart -- the 5 and the 6 in that case.
w

isBST() -- version 1

Suppose you have helper functions minValue() and maxValue() that return the min or max int value from a
w

non-empty tree (see problem 3 above). Write an isBST() function that returns true if a tree is a binary search tree
and false otherwise. Use the helper functions, and don't forget to check every node in the tree. It's ok if your
solution is not very efficient. (Thanks to Owen Astrachan for the idea of having this problem, and comparing it to
problem 14)
w

Returns true if a binary tree is a binary search tree.

int isBST(struct node* node) {

isBST() -- version 2
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Version 1 above runs slowly since it traverses over some parts of the tree many times. A better solution looks at each
node only once. The trick is to write a utility helper function isBSTRecur(struct node* node, int min, int max) that
traverses down the tree keeping track of the narrowing min and max allowed values as it goes, looking at each node
only once. The initial values for min and max should be INT_MIN and INT_MAX -- they narrow from there.

/*
Returns true if the given tree is a binary search tree

om
(efficient version).
*/
int isBST2(struct node* node) {
return(isBSTRecur(node, INT_MIN, INT_MAX));
}

/*
Returns true if the given tree is a BST and its

.c
values are >= min and <= max.
*/
int isBSTRecur(struct node* node, int min, int max) {

ul
Tree-List

The Tree-List problem is one of the greatest recursive pointer problems ever devised, and it happens to use binary
pa
trees as well. CLibarary works through the Tree-List problem in detail and includes solution code in C and Java.
The problem requires an understanding of binary trees, linked lists, recursion, and pointers. It's a great problem,
but it's complex.
jin
C/C++ Solutions
Make an attempt to solve each problem before looking at the solution -- it's the best way to learn.

Build123() Solution (C/C++)

.re

// call newNode() three times

struct node* build123a() {
struct node* root = newNode(2);
struct node* lChild = newNode(1);
struct node* rChild = newNode(3);
w

root->left = lChild;
root->right= rChild;
w

return(root);
}

// call newNode() three times, and use only one local variable
w

struct node* build123b() {

struct node* root = newNode(2);
root->left = newNode(1);
root->right = newNode(3);

return(root);
}
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

/*
Build 123 by calling insert() three times.
Note that the '2' must be inserted first.
*/
struct node* build123c() {
struct node* root = NULL;

om
root = insert(root, 2);
root = insert(root, 1);
root = insert(root, 3);
return(root);
}

size() Solution (C/C++)

.c
/*
Compute the number of nodes in a tree.
*/

ul
int size(struct node* node) {
if (node==NULL) {
return(0);
} else { pa
return(size(node->left) + 1 + size(node->right));
}
}

maxDepth() Solution (C/C++)

jin
/*
Compute the "maxDepth" of a tree -- the number of nodes along
the longest path from the root node down to the farthest leaf node.
*/
.re

int maxDepth(struct node* node) {

if (node==NULL) {
return(0);
}
else {
// compute the depth of each subtree
w

int lDepth = maxDepth(node->left);

int rDepth = maxDepth(node->right);

// use the larger one

if (lDepth > rDepth) return(lDepth+1);

else return(rDepth+1);
}
}
w

minValue() Solution (C/C++)

/*
Given a non-empty binary search tree,
return the minimum data value found in that tree.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Note that the entire tree does not need to be searched.

*/
int minValue(struct node* node) {
struct node* current = node;

// loop down to find the leftmost leaf

while (current->left != NULL) {

om
current = current->left;
}

return(current->data);
}

printTree() Solution (C/C++)

.c
/*
Given a binary search tree, print out
its data elements in increasing

ul
sorted order.
*/
void printTree(struct node* node) {
if (node == NULL) return; pa
printTree(node->left);
printf("%d ", node->data);
printTree(node->right);
}
jin
printPostorder() Solution (C/C++)

/*
Given a binary tree, print its
.re

nodes according to the "bottom-up"

postorder traversal.
*/
void printPostorder(struct node* node) {
if (node == NULL) return;

// first recur on both subtrees

printTree(node->left);
printTree(node->right);

// then deal with the node

printf("%d ", node->data);

}
w

hasPathSum() Solution (C/C++)

/*
Given a tree and a sum, return true if there is a path from the root
down to a leaf, such that adding up all the values along the path
equals the given sum.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Strategy: subtract the node value from the sum when recurring down,
and check to see if the sum is 0 when you run out of tree.
*/
int hasPathSum(struct node* node, int sum) {
// return true if we run out of tree and sum==0
if (node == NULL) {
return(sum == 0);

om
}
else {
// otherwise check both subtrees int
subSum = sum - node->data;
return(hasPathSum(node->left, subSum) ||
hasPathSum(node->right, subSum));
}
}

.c
printPaths() Solution (C/C++)

ul
/*
Given a binary tree, print out all of its root-to-leaf
paths, one per line. Uses a recursive helper to do the work.
*/ pa
void printPaths(struct node* node) {
int path[1000];

printPathsRecur(node, path, 0);

}
jin
/*
Recursive helper function -- given a node, and an array containing
the path from the root node up to but not including this node,
print out all the root-leaf paths.
*/
void printPathsRecur(struct node* node, int path[], int pathLen) {
.re

if (node==NULL) return;

// append this node to the path array

path[pathLen] = node->data;
pathLen++;
w

// it's a leaf, so print the path that led to here

if (node->left==NULL && node->right==NULL) {
printArray(path, pathLen);
}
w

else {
// otherwise try both subtrees
printPathsRecur(node->left, path, pathLen);
printPathsRecur(node->right, path, pathLen);
w

}
}

// Utility that prints out an array on a line.

void printArray(int ints[], int len) {
int i;
for (i=0; i<len; i++) {
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

printf("%d ", ints[i]);

}
printf("\n");
}

mirror() Solution (C/C++)

om
/*
Change a tree so that the roles of the
left and right pointers are swapped at every node.

So the tree...
4
/ \

.c
2 5
/ \
1 3

ul
is changed to...
4
/ \
5 2 pa
/ \
3 1
*/
void mirror(struct node* node) {
if (node==NULL) {
return;
jin
}
else {
struct node* temp;

// do the subtrees
mirror(node->left);
.re

mirror(node->right);

// swap the pointers in this node

temp = node->left;
node->left = node->right;
node->right = temp;
w

}
}
w

doubleTree() Solution (C/C++)

/*
For each node in a binary search tree,
w

create a new duplicate node, and insert

the duplicate as the left child of the original node.
The resulting tree should still be a binary search tree.

So the tree...
2
/ \
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

1 3

Is changed to...
2
/ \
2 3
/ /

om
1 3
/
1

*/
void doubleTree(struct node* node) {
struct node* oldLeft;

.c
if (node==NULL) return;

// do the subtrees
doubleTree(node->left);

ul
doubleTree(node->right);

// duplicate this node to its left

oldLeft = node->left; pa
node->left = newNode(node->data);
node->left->left = oldLeft;
}

sameTree() Solution (C/C++)

jin
/*
Given two trees, return true if they are
structurally identical.
*/
int sameTree(struct node* a, struct node* b) {
.re

// 1. both empty -> true

if (a==NULL && b==NULL) return(true);

// 2. both non-empty -> compare them

else if (a!=NULL && b!=NULL) {
return(
w

a->data == b->data &&

sameTree(a->left, b->left) &&
sameTree(a->right, b->right)
);
w

}
// 3. one empty, one not -> false
else return(false);
}
w

countTrees() Solution (C/C++)

/*
For the key values 1...numKeys, how many structurally unique
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

binary search trees are possible that store those keys.

Strategy: consider that each value could be the root.

Recursively find the size of the left and right subtrees.
*/
int countTrees(int numKeys) {

om
if (numKeys <=1) {
return(1);
}
else {
// there will be one value at the root, with whatever remains
// on the left and right each forming their own subtrees.
// Iterate through all the values that could be the root...
int sum = 0;

.c
int left, right, root;

for (root=1; root<=numKeys; root++) {

left = countTrees(root - 1);

ul
right = countTrees(numKeys - root);

// number of possible trees with this root == left*right

sum += left*right; pa
}

return(sum);
}
}
jin
isBST1() Solution (C/C++)

/*
Returns true if a binary tree is a binary search tree.
.re

*/
int isBST(struct node* node) {
if (node==NULL) return(true);

// false if the min of the left is > than us

if (node->left!=NULL && minValue(node->left) > node->data)
w

return(false);

// false if the max of the right is <= than us

if (node->right!=NULL && maxValue(node->right) <= node->data)
w

return(false);

// false if, recursively, the left or right is not a BST

if (!isBST(node->left) || !isBST(node->right))
w

return(false);

// passing all that, it's a BST

return(true);
}
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Message Passing Interface

The Message Passing Interface, MPI is a controlled API standard for programming a wide
array of parallel architectures. Though MPI was originally intended for classic distributed
memory architectures, it is used on various architectures from networks of PCs via large
shared memory systems, such as the SGI Origin 2000, to massive parallel architectures,
such as Cray T3D and Intel paragon. The complete MPI API offers 186 operations, which
makes this is a rather complex programming API. However, most MPI applications use
only six to ten of the available operations.

om
MPI is intended for the Single Program Multiple Data (SPMD) programming paradigm
– all nodes run the same application-code. The SPMD paradigm is efficient and easy to use
for a large set of scientific applications with a regular execution pattern. Other, less regular,
applications are far less suited to this paradigm and implementation in MPI is tedious.
MPI's point-to-point communication comes in four shapes: standard, ready, synchronous
and buffered. A standard-send operation does not return until the send buffer has been

.c
copied, either to another buffer below the MPI layer or to the network interface, (NIC). The
ready-send operations are not initiated until the addressed process has initiated a
corresponding receive-operation. The synchronous call sends the message, but does not

ul
return until the receiver has initiated a read of the message. The fourth model, the buffered
send, copies the message to a buffer in the MPI-layer and then allows the application to
continue. Each of the four models also comes in asynchronous (in MPI called non-
blocking) modes. The non-blocking calls return immediately, and it is the programmer's
pa
responsibility to check that the send has completed before overwriting the buffer. Likewise
a non-blocking receive exist, which returns immediately and the programmer needs to
ensure that the receive operation has finished before using the data.
MPI supports both group broadcasting and global reductions. Being SPMD, all nodes
have to meet at a group operation, i.e. a broadcast operation blocks until all the processes in
jin

the context have issued the broadcast operation. This is important because it turns all group-
operations into synchronization points in the application. The MPI API also supports
scatter-gather for easy exchange of large data-structures and virtual architecture topologies,
which allow source-code compatible MPI applications to execute efficiently across
.re

different platforms.

Experiment Environment

Cluster
w

The cluster comprises 51 Dell Precision Workstation 360s, each with a 3.2GHz Intel
Prescott processor, 2GB RAM and a 120GB Serial ATA hard-disk2. The nodes are
w

coPn4nected using Gigabit Ethernet over two HP Procurve 2848 switches. 32 nodes are
connected to the first switch, and 19 nodes to the second switch. The two switches are
trunked3 with 4 copper cables, providing 4Gbit/s bandwidth between the switches, see
w

Figure 1. The nodes are running RedHat Linux 9 with a patched Linux 2.4.26 kernel to
support Serial ATA. Hyperthreading is switched on, and Linux is configured for Symmetric
Multiprocessor support.

2
The computers have a motherboard with Intel’s 875P chipset. The chipset supports Gigabit Ethernet over
Intel’s CSA (Communication Streaming Architecture) bus, but Dell’s implementation of the motherboards
use an Intel 82540EM Gigabit Ethernet controller connected to the PCI bus instead.
3
Trunking is a method where traffic between two switches is loadbalanced across a set of links in order to
provide a higher available bandwidth between the switches.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Switches trunked with 4

Copper Cables
32 Dell WS 360 19 Dell WS 360

om
Figure 1: the experiment cluster

MPICH

MPICH is the official reference implementation of MPI and has a high focus on being

.c
portable. MPICH is available for all UNIX flavors and for Windows, a special GRID
enabled version, MPICH-G2, is available for Globus[11]. Many of the MPI
implementations for specialized hardware, i.e. cluster interconnects, are based on MPICH.

ul
MPICH version 1.2.52 is used for the below experiments.

LAM-MPI
pa
Local Area Multicomputer-MPI, LAM-MPI, started out as an implementation for running
MPI applications on LANs. An integrated part of this model was ‘on-the-fly’ endian-
conversion to support different architectures to collaborate on an MPI execution. While
endian-conversion still is supported, it is no longer performed per default as it is assumed
that most executions will be on homogenous clusters. The experiments in this paper are
jin

performed with LAM-MPI 7.0.5.

MESH-MPI

MESH-MPI is only just released and the presented results are thus brand-new. MESH-MPI
.re

is ‘yet-another-commercial-MPI’, but with a strong focus on performance, rather than

simply improved support over the open-source versions. In addition to improved
performance, MESH-MPI also promotes true non-blocking operations, thread safety, and
scalable collective operations. Future versions have announced support for a special Low
Latency Communication library (LLC) and a Runtime Data Dependency Analysis (RDDA)
w

functionality to schedule communication. These functions are not available in the current
version which is 1.0a.
w

Benchmarks

This section describes the benchmark suites we have chosen for examining the performance
w

of the three MPI implementations. One suite, Pallas, is a micro-benchmark suite, which
gives a lot of information about the performance of the different MPI functions, while the
other, NPB, is an application/kernel suite, which describes the application level
performance. The NPB suite originates from NASA and is used as the basis for deciding on
new systems at NASA. This benchmark tests both the processing power of the system and
the communication performance.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Pallas Benchmark Suite

The Pallas benchmark suite[9] from Pallas GmbH is a suite, which measures the
performance of different MPI functions. The performance is measured for individual
operations rather than on the application level. The results can thus be used in two ways;
either to choose an MPI implementation that performs well for the operations one uses, or
to determine which operations performs poorly on the available MPI implementation so
that one can avoid them when coding applications. The tests/operations that are run in
Pallas are:

om
• PingPong
The time it takes to pass a message between two processes and back
• PingPing
The time it takes to send a message from one process to another
•

.c
SendRecv
The time it takes to send and receive a message in parallel
• Exchange
The time it takes to exchange contents of two buffers

ul
• Allreduce
The time it takes to create a common result, i.e. a global sum
• Reduce
pa
The same as Allreduce but the result is delivered to only one process
• Reduce Scatter
The same as Reduce but the result is distributed amongst the processes
• Allgather
The time it takes to collect partial results from all processes and deliver the data
jin

to all processes
• Allgatherv
Same as Allgather, except that the partial results need not have the same size
• Alltoall
The time it takes for all processes to send data to all other processes and receive
.re

from all other processes – the data that is sent is unique to each reciever
• Bcast - the time it takes to deliver a message to all processes

NAS Parallel Benchmark Suite

The NAS Parallel Benchmark, NPB, suite is described as:

The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to

help evaluate the performance of parallel supercomputers. The benchmarks,
w

which are derived from computational fluid dynamics (CFD) applications,

consist of five kernels and three pseudo-applications. The NPB come in
several flavors. NAS solicits performance results for each from all sources.
w

[10]

NPB is available for threaded, OpenMP and MPI systems and we naturally run the MPI
version. NPB is available with five different data-sets, A through D, and W which is for
workstations only. We use dataset C since D won’t fit on the cluster, and also since C is the
most widely reported dataset.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

The application kernels in NPB are:

• MG – Multigrid
• CG – Conjugate Gradient
• FT – Fast Fourier Transform
• IS – Integer Sort
• EP – Embarrassingly Parallel
• BT – Block Tridiagonal

om
• SP – Scalar Pentadiagonal
• LU – Lower Upper Gauss-Seidel

Results

In this section we present and analyze the results of running the benchmarks from section 3

.c
on the systems described in section 2. All the Pallas benchmarks are run on 32 CPUs (they
run on 2x sized systems) as are the NPB benchmarks except BT and SP which are run on 36
CPUs (they run on X2 sized systems).

ul
Pallas Benchmark Suite

First in the Pallas benchmark is the point to point experiments, the extreme case is the
pa
concurrent Send and Recv experiments where MPICH uses more than 12 times longer than
MESH-MPI, but otherwise all three are fairly close. MPICH performs worse than the other
two and the commercial MESH-MPI loses only on the ping-ping experiment.
The seemingly large differences on ping-pong and ping-ping are not as significant as
they may seem since they are the result of the interrupt throttling rate on the Intel Ethernet
jin

chipsets which – when set at the recommended 8000, discretises latencies in chunks of
125us, thus the difference between 62.5 us and 125us is not as significant as it may seem
and would probably be much smaller on other Ethernet chipsets.
.re

200

180

160

140
120
w time (us)

MESH
100 LAM
MPICH
80
w

20
w

0
PingPong PingPing SendRecv Exchange

Figure 2: point-to-point latencies from Pallas 0B messages.

www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Switching to large messages, 4MB, the picture is more uniform and MPICH consistently
looses to the other two. LAM-MPI and MESH-MPI are quite close in all these experiments
and are running within 2% of each other. The only significant exception is in the ping-ping
experiment where LAM-MPI outperforms MESH-MPI with 5%.
200000
746178 us
180000
363107 us
160000

om
140000
120000
time (us)

MESH
100000 LAM
MPICH
80000

60000

40000

.c
20000

0
PingPong PingPing SendRecv Exchange

ul
Figure 3: point-to-point latencies from Pallas 4MB messages.

In the collective operations, the small data is tested on 8B (eight bytes) rather than 0B
pa
because 0B on group-operations are often not performed at all and resulting times are
reported in the 0.05us range, thus to test the performance on small packages we use the size
of a double precision number. The results are shown in Figure 4.
In the collective operations, the extreme case is Allgatherv using LAM-MPI which
reports a whopping 4747us or 11 times longer than when using MESH-MPI. Except for the
jin

Alltoall benchmark where LAM-MPI is fastest, MESH-MPI is consistently the faster,

and for most experiments, the advantage is significant, measured in multiples rather than
percentages. The Bcast operation, which is a frequently used operation in many
applications, shows MESH-MPI to be 7 times faster than MPICH and 12 times faster than
.re

LAM-MPI.
2000

1800
2800 us 4747 us
1600

1400
w

1200
time (us)

MESH
1000 LAM
MPICH
w

800

600

400
w

200

0
Allreduce Reduce Reduce Allgather Allgatherv Alltoall Bcast
Scatter

Figure 4: collective Operations latencies from Pallas 8B messages.

www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

For large messages, the results have been placed in two Figures, 5 and 6, in order to fit
the time-scale better. With the large messages, MESH-MPI is consistently better than both
open-source candidates, ranging from nothing, -1%, to a lot: 11 times. On average MESH-
MPI outperforms LAM-MPI with 4.6 times and MPICH with 4.3 times. MESH-MPI is on
average 3.5 times faster than the best of the two open-source implementations.

1200000

om
1000000

800000
time (us)

MESH
600000 LAM
MPICH

400000

.c
200000

ul
0
Allreduce Reduce Reduce Scatter Bcast

Figure 5: collective Operations latencies from Pallas 4MB messages.

pa
35000000

30000000
jin

25000000

20000000
time (us)

MESH
LAM
15000000 MPICH
.re

10000000

5000000

0
w

Allgather Allgatherv Alltoall

Figure 6: collective Operations latencies from Pallas 4MB messages.

NPB Benchmark Suite

While micro-benchmarks are interesting from an MPI perspective, users are primarily
interested in the performance at application level. Here, according to Amdahl’s law,
improvements are limited by the fraction of time spent on MPI operations. Thus the runtime
of the NPB suite is particularly interesting, since it allows us to predict the value of running
a commercial MPI, and it will even allow us to determine if the differences at the operation
level performance can be seen at the application level.
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

The results are in favour of the commercial MPI; MESH-MPI finished the suite 14.5%
faster than LAM and 37.1% faster than MPICH. Considering that these are real-world
applications doing real work and taking Amdahl’s law into consideration, this is significant.

800
716
700

597
600

om
522
500
time (s)

400

300

200

.c
100

0
MESH LAM MPICH

ul
Figure 7: runtime of the NPB benchmark
pa
If we break down the results in the individual applications the picture is a little less
obvious and LAM-MPI actually outperforms MESH-MPI on two of the experiments; the
FT by 3% and the LU by 6%. Both of these makes extensive use of the Alltoall operation
where MESH-MPI has the biggest problems keeping up with LAM-MPI in the Pallas tests.
jin

25000

20000

15000
.re

MESH
MOPS

LAM
MPICH
10000

5000
w

0
BT CG EP FT IS LU MG SP
Benchmark
w

Figure 8: runtime of the individual NPB benchmarks

w
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

OpenMP and MPI implementations and comparison

MPI OpenMP

om
1 . Available from different vendor and can be 1 .OpenMP are hooked with compiler so
compiled in desired platform with desired compiler. with gnu compiler and with Intel
One can use any of MPI API i.e MPICH, OpenMPI compiler one have specific

.c
or other implementation. User is at liberty with
changing compiler but not with openmp
implementation.

ul
2. MPI support C,C++ and FORTRAN 2.OpenMP support C,C++ and
FORTRAN
pa
3.OpenMPI one of API for MPI is providing 3.Few projects try to replicate openmp
provisional support for Java for Java.
jin

4. MPI target both distributed as well shared memory 4.OpenMP target only shared memory
system system

5.Based on both process and thread based approach

.re

.(Earlier it was mainly process based parallelism but 5.Only thread based parallelism.
now with MPI 2 and 3 thread based parallelism is
there too. Usually a process can contain more than 1
thread and call MPI subroutine as desired
w

6. Overhead for creating process is one time 6. Depending on implementation threads

can be created and joined for particular
w

task which add overhead

7.There are overheads associated with transferring 7.No such overheads, as thread can share
message from one process to another variables

8. Process in MPI has private variable only, no 8. In OpenMP , threads have both private
shared variable as well shared variable
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

9.Data racing is not there if not using 9. Data racing is inherent in OpenMP
any thread in process . model

om
10.Compilation of MPI program require 10. Need to add omp.h and then can
1. Adding header file : #include "mpi.h" directly compile code with -fopenmp in
2. compiler as:(in linux ) Linux environment
mpic++ mpi.cxx -o mpiExe g++ -fopenmp openmp.cxx -o
openmpExe
(User need to set environment variable PATH and

.c
LD_LIBRARY_PATH to MPI as OpenMPI installed
folder or binaries) (For Linux)

11 . Running MPI program . 11.User can launch executable

ul
a ) User need to make sure that bin and library folder OpenMPExe in normal way.
from MPI installation are included in environmental
variable PATH and LD_LIBRARY_PATH.
pa
b) For running executable from command line ,user
need to supply following command and specify
number of processor as in example below it is four .

mpirun -np 4 mpiExe ./openmpExe

jin
.re

Sample MPI program

#include <iostream>
#include <mpi.h>
w

/**************************************************************************
This is a simple hello world program. Each processor print its id
w

************************************************************/
using namespace std;
int main(int argc,char** argv)
{
int myid, numprocs;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

/* output my rank */
cout<<"Hello from "<<myid<<endl;
MPI_Finalize();

Command to run executable with name a.out in Linux = mpirun -np 4 a.out

om
Output

Hello from 1
Hello from 0
Hello from 2
Hello from 3

.c
Sample OpenMP Program

ul
#include<iostream>
#include<omp.h>
using namespace std;
pa
/********************************************************************
Sample OpenMP program which at stage 1 has 4 threads and at stage 2 has 2 threads
**********************************************************/
int main()
{
jin

#pragma omp parallel num_threads(4) //*create 4 threads and region inside it will be
executed by all threads . */
{
#pragma omp critical//allow one thread at a time to access below statement
cout<<" Thread Id in OpenMP stage 1= "<<omp_get_thread_num()<< endl;
.re

} //here all thread get merged into one thread id

cout<<"I am alone"<<endl;

#pragma omp parallel num_threads(2)//create two threads

{
cout<<" Thread Id in OpenMP stage 2= "<<omp_get_thread_num()<< endl;;
}
w

}
Command to run executable with name a.out on Linux : /a.out
w

Output

Thread Id in OpenMP stage 1= 2

Thread Id in OpenMP stage 1=0
Thread Id in OpenMP stage 1=3
Thread Id in OpenMP stage 1= 1
I am alone
Thread Id in OpenMP stage 2= 1
Thread Id in OpenMP stage 2=0
www.rejinpaul.com
UNIT V -PARALLEL PROGRAM DEVELOPMENT

Summary

MPI and OpenMP have its own advantages and limitations . OpenMP is relatively easy to
implement and involves few pragma directives to achieve desired tasks. OpenMP can be used
in recursive function as well i.e as traversing in binary tree. However it suffers from problem
of memory limitations for memory intensive calculations.
MPI usually serve those problem well which involve large memory. With MPI 3 , shared

om
memory advantage can be utilized within MPI too. Also one can use OpenMP with MPI i.e
for shared memory in targeted platform OpenMP can be used whereas for distributed one,
MPI can be used.

.c
ul
pa
jin
.re
w
w
w

Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
Manufacturing Trial Balance
100% (1)
Manufacturing Trial Balance
3 pages
CP4253 Map Unit I
No ratings yet
CP4253 Map Unit I
31 pages
CP4292 Multicore ME Lab Record
No ratings yet
CP4292 Multicore ME Lab Record
30 pages
Java Notes-Ii CS
No ratings yet
Java Notes-Ii CS
265 pages
hpc qb with answer
No ratings yet
hpc qb with answer
17 pages
Multimedia Database
No ratings yet
Multimedia Database
14 pages
Agile Module 1 Notes
No ratings yet
Agile Module 1 Notes
17 pages
mini project(1)
No ratings yet
mini project(1)
43 pages
cs8076 Gpu Architecture and Programming
No ratings yet
cs8076 Gpu Architecture and Programming
11 pages
Study On Intel 80386 Microprocessor
No ratings yet
Study On Intel 80386 Microprocessor
3 pages
Superpipelining
No ratings yet
Superpipelining
7 pages
CP4253 Map Unit Iv
No ratings yet
CP4253 Map Unit Iv
22 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
63 pages
Multimedia DB
No ratings yet
Multimedia DB
30 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
CP4253 Map Unit Iii
No ratings yet
CP4253 Map Unit Iii
26 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
Deep Learning Techniques Notes
No ratings yet
Deep Learning Techniques Notes
42 pages
Cs2203 Object Oriented Programming Iiird Sem Question Bank Unit - I Part - A (2 Marks)
No ratings yet
Cs2203 Object Oriented Programming Iiird Sem Question Bank Unit - I Part - A (2 Marks)
6 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
Eceg-4221-Vlsi Lec 01 Overview
No ratings yet
Eceg-4221-Vlsi Lec 01 Overview
42 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
Chapter 8 Code Optimization and Code Generation
No ratings yet
Chapter 8 Code Optimization and Code Generation
58 pages
Types of Pipeline
100% (1)
Types of Pipeline
2 pages
RM4151 Class Notes3
No ratings yet
RM4151 Class Notes3
14 pages
Mobile Application Development
No ratings yet
Mobile Application Development
193 pages
Mini Project HPC
No ratings yet
Mini Project HPC
17 pages
Features of Authoring Tools: Card and Page Based Tools
No ratings yet
Features of Authoring Tools: Card and Page Based Tools
4 pages
M.Tech (CSE) Big Data Analytics Curriculum
No ratings yet
M.Tech (CSE) Big Data Analytics Curriculum
69 pages
DLT Unit-2
No ratings yet
DLT Unit-2
50 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
SPM 3-I Couse File Format
No ratings yet
SPM 3-I Couse File Format
18 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
ML Question Bank
No ratings yet
ML Question Bank
29 pages
Tangent Prop and Manifold Tangent Classifier are b
No ratings yet
Tangent Prop and Manifold Tangent Classifier are b
4 pages
Data Structures Design - AD3251 - Important Questions with Answer - Unit 1 - Abstract Data Types
No ratings yet
Data Structures Design - AD3251 - Important Questions with Answer - Unit 1 - Abstract Data Types
15 pages
Transaction With Replicated Data PDF
No ratings yet
Transaction With Replicated Data PDF
3 pages
IJPREMS Template January 2023
No ratings yet
IJPREMS Template January 2023
2 pages
Parallel Computer Models: CSE7002: Advanced Computer Architecture
No ratings yet
Parallel Computer Models: CSE7002: Advanced Computer Architecture
37 pages
CS3401 Algorithms Syllabus
No ratings yet
CS3401 Algorithms Syllabus
3 pages
Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
18CS61 - SS and C - Module 5
No ratings yet
18CS61 - SS and C - Module 5
36 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
Data Science PPT Final
No ratings yet
Data Science PPT Final
13 pages
"Introduction To Computer Vision": Submitted by
No ratings yet
"Introduction To Computer Vision": Submitted by
45 pages
CS8602 - CD - Unit 3
No ratings yet
CS8602 - CD - Unit 3
7 pages
Agreement Protocols
No ratings yet
Agreement Protocols
17 pages
006 Practical List of DM-2023
No ratings yet
006 Practical List of DM-2023
1 page
SDN Lab Manual
No ratings yet
SDN Lab Manual
24 pages
Multiprocessor Configuration
100% (1)
Multiprocessor Configuration
7 pages
Computer Vision Module Application For Finding A Target in A Live Camera
No ratings yet
Computer Vision Module Application For Finding A Target in A Live Camera
8 pages
Report
100% (1)
Report
32 pages
Exception Handling in Java PDF
No ratings yet
Exception Handling in Java PDF
1 page
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
CCS369
No ratings yet
CCS369
2 pages
Storage Technologies Question Bank (1)
No ratings yet
Storage Technologies Question Bank (1)
61 pages
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
Parallel Database Systems
No ratings yet
Parallel Database Systems
17 pages
SDN Unit 1 Notes
No ratings yet
SDN Unit 1 Notes
9 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Unit 01 (1)
No ratings yet
Unit 01 (1)
66 pages
Unit 04
No ratings yet
Unit 04
20 pages
unit wise important question unit wise
No ratings yet
unit wise important question unit wise
178 pages
DE&V TWO MARKS QUESTIONS WITH ANSWERS
No ratings yet
DE&V TWO MARKS QUESTIONS WITH ANSWERS
19 pages
Vol 11 3
No ratings yet
Vol 11 3
5 pages
CP4153 Unit 2 Network Technologies
100% (2)
CP4153 Unit 2 Network Technologies
27 pages
PPL Me Previousqp 2 Previous Year QP
No ratings yet
PPL Me Previousqp 2 Previous Year QP
4 pages
Routledge International Handbooks Eileen Z Taylor and Paul F Williams The Routledge Handbook of Accounting Ethics Rout
No ratings yet
Routledge International Handbooks Eileen Z Taylor and Paul F Williams The Routledge Handbook of Accounting Ethics Rout
441 pages
Chapter 7
No ratings yet
Chapter 7
22 pages
Time of Supply-20
No ratings yet
Time of Supply-20
44 pages
A HighDensity, High 10 KV Silicon Carbide (Speed SiC) MOSFET Power Module
No ratings yet
A HighDensity, High 10 KV Silicon Carbide (Speed SiC) MOSFET Power Module
73 pages
UserManualforCorrectionModuleofWBTN Rural
No ratings yet
UserManualforCorrectionModuleofWBTN Rural
32 pages
Literature Study-TNCDBR Norms
No ratings yet
Literature Study-TNCDBR Norms
5 pages
Excel 2010
No ratings yet
Excel 2010
28 pages
The Effect of Brainstorming On Students Creative
No ratings yet
The Effect of Brainstorming On Students Creative
5 pages
Kotlin Tutorial PDF
100% (1)
Kotlin Tutorial PDF
5 pages
Mathlinks 8 - Section 8 2
No ratings yet
Mathlinks 8 - Section 8 2
17 pages
Evidence Based Management PDF
No ratings yet
Evidence Based Management PDF
2 pages
Handling of Precast Column-Footing and Precast Prestressed Hollowcore Slabs
No ratings yet
Handling of Precast Column-Footing and Precast Prestressed Hollowcore Slabs
5 pages
Northern Lights Warranty
No ratings yet
Northern Lights Warranty
2 pages
List of Companies (Vru)
No ratings yet
List of Companies (Vru)
3 pages
Bikol Reporter January 29 - February 4, 2017 Issue
No ratings yet
Bikol Reporter January 29 - February 4, 2017 Issue
8 pages
Sample Speech On The Topic Smile (For Kids)
No ratings yet
Sample Speech On The Topic Smile (For Kids)
2 pages
Pisting Yawa Nga Notes
No ratings yet
Pisting Yawa Nga Notes
1 page
Hot Deformation Behavior of 2219
No ratings yet
Hot Deformation Behavior of 2219
18 pages
MSDS Petron XD3
No ratings yet
MSDS Petron XD3
5 pages
AZ-400: Designing and Implementing Microsoft DevOps-02
No ratings yet
AZ-400: Designing and Implementing Microsoft DevOps-02
67 pages
Servo-Hydraulic Actuator SHA Fields of Application: RE 08137, Edition: 2018-02, Bosch Rexroth AG
No ratings yet
Servo-Hydraulic Actuator SHA Fields of Application: RE 08137, Edition: 2018-02, Bosch Rexroth AG
12 pages
An Etymological Dictionary of Common Chinese Characters
No ratings yet
An Etymological Dictionary of Common Chinese Characters
190 pages
Mobilization Construction Equipment & Tools
No ratings yet
Mobilization Construction Equipment & Tools
1 page
Chapter 8 - Individual Variation in Drug Responses
100% (3)
Chapter 8 - Individual Variation in Drug Responses
3 pages
Data Sheet
No ratings yet
Data Sheet
6 pages
10.254.020 (01) - Var-330150-Escape-Horizontal - Farmall 75A Rops - JPY100355
No ratings yet
10.254.020 (01) - Var-330150-Escape-Horizontal - Farmall 75A Rops - JPY100355
3 pages
Patanjali - Strategic Management - Group Project
No ratings yet
Patanjali - Strategic Management - Group Project
18 pages
Practical Research 1
No ratings yet
Practical Research 1
4 pages
Hongkong Policing
No ratings yet
Hongkong Policing
11 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.