0% found this document useful (0 votes)
0 views59 pages

Omp Sync Data Runtime Environment

The document provides an overview of OpenMP for intranode programming, focusing on synchronization, data sharing environments, and runtime library variables. It explains the OpenMP memory model, syntax, scheduling clauses, and various synchronization mechanisms such as barriers, critical sections, and atomic operations. Additionally, it discusses data-sharing attributes and the management of private and shared data in parallel programming contexts.

Uploaded by

shefaaalhindi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views59 pages

Omp Sync Data Runtime Environment

The document provides an overview of OpenMP for intranode programming, focusing on synchronization, data sharing environments, and runtime library variables. It explains the OpenMP memory model, syntax, scheduling clauses, and various synchronization mechanisms such as barriers, critical sections, and atomic operations. Additionally, it discusses data-sharing attributes and the management of private and shared data in parallel programming contexts.

Uploaded by

shefaaalhindi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

OpenMP for Intranode Programming

Synchronization, Data Sharing Environment,


and runtime library and environment variables

These slides were originally written by Dr. Barbara Chapman, University of Houston
OpenMP Memory Model
 OpenMP assumes a shared memory
 Threads communicate by sharing variables.

 Synchronization protects data conflicts.


 Synchronization is expensive.
 Change how data is accessed to minimize the need for synchronization.

2
OpenMP Syntax
 Most OpenMP constructs are compiler directives
 For C and C++, they are pragmas with the form:
#pragma omp construct [clause [clause]…]
 For Fortran, the directives may have fixed or free form:
*$OMP construct [clause [clause]…]
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
 Include file and the OpenMP lib module
#include <omp.h>
use omp_lib
 Most OpenMP constructs apply to a “structured block”.
 A block of one or more statements with one point of entry at the top
and one point of exit at the bottom.
 It’s OK to have an exit() within the structured block.

OpenMP sentinel forms: #pragma omp !$OMP


3
OpenMP schedule Clause
The schedule clause affects how loop iterations are mapped onto threads
schedule ( static | dynamic | guided [, chunk] )
schedule ( auto | runtime )
static Distribute iterations in blocks of size "chunk" over the
threads in a round-robin fashion
dynamic Fixed portions of work; size is controlled by the value of
chunk. When a thread finishes, it starts on the next portion of
work
guided Same dynamic behavior as "dynamic", but size of the portion
of work decreases exponentially
auto The compiler (or runtime system) decides what is best to use;
choice could be implementation dependent
runtime Iteration scheduling scheme is set at runtime through
environment variable OMP_SCHEDULE
4
Example Of A Static Schedule
A loop of length 16 using 4 threads

Thread 0 1 2 3
no chunk * 1-4 5-8 9-12 13-16
chunk = 2 1-2 3-4 5-6 7-8
9-10 11-12 13-14 15-16

*) The precise distribution is implementation defined


The Schedule Clause

Schedule Clause When To Use Least work at


runtime :
scheduling
STATIC Pre-determined and done at
predictable by the compile-time
programmer
DYNAMIC Unpredictable, highly Most work at
variable work per runtime :
iteration complex
scheduling
GUIDED Special case of dynamic logic used at
to reduce scheduling run-time

overhead
OpenMP Synchronization
 Synchronization enables the user to
 Control the ordering of executions in different threads
 Ensure that at most one thread executes operation or
region of code at any given time (mutual exclusion)

 High level synchronization:


 critical section
 atomic
 barrier
 ordered
 Low level synchronization:
 flush
 locks (both simple and nested)

7
Barrier
We need to update all of a[ ] before using a[ ] *

for (i=0; i < N; i++)


a[i] = b[i] + c[i];
wait !
barrier
for (i=0; i < N; i++)
d[i] = a[i] + b[i];

All threads wait at the barrier point and only continue when all
threads have reached the barrier point

*) If the mapping of iterations onto threads is guaranteed to be


identical for both loops, we do not need to wait in this case
Barrier
Barrier Region
idle

idle

idle

time

Barrier syntax in OpenMP:


#pragma omp barrier !$omp barrier
Barrier
 Each thread waits until all threads arrive.

#pragma omp parallel shared (A, B, C) private(id)


{
id=omp_get_thread_num();
A[id] = big_calc1(id); implicit barrier at the
#pragma omp barrier end of a for work-
#pragma omp for sharing construct
for(i=0;i<N;i++){C[i]=big_calc3(I,A);}
#pragma omp for nowait
for(i=0;i<N;i++){ B[i]=big_calc2(C, i); }
A[id] = big_calc3(id);
} no implicit barrier
implicit barrier at the end
of a parallel region due to nowait 10
The Nowait Clause

 Barriers are implied at end of parallel region,


for/do, sections and single constructs
 Barrier can be suppressed by using the
optional nowait clause
 If present, threads do not synchronize/wait at the
end of that particular construct

#pragma omp for nowait !$omp do


{ :
: :
} !$omp end do nowait
Critical Section
 Mutual exclusion: Code may only be executed by
at most one thread at any given time
 Could lead to long wait times for other threads
 Atomic updates for individual operations
 Critical regions and locks for structured regions of code
critical region

time
Critical Region (Section)

 Only one thread at a time can enter a critical


region
float res;
#pragma omp parallel
{ float B; int i;
#pragma omp for
Threads wait for(i=0;i<niters;i++){
their turn – only
B = big_job(i);
one at a time
calls consume() #pragma omp critical
consume (B, RES);
}
Use e.g. when all threads }update a variable; if the order in which they do so is
unimportant, we need to ensure that they do not do it at the same time
13
Atomic The statement inside the
atomic must be one of:
x binop= expr
 Atomic is a special case of x = x binop expr
mutual exclusion x = expr binop x
x++
 It applies only to the update ++x
of a memory location x—
--x
X is an lvalue of scalar type
and binop is a non-
C$OMP PARALLEL PRIVATE(B) overloaded built in operator.
B = DOIT(I)
tmp = big_ugly(); OpenMP 3.1 describes the behavior
in more detail via theSE clauses:
C$OMP ATOMIC read, write, update, capture
X = X + temp
The pre-3.1 atomic construct is
C$OMP END PARALLEL
equivalent to
#pragma omp atomic capture

14
Ordered
 The ordered construct enforces the sequential order for a
block.
 Code is executed in order in which iterations would be
performed sequentially
 The worksharing construct has to have the ordered clause

#pragma omp parallel private (tmp)


#pragma omp for ordered
for (i=0;i<N;i++){
tmp = NEAT_STUFF(i);
#pragma ordered
res += consum(tmp);
}

15
Updates to Shared Data
 Blocks of data are fetched into cache lines
 Values may temporarily differ from other copies of
data within a parallel region
a
Shared memory

a
cache1 cache2 cache3 cacheN

proc1 proc2 proc3 procN

16
a
The Flush Directive
 The flush construct denotes a sequence point where
a thread tries to create a consistent view of memory
for specified variables.
 All memory operations (both reads and writes) defined
prior to the sequence point must complete.
 All memory operations (both reads and writes) defined
after the sequence point must follow the flush.
 Variables in registers or write buffers must be updated
in memory.
 Arguments to flush specify which variables are
flushed.
 If no arguments are specified, all thread visible
variables are flushed.
17
What Else Does Flush Influence?
The flush operation does not
actually synchronize different
threads. It just ensures that a
thread’s values are made
consistent with main memory.
Something to note:
Compilers reorder instructions to better exploit the functional
units and keep the machine busy
Flush prevents the compiler from doing the following:
 Reorder read/writes of variables in a flush set relative to a flush.
 Reorder flush constructs when flush sets overlap.
A compiler CAN do the following:
 Reorder instructions NOT involving variables in the flush set
relative to the flush.
 Reorder flush constructs that don’t have overlapping flush sets.

18
A Flush Example
Pair-wise synchronization.
integer ISYNC(NUM_THREADS)
C$OMP PARALLEL DEFAULT (PRIVATE) SHARED (ISYNC)
IAM = OMP_GET_THREAD_NUM()
ISYNC(IAM) = 0
Make sure other threads can
C$OMP BARRIER see my write.
CALL WORK()
ISYNC(IAM) = 1 ! I’m all done; signal this to other threads
C$OMP FLUSH(ISYNC)
DO WHILE (ISYNC(NEIGH) .EQ. 0)
C$OMP FLUSH(ISYNC)
END DO Make sure the read picks up a
C$OMP END PARALLEL good copy from memory.

Note: flush is analogous to a fence in other shared


memory APIs.
19
Implied Flush
Flushes are implicitly performed during execution:
 In a barrier region

 At exit from worksharing regions, unless a nowait is present

 At entry to and exit from parallel, critical, ordered and parallel


worksharing regions
 During omp_set_lock and omp_unset_lock regions

 During omp_test_lock, omp_set_nest_lock, omp_unset


_nest_lock and omp_test_nest_lock regions, if the region
causes the lock to be set or unset
 Immediately before and after every task scheduling point

 At entry to and exit from atomic regions, where the list


contains only the variable updated in the atomic construct
 But not on entry to a worksharing region, or entry to/exit from
a master region,
Managing the data environment

21
OpenMP Memory Model
 OpenMP assumes a shared memory
 Threads communicate by sharing variables.

 Synchronization protects data conflicts.


 Synchronization is expensive.
 Change how data is accessed to minimize the need for synchronization.

22
Data-Sharing Attributes

 In OpenMP code, data needs to be “labeled”


 There are two basic types:
 Shared – there is only one instance of the data
 Threads can read and write the data simultaneously
unless protected through a specific construct
 All changes made are visible to all threads
– But not necessarily immediately, unless enforced ......
 Private - Each thread has a copy of the data
 No other thread can access this data
 Changes only visible to the thread owning the data
OpenMP Data Environment
 Most variables are shared by default
 Global variables are SHARED among threads
 Fortran: COMMON blocks, SAVE variables, MODULE
variables
 C: File scope variables, static
 But not everything is shared by default...
 Stack variables in sub-programs called from parallel
regions are PRIVATE
 Automatic variables defined inside the parallel region are
PRIVATE.
 The default status can be modified with:
 DEFAULT (PRIVATE | SHARED | NONE)

All data clauses apply to parallel regions and worksharing constructs except “shared”
24
which only applies to parallel regions.
About Storage Association

 Private variables are undefined on entry and


exit of the parallel region
 A private variable within a parallel region has
no storage association with the same variable
outside of the region
 Use the firstprivate and lastprivate clauses
to override this behavior
 We illustrate these concepts with an example
OpenMP Data Environment

double a[size][size], b=4;


#pragma omp parallel private (b)
{ .... }

shared data
a[size][size]

private data private data private data private data


b=6 b=8 b =? b =?
T0 T1 T2 T3
b becomes undefined
on exit from region
OpenMP Data Environment
program sort subroutine work (index)
common /input/ A(10) common /input/ A(10)
integer index(10) integer index(*)
C$OMP PARALLEL real temp(10)
call work (index) …………
C$OMP END PARALLEL
print*, index(1)

A, index
A and index are shared
by all threads.
temp temp temp
temp is local to each
thread

A, index
27
OpenMP Private Clause
 private(var) creates a local copy of var for each
thread.
 The value is uninitialized
 Private copy is not storage-associated with the original
 The original is undefined at the end
IS = 0
C$OMP PARALLEL DO PRIVATE(IS)
DO J=1,1000
IS = IS + J


END DO
IS was not
C$OMP END PARALLEL DO initialized
print *, IS

IS is undefined
here, regardless
28
of initialization
(In)Visibility of Private Data
#pragma omp parallel private(x) shared(p0, p1)
Thread 0 Thread 1
X = …; X = …;
P0 = &x; P1 = &x;
/* references in the following line are not allowed */
… *p1 … … *p0 …

You can not reference another’s threads private variables … even if you have a
shared pointer between the two threads.

29
The Firstprivate And Lastprivate Clauses
firstprivate (list)
 All variables in the list are initialized with the
value the original object had before entering
the parallel construct

lastprivate (list)
 The thread that executes the sequentially last
iteration or section updates the value of the
objects in the list
Firstprivate Clause
 firstprivate is a special case of private.
 Initializes each private copy with the corresponding
value from the master thread.



IS = 0
C$OMP PARALLEL DO FIRSTPRIVATE(IS)
DO 20 J=1,1000
IS = IS + J
20 CONTINUE
C$OMP END PARALLEL DO Each thread gets its own IS
print *, IS with an initial value of 0

Regardless of initialization, IS is
undefined at this point
31
Lastprivate Clause
 Lastprivate passes the value of a private variable
from the last iteration to the variable of the master
thread
IS = 0


C$OMP PARALLEL DO FIRSTPRIVATE(IS)
C$OMP& LASTPRIVATE(IS)
DO 20 J=1,1000
Are you sure ?
IS = IS + J
20 CONTINUE
C$OMP END PARALLEL DO Each thread gets its own IS
print *, IS with an initial value of 0

IS is defined as its value at the last


iteration (i.e. for J=1000)
32
A Data Environment Checkup
 Consider this example of PRIVATE and FIRSTPRIVATE
C variables A,B, and C = 1
C$OMP PARALLEL PRIVATE(B)
C$OMP& FIRSTPRIVATE(C)
 Are A,B,C local to each thread or shared inside the parallel region?
 What are their initial values inside and after the parallel region?

33
A Data Environment Checkup
 Consider this example of PRIVATE and FIRSTPRIVATE
C variables A,B, and C = 1
C$OMP PARALLEL PRIVATE(B)
C$OMP& FIRSTPRIVATE(C)
 Are A,B,C local to each thread or shared inside the parallel region?
 What are their initial values inside and after the parallel region?

Inside this parallel region ...


 “A” is shared by all threads; equals 1
 “B” and “C” are local to each thread.
– B’s initial value is undefined
– C’s initial value equals 1
Outside this parallel region ...
34  The values of “B” and “C” are undefined.
OpenMP Reduction
 If it’s the sum of all J values that you need, there is a way to do that too.
 We have already seen how

IS = 0
C$OMP PARALLEL DO REDUCTION(+:IS)
DO 1000 J=1,1000
IS = IS + J
1000 CONTINUE
print *, IS

Result variable is shared by default

35
OpenMP Reduction

 Combines an accumulation operation across threads:


reduction (op : list)
 Inside a parallel or work-sharing construct:
 A local copy of each list variable is made and initialized
depending on the “op” (e.g. 0 for “+”).
 Compiler finds standard reduction expressions containing “op”
and uses them to update the local copy.
 Local copies are reduced into a single value and combined with
the original global value.
 The variables in “list” must be shared in the enclosing
parallel region.

36
The Reduction Clause
reduction ( operator: list ) C/C++
reduction ( [operator | intrinsic] ) : list ) Fortran
 Reduction variable(s) must be shared variables
Check the specs
 A reduction is defined as: for details
Fortran C/C++
x = x operator expr x = x operator expr
x = expr operator x x = expr operator x
x = intrinsic (x, expr_list) x++, ++x, x--, --x
x = intrinsic (expr_list, x) x <binop> = expr
“min” and “max” intrinsic
 Note that the value of a reduction variable is undefined
from the moment the first thread reaches the clause till
the operation has completed
 The reduction can be hidden in a function call
Reduction Example
 Remember the code we used to demo private,
firstprivate and lastprivate.

program closer
IS = 0
DO J=1,1000
IS = IS + J
1000 CONTINUE
print *, IS program closer
IS = 0
#pragma omp parallel for reduction(+:IS)
DO J=1,1000
IS = IS + J
1000 CONTINUE
print *, IS

38
Example - The Reduction Clause
sum = 0.0
!$omp parallel default(none) &
!$omp shared(n,x) private(i) Variable SUM
!$omp do reduction (+:sum) is a shared
do i = 1, n variable
sum = sum + x(i)
end do
!$omp end do
!$omp end parallel
print *,sum

 Care needs to be taken when updating shared


variable SUM
 With the reduction clause, the OpenMP compiler
generates code that avoids a race condition
Reduction Operands/Initial Values
 Associative operands used with reduction
 Initial values are the ones that make sense
mathematically

Operand Initial value Operand Initial value


+ 0 .OR. 0
* 1 MAX 1
- 0 MIN 0
.AND. All 1’s // All 1’s
40
The Default Clause
default ( none | shared ) C/C++
default (none | shared | private | threadprivate ) Fortran
none
 No implicit defaults; have to scope all variables explicitly
shared
 All variables are shared
 The default in absence of an explicit "default" clause
private
 All variables are private to the thread
 Includes common block data, unless THREADPRIVATE
firstprivate
 All variables are private to the thread; pre-initialized
Default Clause Example

itotal = 1000
C$OMP PARALLEL PRIVATE(np, each)
np = omp_get_num_threads() Are these
each = itotal/np two codes
………
C$OMP END PARALLEL
equivalent?

itotal = 1000
C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal)
np = omp_get_num_threads()
each = itotal/np
………
C$OMP END PARALLEL
42
Default Clause Example

itotal = 1000
C$OMP PARALLEL PRIVATE(np, each)
np = omp_get_num_threads() Are these
each = itotal/np two codes
………
C$OMP END PARALLEL
equivalent?

itotal = 1000
C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal) yes
np = omp_get_num_threads()
each = itotal/np
………
C$OMP END PARALLEL
43
OpenMP Threadprivate
 Makes global data private to a thread and persistent,
thus crossing parallel region boundary
 Fortran: COMMON blocks
 C: File scope and static variables
 Different from making them PRIVATE
 With PRIVATE, global variables are masked.
 THREADPRIVATE preserves global scope within each thread
 Threadprivate variables can be initialized using COPYIN
or by using DATA statements.
 Some limitations on use of threadprivate
 Consult specification before using

44
A Threadprivate Example
Consider two different routines called within a parallel region.

subroutine poo subroutine bar


parameter (N=1000) parameter (N=1000)
common/buf/A(N),B(N) common/buf/A(N),B(N)
!$OMP THREADPRIVATE(/buf/) !$OMP THREADPRIVATE(/buf/)
do i=1, N do i=1, N
B(i)= const* A(i) A(i) = sqrt(B(i))
end do end do
return return
end end

Because of the threadprivate construct, each thread executing these routines has
its own copy of the common block /buf/.

45
Threadprivate/Copyin
• You initialize threadprivate data using a copyin clause.

parameter (N=1000)
common/buf/A(N)
C$OMP THREADPRIVATE(/buf/)

C Initialize the A array


call init_data(N,A)

C$OMP PARALLEL COPYIN(A)


… Now each thread sees threadprivate array A initialized
… to the global value set in the subroutine init_data()
C$OMP END PARALLEL
....
C$OMP PARALLEL
... Values of threadprivate are persistent across parallel regions
C$OMP END PARALLEL

46
The Copyin Clause
copyin (list)
 Applies to THREADPRIVATE common blocks only
 At the start of the parallel region, data of the master thread is
copied to the thread private copies
Example:
common /cblock/velocity
common /fields/xfield, yfield, zfield
! create thread private common blocks
!$omp threadprivate (/cblock/, /fields/)
!$omp parallel &
!$omp default (private) & Data now
!$omp copyin ( /cblock/, zfield ) available to
threads
Copyprivate
Used with a single region to broadcast values of private
variables from one member of a team to the rest of the team.
#include <omp.h>
void input_parameters (int, int); // fetch values of input parameters
void do_work(int, int);

void main()
{
int Nsize, choice;

#pragma omp parallel private (Nsize, choice)


{ .....
#pragma omp single copyprivate (Nsize, choice)
input_parameters (Nsize, choice);

do_work(Nsize, choice);
}
}
48
Fortran - Allocatable Arrays
Fortran allocatable arrays whose status is
“currently allocated” are allowed to be specified as
private, lastprivate, firstprivate, reduction, or copyprivate

integer, allocatable,dimension (:) :: A


integer i
allocate (A(n))

!$omp parallel private (A)


do i = 1, n
A(i) = i
end do
...
!$omp end parallel
C++ And Threadprivate

❑OpenMP 3.0 clarified where/how


threadprivate objects are constructed and
destructed
❑Allow C++ static class members to be
threadprivate
class T {
public:
static int i;
#pragma omp threadprivate(i)
...
};
The runtime library and environment
variables

51
OpenMP Runtime Functions

 OpenMP provides a set of runtime functions


 They all start with “omp_”
 These functions can be used to:
 Query for a specific feature
 E.g. what is my thread ID?
 Change a setting
 E.g. to change the number of threads in next parallel
region
 A special category consists of the locking
functions
C/C++ : Need to include file <omp.h>
Fortran : Add “use omp_lib” or include file “omp_lib.h”
OpenMP Library Routines

 Modify/Check the number of threads


 omp_set_num_threads(), omp_get_num_threads(),
omp_get_thread_num(), omp_get_max_threads()
 Are we in a parallel region?
 omp_in_parallel()
 How many processors in the system?
 omp_num_procs()

53
OpenMP Library Routines
 To use a known, fixed number of threads used in a program,
(1) tell the system that you don’t want dynamic adjustment of the
number of threads, (2) set the number threads, then (3) save the
number you got.
Disable dynamic adjustment of the
#include <omp.h> number of threads.
void main()
{ int num_threads; Request as many threads as
omp_set_dynamic( 0 ); you have processors.
omp_set_num_threads( omp_num_procs() );
#pragma omp parallel
Protect this op since Memory
{ int id=omp_get_thread_num();
stores are not atomic
#pragma omp single
num_threads = omp_get_num_threads();
do_lots_of_stuff(id);
}
} Even in this case, the system may give you fewer threads
than requested. If the precise # of threads matters, test for
54 it and respond accordingly.
OpenMP Runtime Functions
Name Functionality
omp_set_num_threads Set number of threads
omp_get_num_threads Number of threads in team
omp_get_max_threads Max num of threads for parallel region
omp_get_thread_num Get thread ID
omp_get_num_procs Maximum number of processors
omp_in_parallel Check whether in parallel region
omp_set_dynamic Activate dynamic thread adjustment
(but implementation is free to ignore this)
omp_get_dynamic Check for dynamic thread adjustment
omp_set_nested Activate nested parallelism
(but implementation is free to ignore this)
omp_get_nested Check for nested parallelism
omp_get_wtime Returns wall clock time
omp_get_wtick Number of seconds between clock ticks
OpenMP Runtime Functions
Name Functionality
omp_set_schedule Set schedule (if “runtime” is used)
omp_get_schedule Returns the schedule in use
omp_get_thread_limit Max number of threads for
program
omp_set_max_active_levels Set number of active parallel
regions
omp_get_max_active_levels Number of active parallel regions
omp_get_level Number of nested parallel regions
omp_get_active_level Number of nested active par.
regions
omp_get_ancestor_thread_num Thread id of ancestor thread
omp_get_team_size (level) Size of the thread team at this level
omp_in_final Check in final task or not
OpenMP Environment Variables

 Set the default number of threads to use.


 OMP_NUM_THREADS int_literal

 Control how “omp for schedule(RUNTIME)”


loop iterations are scheduled.
 OMP_SCHEDULE “schedule[, chunk_size]”

57
OpenMP Environment Variables/1
Default Oracle
OpenMP Environment Variable
Solaris Studio
OMP_NUM_THREADS 2
OMP_SCHEDULE
static, “N/P”
“schedule,[chunk]”
OMP_DYNAMIC {TRUE | FALSE} TRUE
OMP_NESTED {TRUE | FALSE} FALSE

OMP_STACKIZE “size [B|K|M|G]” 4 MB (32 bit)/8 MB (64 bit)

OMP_WAIT_POLICY [ACTIVE |
PASSIVE
PASSIVE]
OMP_MAX_ACTIVE_LEVELS 4
 The names are in uppercase, the values are case insensitive
 Be careful when relying on defaults (because they are
compiler dependent)
OpenMP Environment Variables/2
Default Oracle
OpenMP Environment Variable
Solaris Studio
OMP_THREAD_LIMIT 1024
OMP_PROC_BIND {TRUE | FALSE} FALSE

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy