Excelente
Excelente
Performance Issues I
C/C++ stores matrices in row-major fashion. Loop interchanges may increase cache locality
{
#pragma omp parallel for for(i=0;i< N; i++) { for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } }
Performance Issues II
Move synchronization points outwards. The inner loop is parallelized. In each iteration step of the outer loop, a parallel region is created. This causes parallelization overhead.
{ for(i=0;i< N; i++) { #pragma omp parallel for for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } }
3
Conditional Compilation
Keep sequential and parallel programs as a single source code
#if def _OPENMP #include omp.h #endif Main() { #ifdef _OPENMP omp_set_num_threads(3); #endif for(i=0;i< N; i++) { #pragma omp parallel for for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } }
a[i] is written in loop iteration i and read in loop iteration i+1. This loop can not be executed in parallel. The results may not be correct.
7
Anti-dependence
for(i=0;i< N-1; i++) { x = b[i] + c[i]; a[i] = a[i+1] + x; }
#pragma omp parallel for private (i) for(j=0;j< n; j++) for(i=1;i<m;i++) { a[i][j] = 2.0*a[i-1][j]; }
11
Parallel version
#pragma omp parallel for shared (a,b) for(i=0;i< N/2; i++) { a[i] = a[i] + a[i+N/2]; b[i] = i*(i-1)/2; c[i] = pow(2,i); }
12
Parallel version
b[1]=b[1]+a[0]; #pragma omp parallel for shared (a,b,c) for(i=1;i< N-1; i++) { a[i] = a[i] + c[i]; b[i+1] = b[i+1]+a[i]; } a[N-1] = a[N-1]+c[N-1];
13
14
15
Cyclic reduction probably is the best method to solve tridiagonal systems Z. Liu, B. Chapman, Y. Wen and L. Huang. Analyses for the Translation of OpenMP Codes into SPMD Style with Array Privatization. OpenMP shared memory parallel programming: International Workshop on OpenMP C. Addison, Y. Ren and M. van Waveren. OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries. J. Sci. Programming OpenMP, 11(2), 2003. 16 S.F. McGinn and R.E. Shaw. Parallel Gaussian Elimination Using OpenMP and MPI
alpha
epsilon
17
#pragma omp sections [clause list] private (list) firstprivate (list) lastprivate (list) reduction (operator: list) nowait { #pragma omp section structured_block #pragma omp section structured_block }
18
#include omp.h #define N 1000 int main(){ int i; double a[N], b[N], c[N], d[N]; for(i=0; i<N; i++){ a[i] = i*2.0; b[i] = i + a[i]*22.5; } #pragma omp parallel shared(a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for(i=0; i<N;i++) c[i] = a[i]+b[i]; #pragma omp section for(i=0; i<N;i++) d[i] = a[i]*b[i]; } } By default, there is a barrier at the end of the sections. Use the nowait clause to turn of }
the barrier.
19
#include omp.h
#pragma omp parallel { #pragma omp sections { #pragma omp section v=alpha(); #pragma omp section w=beta(); } #pragma omp sections { #pragma omp section x=gamma(v,w); #pragma omp section y=delta(); } printf(%g\n, epsilon(x,y)); }
20
Synchronization I
Threads communicate through shared variables. Uncoordinated access of these variables can lead to undesired effects.
E.g. two threads update (write) a shared variable in the same step of execution, the result is dependent on the way this variable is accessed. This is called a race condition.
To prevent race condition, the access to shared variables must be synchronized. Synchronization can be time consuming. The barrier directive is set to synchronize all threads. All threads wait at the barrier until all of them have arrived.
21
Synchronization II Synchronization imposes order constraints and is used to protect access to shared data High level synchronization:
critical atomic barrier ordered
Synchronization: critical
Mutual exclusion: only one thread at a time can enter a critical region. { double res; #pragma omp parallel { double B; int i, id, nthrds; id = omp_get_thread_num(); nthrds = omp_get_num_threads(); for(i=id; i<niters; i+=nthrds){ B = some_work(i); Threads wait here: only one thread #pragma omp critical at a time calls consume(). So this is a piece of sequential code inside consume(B,res); the for loop. } } }
23
24
{ #pragma omp parallel { #pragma omp for nowait shared(best_cost) for(i=0; i<N; i++){ int my_cost; Only one thread at a time my_cost = estimate(i); executes if() statement. This #pragma omp critical ensures mutual exclusion when accessing shared data. { Without critical, this will set up if(best_cost < my_cost) a race condition, in which the best_cost = my_cost; computation exhibits } nondeterministic behavior } when performed by multiple threads accessing a shared } }
variable
25
Synchronization: atomic
atomic provides mutual exclusion but only applies to the load/update of a memory location. This is a lightweight, special form of a critical section. It is applied only to the (single) assignment statement that immediately follows it.
{ #pragma omp parallel { double tmp, B; . #pragma omp atomic { X+=tmp; } } }
26
ic is a counter. The atomic construct ensures that no updates are lost when multiple threads are updating a counter value.
27
Atomic construct may only be used together with an expression statement with one of operations: +, *, -, /, &, ^, |, <<, >>.
The atomic construct does not prevent multiple threads from executing the function bigfunc() at the same time.
28
Synchronization: barrier
Suppose each of the following two loops are run in parallel over i, this may give a wrong answer.
for(i= 0; i<N; i++) a[i] = b[i] + c[i]; for(i= 0; i<N; i++) d[i] = a[i] + b[i];
There could be a data race in a[].
29
for(i= 0; i<N; i++) a[i] = b[i] + c[i]; for(i= 0; i<N; i++) d[i] = a[i] + b[i];
wait barrier
To avoid race condition: NEED: All threads wait at the barrier point and only continue when all threads have reached the barrier point. Barrier syntax: #pragma omp barrier
30
Synchronization: barrier barrier: each threads waits until all threads arrive
#pragma omp parallel shared (A,B,C) private (id) { id=omp_get_thread_num(); A[id] = big_calc1(id); #pragma omp barrier #pragma omp for for(i=0; i<N;i++){C[i]=big_calc3(i,A);} #pragma omp for nowait for(i=0;i<N;i++) {B[i]=big_calc2(i,C);} A[id]=big_calc4(id); }
When to Use Barriers If data is updated asynchronously and data integrity is at risk Examples:
Between parts in the code that read and write the same section of memory After one timestep/iteration in a numerical solver
Barriers are expensive and also may not scale to a large number of processors
32
master Construct
The master construct defines a structured block that is only executed by the master thread. The other threads skip the master construct. No synchronization is implied. It does not have an implied barrier on entry or exit. The lack of a barrier may lead to problems.
#pragma omp parallel { #pragma omp master { exchange_information(); } #pragma omp barrier }
33
single Construct
The single construct builds a block of code that is executed by only one thread (not necessarily the master thread). A barrier is implicitly set at the end of the single block (the barrier can be removed by the nowait clause)
#pragma omp parallel { #pragma omp single { exchange_information(); } do_other_things(); }
35
Synchronization: ordered
The ordered region executes in the sequential order
#pragma omp parallel private (tmp) { #pragma omp for ordered reduction(+:res) for(i=0;i<N;i++) { tmp = compute(i); #pragma ordered res += consum(tmp); } do_other_things(); }
37
38
Locking Example
The protected region contains the update of a shared variable One thread acquires the lock and performs the update Meanwhile, other threads perform some other work When the lock is released again, the other threads perform the update
40
omp_lock_t lck; omp_init_lock(&lck); #pragma omp parallel shared(lck) private (tmp, id) { id = omp_get_thread_num(); tmp = do_some_work(id); omp_set_lock(&lck); printf(%d %d\n, id, tmp); omp_unset_lock(&lck); } omp_destroy_lock(&lck);
Initialize a lock associated with lock variables lck for use in subsequent calls.
Release the lock so that the next thread gets a turn Dissociate the given lock variable from any locks.
41
Allow system to dynamically vary the number of threads from one parallel construct to another
omp_set_dynamic(int set)
set = true: enables dynamic adjustment of team sizes set = false: disable dynamic adjustment
int omp_get_dynamic(void)
http://gcc.gnu.org/onlinedocs/libgomp/index.html#Top
42
A private variable has multiple storage locations, one within the execution context of each thread.
Not shared variables
Stack variables in functions called from parallel regions are private. Automatic variables within a statement block are private.
This holds for pointer as well. Therefore, do not assign a private pointer the address of a private variable of another thread. The result is not defined.
43
/** main file **/ #include <stdio.h> #include <stdlib.h> double A[100]; int main(){ int index[50]; #pragma omp parallel work(index); printf(%d\n, index[0]); }
/** file 1 **/ #include <stdio.h> #include <stdlib.h> extern double A[100]; void work(int *index){ double temp[50]; static int count; }
Variables A, index and count are shared by all threads. Variable temp is local (or private) to each thread.
44
The final value of a private inside a parallel for loop can be transmitted to the shared variable outside the loop with:
lastprivate
All data clauses listed here apply to the parallel construct region and worksharing construct region except shared, which only applies to parallel constructs.
45
Private Clause
private (variable list) clause creates a new local copy of variables for each thread.
Values of these variables are not initialized on entry of the parallel region. Values of the data specified in the private clause can no longer be accessed after the corresponding region terminates (values are not defined on exit of the parallel region).
/*** wrong implementation ***/ int main(){ int tmp = 0; #pragma omp parallel for private(tmp) for (int j=0; j<1000;j++) tmp += j; printf(%d\n, tmp); }
46
Firstprivate Clause
firstprivate initializes each private copy with the corresponding value from the master thread.
/*** still wrong implementation ***/ int main(){ int tmp = 0; #pragma omp parallel for firstprivate(tmp) for (int j=0; j<1000;j++) tmp += j; printf(%d\n, tmp); }
47
Lastprivate Clause
Lastprivate clause passes the value of a private variable from the last iteration to a global variable.
It is supported on the work-sharing loop and sections constructs. It ensures that the last value of a data object listed is accessible after the corresponding construct has completed execution. In case use with a work-shared loop, the object has the value from the iteration of the loop that would be last in a sequential execution.
/*** useless implementation ***/ int main(){ int tmp = 0; #pragma omp parallel for firstprivate(tmp) lastprivate(tmp) for (int j=0; j<5;j++) tmp += j; printf(%d\n, tmp); } tmp is defined as its value at the last sequential iteration, i.e, j = 5.
48
Default Clause C/C++ only has default(shared) or default(none) Only Fortran supports default(private) Default data attribute is default(shared)
Exception: #pragma omp task
Default(none): no default attribute for variables in static extent. Must list storage attribute for each variable in static extent. Good programming practice.
50
51
Static extent
Dynamic extent
52
53
54
Threadprivate
Threadprivate makes global data private to a thread
C/C++: file scope and static variables, static class members Each thread gives its own set of global variables, with initial values undefined.
55
If all of the conditions below hold, and if a threadprivate object is referenced in two consecutive (at run time) parallel regions, then threads with the same thread number in their respective regions reference the same copy of that variable:
Neither parallel region is nested inside another parallel region. The number of threads used to execute both parallel regions is the same.
56
#include <stdio.h> #include <stdlib.h> #include "omp.h" int *pglobal; #pragma omp threadprivate(pglobal) Threadprivate directive is used to give each thread a private copy of the global pointer pglobal.
int main(){ #pragma omp parallel for private(i,j,sum,TID) shared(n,length,check) for (i=0; i<n;i++) { TID = omp_get_thread_num(); if((pglobal = (int*) malloc(length[i]*sizeof(int))) != NULL) { for(j=sum=0; j < length[i];j++) pglobal[j] = j+1; sum = calculate_sum(length[i]); printf(TID %d: value of sum for I = %d is %d\n, TID,i,sum); free(pglobal); } else { printf(TID %d: not enough memory : length[%d] = %d\n", TID,i,length[i]); } } }
57
int calculate_sum(int length){ int sum = 0; for (j=0; j<length;j++) { sum += pglobal[j]; } return (sum); }
58
Each thread has its own copy of sum0, updated in a parallel region that is called several times. The values for sum0 from one execution of the parallel region will be available 59 when it is next started.
Copyin Clause Copyin allows to copy the master threads threadprivate variables to corresponding threadprivate variables of the other threads.
int global[100]; #pragma omp threadprivate(global) int main(){ for(int i= 0; i<100; i++) global[i] = i+2; // initialize data #pragma omp parallel copyin(global) { /// parallel region, each thread gets a copy of global, with initialized value } }
60
Copyprivate Clause
Copyprivate clause is supported on the single directive to broadcast values of privates from one thread of a team to the other threads in the team.
The typical usage is to have one thread read or initialize private data that is subsequently used by the other threads as well. After the single construct has ended, but before the threads have left the associated barrier, the values of variables specified in the associated list are copied to the other threads. Do not use copyprivate in combination with the nowait clause. #include omp.h Void input_parameters(int, int); // fetch values of input parameters int main(){ int Nsize, choice; #pragma omp parallel private(Nsize, choice) { #pragma omp single copyprivate (Nsize, choice) input_parameters(Nsize,choice); do_work(Nsize, choice); } }
61
Flush Directive
OpenMP supports a shared memory model.
However, processors can have their own local high speed memory, the registers and cache. If a thread updates shared data, the new value will first be saved in register and then stored back to the local cache. The update are thus not necessarily immediately visible to other threads.
62
Flush Directive The flush directive is to make a threads temporary view of shared data consistent with the value in memory.
#pragma omp flush (list) Thread-visible variables are written back to memory at this point. For pointers in the list, note that the pointer itself is flushed, not the object it points to.
63
References:
http://bisqwit.iki.fi/story/howto/openmp/ http://openmp.org/mp-documents/omp-hands-onSC08.pdf https://computing.llnl.gov/tutorials/openMP/ http://www.mosaic.ethz.ch/education/Lectures/hpc R. van der Pas. An Overview of OpenMP B. Chapman, G. Jost and R. van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press, Cambridge, Massachusetts, London, England B. Estrade, Hybrid Programming with MPI and OpenMP
64