0% found this document useful (0 votes)
35 views91 pages

OpenMP Workshop Day 3

Uploaded by

mamalee393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views91 pages

OpenMP Workshop Day 3

Uploaded by

mamalee393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Dr.-Ing.

Michael Klemm
Chief Executive Officer
OpenMP Architecture Review Board
Agenda
◼ OpenMP Architecture Review Board
◼ Introduction to OpenMP Offload Features
◼ Case Study: NWChem TCE CCSD(T)
◼ Detachable Tasks

2
Introduction to
OpenMP Offload Features
Running Example for this Presentation: saxpy
void saxpy() {
float a, x[SZ], y[SZ];
// left out initialization
double t = 0.0;
double tb, te; Timing code (not needed, just to have
tb = omp_get_wtime(); a bit more code to show ☺)
#pragma omp parallel for firstprivate(a)
for (int i = 0; i < SZ; i++) { This is the code we want to execute on a
y[i] = a * x[i] + y[i]; target device (i.e., GPU)
}
te = omp_get_wtime();
Timing code (not needed, just to have
t = te - tb;
a bit more code to show ☺)
printf("Time of kernel: %lf\n", t);
}

Don’t do this at home!


Use a BLAS library for this!
4
Device Model
◼ Asof version 4.0 the OpenMP API supports accelerators/coprocessors
◼ Device model:
▪ One host for “traditional” multi-threading
▪ Multiple accelerators/coprocessors of the same kind for offloading

Accelerators
Host
5
Execution Model
◼ Offload region and data environment is lexically scoped
▪ Data environment is destroyed at closing curly brace
▪ Allocated buffers/data are automatically released

Host Device
pA 1
alloc
2
to

#pragma omp target \


4 map(alloc:...) \
from map(to:...) \
map(from:...)
{ ... } 3

6
7

OpenMP for Devices - Constructs


◼ Transfer control and data from the host to the device
◼ Syntax (C/C++)
#pragma omp target [clause[[,] clause],…]
structured-block
◼ Syntax (Fortran)
!$omp target [clause[[,] clause],…]
structured-block
!$omp end target
◼ Clauses
device(scalar-integer-expression)
map([{alloc | to | from | tofrom}:] list)
if(scalar-expr)
Example: saxpy
The compiler identifies variables that are
used in the target region.
void saxpy() {
float a, x[SZ], y[SZ]; All accessed arrays are copied from

host
double t = 0.0; host to device and back
a
double tb, te; x[0:SZ]
tb = omp_get_wtime(); y[0:SZ]
#pragma omp target “map(tofrom:y[0:SZ])”
for (int i = 0; i < SZ; i++) {

target
y[i] = a * x[i] + y[i];
} Presence check: only transfer
te = omp_get_wtime(); x[0:SZ] if not yet allocated on the
t = te - tb; device.

host
y[0:SZ]
printf("Time of kernel: %lf\n", t);
}
Copying x back is not necessary: it
was not changed.
clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908
8
Example: saxpy The compiler identifies variables that are
used in the target region.
subroutine saxpy(a, x, y, n)
use iso_fortran_env
integer :: n, i All accessed arrays are copied from

host
real(kind=real32) :: a host to device and back
real(kind=real32), dimension(n) :: x a
x(1:n)
real(kind=real32), dimension(n) :: y y(1:n)

!$omp target “map(tofrom:y(1:n))”


do i=1,n Presence check: only transfer

target
y(i) = a * x(i) + y(i) if not yet allocated on the
end do device.
!$omp end target

host
x(1:n)
end subroutine y(1:n)
Copying x back is not necessary: it
was not changed.
flang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908
9
Example: saxpy
void saxpy() {
double a, x[SZ], y[SZ];

host
double t = 0.0;
a
double tb, te; x[0:SZ]
tb = omp_get_wtime(); y[0:SZ]
#pragma omp target map(to:x[0:SZ]) \
map(tofrom:y[0:SZ])

target
for (int i = 0; i < SZ; i++) {
y[i] = a * x[i] + y[i];
} y[0:SZ]
te = omp_get_wtime();

host
t = te - tb;
printf("Time of kernel: %lf\n", t);
}

clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908


10
Example: saxpy
The compiler cannot determine the size
of memory behind the pointer.
void saxpy(float a, float* x, float* y,
int sz) {

host
double t = 0.0;
a
double tb, te; x[0:sz]
tb = omp_get_wtime(); y[0:sz]
#pragma omp target map(to:x[0:sz]) \
map(tofrom:y[0:sz])

target
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
} y[0:sz]
te = omp_get_wtime();

host
t = te - tb;
printf("Time of kernel: %lf\n", t);
Programmers have to help the compiler
}
with the size of the data transfer needed.

clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908


11
Creating Parallelism on the Target Device
◼ The target construct transfers the control flow to the target device
▪ Transfer of control is sequential and synchronous
▪ This is intentional!

◼ OpenMP separates offload and parallelism


▪ Programmers need to explicitly create parallel regions on the target device
▪ In theory, this can be combined with any OpenMP construct
▪ In practice, there is only a useful subset of OpenMP features for a target device such
as a GPU, e.g., no I/O, limited use of base language features.

12
Example: saxpy
void saxpy(float a, float* x, float* y,

host
int sz) {
#pragma omp target map(to:x[0:sz]) \
map(tofrom(y[0:sz])

target
#pragma omp parallel for simd
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}

host
} GPUs are multi-level devices:
SIMD, threads, thread blocks
Create a team of threads to execute the loop in
parallel using SIMD instructions.

clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908


13
teams Construct
◼ Support multi-level parallel devices
◼ Syntax (C/C++):
#pragma omp teams [clause[[,] clause],…]
structured-block
◼ Syntax (Fortran):
!$omp teams [clause[[,] clause],…]
structured-block
◼ Clauses
num_teams(integer-expression), thread_limit(integer-expression)
default(shared | firstprivate | private none)
private(list), firstprivate(list), shared(list), reduction(operator:list)

14
Multi-level Parallel saxpy
◼ Manual code transformation
▪ Tile the loops into an outer loop and an inner loop
▪ Assign the outer loop to “teams” (OpenCL: work groups)
▪ Assign the inner loop to the “threads” (OpenCL: work items)
void saxpy(float a, float* x, float* y, int sz) {
#pragma omp target teams map(to:x[0:sz]) map(tofrom:y[0:sz])
{
int bs = n / omp_get_num_teams();
#pragma omp distribute
for (int i = 0; i < sz; i += bs) {
#pragma omp parallel for simd firstprivate(i,bs)
for (int ii = i; ii < i + bs; ii++) {
y[ii] = a * x[ii] + y[ii];
}
}
}
}

15
Multi-level Parallel saxpy
◼ For
convenience, OpenMP defines composite constructs to implement the
required code transformations
void saxpy(float a, float* x, float* y, int sz) {
#pragma omp target teams distribute parallel for simd \
num_teams(num_blocks) map(to:x[0:sz]) map(tofrom:y[0:sz])
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}
}

subroutine saxpy(a, x, y, n)
! Declarations omitted
!$omp omp target teams distribute parallel do simd &
!$omp& num_teams(num_blocks) map(to:x) map(tofrom:y)
do i=1,n
y(i) = a * x(i) + y(i)
end do
!$omp end target teams distribute parallel do simd
end subroutine
16
Optimize Data Transfers
◼ Reduce the amount of time spent transferring data
▪ Use map clauses to enforce direction of data transfer.
▪ Use target data, target enter data, target exit data constructs to keep
data environment on the target device.
void example() { void zeros(float* a, int n) {
float tmp[N], data_in[N], float data_out[N]; #pragma omp target teams distribute parallel for
#pragma omp target data map(alloc:tmp[:N]) \ for (int i = 0; i < n; i++)
map(to:a[:N],b[:N]) \ a[i] = 0.0f;
map(tofrom:c[:N]) }
{
zeros(tmp, N);
compute_kernel_1(tmp, a, N); // uses target void saxpy(float a, float* y, float* x, int n) {
saxpy(2.0f, tmp, b, N); #pragma omp target teams distribute parallel for
compute_kernel_2(tmp, b, N); // uses target for (int i = 0; i < n; i++)
saxpy(2.0f, c, tmp, N); y[i] = a * x[i] + y[i];
} } }

17
target data Construct Syntax
◼ Create scoped data environment and transfer data from the host to the device and back
◼ Syntax (C/C++)
#pragma omp target data [clause[[,] clause],…]
structured-block
◼ Syntax (Fortran)
!$omp target data [clause[[,] clause],…]
structured-block
!$omp end target data
◼ Clauses
device(scalar-integer-expression)
map([{alloc | to | from | tofrom | release | delete}:] list)
if(scalar-expr)

18
target update Construct Syntax
◼ Issuedata transfers to or from existing data device environment
◼ Syntax (C/C++)
#pragma omp target update [clause[[,] clause],…]

◼ Syntax (Fortran)
!$omp target update [clause[[,] clause],…]

◼ Clauses
device(scalar-integer-expression)
to(list)
from(list)
if(scalar-expr)

19
Example: target data and target update
#pragma omp target data device(0) map(alloc:tmp[:N]) map(to:input[:N)) map(from:res)

host
{
#pragma omp target device(0)
#pragma omp parallel for

target
for (i=0; i<N; i++)
tmp[i] = some_computation(input[i], i);

update_input_array_on_the_host(input);

host
#pragma omp target update device(0) to(input[:N])

#pragma omp target device(0)


#pragma omp parallel for reduction(+:res)

target
for (i=0; i<N; i++)
res += final_computation(input[i], tmp[i], i)

host
}

20
Asynchronous Offloads
◼ OpenMP target constructs are synchronous by default
▪ The encountering host thread awaits the end of the target region before continuing
▪ The nowait clause makes the target constructs asynchronous (in OpenMP speak: they become
an OpenMP task)

#pragma omp task depend(out:a)


init_data(a);

#pragma omp target map(to:a[:N]) map(from:x[:N]) nowait depend(in:a) depend(out:x)


compute_1(a, x, N);

#pragma omp target map(to:b[:N]) map(from:z[:N]) nowait depend(out:y)


compute_3(b, z, N);

#pragma omp target map(to:y[:N]) map(to:z[:N]) nowait depend(in:x) depend(in:y)


compute_4(z, x, y, N);
#pragma omp taskwait

21
Case Study: NWChem TCE
CCSD(T)
TCE: Tensor Contraction Engine 22

CCSD(T): Coupled-Cluster with Single, Double,


and perturbative Triple replacements
NWChem
◼ Computational chemistry software package
▪ Quantum chemistry
▪ Molecular dynamics
◼ Designedfor large-scale supercomputers
◼ Developed at the EMSL at PNNL
▪ EMSL: Environmental Molecular Sciences Laboratory
▪ PNNL: Pacific Northwest National Lab
◼ URL: http://www.nwchem-sw.org

23
Finding Offload Candidates
◼ Requirements for offload candidates
▪ Compute-intensive code regions (kernels)
▪ Highly parallel
▪ Compute scaling stronger than data transfer,
e.g., compute O(n3) vs. data size O(n2)

24
Example Kernel (1 of 27 in total)
subroutine sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,
1 h7d,triplesx,t2sub,v2sub)
c Declarations omitted. ◼ All kernels have the same structure
double precision triplesx(h3d*h2d,h1d,p6d,p5d,p4d)
double precision t2sub(h7d,p4d,p5d,h1d) ◼ 7 perfectly nested loops
double precision v2sub(h3d*h2d,p6d,h7d)
!$omp target „presence?(triplesx,t2sub,v2sub)" ◼ Some kernels contain inner product loop
!$omp teams distribute parallel do private(p4,p5,p6,h2,h3,h1,h7)
do p4=1,p4d (then, 6 perfectly nested loops)
do p5=1,p5d
do p6=1,p6d 1.5GB data transferred ◼ Trip count per loop is equal to “tile size”
do h1=1,h1d
do h7=1,h7d (host to device) (20-30 in production)
do h2h3=1,h3d*h2d
triplesx(h2h3,h1,p6,p5,p4)=triplesx(h2h3,h1,p6,p5,p4) ◼ Naïve data allocation (tile size 24)
1 - t2sub(h7,p4,p5,h1)*v2sub(h2h3,p6,h7)
end do ▪ Per-array transfer for each target construct
end do
end do 1.5GB data transferred ▪ triplesx: 1458 MB
end do (device to host)
end do ▪ t2sub, v2sub: 2.5 MB each
end do
!$omp end teams distribute parallel do
!$omp end target
end subroutine

25
Invoking the Kernels / Data Management
◼ Simplified pseudo-code ◼ Reduced data transfers:
!$omp target enter data alloc(triplesx(1:tr_size)) ▪ triplesx:
c for all tiles
do ... ▪ allocated once
call zero_triplesx(triplesx)
Allocate 1.5GB data once, ▪ always kept on the target
do ...
call comm_and_sort(t2sub, v2sub) stays on device. ▪ t2sub, v2sub:
!$omp target data map(to:t2sub(t2_size)) map(to:v2sub(v2_size))
if (...) ▪ allocated after comm.
call sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,h7,triplesx,t2sub,v2sub) ▪ kept for (multiple) kernel
end if
c same for sd_t_d1_2 until sd_t_d1_9
invocations
Update 2x2.5MB of data for
!$omp target end data
end do
(potentially) multiple kernels.
do ...
c Similar structure for sd_t_d2_1 until sd_t_d2_9, incl. target data
end do
call sum_energy(energy, triplesx)
end do
!$omp target exit data release(triplesx(1:size))
26
Invoking the Kernels / Data Management
◼ Simplified pseudo-code subroutine sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,
1 h7d,triplesx,t2sub,v2sub)
!$omp target enter data alloc(triplesx(1:tr_size)) c Declarations omitted.
c for all tiles double precision triplesx(h3d*h2d,h1d,p6d,p5d,p4d)
double precision t2sub(h7d,p4d,p5d,h1d)
do ... double precision v2sub(h3d*h2d,p6d,h7d)
call zero_triplesx(triplesx) !$omp target „presence?(triplesx,t2sub,v2sub)"
do ...
Allocate 1.5GB data once,
!$omp teams distribute parallel do private(p4,p5,p6,h2,h3,h1,h7)
call comm_and_sort(t2sub, v2sub) stays on device.
do p4=1,p4d
!$omp target data map(to:t2sub(t2_size)) map(to:v2sub(v2_size)) do p5=1,p5d
do p6=1,p6d
if (...) do h1=1,h1d
call sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,h7,triplesx,t2sub,v2sub)
end if
do h7=1,h7d Presence check determines that arrays
do h2h3=1,h3d*h2d
c same for sd_t_d1_2 until sd_t_d1_9 data for have been allocated in the device data
Update 2x2.5MB 1oftriplesx(h2h3,h1,p6,p5,p4)=triplesx(h2h3,h1,p6,p5,p4)
- t2sub(h7,p4,p5,h1)*v2sub(h2h3,p6,h7)
!$omp target end data
(potentially) multiple
endkernels.
do
environment already.
end do
end do
do ... end do
c Similar structure for sd_t_d2_1 until sd_t_d2_9, incl. target data
end do
end do end do
call sum_energy(energy, triplesx) end do
!$omp end teams distribute parallel do
end do
!$omp end target
!$omp target exit data release(triplesx(1:size)) end subroutine
27
Advanced Task Synchronization
Asynchronous API Interaction
◼ Some APIs are based on asynchronous operations
▪ MPI asynchronous send and receive
▪ Asynchronous I/O
▪ HIP, CUDA and OpenCL stream-based offloading
▪ In general: any other API/model that executes asynchronously with OpenMP (tasks)
◼ Example: CUDA memory transfers
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
do_something_else();
cudaStreamSynchronize(stream);
do_other_important_stuff(dst);

◼ Programmers need a mechanism to marry asynchronous APIs with the parallel task model of
OpenMP
▪ How to synchronize completions events with task execution?

29
Try 1: Use just OpenMP Tasks
void cuda_example() {
#pragma omp task // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
}
#pragma omp task // task B Race condition between the tasks A & C,
{ task C may start execution before
do_something_else(); task A enqueues memory transfer.
}
#pragma omp task // task C
{
cudaStreamSynchronize(stream);
do_other_important_stuff(dst);
}
}

◼ This solution does not work!


30
Try 2: Use just OpenMP Tasks Dependences
void cuda_example() {
#pragma omp task depend(out:stream) // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
}
#pragma omp task // task B Synchronize execution of tasks through dependence.
{ May work, but task C will be blocked waiting for
do_something_else(); the data transfer to finish
}
#pragma omp task depend(in:stream) // task C
{
cudaStreamSynchronize(stream);
do_other_important_stuff(dst);
}
}

◼ This solution may work, but


▪ takes a thread away from execution while the system is handling the data transfer.
▪ may be problematic if called interface is not thread-safe
31
OpenMP Detachable Tasks
◼ OpenMP 5.0 introduces the concept of a detachable task
▪ Task can detach from executing thread without being “completed”
▪ Regular task synchronization mechanisms can be applied to await completion of a
detached task
▪ Runtime API to complete a task

◼ Detached task events: omp_event_t datatype


◼ Detached task clause: detach(event)
◼ Runtime API: void omp_fulfill_event(omp_event_t *event)

32
Detaching Tasks
omp_event_t *event;
void detach_example() {
#pragma omp task detach(event)
{
important_code();


} Some other thread/task:


omp_fulfill_event(event);
#pragma omp taskwait
}

1. Task detaches 3. Signal event for completion


2. taskwait construct cannot 4. Task completes and taskwait
complete can continue
33
Putting It All Together
void CUDART_CB callback(cudaStream_t stream, cudaError_t status, void *cb_dat) {

}
omp_fulfill_event((omp_event_t *) cb_data);

void cuda_example() {
omp_event_t *cuda_event;
#pragma omp task detach(cuda_event) // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
cudaStreamAddCallback(stream, callback, cuda_event, 0);
 }
#pragma omp task // task B
do_something_else();
1. Task A detaches

#pragma omp task



#pragma omp taskwait
// task C
2. taskwait does not continue
3. When memory transfer completes, callback is
{ invoked to signal the event for task completion
do_other_important_stuff(dst);
4. taskwait continues, task C executes
} }
34
Removing the taskwait Construct
void CUDART_CB callback(cudaStream_t stream, cudaError_t status, void *cb_dat) {

} omp_fulfill_event((omp_event_t *) cb_data);

void cuda_example() {
omp_event_t *cuda_event;
#pragma omp task depend(out:dst) detach(cuda_event) // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
 }
cudaStreamAddCallback(stream, callback, cuda_event, 0);

#pragma omp task // task B


do_something_else();
1. Task A detaches and task C will not execute because
#pragma omp task depend(in:dst) // task C of its unfulfilled dependency on A
{  2. When memory transfer completes, callback is
do_other_important_stuff(dst); invoked to signal the event for task completion
} } 3. Task A completes and C’s dependency is fulfilled
35
Summary
◼ OpenMP API is ready to use Intel discrete GPUs for offloading compute
▪ Mature offload model w/ support for asynchronous offload/transfer
▪ Tightly integrates with OpenMP multi-threading on the host
◼ More, advanced features (not covered here)
▪ Memory management API
▪ Interoperability with native data
management
▪ Interoperability with native streaming
interfaces
▪ Unified shared memory support

36
Visit www.openmp.org for more information
Tools for OpenMP Programming

Advanced OpenMP
1
OpenMP Tools
◼ Correctness Tools
→ThreadSanitizer

→Intel Inspector XE (or whatever the current name is)

◼ Performance Analysis
→Performance Analysis basics

→Overview on available tools

Advanced OpenMP
2
Data Race
◼ Data Race: the typical OpenMP programming error, when:
→two or more threads access the same memory location, and

→at least one of these accesses is a write, and

→the accesses are not protected by locks or critical regions, and

→the accesses are not synchronized, e.g. by a barrier.


◼ Non-deterministic occurrence: e.g. the sequence of the execution of
parallel loop iterations is non-deterministic
→In many cases private clauses, barriers or critical regions are missing
◼ Data races are hard to find using a traditional debugger
Advanced OpenMP
3
ThreadSanitizer: Overview
◼ Correctness checking for threaded applications

◼ Integrated in clang and gcc compiler

◼ Low runtime overhead: 2x – 15x

◼ Used to find data races in browsers like Chrome and Firefox

Advanced OpenMP
4
ThreadSanitizer: Usage Module in Aachen.
module load clang https://pruners.github.io

C
• Compile the program with clang compiler:
C++
clang –fsanitize=thread –fopenmp –g myprog.c –o myprog
clang++ –fsanitize=thread –fopenmp –g myprog.cpp
Fortran
–o myprog
gfortran –fsanitize=thread –fopenmp –g myprog.f –c
clang –fsanitize=thread –fopenmp –lgfortran myprog.o
–o myprog

• Execute:
OMP_NUM_THREADS=4 ./myprog

• Understand and correct the detected threading errors


Advanced OpenMP
5
ThreadSanitizer: Example
1 #include <stdio.h> WARNING: ThreadSanitizer: data race
2 Read of size 4 at 0x7fffffffdcdc by thread T2:
3 int main(int argc, char **argv) { #0 .omp_outlined. race.c:7
4 int a = 0; (race+0x0000004a6dce)
5 #pragma omp parallel
6 { #1 __kmp_invoke_microtask <null>
7 if (a < 100) { (libomp_tsan.so)
8 #pragma omp critical
9 a++; Previous write of size 4 at 0x7fffffffdcdc by
10 } main thread:
11 } #0 .omp_outlined. race.c:9
12 } (race+0x0000004a6e2c)
#1 __kmp_invoke_microtask <null>
(libomp_tsan.so)

Advanced OpenMP
6
Intel Inspector XE
◼ Detection of
→Memory Errors

→Deadlocks

→Data Races
◼ Support for
→WIN32-Threads, Posix-Threads, Intel Threading Building Blocks and OpenMP
◼ Features
→Binary instrumentation gives full functionality

→Independent stand-alone GUI for Windows and Linux


Advanced OpenMP
7
PI example / 1
double f(double x)
{ 1
return (4.0 / (1.0 + x*x)); 4
𝜋=න
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;
#pragma omp parallel for private(fX,i) reduction(+:fSum)
for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

Advanced OpenMP
8
PI example / 2
double f(double x)
{
return (4.0 / (1.0 + x*x));
}
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;
What if we
#pragma omp parallel for private(fX,i) reduction(+:fSum) would have
for (i = 0; i < n; i++)
{ forgotten this?
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

Advanced OpenMP
9
Inspector XE: create project / 1
$ module load Inspector ; inspxe-gui

Advanced OpenMP
10
Inspector XE: create project / 2
- ensure that multiple threads are used
- choose a small dataset (really!),
execution time can increase
10X – 1000X

Advanced OpenMP
11
Inspector XE: configure analysis
Threading Error Analysis Modes
1. Detect Deadlocks more details,
2. Detect Deadlocks and Data Races more overhead
3. Locate Deadlocks and Data Races

Advanced OpenMP
12
Inspector XE: results / 1
1 detected problems
2 filters
3 code location
4 Timeline

4
3

Advanced OpenMP
13
Inspector XE: results / 2
1 Source Code producing the issue – double click opens an editor
2 Corresponding Call Stack

1 2

1 2

Advanced OpenMP
14
Inspector XE: results / 3
1 Source Code producing the issue – double click opens an editor
2 Corresponding Call Stack
The missing reduction
is detected.

1 2

1 2

Advanced OpenMP
15
Sampling vs. Instrumentation
Sampling
◼ Running program is periodically interrupted to take measurement
◼ Statistical inference of program behavior
◼ Works with unmodified executables
t1 t2 t3 t4 t5 t6 t7 t8 t9

Time
main foo bar baz Measurement

Instrumentation
◼ Every event of interest is captured directly
◼ More detailed and exact information
◼ Typically: recompile for instrumentation
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11t12t13 t14

Time
Advanced OpenMP
16
Tracing vs. Profiling
Trace
◼ Chronologically ordered sequence of event records

Time
main foo bar baz

Profile from instrumentation


◼ Aggregated information

Profile from sampling


t1 t3 t2 t5 t8 t4 t7 t6 t9
t1 t2 t3 t4 t5 t6 t7 t8 t9

Time
Advanced OpenMP
17
OMPT support for sampling
◼ OMPT defines states like barrier-wait, work-serial or work-parallel
→ Allows to collect OMPT state statistics in the profile void foo() {}
void bar() {foo();}
→ Profile break down for different OMPT states
void baz() {bar();}
int main()
◼ OMPT provides frame information {foo();bar();baz();
return 0;}
→ Allows to identify OpenMP runtime frames.

→ Runtime frames can be eliminated from call trees


t1 t2 t3 t4 t5 t6 t7 t8 t9

Time

main foo bar baz Measurement

Advanced OpenMP
18
OMPT support for instrumentation
◼ OMPT provides event callbacks
→ Parallel begin / end
void foo() {}
→ Implicit task begin / end void bar() {
#pragma omp task
→ Barrier / taskwait foo();}
→ Task create / schedule void baz() {
#pragma omp task
bar();}
◼ Tool can instrument those callbacks int main() {
#pragma omp parallel sections
{foo();bar();baz();}
◼ OpenMP-only instrumentation might return 0;}
be sufficient for some use-cases

Advanced OpenMP
19
VI-HPS Tools / 1
◼ Virtual institute – high productivity supercomputing

◼ Tool development

◼ Training:
→ VI-HPS/PRACE tuning workshop series

→ SC/ISC tutorials

◼ Many performance tools available under vi-hps.org


→ → tools → VI-HPS Tools Guide

→ Tools-Guide: flyer with a 2 page summary for each tool

Advanced OpenMP
20
VI-HPS Tools / 2
Data collection
◼ Score-P : instrumentation based profiling / tracing
◼ Extrae : instrumentation based profiling / tracing

Data processing
◼ Scalasca : trace-based analysis

Data presentation
◼ ARM Map, ARM performance report
◼ CUBE : display for profile information
◼ Vampir : display for trace data (commercial/test)
◼ Paraver : display for extrae data
◼ Tau : visualization

Advanced OpenMP
21
Performance tools GUI

HPC Toolkit

Advanced OpenMP
22
Summary
Correctness:
◼ Data Races are very hard to find, since they do not show up every program run.
◼ Intel Inspector XE or ThreadSanitizer help a lot in finding these errors.
◼ Use really small datasets, since the runtime increases significantly.

Performance:
◼ Start with simple performance measurements like hotspots analyses and then focus
on these hot spots.
◼ In OpenMP applications analyze the waiting time of threads. Is the waiting time
balanced?
◼ Hardware counters might help for a better understanding of an application, but they
might be hard to interpret.
Advanced OpenMP
23
OpenMP Parallel Loops

1 Advanced OpenMP
loop Construct
◼ Existing loop constructs are tightly bound to execution model:

#pragma omp parallel for #pragma omp simd #pragma omp taskloop
for (i=0; i<N;++i) {…} for (i=0; i<N;++i) {…} for (i=0; i<N;++i) {…}

fork generate tasks

distribute work

barrier …
join taskwait

◼ The loop construct is meant to tell OpenMP about truly parallel


semantics of a loop.

2 Advanced OpenMP
OpenMP Fully Parallel Loops

int main(int argc, const char* argv[]) {


float *x = (float*) malloc(n * sizeof(float));
float *y = (float*) malloc(n * sizeof(float));
// Define scalars n, a, b & initialize x, y

#pragma omp parallel


#pragma omp loop
for (int i = 0; i < n; ++i){
y[i] = a*x[i] + y[i];
}
}
}

3 Advanced OpenMP
loop Constructs, Syntax
◼ Syntax (C/C++)
#pragma omp loop [clause[[,] clause],…]
for-loops

◼ Syntax (Fortran)
!$omp loop [clause[[,] clause],…]
do-loops
[!$omp end loop]

4 Advanced OpenMP
loop Constructs, Clauses
◼ bind(binding)
→ Binding region the loop construct should bind to
→ One of: teams, parallel, thread

◼ order(concurrent)
→ Tell the OpenMP compiler that the loop can be executed in any order.
→ Default!

◼ collapse(n)
◼ private(list)
◼ lastprivate(list)
◼ reduction(reduction-id:list)

5 Advanced OpenMP
Extensions to Existing Constructs
◼ Existing loop constructs have been extended to also have truly parallel
semantics.

◼ C/C++ Worksharing:
#pragma omp [for|simd] order(concurrent) \
[clause[[,] clause],…]
for-loops

◼ Fortran Worksharing:
!$omp [do|simd] order(concurrent) &
[clause[[,] clause],…]
do-loops
[!$omp end [do|simd}]

6 Advanced OpenMP
DOACROSS Loops

7 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
DOACROSS Loops
◼ “DOACROSS” loops are loops with special loop schedules
→ Restricted form of loop-carried dependencies
→ Require fine-grained synchronization protocol for parallelism

◼ Loop-carried dependency:
→ Loop iterations depend on each other
→ Source of dependency must scheduled before sink of the dependency

◼ DOACROSS loop:
→ Data dependency is an invariant for the execution of the whole loop nest

8 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
Parallelizable Loops
◼ A parallel loop cannot not have any loop-carried dependencies (simplified just a
little bit!)
for (int i = 1; i < N; ++i) {
for (int j = 1; j < M; ++j) {
b[i][j] = f(b[i][j],
b[i][j], a[i][j]);
}
}

Thread 1 Thread 2
j

execution order
dependency
i
9 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Non-parallelizable Loops
◼ If there is a loop-carried dependency, a loop cannot be parallelized anymore
(“easily” that is)
for (int i = 1; i < N; ++i) {
for (int j = 1; j < M; ++j) {
b[i][j] = f(b[i-1][j],
b[i][j-1], a[i][j]);
}
}

Thread 1 Thread 2
j

error
execution order
dependency
i
10 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Wavefront-Parallel Loops
◼ If the data dependency is invariant, then skewing the loop helps remove the data
dependency
for (int i = 1; i < N; ++i) {
for (int j = i+1; j < i+N; ++j) {
b[i][j-i] = f(b[i-1][j-i],
b[i][j-i-1], a[i][j]);
}
}

Thread 1 Thread 2
j

error
execution order
dependency
i
11 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
DOACROSS Loops with OpenMP
◼ OpenMP 4.5 extends the notion of the ordered construct to describe loop-carried
dependencies
◼ Syntax (C/C++):
#pragma omp for ordered(d) [clause[[,] clause],…]
for-loops
and
#pragma omp ordered [clause[[,] clause],…]
where clause is one of the following:
depend(source)
depend(sink:vector)
◼ Syntax (Fortran):
!$omp do ordered(d) [clause[[,] clause],…]
do-loops
!$omp ordered [clause[[,] clause],…]
12 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Example
◼ The ordered clause tells the compiler about loop-carried dependencies and their
distances
#pragma omp parallel for ordered(2)
for (int i = 1; i < N; ++i) {
for (int j = 1; j < M; ++j) {
#pragma omp ordered depend(sink:i-1,j) depend(sink:i,j-1)
b[i][j] = f(b[i-1][j],
b[i][j-1], a[i][j]);
}
#pragma omp ordered depend(source)
}

13 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
Example: 3D Gauss-Seidel
#pragma omp for ordered(2) private(j,k)
for (i = 1; i < N-1; ++i) {
for (j = 1; j < N-1; ++j) {
#pragma omp ordered depend(sink: i-1,j-1) depend(sink: i-1,j) \
depend(sink: i-1,j+1) depend(sink: i,j-1)
for (k = 1; k < N-1; ++k) {
double tmp1 = (p[i-1][j-1][k-1] + p[i-1][j-1][k] + p[i-1][j-1][k+1]
+ p[i-1][j][k-1] + p[i-1][j][k] + p[i-1][j][k+1]
+ p[i-1][j+1][k-1] + p[i-1][j+1][k] + p[i-1][j+1][k+1]);
double tmp2 = (p[i][j-1][k-1] + p[i][j-1][k] + p[i][j-1][k+1]
+ p[i][j][k-1] + p[i][j][k] + p[i][j][k+1]
+ p[i][j+1][k-1] + p[i][j+1][k] + p[i][j+1][k+1]);
double tmp3 = (p[i+1][j-1][k-1] + p[i+1][j-1][k] + p[i+1][j-1][k+1]
+ p[i+1][j][k-1] + p[i+1][j][k] + p[i+1][j][k+1]
+ p[i+1][j+1][k-1] + p[i+1][j+1][k] + p[i+1][j+1][k+1]);
p[i][j][k] = (tmp1 + tmp2 + tmp3) / 27.0;
}
#pragma omp ordered depend(source)
}
}

14 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
OpenMP Meta-Programming

15 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
The metadirective Directive
◼ Construct OpenMP directives for different OpenMP contexts
◼ Limited form of meta-programming for OpenMP directives and clauses

#pragma omp target map(to:v1,v2) map(from:v3)


#pragma omp metadirective \
when( device={arch(nvptx)}: teams loop ) \
default( parallel loop )
for (i = lb; i < ub; i++)
v3[i] = v1[i] * v2[i];

!$omp begin metadirective &


when( implementation={unified_shared_memory}: target ) &
default( target map(mapper(vec_map),tofrom: vec) )
!$omp teams distribute simd
do i=1, vec%size()
call vec(i)%work()
end do
!$omp end teams distribute simd
!$omp end metadirective

16 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
Nothing Directive

17 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
The nothing Directive
◼ The nothing directive makes meta programming a bit clearer and more flexible.
◼ If a certain criterion matches, the nothing directive can stand to indicate that no
(other) OpenMP directive should be used.
→ The nothing directive is implicitly added if no condition matches

!$omp begin metadirective &


when( implementation={unified_shared_memory}: &
target teams distribute parallel do simd) &
default( nothing )
do i=1, vec%size()
call vec(i)%work()
end do
!$omp end metadirective

18 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
Error Directive

19 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
Error Directive Syntax
◼ Syntax (C/C++)
#pragma omp error [clause[[,] clause],…]
for-loops

◼ Syntax (Fortran)
!$omp error [clause[[,] clause],…]
do-loops
[!$omp end loop]

◼ Clauses
one of: at(compilation), at(runtime)
one of: severity(fatal), severity(warning)
message(msg-string)
20 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Error Directive
◼ Can be used to issue a warning or an error at compile time and runtime.
◼ Consider this a “directive version” of assert(), but with a bit more flexibility.

#pragma omp parallel


{
if (omp_get_num_threads() % 2) {
#pragma omp error at(runtime) severity(warning) \
message(“Running on odd number of threads\n”);
}
do_stuff_that_works_best_with_even_thread_count();
}

21 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
Error Directive
◼ Can be used to issue a warning or an error at compile time and runtime.
◼ Consider this a “directive version” of assert(), but with a bit more flexibility.
◼ More useful in combination with OpenMP metadirective

!$omp begin metadirective &


when( arch={fancy_processor}: parallel ) &
default( error severity(fatal) at(compilation) &
message(“No implementation available” )
call fancy_impl_for_fancy_processor()
!$omp end metadirective

22 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS


Michael Klemm
OpenMP API Version 5.1
State of the Union
Architecture Review Board

The mission of the OpenMP


ARB (Architecture Review
Board) is to standardize
directive-based multi-language
high-level parallelism that is
performant, productive and
portable.
Development Process of the Specification
◼ Modifications of the OpenMP specification follow a (strict) process:

Impl. 1st vote Merge to


Proposal 2nd vote Verification
in LaTeX “mainline”

◼ Release process for specifications:

Editing Comment ARB


Draft Corrections Final Draft
Draft Approval

3
OpenMP Roadmap
◼ OpenMP has a well-defined roadmap:
▪ 5-year cadence for major releases
▪ One minor release in between
▪ (At least) one Technical Report (TR) with feature previews in every year

TR6 OpenMP 5.0 TR8 OpenMP 5.1 OpenMP 5.1 TR11* OpenMP 6.0

Nov’17 Nov’18 Nov’19 Nov’20 Nov’21 Nov’22 Nov’23


Public Comment Public Comment Public Comment Public Comment
Draft (TR7) Draft (TR9) Draft (TR10) Draft (TR12)

* Numbers assigned to TRs may change if additional TRs are released. 4


OpenMP API Version 6.0 Outlook – Plans
◼ Better support for descriptive and prescriptive control
◼ More support for memory affinity and complex memory hierarchies
◼ Support for pipelining, other computation/data associations
◼ Continued improvements to device support
▪ Extensions of deep copy support (serialize/deserialize functions)
◼ Task-only,unshackled or free-agent threads
◼ Event-driven parallelism

5
Printed OpenMP API Specification
◼ Saveyour printer-ink and get the full
specification as a paperback book!
▪ Always have the spec in easy reach.
▪ Includes the entire specification with the same
pagination and line numbers as the PDF.
▪ Available at a near-wholesale price.

◼ Get yours at Amazon at


https://link.openmp.org/book51

6
Recent Books about OpenMP

Covers all of the Introduces the


OpenMP 4.5 features, 2017 OpenMP Common Core, 2019
7
Help Us Shape the Future of OpenMP
◼ OpenMP continues to grow
▪ 33 members currently

◼ You can contribute to our annual releases

◼ Attend IWOMP, become a cOMPunity member

◼ OpenMP membership types now include less expensive memberships


▪ Please get in touch with me if you are interested
Visit www.openmp.org for more information

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy