0% found this document useful (0 votes)

35 views91 pages

OpenMP Workshop Day 3

Uploaded by

mamalee393

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views91 pages

OpenMP Workshop Day 3

Uploaded by

mamalee393

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Dr.-Ing.

Michael Klemm
Chief Executive Officer
OpenMP Architecture Review Board
Agenda
◼ OpenMP Architecture Review Board
◼ Introduction to OpenMP Offload Features
◼ Case Study: NWChem TCE CCSD(T)
◼ Detachable Tasks

2
Introduction to
OpenMP Offload Features
Running Example for this Presentation: saxpy
void saxpy() {
float a, x[SZ], y[SZ];
// left out initialization
double t = 0.0;
double tb, te; Timing code (not needed, just to have
tb = omp_get_wtime(); a bit more code to show ☺)
#pragma omp parallel for firstprivate(a)
for (int i = 0; i < SZ; i++) { This is the code we want to execute on a
y[i] = a * x[i] + y[i]; target device (i.e., GPU)
}
te = omp_get_wtime();
Timing code (not needed, just to have
t = te - tb;
a bit more code to show ☺)
printf("Time of kernel: %lf\n", t);
}

Don’t do this at home!

Use a BLAS library for this!
4
Device Model
◼ Asof version 4.0 the OpenMP API supports accelerators/coprocessors
◼ Device model:
▪ One host for “traditional” multi-threading
▪ Multiple accelerators/coprocessors of the same kind for offloading

Accelerators
Host
5
Execution Model
◼ Offload region and data environment is lexically scoped
▪ Data environment is destroyed at closing curly brace
▪ Allocated buffers/data are automatically released

Host Device
pA 1
alloc
2
to

#pragma omp target \

4 map(alloc:...) \
from map(to:...) \
map(from:...)
{ ... } 3

6
7

OpenMP for Devices - Constructs

◼ Transfer control and data from the host to the device
◼ Syntax (C/C++)
#pragma omp target [clause[[,] clause],…]
structured-block
◼ Syntax (Fortran)
!$omp target [clause[[,] clause],…]
structured-block
!$omp end target
◼ Clauses
device(scalar-integer-expression)
map([{alloc | to | from | tofrom}:] list)
if(scalar-expr)
Example: saxpy
The compiler identifies variables that are
used in the target region.
void saxpy() {
float a, x[SZ], y[SZ]; All accessed arrays are copied from

host
double t = 0.0; host to device and back
a
double tb, te; x[0:SZ]
tb = omp_get_wtime(); y[0:SZ]
#pragma omp target “map(tofrom:y[0:SZ])”
for (int i = 0; i < SZ; i++) {

target
y[i] = a * x[i] + y[i];
} Presence check: only transfer
te = omp_get_wtime(); x[0:SZ] if not yet allocated on the
t = te - tb; device.

host
y[0:SZ]
printf("Time of kernel: %lf\n", t);
}
Copying x back is not necessary: it
was not changed.
clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908
8
Example: saxpy The compiler identifies variables that are
used in the target region.
subroutine saxpy(a, x, y, n)
use iso_fortran_env
integer :: n, i All accessed arrays are copied from

host
real(kind=real32) :: a host to device and back
real(kind=real32), dimension(n) :: x a
x(1:n)
real(kind=real32), dimension(n) :: y y(1:n)

!$omp target “map(tofrom:y(1:n))”

do i=1,n Presence check: only transfer

target
y(i) = a * x(i) + y(i) if not yet allocated on the
end do device.
!$omp end target

host
x(1:n)
end subroutine y(1:n)
Copying x back is not necessary: it
was not changed.
flang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908
9
Example: saxpy
void saxpy() {
double a, x[SZ], y[SZ];

host
double t = 0.0;
a
double tb, te; x[0:SZ]
tb = omp_get_wtime(); y[0:SZ]
#pragma omp target map(to:x[0:SZ]) \
map(tofrom:y[0:SZ])

target
for (int i = 0; i < SZ; i++) {
y[i] = a * x[i] + y[i];
} y[0:SZ]
te = omp_get_wtime();

host
t = te - tb;
printf("Time of kernel: %lf\n", t);
}

clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908

10
Example: saxpy
The compiler cannot determine the size
of memory behind the pointer.
void saxpy(float a, float* x, float* y,
int sz) {

host
double t = 0.0;
a
double tb, te; x[0:sz]
tb = omp_get_wtime(); y[0:sz]
#pragma omp target map(to:x[0:sz]) \
map(tofrom:y[0:sz])

target
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
} y[0:sz]
te = omp_get_wtime();

host
t = te - tb;
printf("Time of kernel: %lf\n", t);
Programmers have to help the compiler
}
with the size of the data transfer needed.

clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908

11
Creating Parallelism on the Target Device
◼ The target construct transfers the control flow to the target device
▪ Transfer of control is sequential and synchronous
▪ This is intentional!

◼ OpenMP separates offload and parallelism

▪ Programmers need to explicitly create parallel regions on the target device
▪ In theory, this can be combined with any OpenMP construct
▪ In practice, there is only a useful subset of OpenMP features for a target device such
as a GPU, e.g., no I/O, limited use of base language features.

12
Example: saxpy
void saxpy(float a, float* x, float* y,

host
int sz) {
#pragma omp target map(to:x[0:sz]) \
map(tofrom(y[0:sz])

target
#pragma omp parallel for simd
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}

host
} GPUs are multi-level devices:
SIMD, threads, thread blocks
Create a team of threads to execute the loop in
parallel using SIMD instructions.

clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908

13
teams Construct
◼ Support multi-level parallel devices
◼ Syntax (C/C++):
#pragma omp teams [clause[[,] clause],…]
structured-block
◼ Syntax (Fortran):
!$omp teams [clause[[,] clause],…]
structured-block
◼ Clauses
num_teams(integer-expression), thread_limit(integer-expression)
default(shared | firstprivate | private none)
private(list), firstprivate(list), shared(list), reduction(operator:list)

14
Multi-level Parallel saxpy
◼ Manual code transformation
▪ Tile the loops into an outer loop and an inner loop
▪ Assign the outer loop to “teams” (OpenCL: work groups)
▪ Assign the inner loop to the “threads” (OpenCL: work items)
void saxpy(float a, float* x, float* y, int sz) {
#pragma omp target teams map(to:x[0:sz]) map(tofrom:y[0:sz])
{
int bs = n / omp_get_num_teams();
#pragma omp distribute
for (int i = 0; i < sz; i += bs) {
#pragma omp parallel for simd firstprivate(i,bs)
for (int ii = i; ii < i + bs; ii++) {
y[ii] = a * x[ii] + y[ii];
}
}
}
}

15
Multi-level Parallel saxpy
◼ For
convenience, OpenMP defines composite constructs to implement the
required code transformations
void saxpy(float a, float* x, float* y, int sz) {
#pragma omp target teams distribute parallel for simd \
num_teams(num_blocks) map(to:x[0:sz]) map(tofrom:y[0:sz])
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}
}

subroutine saxpy(a, x, y, n)
! Declarations omitted
!$omp omp target teams distribute parallel do simd &
!$omp& num_teams(num_blocks) map(to:x) map(tofrom:y)
do i=1,n
y(i) = a * x(i) + y(i)
end do
!$omp end target teams distribute parallel do simd
end subroutine
16
Optimize Data Transfers
◼ Reduce the amount of time spent transferring data
▪ Use map clauses to enforce direction of data transfer.
▪ Use target data, target enter data, target exit data constructs to keep
data environment on the target device.
void example() { void zeros(float* a, int n) {
float tmp[N], data_in[N], float data_out[N]; #pragma omp target teams distribute parallel for
#pragma omp target data map(alloc:tmp[:N]) \ for (int i = 0; i < n; i++)
map(to:a[:N],b[:N]) \ a[i] = 0.0f;
map(tofrom:c[:N]) }
{
zeros(tmp, N);
compute_kernel_1(tmp, a, N); // uses target void saxpy(float a, float* y, float* x, int n) {
saxpy(2.0f, tmp, b, N); #pragma omp target teams distribute parallel for
compute_kernel_2(tmp, b, N); // uses target for (int i = 0; i < n; i++)
saxpy(2.0f, c, tmp, N); y[i] = a * x[i] + y[i];
} } }

17
target data Construct Syntax
◼ Create scoped data environment and transfer data from the host to the device and back
◼ Syntax (C/C++)
#pragma omp target data [clause[[,] clause],…]
structured-block
◼ Syntax (Fortran)
!$omp target data [clause[[,] clause],…]
structured-block
!$omp end target data
◼ Clauses
device(scalar-integer-expression)
map([{alloc | to | from | tofrom | release | delete}:] list)
if(scalar-expr)

18
target update Construct Syntax
◼ Issuedata transfers to or from existing data device environment
◼ Syntax (C/C++)
#pragma omp target update [clause[[,] clause],…]

◼ Syntax (Fortran)
!$omp target update [clause[[,] clause],…]

◼ Clauses
device(scalar-integer-expression)
to(list)
from(list)
if(scalar-expr)

19
Example: target data and target update
#pragma omp target data device(0) map(alloc:tmp[:N]) map(to:input[:N)) map(from:res)

host
{
#pragma omp target device(0)
#pragma omp parallel for

target
for (i=0; i<N; i++)
tmp[i] = some_computation(input[i], i);

update_input_array_on_the_host(input);

host
#pragma omp target update device(0) to(input[:N])

#pragma omp target device(0)

#pragma omp parallel for reduction(+:res)

target
for (i=0; i<N; i++)
res += final_computation(input[i], tmp[i], i)

host
}

20
Asynchronous Offloads
◼ OpenMP target constructs are synchronous by default
▪ The encountering host thread awaits the end of the target region before continuing
▪ The nowait clause makes the target constructs asynchronous (in OpenMP speak: they become
an OpenMP task)

#pragma omp task depend(out:a)

init_data(a);

#pragma omp target map(to:a[:N]) map(from:x[:N]) nowait depend(in:a) depend(out:x)

compute_1(a, x, N);

#pragma omp target map(to:b[:N]) map(from:z[:N]) nowait depend(out:y)

compute_3(b, z, N);

#pragma omp target map(to:y[:N]) map(to:z[:N]) nowait depend(in:x) depend(in:y)

compute_4(z, x, y, N);
#pragma omp taskwait

21
Case Study: NWChem TCE
CCSD(T)
TCE: Tensor Contraction Engine 22

CCSD(T): Coupled-Cluster with Single, Double,

and perturbative Triple replacements
NWChem
◼ Computational chemistry software package
▪ Quantum chemistry
▪ Molecular dynamics
◼ Designedfor large-scale supercomputers
◼ Developed at the EMSL at PNNL
▪ EMSL: Environmental Molecular Sciences Laboratory
▪ PNNL: Pacific Northwest National Lab
◼ URL: http://www.nwchem-sw.org

23
Finding Offload Candidates
◼ Requirements for offload candidates
▪ Compute-intensive code regions (kernels)
▪ Highly parallel
▪ Compute scaling stronger than data transfer,
e.g., compute O(n3) vs. data size O(n2)

24
Example Kernel (1 of 27 in total)
subroutine sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,
1 h7d,triplesx,t2sub,v2sub)
c Declarations omitted. ◼ All kernels have the same structure
double precision triplesx(h3d*h2d,h1d,p6d,p5d,p4d)
double precision t2sub(h7d,p4d,p5d,h1d) ◼ 7 perfectly nested loops
double precision v2sub(h3d*h2d,p6d,h7d)
!$omp target „presence?(triplesx,t2sub,v2sub)" ◼ Some kernels contain inner product loop
!$omp teams distribute parallel do private(p4,p5,p6,h2,h3,h1,h7)
do p4=1,p4d (then, 6 perfectly nested loops)
do p5=1,p5d
do p6=1,p6d 1.5GB data transferred ◼ Trip count per loop is equal to “tile size”
do h1=1,h1d
do h7=1,h7d (host to device) (20-30 in production)
do h2h3=1,h3d*h2d
triplesx(h2h3,h1,p6,p5,p4)=triplesx(h2h3,h1,p6,p5,p4) ◼ Naïve data allocation (tile size 24)
1 - t2sub(h7,p4,p5,h1)*v2sub(h2h3,p6,h7)
end do ▪ Per-array transfer for each target construct
end do
end do 1.5GB data transferred ▪ triplesx: 1458 MB
end do (device to host)
end do ▪ t2sub, v2sub: 2.5 MB each
end do
!$omp end teams distribute parallel do
!$omp end target
end subroutine

25
Invoking the Kernels / Data Management
◼ Simplified pseudo-code ◼ Reduced data transfers:
!$omp target enter data alloc(triplesx(1:tr_size)) ▪ triplesx:
c for all tiles
do ... ▪ allocated once
call zero_triplesx(triplesx)
Allocate 1.5GB data once, ▪ always kept on the target
do ...
call comm_and_sort(t2sub, v2sub) stays on device. ▪ t2sub, v2sub:
!$omp target data map(to:t2sub(t2_size)) map(to:v2sub(v2_size))
if (...) ▪ allocated after comm.
call sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,h7,triplesx,t2sub,v2sub) ▪ kept for (multiple) kernel
end if
c same for sd_t_d1_2 until sd_t_d1_9
invocations
Update 2x2.5MB of data for
!$omp target end data
end do
(potentially) multiple kernels.
do ...
c Similar structure for sd_t_d2_1 until sd_t_d2_9, incl. target data
end do
call sum_energy(energy, triplesx)
end do
!$omp target exit data release(triplesx(1:size))
26
Invoking the Kernels / Data Management
◼ Simplified pseudo-code subroutine sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,
1 h7d,triplesx,t2sub,v2sub)
!$omp target enter data alloc(triplesx(1:tr_size)) c Declarations omitted.
c for all tiles double precision triplesx(h3d*h2d,h1d,p6d,p5d,p4d)
double precision t2sub(h7d,p4d,p5d,h1d)
do ... double precision v2sub(h3d*h2d,p6d,h7d)
call zero_triplesx(triplesx) !$omp target „presence?(triplesx,t2sub,v2sub)"
do ...
Allocate 1.5GB data once,
!$omp teams distribute parallel do private(p4,p5,p6,h2,h3,h1,h7)
call comm_and_sort(t2sub, v2sub) stays on device.
do p4=1,p4d
!$omp target data map(to:t2sub(t2_size)) map(to:v2sub(v2_size)) do p5=1,p5d
do p6=1,p6d
if (...) do h1=1,h1d
call sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,h7,triplesx,t2sub,v2sub)
end if
do h7=1,h7d Presence check determines that arrays
do h2h3=1,h3d*h2d
c same for sd_t_d1_2 until sd_t_d1_9 data for have been allocated in the device data
Update 2x2.5MB 1oftriplesx(h2h3,h1,p6,p5,p4)=triplesx(h2h3,h1,p6,p5,p4)
- t2sub(h7,p4,p5,h1)*v2sub(h2h3,p6,h7)
!$omp target end data
(potentially) multiple
endkernels.
do
environment already.
end do
end do
do ... end do
c Similar structure for sd_t_d2_1 until sd_t_d2_9, incl. target data
end do
end do end do
call sum_energy(energy, triplesx) end do
!$omp end teams distribute parallel do
end do
!$omp end target
!$omp target exit data release(triplesx(1:size)) end subroutine
27
Advanced Task Synchronization
Asynchronous API Interaction
◼ Some APIs are based on asynchronous operations
▪ MPI asynchronous send and receive
▪ Asynchronous I/O
▪ HIP, CUDA and OpenCL stream-based offloading
▪ In general: any other API/model that executes asynchronously with OpenMP (tasks)
◼ Example: CUDA memory transfers
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
do_something_else();
cudaStreamSynchronize(stream);
do_other_important_stuff(dst);

◼ Programmers need a mechanism to marry asynchronous APIs with the parallel task model of
OpenMP
▪ How to synchronize completions events with task execution?

29
Try 1: Use just OpenMP Tasks
void cuda_example() {
#pragma omp task // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
}
#pragma omp task // task B Race condition between the tasks A & C,
{ task C may start execution before
do_something_else(); task A enqueues memory transfer.
}
#pragma omp task // task C
{
cudaStreamSynchronize(stream);
do_other_important_stuff(dst);
}
}

◼ This solution does not work!

30
Try 2: Use just OpenMP Tasks Dependences
void cuda_example() {
#pragma omp task depend(out:stream) // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
}
#pragma omp task // task B Synchronize execution of tasks through dependence.
{ May work, but task C will be blocked waiting for
do_something_else(); the data transfer to finish
}
#pragma omp task depend(in:stream) // task C
{
cudaStreamSynchronize(stream);
do_other_important_stuff(dst);
}
}

◼ This solution may work, but

▪ takes a thread away from execution while the system is handling the data transfer.
▪ may be problematic if called interface is not thread-safe
31
OpenMP Detachable Tasks
◼ OpenMP 5.0 introduces the concept of a detachable task
▪ Task can detach from executing thread without being “completed”
▪ Regular task synchronization mechanisms can be applied to await completion of a
detached task
▪ Runtime API to complete a task

◼ Detached task events: omp_event_t datatype

◼ Detached task clause: detach(event)
◼ Runtime API: void omp_fulfill_event(omp_event_t *event)

32
Detaching Tasks
omp_event_t *event;
void detach_example() {
#pragma omp task detach(event)
{
important_code();


} Some other thread/task:


omp_fulfill_event(event);
#pragma omp taskwait
}

1. Task detaches 3. Signal event for completion

2. taskwait construct cannot 4. Task completes and taskwait
complete can continue
33
Putting It All Together
void CUDART_CB callback(cudaStream_t stream, cudaError_t status, void *cb_dat) {

}
omp_fulfill_event((omp_event_t *) cb_data);

void cuda_example() {
omp_event_t *cuda_event;
#pragma omp task detach(cuda_event) // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
cudaStreamAddCallback(stream, callback, cuda_event, 0);
 }
#pragma omp task // task B
do_something_else();
1. Task A detaches

#pragma omp task


#pragma omp taskwait
// task C
2. taskwait does not continue
3. When memory transfer completes, callback is
{ invoked to signal the event for task completion
do_other_important_stuff(dst);
4. taskwait continues, task C executes
} }
34
Removing the taskwait Construct
void CUDART_CB callback(cudaStream_t stream, cudaError_t status, void *cb_dat) {

} omp_fulfill_event((omp_event_t *) cb_data);

void cuda_example() {
omp_event_t *cuda_event;
#pragma omp task depend(out:dst) detach(cuda_event) // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
 }
cudaStreamAddCallback(stream, callback, cuda_event, 0);

#pragma omp task // task B

do_something_else();
1. Task A detaches and task C will not execute because
#pragma omp task depend(in:dst) // task C of its unfulfilled dependency on A
{  2. When memory transfer completes, callback is
do_other_important_stuff(dst); invoked to signal the event for task completion
} } 3. Task A completes and C’s dependency is fulfilled
35
Summary
◼ OpenMP API is ready to use Intel discrete GPUs for offloading compute
▪ Mature offload model w/ support for asynchronous offload/transfer
▪ Tightly integrates with OpenMP multi-threading on the host
◼ More, advanced features (not covered here)
▪ Memory management API
▪ Interoperability with native data
management
▪ Interoperability with native streaming
interfaces
▪ Unified shared memory support

36
Visit www.openmp.org for more information
Tools for OpenMP Programming

Advanced OpenMP
1
OpenMP Tools
◼ Correctness Tools
→ThreadSanitizer

→Intel Inspector XE (or whatever the current name is)

◼ Performance Analysis
→Performance Analysis basics

→Overview on available tools

Advanced OpenMP
2
Data Race
◼ Data Race: the typical OpenMP programming error, when:
→two or more threads access the same memory location, and

→at least one of these accesses is a write, and

→the accesses are not protected by locks or critical regions, and

→the accesses are not synchronized, e.g. by a barrier.

◼ Non-deterministic occurrence: e.g. the sequence of the execution of
parallel loop iterations is non-deterministic
→In many cases private clauses, barriers or critical regions are missing
◼ Data races are hard to find using a traditional debugger
Advanced OpenMP
3
ThreadSanitizer: Overview
◼ Correctness checking for threaded applications

◼ Integrated in clang and gcc compiler

◼ Low runtime overhead: 2x – 15x

◼ Used to find data races in browsers like Chrome and Firefox

Advanced OpenMP
4
ThreadSanitizer: Usage Module in Aachen.
module load clang https://pruners.github.io

C
• Compile the program with clang compiler:
C++
clang –fsanitize=thread –fopenmp –g myprog.c –o myprog
clang++ –fsanitize=thread –fopenmp –g myprog.cpp
Fortran
–o myprog
gfortran –fsanitize=thread –fopenmp –g myprog.f –c
clang –fsanitize=thread –fopenmp –lgfortran myprog.o
–o myprog

• Execute:
OMP_NUM_THREADS=4 ./myprog

• Understand and correct the detected threading errors

Advanced OpenMP
5
ThreadSanitizer: Example
1 #include <stdio.h> WARNING: ThreadSanitizer: data race
2 Read of size 4 at 0x7fffffffdcdc by thread T2:
3 int main(int argc, char **argv) { #0 .omp_outlined. race.c:7
4 int a = 0; (race+0x0000004a6dce)
5 #pragma omp parallel
6 { #1 __kmp_invoke_microtask <null>
7 if (a < 100) { (libomp_tsan.so)
8 #pragma omp critical
9 a++; Previous write of size 4 at 0x7fffffffdcdc by
10 } main thread:
11 } #0 .omp_outlined. race.c:9
12 } (race+0x0000004a6e2c)
#1 __kmp_invoke_microtask <null>
(libomp_tsan.so)

Advanced OpenMP
6
Intel Inspector XE
◼ Detection of
→Memory Errors

→Deadlocks

→Data Races
◼ Support for
→WIN32-Threads, Posix-Threads, Intel Threading Building Blocks and OpenMP
◼ Features
→Binary instrumentation gives full functionality

→Independent stand-alone GUI for Windows and Linux

Advanced OpenMP
7
PI example / 1
double f(double x)
{ 1
return (4.0 / (1.0 + x*x)); 4
𝜋=න
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;
#pragma omp parallel for private(fX,i) reduction(+:fSum)
for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

Advanced OpenMP
8
PI example / 2
double f(double x)
{
return (4.0 / (1.0 + x*x));
}
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;
What if we
#pragma omp parallel for private(fX,i) reduction(+:fSum) would have
for (i = 0; i < n; i++)
{ forgotten this?
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

Advanced OpenMP
9
Inspector XE: create project / 1
$ module load Inspector ; inspxe-gui

Advanced OpenMP
10
Inspector XE: create project / 2
- ensure that multiple threads are used
- choose a small dataset (really!),
execution time can increase
10X – 1000X

Advanced OpenMP
11
Inspector XE: configure analysis
Threading Error Analysis Modes
1. Detect Deadlocks more details,
2. Detect Deadlocks and Data Races more overhead
3. Locate Deadlocks and Data Races

Advanced OpenMP
12
Inspector XE: results / 1
1 detected problems
2 filters
3 code location
4 Timeline

4
3

Advanced OpenMP
13
Inspector XE: results / 2
1 Source Code producing the issue – double click opens an editor
2 Corresponding Call Stack

1 2

Advanced OpenMP
14
Inspector XE: results / 3
1 Source Code producing the issue – double click opens an editor
2 Corresponding Call Stack
The missing reduction
is detected.

1 2

Advanced OpenMP
15
Sampling vs. Instrumentation
Sampling
◼ Running program is periodically interrupted to take measurement
◼ Statistical inference of program behavior
◼ Works with unmodified executables
t1 t2 t3 t4 t5 t6 t7 t8 t9

Time
main foo bar baz Measurement

Instrumentation
◼ Every event of interest is captured directly
◼ More detailed and exact information
◼ Typically: recompile for instrumentation
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11t12t13 t14

Time
Advanced OpenMP
16
Tracing vs. Profiling
Trace
◼ Chronologically ordered sequence of event records

Time
main foo bar baz

Profile from instrumentation

◼ Aggregated information

Profile from sampling

t1 t3 t2 t5 t8 t4 t7 t6 t9
t1 t2 t3 t4 t5 t6 t7 t8 t9

Time
Advanced OpenMP
17
OMPT support for sampling
◼ OMPT defines states like barrier-wait, work-serial or work-parallel
→ Allows to collect OMPT state statistics in the profile void foo() {}
void bar() {foo();}
→ Profile break down for different OMPT states
void baz() {bar();}
int main()
◼ OMPT provides frame information {foo();bar();baz();
return 0;}
→ Allows to identify OpenMP runtime frames.

→ Runtime frames can be eliminated from call trees

t1 t2 t3 t4 t5 t6 t7 t8 t9

Time

main foo bar baz Measurement

Advanced OpenMP
18
OMPT support for instrumentation
◼ OMPT provides event callbacks
→ Parallel begin / end
void foo() {}
→ Implicit task begin / end void bar() {
#pragma omp task
→ Barrier / taskwait foo();}
→ Task create / schedule void baz() {
#pragma omp task
bar();}
◼ Tool can instrument those callbacks int main() {
#pragma omp parallel sections
{foo();bar();baz();}
◼ OpenMP-only instrumentation might return 0;}
be sufficient for some use-cases

Advanced OpenMP
19
VI-HPS Tools / 1
◼ Virtual institute – high productivity supercomputing

◼ Tool development

◼ Training:
→ VI-HPS/PRACE tuning workshop series

→ SC/ISC tutorials

◼ Many performance tools available under vi-hps.org

→ → tools → VI-HPS Tools Guide

→ Tools-Guide: flyer with a 2 page summary for each tool

Advanced OpenMP
20
VI-HPS Tools / 2
Data collection
◼ Score-P : instrumentation based profiling / tracing
◼ Extrae : instrumentation based profiling / tracing

Data processing
◼ Scalasca : trace-based analysis

Data presentation
◼ ARM Map, ARM performance report
◼ CUBE : display for profile information
◼ Vampir : display for trace data (commercial/test)
◼ Paraver : display for extrae data
◼ Tau : visualization

Advanced OpenMP
21
Performance tools GUI

HPC Toolkit

Advanced OpenMP
22
Summary
Correctness:
◼ Data Races are very hard to find, since they do not show up every program run.
◼ Intel Inspector XE or ThreadSanitizer help a lot in finding these errors.
◼ Use really small datasets, since the runtime increases significantly.

Performance:
◼ Start with simple performance measurements like hotspots analyses and then focus
on these hot spots.
◼ In OpenMP applications analyze the waiting time of threads. Is the waiting time
balanced?
◼ Hardware counters might help for a better understanding of an application, but they
might be hard to interpret.
Advanced OpenMP
23
OpenMP Parallel Loops

1 Advanced OpenMP
loop Construct
◼ Existing loop constructs are tightly bound to execution model:

#pragma omp parallel for #pragma omp simd #pragma omp taskloop
for (i=0; i<N;++i) {…} for (i=0; i<N;++i) {…} for (i=0; i<N;++i) {…}

fork generate tasks

distribute work

barrier …
join taskwait

◼ The loop construct is meant to tell OpenMP about truly parallel

semantics of a loop.

2 Advanced OpenMP
OpenMP Fully Parallel Loops

int main(int argc, const char* argv[]) {

float *x = (float*) malloc(n * sizeof(float));
float *y = (float*) malloc(n * sizeof(float));
// Define scalars n, a, b & initialize x, y

#pragma omp parallel

#pragma omp loop
for (int i = 0; i < n; ++i){
y[i] = a*x[i] + y[i];
}
}
}

3 Advanced OpenMP
loop Constructs, Syntax
◼ Syntax (C/C++)
#pragma omp loop [clause[[,] clause],…]
for-loops

◼ Syntax (Fortran)
!$omp loop [clause[[,] clause],…]
do-loops
[!$omp end loop]

4 Advanced OpenMP
loop Constructs, Clauses
◼ bind(binding)
→ Binding region the loop construct should bind to
→ One of: teams, parallel, thread

◼ order(concurrent)
→ Tell the OpenMP compiler that the loop can be executed in any order.
→ Default!

◼ collapse(n)
◼ private(list)
◼ lastprivate(list)
◼ reduction(reduction-id:list)

5 Advanced OpenMP
Extensions to Existing Constructs
◼ Existing loop constructs have been extended to also have truly parallel
semantics.

◼ C/C++ Worksharing:
#pragma omp [for|simd] order(concurrent) \
[clause[[,] clause],…]
for-loops

◼ Fortran Worksharing:
!$omp [do|simd] order(concurrent) &
[clause[[,] clause],…]
do-loops
[!$omp end [do|simd}]

6 Advanced OpenMP
DOACROSS Loops

7 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
DOACROSS Loops
◼ “DOACROSS” loops are loops with special loop schedules
→ Restricted form of loop-carried dependencies
→ Require fine-grained synchronization protocol for parallelism

◼ Loop-carried dependency:
→ Loop iterations depend on each other
→ Source of dependency must scheduled before sink of the dependency

◼ DOACROSS loop:
→ Data dependency is an invariant for the execution of the whole loop nest

8 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
Parallelizable Loops
◼ A parallel loop cannot not have any loop-carried dependencies (simplified just a
little bit!)
for (int i = 1; i < N; ++i) {
for (int j = 1; j < M; ++j) {
b[i][j] = f(b[i][j],
b[i][j], a[i][j]);
}
}

Thread 1 Thread 2
j

execution order
dependency
i
9 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Non-parallelizable Loops
◼ If there is a loop-carried dependency, a loop cannot be parallelized anymore
(“easily” that is)
for (int i = 1; i < N; ++i) {
for (int j = 1; j < M; ++j) {
b[i][j] = f(b[i-1][j],
b[i][j-1], a[i][j]);
}
}

Thread 1 Thread 2
j

error
execution order
dependency
i
10 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Wavefront-Parallel Loops
◼ If the data dependency is invariant, then skewing the loop helps remove the data
dependency
for (int i = 1; i < N; ++i) {
for (int j = i+1; j < i+N; ++j) {
b[i][j-i] = f(b[i-1][j-i],
b[i][j-i-1], a[i][j]);
}
}

Thread 1 Thread 2
j

error
execution order
dependency
i
11 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
DOACROSS Loops with OpenMP
◼ OpenMP 4.5 extends the notion of the ordered construct to describe loop-carried
dependencies
◼ Syntax (C/C++):
#pragma omp for ordered(d) [clause[[,] clause],…]
for-loops
and
#pragma omp ordered [clause[[,] clause],…]
where clause is one of the following:
depend(source)
depend(sink:vector)
◼ Syntax (Fortran):
!$omp do ordered(d) [clause[[,] clause],…]
do-loops
!$omp ordered [clause[[,] clause],…]
12 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Example
◼ The ordered clause tells the compiler about loop-carried dependencies and their
distances
#pragma omp parallel for ordered(2)
for (int i = 1; i < N; ++i) {
for (int j = 1; j < M; ++j) {
#pragma omp ordered depend(sink:i-1,j) depend(sink:i,j-1)
b[i][j] = f(b[i-1][j],
b[i][j-1], a[i][j]);
}
#pragma omp ordered depend(source)
}

13 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
Example: 3D Gauss-Seidel
#pragma omp for ordered(2) private(j,k)
for (i = 1; i < N-1; ++i) {
for (j = 1; j < N-1; ++j) {
#pragma omp ordered depend(sink: i-1,j-1) depend(sink: i-1,j) \
depend(sink: i-1,j+1) depend(sink: i,j-1)
for (k = 1; k < N-1; ++k) {
double tmp1 = (p[i-1][j-1][k-1] + p[i-1][j-1][k] + p[i-1][j-1][k+1]
+ p[i-1][j][k-1] + p[i-1][j][k] + p[i-1][j][k+1]
+ p[i-1][j+1][k-1] + p[i-1][j+1][k] + p[i-1][j+1][k+1]);
double tmp2 = (p[i][j-1][k-1] + p[i][j-1][k] + p[i][j-1][k+1]
+ p[i][j][k-1] + p[i][j][k] + p[i][j][k+1]
+ p[i][j+1][k-1] + p[i][j+1][k] + p[i][j+1][k+1]);
double tmp3 = (p[i+1][j-1][k-1] + p[i+1][j-1][k] + p[i+1][j-1][k+1]
+ p[i+1][j][k-1] + p[i+1][j][k] + p[i+1][j][k+1]
+ p[i+1][j+1][k-1] + p[i+1][j+1][k] + p[i+1][j+1][k+1]);
p[i][j][k] = (tmp1 + tmp2 + tmp3) / 27.0;
}
#pragma omp ordered depend(source)
}
}

14 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
OpenMP Meta-Programming

15 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
The metadirective Directive
◼ Construct OpenMP directives for different OpenMP contexts
◼ Limited form of meta-programming for OpenMP directives and clauses

#pragma omp target map(to:v1,v2) map(from:v3)

#pragma omp metadirective \
when( device={arch(nvptx)}: teams loop ) \
default( parallel loop )
for (i = lb; i < ub; i++)
v3[i] = v1[i] * v2[i];

!$omp begin metadirective &

when( implementation={unified_shared_memory}: target ) &
default( target map(mapper(vec_map),tofrom: vec) )
!$omp teams distribute simd
do i=1, vec%size()
call vec(i)%work()
end do
!$omp end teams distribute simd
!$omp end metadirective

16 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
Nothing Directive

17 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
The nothing Directive
◼ The nothing directive makes meta programming a bit clearer and more flexible.
◼ If a certain criterion matches, the nothing directive can stand to indicate that no
(other) OpenMP directive should be used.
→ The nothing directive is implicitly added if no condition matches

!$omp begin metadirective &

when( implementation={unified_shared_memory}: &
target teams distribute parallel do simd) &
default( nothing )
do i=1, vec%size()
call vec(i)%work()
end do
!$omp end metadirective

18 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
Error Directive

19 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
Error Directive Syntax
◼ Syntax (C/C++)
#pragma omp error [clause[[,] clause],…]
for-loops

◼ Syntax (Fortran)
!$omp error [clause[[,] clause],…]
do-loops
[!$omp end loop]

◼ Clauses
one of: at(compilation), at(runtime)
one of: severity(fatal), severity(warning)
message(msg-string)
20 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Error Directive
◼ Can be used to issue a warning or an error at compile time and runtime.
◼ Consider this a “directive version” of assert(), but with a bit more flexibility.

#pragma omp parallel

{
if (omp_get_num_threads() % 2) {
#pragma omp error at(runtime) severity(warning) \
message(“Running on odd number of threads\n”);
}
do_stuff_that_works_best_with_even_thread_count();
}

21 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
Error Directive
◼ Can be used to issue a warning or an error at compile time and runtime.
◼ Consider this a “directive version” of assert(), but with a bit more flexibility.
◼ More useful in combination with OpenMP metadirective

!$omp begin metadirective &

when( arch={fancy_processor}: parallel ) &
default( error severity(fatal) at(compilation) &
message(“No implementation available” )
call fancy_impl_for_fancy_processor()
!$omp end metadirective

22 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

Michael Klemm
OpenMP API Version 5.1
State of the Union
Architecture Review Board

The mission of the OpenMP

ARB (Architecture Review
Board) is to standardize
directive-based multi-language
high-level parallelism that is
performant, productive and
portable.
Development Process of the Specification
◼ Modifications of the OpenMP specification follow a (strict) process:

Impl. 1st vote Merge to

Proposal 2nd vote Verification
in LaTeX “mainline”

◼ Release process for specifications:

Editing Comment ARB

Draft Corrections Final Draft
Draft Approval

3
OpenMP Roadmap
◼ OpenMP has a well-defined roadmap:
▪ 5-year cadence for major releases
▪ One minor release in between
▪ (At least) one Technical Report (TR) with feature previews in every year

TR6 OpenMP 5.0 TR8 OpenMP 5.1 OpenMP 5.1 TR11* OpenMP 6.0

Nov’17 Nov’18 Nov’19 Nov’20 Nov’21 Nov’22 Nov’23

Public Comment Public Comment Public Comment Public Comment
Draft (TR7) Draft (TR9) Draft (TR10) Draft (TR12)

* Numbers assigned to TRs may change if additional TRs are released. 4

OpenMP API Version 6.0 Outlook – Plans
◼ Better support for descriptive and prescriptive control
◼ More support for memory affinity and complex memory hierarchies
◼ Support for pipelining, other computation/data associations
◼ Continued improvements to device support
▪ Extensions of deep copy support (serialize/deserialize functions)
◼ Task-only,unshackled or free-agent threads
◼ Event-driven parallelism

5
Printed OpenMP API Specification
◼ Saveyour printer-ink and get the full
specification as a paperback book!
▪ Always have the spec in easy reach.
▪ Includes the entire specification with the same
pagination and line numbers as the PDF.
▪ Available at a near-wholesale price.

◼ Get yours at Amazon at

https://link.openmp.org/book51

6
Recent Books about OpenMP

Covers all of the Introduces the

OpenMP 4.5 features, 2017 OpenMP Common Core, 2019
7
Help Us Shape the Future of OpenMP
◼ OpenMP continues to grow
▪ 33 members currently

◼ You can contribute to our annual releases

◼ Attend IWOMP, become a cOMPunity member

◼ OpenMP membership types now include less expensive memberships

▪ Please get in touch with me if you are interested
Visit www.openmp.org for more information

CP4252 Multicore Architecture and Programming Lab Manual
No ratings yet
CP4252 Multicore Architecture and Programming Lab Manual
26 pages
Sample Appendix QUANTITATIVE Research
100% (3)
Sample Appendix QUANTITATIVE Research
22 pages
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
No ratings yet
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
50 pages
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
No ratings yet
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
381 pages
Parallel Programming Using OpenMP
No ratings yet
Parallel Programming Using OpenMP
76 pages
OpenMP Workshop Day 3
No ratings yet
OpenMP Workshop Day 3
116 pages
Multicore Architecture and Programming Lab Manual
No ratings yet
Multicore Architecture and Programming Lab Manual
29 pages
Openmp
No ratings yet
Openmp
95 pages
OpenMP Workshop Day 2
No ratings yet
OpenMP Workshop Day 2
155 pages
CS-3006 8 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 8 UsingOpenMP SharedMemoryProgramming
61 pages
Cp4292 Multicore Lab Multicore Lab Removed
No ratings yet
Cp4292 Multicore Lab Multicore Lab Removed
37 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Omp Handouts
No ratings yet
Omp Handouts
109 pages
MAP Lab Completed
No ratings yet
MAP Lab Completed
29 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
Module 4
No ratings yet
Module 4
40 pages
OpenACC 2017spring
No ratings yet
OpenACC 2017spring
46 pages
Omp Exercises
No ratings yet
Omp Exercises
81 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
MAP Lab Mannual
No ratings yet
MAP Lab Mannual
24 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
A Deep Dive Into The Latest HPC Software
No ratings yet
A Deep Dive Into The Latest HPC Software
38 pages
MPC LAB Manual New
No ratings yet
MPC LAB Manual New
24 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
Openmp
No ratings yet
Openmp
115 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
Tutorial 4
No ratings yet
Tutorial 4
32 pages
Gpu Test Answer Bank
No ratings yet
Gpu Test Answer Bank
22 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
About OpenMP
No ratings yet
About OpenMP
86 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
Openmp Offload Infrastructure in LLVM
No ratings yet
Openmp Offload Infrastructure in LLVM
16 pages
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
No ratings yet
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
17 pages
Openmp Boston
No ratings yet
Openmp Boston
90 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
No ratings yet
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
74 pages
Openmp Overview
No ratings yet
Openmp Overview
74 pages
Multi Core Architectures and Programming
No ratings yet
Multi Core Architectures and Programming
10 pages
Azizul Azri Bin Mustaffa - PEC12-60
No ratings yet
Azizul Azri Bin Mustaffa - PEC12-60
36 pages
Openmp
No ratings yet
Openmp
61 pages
U1 Programa4 S12021
No ratings yet
U1 Programa4 S12021
6 pages
Exploiting Loop-Level Parallelism For Simd Arrays Using: Openmp
No ratings yet
Exploiting Loop-Level Parallelism For Simd Arrays Using: Openmp
12 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
CSC-334 - P&DC - Lab Manual - V2.0
No ratings yet
CSC-334 - P&DC - Lab Manual - V2.0
102 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
10 OpenMP-2
No ratings yet
10 OpenMP-2
25 pages
OpenACC Advanced Fixed
No ratings yet
OpenACC Advanced Fixed
53 pages
Shared Memory Parallel Programming: Introduction To Openmp
No ratings yet
Shared Memory Parallel Programming: Introduction To Openmp
39 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
Design Report - Steel Structure Shade
No ratings yet
Design Report - Steel Structure Shade
53 pages
Active Directory Connector - FAQ and Troubleshooting
No ratings yet
Active Directory Connector - FAQ and Troubleshooting
50 pages
Lab # 2 by Akram
No ratings yet
Lab # 2 by Akram
14 pages
A. I. in Healthcare
100% (1)
A. I. in Healthcare
14 pages
Stair Calculator
No ratings yet
Stair Calculator
3 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
OpenMP 2
No ratings yet
OpenMP 2
3 pages
PDC-Lab 21BCE10419
No ratings yet
PDC-Lab 21BCE10419
20 pages
Infiniti Q30 Owners Manual PDF
No ratings yet
Infiniti Q30 Owners Manual PDF
468 pages
KPS Industrial Brochure
No ratings yet
KPS Industrial Brochure
24 pages
World Radio 1993 07
No ratings yet
World Radio 1993 07
80 pages
Specifications: Diagnostic Ultrasound System
100% (2)
Specifications: Diagnostic Ultrasound System
19 pages
Tuyển tập 26 đề thi vào 10 NH 24-25 môn Tiếng Anh
No ratings yet
Tuyển tập 26 đề thi vào 10 NH 24-25 môn Tiếng Anh
101 pages
Parallel Io hdf5
No ratings yet
Parallel Io hdf5
53 pages
OpenMP Workshop Day 1
No ratings yet
OpenMP Workshop Day 1
49 pages
Experiment 7
No ratings yet
Experiment 7
9 pages
Mpi Advancedipcmoc15
No ratings yet
Mpi Advancedipcmoc15
69 pages
Modern Dispersion Technology: A Primer in Dispersers
100% (1)
Modern Dispersion Technology: A Primer in Dispersers
24 pages
Все Ответы
No ratings yet
Все Ответы
36 pages
Question 3
No ratings yet
Question 3
41 pages
Unix
No ratings yet
Unix
12 pages
Your Electricity Bill Actual
No ratings yet
Your Electricity Bill Actual
1 page
Electronic Engines Support 7.9.0 Global-Guide 2022-06
No ratings yet
Electronic Engines Support 7.9.0 Global-Guide 2022-06
728 pages
Lecture1 A Business Statistics
No ratings yet
Lecture1 A Business Statistics
15 pages
Ferret
No ratings yet
Ferret
6 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Rooftop Packaged: A5Rt-C Series
No ratings yet
Rooftop Packaged: A5Rt-C Series
54 pages
Question Bank - SQA
No ratings yet
Question Bank - SQA
6 pages
Laptop Invoice
No ratings yet
Laptop Invoice
1 page
Heydaraliyevculturalcentre 180131094714 PDF
No ratings yet
Heydaraliyevculturalcentre 180131094714 PDF
23 pages
Composite Landing Gear Components For Aerospace Applications
No ratings yet
Composite Landing Gear Components For Aerospace Applications
8 pages
Kolida KTS442R10LCN
No ratings yet
Kolida KTS442R10LCN
2 pages
Frequency Relay: 1MRS 750418-MBG Spaf 140 C
No ratings yet
Frequency Relay: 1MRS 750418-MBG Spaf 140 C
8 pages
GBAS - How It Works
No ratings yet
GBAS - How It Works
4 pages
SLG Module 6.2.3
No ratings yet
SLG Module 6.2.3
3 pages
PSP Lab Manual Exp 3 Fault Analysis
No ratings yet
PSP Lab Manual Exp 3 Fault Analysis
4 pages
AHT87 Leaflet - Issue June 2005
No ratings yet
AHT87 Leaflet - Issue June 2005
1 page
ABB DB EleganceSeries PDF
No ratings yet
ABB DB EleganceSeries PDF
9 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
A Star: Fundamentals and Applications
From Everand
A Star: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

OpenMP Workshop Day 3

Uploaded by

OpenMP Workshop Day 3

Uploaded by

Dr.-Ing.

Don’t do this at home!

#pragma omp target \

OpenMP for Devices - Constructs

!$omp target “map(tofrom:y(1:n))”

clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908

clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908

◼ OpenMP separates offload and parallelism

clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908

#pragma omp target device(0)

#pragma omp task depend(out:a)

#pragma omp target map(to:a[:N]) map(from:x[:N]) nowait depend(in:a) depend(out:x)

#pragma omp target map(to:b[:N]) map(from:z[:N]) nowait depend(out:y)

#pragma omp target map(to:y[:N]) map(to:z[:N]) nowait depend(in:x) depend(in:y)

CCSD(T): Coupled-Cluster with Single, Double,

◼ This solution does not work!

◼ This solution may work, but

◼ Detached task events: omp_event_t datatype

1. Task detaches 3. Signal event for completion

#pragma omp task

#pragma omp task // task B

→Intel Inspector XE (or whatever the current name is)

→Overview on available tools

→at least one of these accesses is a write, and

→the accesses are not protected by locks or critical regions, and

→the accesses are not synchronized, e.g. by a barrier.

◼ Integrated in clang and gcc compiler

◼ Low runtime overhead: 2x – 15x

◼ Used to find data races in browsers like Chrome and Firefox

• Understand and correct the detected threading errors

→Independent stand-alone GUI for Windows and Linux

Profile from instrumentation

Profile from sampling

→ Runtime frames can be eliminated from call trees

main foo bar baz Measurement

◼ Many performance tools available under vi-hps.org

→ Tools-Guide: flyer with a 2 page summary for each tool

fork generate tasks

◼ The loop construct is meant to tell OpenMP about truly parallel

int main(int argc, const char* argv[]) {

#pragma omp parallel

7 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

8 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

13 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

14 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

15 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

#pragma omp target map(to:v1,v2) map(from:v3)

!$omp begin metadirective &

16 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

17 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

!$omp begin metadirective &

18 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

19 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

#pragma omp parallel

21 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

!$omp begin metadirective &

22 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS

The mission of the OpenMP

Impl. 1st vote Merge to

◼ Release process for specifications:

Editing Comment ARB

Nov’17 Nov’18 Nov’19 Nov’20 Nov’21 Nov’22 Nov’23

* Numbers assigned to TRs may change if additional TRs are released. 4

◼ Get yours at Amazon at

Covers all of the Introduces the

◼ You can contribute to our annual releases

◼ Attend IWOMP, become a cOMPunity member

◼ OpenMP membership types now include less expensive memberships

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.