OpenMP Workshop Day 3
OpenMP Workshop Day 3
Michael Klemm
Chief Executive Officer
OpenMP Architecture Review Board
Agenda
◼ OpenMP Architecture Review Board
◼ Introduction to OpenMP Offload Features
◼ Case Study: NWChem TCE CCSD(T)
◼ Detachable Tasks
2
Introduction to
OpenMP Offload Features
Running Example for this Presentation: saxpy
void saxpy() {
float a, x[SZ], y[SZ];
// left out initialization
double t = 0.0;
double tb, te; Timing code (not needed, just to have
tb = omp_get_wtime(); a bit more code to show ☺)
#pragma omp parallel for firstprivate(a)
for (int i = 0; i < SZ; i++) { This is the code we want to execute on a
y[i] = a * x[i] + y[i]; target device (i.e., GPU)
}
te = omp_get_wtime();
Timing code (not needed, just to have
t = te - tb;
a bit more code to show ☺)
printf("Time of kernel: %lf\n", t);
}
Accelerators
Host
5
Execution Model
◼ Offload region and data environment is lexically scoped
▪ Data environment is destroyed at closing curly brace
▪ Allocated buffers/data are automatically released
Host Device
pA 1
alloc
2
to
6
7
host
double t = 0.0; host to device and back
a
double tb, te; x[0:SZ]
tb = omp_get_wtime(); y[0:SZ]
#pragma omp target “map(tofrom:y[0:SZ])”
for (int i = 0; i < SZ; i++) {
target
y[i] = a * x[i] + y[i];
} Presence check: only transfer
te = omp_get_wtime(); x[0:SZ] if not yet allocated on the
t = te - tb; device.
host
y[0:SZ]
printf("Time of kernel: %lf\n", t);
}
Copying x back is not necessary: it
was not changed.
clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908
8
Example: saxpy The compiler identifies variables that are
used in the target region.
subroutine saxpy(a, x, y, n)
use iso_fortran_env
integer :: n, i All accessed arrays are copied from
host
real(kind=real32) :: a host to device and back
real(kind=real32), dimension(n) :: x a
x(1:n)
real(kind=real32), dimension(n) :: y y(1:n)
target
y(i) = a * x(i) + y(i) if not yet allocated on the
end do device.
!$omp end target
host
x(1:n)
end subroutine y(1:n)
Copying x back is not necessary: it
was not changed.
flang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908
9
Example: saxpy
void saxpy() {
double a, x[SZ], y[SZ];
host
double t = 0.0;
a
double tb, te; x[0:SZ]
tb = omp_get_wtime(); y[0:SZ]
#pragma omp target map(to:x[0:SZ]) \
map(tofrom:y[0:SZ])
target
for (int i = 0; i < SZ; i++) {
y[i] = a * x[i] + y[i];
} y[0:SZ]
te = omp_get_wtime();
host
t = te - tb;
printf("Time of kernel: %lf\n", t);
}
host
double t = 0.0;
a
double tb, te; x[0:sz]
tb = omp_get_wtime(); y[0:sz]
#pragma omp target map(to:x[0:sz]) \
map(tofrom:y[0:sz])
target
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
} y[0:sz]
te = omp_get_wtime();
host
t = te - tb;
printf("Time of kernel: %lf\n", t);
Programmers have to help the compiler
}
with the size of the data transfer needed.
12
Example: saxpy
void saxpy(float a, float* x, float* y,
host
int sz) {
#pragma omp target map(to:x[0:sz]) \
map(tofrom(y[0:sz])
target
#pragma omp parallel for simd
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}
host
} GPUs are multi-level devices:
SIMD, threads, thread blocks
Create a team of threads to execute the loop in
parallel using SIMD instructions.
14
Multi-level Parallel saxpy
◼ Manual code transformation
▪ Tile the loops into an outer loop and an inner loop
▪ Assign the outer loop to “teams” (OpenCL: work groups)
▪ Assign the inner loop to the “threads” (OpenCL: work items)
void saxpy(float a, float* x, float* y, int sz) {
#pragma omp target teams map(to:x[0:sz]) map(tofrom:y[0:sz])
{
int bs = n / omp_get_num_teams();
#pragma omp distribute
for (int i = 0; i < sz; i += bs) {
#pragma omp parallel for simd firstprivate(i,bs)
for (int ii = i; ii < i + bs; ii++) {
y[ii] = a * x[ii] + y[ii];
}
}
}
}
15
Multi-level Parallel saxpy
◼ For
convenience, OpenMP defines composite constructs to implement the
required code transformations
void saxpy(float a, float* x, float* y, int sz) {
#pragma omp target teams distribute parallel for simd \
num_teams(num_blocks) map(to:x[0:sz]) map(tofrom:y[0:sz])
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}
}
subroutine saxpy(a, x, y, n)
! Declarations omitted
!$omp omp target teams distribute parallel do simd &
!$omp& num_teams(num_blocks) map(to:x) map(tofrom:y)
do i=1,n
y(i) = a * x(i) + y(i)
end do
!$omp end target teams distribute parallel do simd
end subroutine
16
Optimize Data Transfers
◼ Reduce the amount of time spent transferring data
▪ Use map clauses to enforce direction of data transfer.
▪ Use target data, target enter data, target exit data constructs to keep
data environment on the target device.
void example() { void zeros(float* a, int n) {
float tmp[N], data_in[N], float data_out[N]; #pragma omp target teams distribute parallel for
#pragma omp target data map(alloc:tmp[:N]) \ for (int i = 0; i < n; i++)
map(to:a[:N],b[:N]) \ a[i] = 0.0f;
map(tofrom:c[:N]) }
{
zeros(tmp, N);
compute_kernel_1(tmp, a, N); // uses target void saxpy(float a, float* y, float* x, int n) {
saxpy(2.0f, tmp, b, N); #pragma omp target teams distribute parallel for
compute_kernel_2(tmp, b, N); // uses target for (int i = 0; i < n; i++)
saxpy(2.0f, c, tmp, N); y[i] = a * x[i] + y[i];
} } }
17
target data Construct Syntax
◼ Create scoped data environment and transfer data from the host to the device and back
◼ Syntax (C/C++)
#pragma omp target data [clause[[,] clause],…]
structured-block
◼ Syntax (Fortran)
!$omp target data [clause[[,] clause],…]
structured-block
!$omp end target data
◼ Clauses
device(scalar-integer-expression)
map([{alloc | to | from | tofrom | release | delete}:] list)
if(scalar-expr)
18
target update Construct Syntax
◼ Issuedata transfers to or from existing data device environment
◼ Syntax (C/C++)
#pragma omp target update [clause[[,] clause],…]
◼ Syntax (Fortran)
!$omp target update [clause[[,] clause],…]
◼ Clauses
device(scalar-integer-expression)
to(list)
from(list)
if(scalar-expr)
19
Example: target data and target update
#pragma omp target data device(0) map(alloc:tmp[:N]) map(to:input[:N)) map(from:res)
host
{
#pragma omp target device(0)
#pragma omp parallel for
target
for (i=0; i<N; i++)
tmp[i] = some_computation(input[i], i);
update_input_array_on_the_host(input);
host
#pragma omp target update device(0) to(input[:N])
target
for (i=0; i<N; i++)
res += final_computation(input[i], tmp[i], i)
host
}
20
Asynchronous Offloads
◼ OpenMP target constructs are synchronous by default
▪ The encountering host thread awaits the end of the target region before continuing
▪ The nowait clause makes the target constructs asynchronous (in OpenMP speak: they become
an OpenMP task)
21
Case Study: NWChem TCE
CCSD(T)
TCE: Tensor Contraction Engine 22
23
Finding Offload Candidates
◼ Requirements for offload candidates
▪ Compute-intensive code regions (kernels)
▪ Highly parallel
▪ Compute scaling stronger than data transfer,
e.g., compute O(n3) vs. data size O(n2)
24
Example Kernel (1 of 27 in total)
subroutine sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,
1 h7d,triplesx,t2sub,v2sub)
c Declarations omitted. ◼ All kernels have the same structure
double precision triplesx(h3d*h2d,h1d,p6d,p5d,p4d)
double precision t2sub(h7d,p4d,p5d,h1d) ◼ 7 perfectly nested loops
double precision v2sub(h3d*h2d,p6d,h7d)
!$omp target „presence?(triplesx,t2sub,v2sub)" ◼ Some kernels contain inner product loop
!$omp teams distribute parallel do private(p4,p5,p6,h2,h3,h1,h7)
do p4=1,p4d (then, 6 perfectly nested loops)
do p5=1,p5d
do p6=1,p6d 1.5GB data transferred ◼ Trip count per loop is equal to “tile size”
do h1=1,h1d
do h7=1,h7d (host to device) (20-30 in production)
do h2h3=1,h3d*h2d
triplesx(h2h3,h1,p6,p5,p4)=triplesx(h2h3,h1,p6,p5,p4) ◼ Naïve data allocation (tile size 24)
1 - t2sub(h7,p4,p5,h1)*v2sub(h2h3,p6,h7)
end do ▪ Per-array transfer for each target construct
end do
end do 1.5GB data transferred ▪ triplesx: 1458 MB
end do (device to host)
end do ▪ t2sub, v2sub: 2.5 MB each
end do
!$omp end teams distribute parallel do
!$omp end target
end subroutine
25
Invoking the Kernels / Data Management
◼ Simplified pseudo-code ◼ Reduced data transfers:
!$omp target enter data alloc(triplesx(1:tr_size)) ▪ triplesx:
c for all tiles
do ... ▪ allocated once
call zero_triplesx(triplesx)
Allocate 1.5GB data once, ▪ always kept on the target
do ...
call comm_and_sort(t2sub, v2sub) stays on device. ▪ t2sub, v2sub:
!$omp target data map(to:t2sub(t2_size)) map(to:v2sub(v2_size))
if (...) ▪ allocated after comm.
call sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,h7,triplesx,t2sub,v2sub) ▪ kept for (multiple) kernel
end if
c same for sd_t_d1_2 until sd_t_d1_9
invocations
Update 2x2.5MB of data for
!$omp target end data
end do
(potentially) multiple kernels.
do ...
c Similar structure for sd_t_d2_1 until sd_t_d2_9, incl. target data
end do
call sum_energy(energy, triplesx)
end do
!$omp target exit data release(triplesx(1:size))
26
Invoking the Kernels / Data Management
◼ Simplified pseudo-code subroutine sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,
1 h7d,triplesx,t2sub,v2sub)
!$omp target enter data alloc(triplesx(1:tr_size)) c Declarations omitted.
c for all tiles double precision triplesx(h3d*h2d,h1d,p6d,p5d,p4d)
double precision t2sub(h7d,p4d,p5d,h1d)
do ... double precision v2sub(h3d*h2d,p6d,h7d)
call zero_triplesx(triplesx) !$omp target „presence?(triplesx,t2sub,v2sub)"
do ...
Allocate 1.5GB data once,
!$omp teams distribute parallel do private(p4,p5,p6,h2,h3,h1,h7)
call comm_and_sort(t2sub, v2sub) stays on device.
do p4=1,p4d
!$omp target data map(to:t2sub(t2_size)) map(to:v2sub(v2_size)) do p5=1,p5d
do p6=1,p6d
if (...) do h1=1,h1d
call sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,h7,triplesx,t2sub,v2sub)
end if
do h7=1,h7d Presence check determines that arrays
do h2h3=1,h3d*h2d
c same for sd_t_d1_2 until sd_t_d1_9 data for have been allocated in the device data
Update 2x2.5MB 1oftriplesx(h2h3,h1,p6,p5,p4)=triplesx(h2h3,h1,p6,p5,p4)
- t2sub(h7,p4,p5,h1)*v2sub(h2h3,p6,h7)
!$omp target end data
(potentially) multiple
endkernels.
do
environment already.
end do
end do
do ... end do
c Similar structure for sd_t_d2_1 until sd_t_d2_9, incl. target data
end do
end do end do
call sum_energy(energy, triplesx) end do
!$omp end teams distribute parallel do
end do
!$omp end target
!$omp target exit data release(triplesx(1:size)) end subroutine
27
Advanced Task Synchronization
Asynchronous API Interaction
◼ Some APIs are based on asynchronous operations
▪ MPI asynchronous send and receive
▪ Asynchronous I/O
▪ HIP, CUDA and OpenCL stream-based offloading
▪ In general: any other API/model that executes asynchronously with OpenMP (tasks)
◼ Example: CUDA memory transfers
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
do_something_else();
cudaStreamSynchronize(stream);
do_other_important_stuff(dst);
◼ Programmers need a mechanism to marry asynchronous APIs with the parallel task model of
OpenMP
▪ How to synchronize completions events with task execution?
29
Try 1: Use just OpenMP Tasks
void cuda_example() {
#pragma omp task // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
}
#pragma omp task // task B Race condition between the tasks A & C,
{ task C may start execution before
do_something_else(); task A enqueues memory transfer.
}
#pragma omp task // task C
{
cudaStreamSynchronize(stream);
do_other_important_stuff(dst);
}
}
32
Detaching Tasks
omp_event_t *event;
void detach_example() {
#pragma omp task detach(event)
{
important_code();
} Some other thread/task:
omp_fulfill_event(event);
#pragma omp taskwait
}
void cuda_example() {
omp_event_t *cuda_event;
#pragma omp task detach(cuda_event) // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
cudaStreamAddCallback(stream, callback, cuda_event, 0);
}
#pragma omp task // task B
do_something_else();
1. Task A detaches
} omp_fulfill_event((omp_event_t *) cb_data);
void cuda_example() {
omp_event_t *cuda_event;
#pragma omp task depend(out:dst) detach(cuda_event) // task A
{
do_something();
cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyDeviceToHost, stream);
}
cudaStreamAddCallback(stream, callback, cuda_event, 0);
36
Visit www.openmp.org for more information
Tools for OpenMP Programming
Advanced OpenMP
1
OpenMP Tools
◼ Correctness Tools
→ThreadSanitizer
◼ Performance Analysis
→Performance Analysis basics
Advanced OpenMP
2
Data Race
◼ Data Race: the typical OpenMP programming error, when:
→two or more threads access the same memory location, and
Advanced OpenMP
4
ThreadSanitizer: Usage Module in Aachen.
module load clang https://pruners.github.io
C
• Compile the program with clang compiler:
C++
clang –fsanitize=thread –fopenmp –g myprog.c –o myprog
clang++ –fsanitize=thread –fopenmp –g myprog.cpp
Fortran
–o myprog
gfortran –fsanitize=thread –fopenmp –g myprog.f –c
clang –fsanitize=thread –fopenmp –lgfortran myprog.o
–o myprog
• Execute:
OMP_NUM_THREADS=4 ./myprog
Advanced OpenMP
6
Intel Inspector XE
◼ Detection of
→Memory Errors
→Deadlocks
→Data Races
◼ Support for
→WIN32-Threads, Posix-Threads, Intel Threading Building Blocks and OpenMP
◼ Features
→Binary instrumentation gives full functionality
Advanced OpenMP
8
PI example / 2
double f(double x)
{
return (4.0 / (1.0 + x*x));
}
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;
What if we
#pragma omp parallel for private(fX,i) reduction(+:fSum) would have
for (i = 0; i < n; i++)
{ forgotten this?
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}
Advanced OpenMP
9
Inspector XE: create project / 1
$ module load Inspector ; inspxe-gui
Advanced OpenMP
10
Inspector XE: create project / 2
- ensure that multiple threads are used
- choose a small dataset (really!),
execution time can increase
10X – 1000X
Advanced OpenMP
11
Inspector XE: configure analysis
Threading Error Analysis Modes
1. Detect Deadlocks more details,
2. Detect Deadlocks and Data Races more overhead
3. Locate Deadlocks and Data Races
Advanced OpenMP
12
Inspector XE: results / 1
1 detected problems
2 filters
3 code location
4 Timeline
4
3
Advanced OpenMP
13
Inspector XE: results / 2
1 Source Code producing the issue – double click opens an editor
2 Corresponding Call Stack
1 2
1 2
Advanced OpenMP
14
Inspector XE: results / 3
1 Source Code producing the issue – double click opens an editor
2 Corresponding Call Stack
The missing reduction
is detected.
1 2
1 2
Advanced OpenMP
15
Sampling vs. Instrumentation
Sampling
◼ Running program is periodically interrupted to take measurement
◼ Statistical inference of program behavior
◼ Works with unmodified executables
t1 t2 t3 t4 t5 t6 t7 t8 t9
Time
main foo bar baz Measurement
Instrumentation
◼ Every event of interest is captured directly
◼ More detailed and exact information
◼ Typically: recompile for instrumentation
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11t12t13 t14
Time
Advanced OpenMP
16
Tracing vs. Profiling
Trace
◼ Chronologically ordered sequence of event records
Time
main foo bar baz
Time
Advanced OpenMP
17
OMPT support for sampling
◼ OMPT defines states like barrier-wait, work-serial or work-parallel
→ Allows to collect OMPT state statistics in the profile void foo() {}
void bar() {foo();}
→ Profile break down for different OMPT states
void baz() {bar();}
int main()
◼ OMPT provides frame information {foo();bar();baz();
return 0;}
→ Allows to identify OpenMP runtime frames.
Time
Advanced OpenMP
18
OMPT support for instrumentation
◼ OMPT provides event callbacks
→ Parallel begin / end
void foo() {}
→ Implicit task begin / end void bar() {
#pragma omp task
→ Barrier / taskwait foo();}
→ Task create / schedule void baz() {
#pragma omp task
bar();}
◼ Tool can instrument those callbacks int main() {
#pragma omp parallel sections
{foo();bar();baz();}
◼ OpenMP-only instrumentation might return 0;}
be sufficient for some use-cases
Advanced OpenMP
19
VI-HPS Tools / 1
◼ Virtual institute – high productivity supercomputing
◼ Tool development
◼ Training:
→ VI-HPS/PRACE tuning workshop series
→ SC/ISC tutorials
Advanced OpenMP
20
VI-HPS Tools / 2
Data collection
◼ Score-P : instrumentation based profiling / tracing
◼ Extrae : instrumentation based profiling / tracing
Data processing
◼ Scalasca : trace-based analysis
Data presentation
◼ ARM Map, ARM performance report
◼ CUBE : display for profile information
◼ Vampir : display for trace data (commercial/test)
◼ Paraver : display for extrae data
◼ Tau : visualization
Advanced OpenMP
21
Performance tools GUI
HPC Toolkit
Advanced OpenMP
22
Summary
Correctness:
◼ Data Races are very hard to find, since they do not show up every program run.
◼ Intel Inspector XE or ThreadSanitizer help a lot in finding these errors.
◼ Use really small datasets, since the runtime increases significantly.
Performance:
◼ Start with simple performance measurements like hotspots analyses and then focus
on these hot spots.
◼ In OpenMP applications analyze the waiting time of threads. Is the waiting time
balanced?
◼ Hardware counters might help for a better understanding of an application, but they
might be hard to interpret.
Advanced OpenMP
23
OpenMP Parallel Loops
1 Advanced OpenMP
loop Construct
◼ Existing loop constructs are tightly bound to execution model:
#pragma omp parallel for #pragma omp simd #pragma omp taskloop
for (i=0; i<N;++i) {…} for (i=0; i<N;++i) {…} for (i=0; i<N;++i) {…}
distribute work
barrier …
join taskwait
2 Advanced OpenMP
OpenMP Fully Parallel Loops
3 Advanced OpenMP
loop Constructs, Syntax
◼ Syntax (C/C++)
#pragma omp loop [clause[[,] clause],…]
for-loops
◼ Syntax (Fortran)
!$omp loop [clause[[,] clause],…]
do-loops
[!$omp end loop]
4 Advanced OpenMP
loop Constructs, Clauses
◼ bind(binding)
→ Binding region the loop construct should bind to
→ One of: teams, parallel, thread
◼ order(concurrent)
→ Tell the OpenMP compiler that the loop can be executed in any order.
→ Default!
◼ collapse(n)
◼ private(list)
◼ lastprivate(list)
◼ reduction(reduction-id:list)
5 Advanced OpenMP
Extensions to Existing Constructs
◼ Existing loop constructs have been extended to also have truly parallel
semantics.
◼ C/C++ Worksharing:
#pragma omp [for|simd] order(concurrent) \
[clause[[,] clause],…]
for-loops
◼ Fortran Worksharing:
!$omp [do|simd] order(concurrent) &
[clause[[,] clause],…]
do-loops
[!$omp end [do|simd}]
6 Advanced OpenMP
DOACROSS Loops
◼ Loop-carried dependency:
→ Loop iterations depend on each other
→ Source of dependency must scheduled before sink of the dependency
◼ DOACROSS loop:
→ Data dependency is an invariant for the execution of the whole loop nest
Thread 1 Thread 2
j
execution order
dependency
i
9 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Non-parallelizable Loops
◼ If there is a loop-carried dependency, a loop cannot be parallelized anymore
(“easily” that is)
for (int i = 1; i < N; ++i) {
for (int j = 1; j < M; ++j) {
b[i][j] = f(b[i-1][j],
b[i][j-1], a[i][j]);
}
}
Thread 1 Thread 2
j
error
execution order
dependency
i
10 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Wavefront-Parallel Loops
◼ If the data dependency is invariant, then skewing the loop helps remove the data
dependency
for (int i = 1; i < N; ++i) {
for (int j = i+1; j < i+N; ++j) {
b[i][j-i] = f(b[i-1][j-i],
b[i][j-i-1], a[i][j]);
}
}
Thread 1 Thread 2
j
error
execution order
dependency
i
11 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
DOACROSS Loops with OpenMP
◼ OpenMP 4.5 extends the notion of the ordered construct to describe loop-carried
dependencies
◼ Syntax (C/C++):
#pragma omp for ordered(d) [clause[[,] clause],…]
for-loops
and
#pragma omp ordered [clause[[,] clause],…]
where clause is one of the following:
depend(source)
depend(sink:vector)
◼ Syntax (Fortran):
!$omp do ordered(d) [clause[[,] clause],…]
do-loops
!$omp ordered [clause[[,] clause],…]
12 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Example
◼ The ordered clause tells the compiler about loop-carried dependencies and their
distances
#pragma omp parallel for ordered(2)
for (int i = 1; i < N; ++i) {
for (int j = 1; j < M; ++j) {
#pragma omp ordered depend(sink:i-1,j) depend(sink:i,j-1)
b[i][j] = f(b[i-1][j],
b[i][j-1], a[i][j]);
}
#pragma omp ordered depend(source)
}
◼ Syntax (Fortran)
!$omp error [clause[[,] clause],…]
do-loops
[!$omp end loop]
◼ Clauses
one of: at(compilation), at(runtime)
one of: severity(fatal), severity(warning)
message(msg-string)
20 Advanced OpenMP Tutorial – Advanced Language Features: DOACROSS
Michael Klemm
Error Directive
◼ Can be used to issue a warning or an error at compile time and runtime.
◼ Consider this a “directive version” of assert(), but with a bit more flexibility.
3
OpenMP Roadmap
◼ OpenMP has a well-defined roadmap:
▪ 5-year cadence for major releases
▪ One minor release in between
▪ (At least) one Technical Report (TR) with feature previews in every year
TR6 OpenMP 5.0 TR8 OpenMP 5.1 OpenMP 5.1 TR11* OpenMP 6.0
5
Printed OpenMP API Specification
◼ Saveyour printer-ink and get the full
specification as a paperback book!
▪ Always have the spec in easy reach.
▪ Includes the entire specification with the same
pagination and line numbers as the PDF.
▪ Available at a near-wholesale price.
6
Recent Books about OpenMP