OpenACC Fundamentals
OpenACC Fundamentals
OpenACC Fundamentals
2
3 Ways to Program GPUs
Applications
Compiler Programming
Libraries
Directives Languages
3
OpenACC Directives
Manage
Data
#pragma acc data copyin(x,y) copyout(z)
{
Incremental
Movement ... Single source
#pragma acc parallel
Initiate
{
#pragma acc loop gang vector
Interoperable
Parallel
for (i = 0; i < n; ++i) {
Execution
z[i] = x[i] + y[i]; Performance portable
...
Optimize } CPU, GPU, MIC
Loop }
Mappings ...
}
4
Accelerated Computing
10x Performance & 5x Energy Efficiency for HPC
GPU Accelerator
CPU Optimized for
Optimized for Parallel Tasks
Serial Tasks
5
What is Accelerated Computing?
Application Code
Compute-Intensive Functions
Rest of Sequential
A few % of Code CPU Code
GPU A large % of Time CPU
+ 6
OpenACC Example
#pragma acc data \
{
copy(b[0:n][0:m]) \
create(a[0:n][0:m]) A S (B)
p2
1
A(i-1,j) A(i+1,j)
A(i,j)
𝐴𝑘 (𝑖 − 1, 𝑗) + 𝐴𝑘 𝑖 + 1, 𝑗 + 𝐴𝑘 𝑖, 𝑗 − 1 + 𝐴𝑘 𝑖, 𝑗 + 1
𝐴𝑘+1 𝑖, 𝑗 =
A(i,j-1) 4
8
Jacobi Iteration: C Code
while ( err > tol && iter < iter_max ) {
err=0.0;
Iterate until converged
iter++;
}
9 9
Look For Parallelism
while ( err > tol && iter < iter_max ) { Data dependency
err=0.0; between iterations.
iter++;
}
10 10
OPENACC DIRECTIVE SYNTAX
C/C++
#pragma acc directive [clause [,] clause] …]
…often followed by a structured code block
Fortran
!$acc directive [clause [,] clause] …]
...often paired with a matching end directive surrounding a structured code block:
!$acc end directive
}
13
OpenACC Parallel Directive
Generates parallelism
}
14
OpenACC Loop Directive
Identifies loops to run in parallel
iter++;
* A reduction means that all of the N*M values
} for err will be reduced to just one, the max.
18 18
BUILDING THE CODE
$ pgcc -fast -acc -ta=tesla -Minfo=all laplace2d.c
main:
40, Loop not fused: function call before adjacent loop
Generated vector sse code for the loop
51, Loop not vectorized/parallelized: potential early exits
55, Accelerator kernel generated
55, Max reduction generated for error
56, #pragma acc loop gang /* blockIdx.x */
58, #pragma acc loop vector(256) /* threadIdx.x */
55, Generating copyout(Anew[1:4094][1:4094])
Generating copyin(A[:][:])
Generating Tesla code
58, Loop is parallelizable
66, Accelerator kernel generated
67, #pragma acc loop gang /* blockIdx.x */
69, #pragma acc loop vector(256) /* threadIdx.x */
66, Generating copyin(Anew[1:4094][1:4094])
Generating copyout(A[1:4094][1:4094])
Generating Tesla code
69, Loop is parallelizable
20 20
BUILDING THE CODE
$ pgcc -fast -acc -ta=tesla -Minfo=all laplace2d.c
main:
40, Loop not fused: function call before adjacent loop
Generated vector sse code for the loop
51, Loop not vectorized/parallelized: potential early exits
55, Accelerator kernel generated
55, Max reduction generated for error
56, #pragma acc loop gang /* blockIdx.x */
58, #pragma acc loop vector(256) /* threadIdx.x */
55, Generating copyout(Anew[1:4094][1:4094])
Generating copyin(A[:][:])
Generating Tesla code
58, Loop is parallelizable
66, Accelerator kernel generated
67, #pragma acc loop gang /* blockIdx.x */
69, #pragma acc loop vector(256) /* threadIdx.x */
66, Generating copyin(Anew[1:4094][1:4094])
Generating copyout(A[1:4094][1:4094])
Generating Tesla code
69, Loop is parallelizable
21 21
BUILDING THE CODE
$ pgcc -fast -acc -ta=tesla -Minfo=all laplace2d.c
main:
40, Loop not fused: function call before adjacent loop
Generated vector sse code for the loop
51, Loop not vectorized/parallelized: potential early exits
55, Accelerator kernel generated
55, Max reduction generated for error
56, #pragma acc loop gang /* blockIdx.x */
58, #pragma acc loop vector(256) /* threadIdx.x */
55, Generating copyout(Anew[1:4094][1:4094])
Generating copyin(A[:][:])
Generating Tesla code
58, Loop is parallelizable
66, Accelerator kernel generated
67, #pragma acc loop gang /* blockIdx.x */
69, #pragma acc loop vector(256) /* threadIdx.x */
66, Generating copyin(Anew[1:4094][1:4094])
Generating copyout(A[1:4094][1:4094])
Generating Tesla code
69, Loop is parallelizable
22 22
Intel Xeon E5-
Speed-up (Higher is Better) 2698 v3 @
6.00X 2.30GHz
(Haswell)
Why did OpenACC vs.
5.00X
slow down here? 5.00X
NVIDIA Tesla
K40 & P100
4.59X
4.00X
3.69X
3.00X
2.00X
1.94X
1.00X
1.00X
0.61X 0.66X
0.00X
Single Thread 2 Threads 4 Threads 6 Threads 8 Threads OpenACC (K40) OpenACC (P100)
Compiler: PGI 16.10 23
Very low
Compute/Memcpy
ratio
Compute 4 seconds
Memory Copy 51 seconds
24
112ms/iteration
PCIe Copies
25
Excessive Data Transfers
#pragma acc parallel loop Does the CPU need the data
for( int j = 1; j < n-1; j++) {
between iterations of the
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i]; convergence loop?
}
}
iter++;
}
27
Data regions
The data directive defines a region of code in which GPU arrays remain on
the GPU and are shared among all kernels in that region.
28
Data Clauses
copy ( list ) Allocates memory on GPU and copies data from host to GPU
when entering region and copies data to the host when
exiting region.
copyin ( list ) Allocates memory on GPU and copies data from host to GPU
when entering region.
copyout ( list ) Allocates memory on GPU and copies data to the host when
exiting region.
deviceptr( list ) The variable is a device pointer (e.g. CUDA) and can be
used directly on the device.
29
Array Shaping
30
Add Data Clauses
#pragma acc data copy(A) create(Anew) Copy A to/from the
while ( err > tol && iter < iter_max ) { accelerator only when
err=0.0;
needed.
#pragma acc parallel loop
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) { Create Anew as a device
temporary.
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
32 32
Visual Profiler: Data Region
33 33
Visual Profiler: Data Region
Was 112ms
Iteration 1 Iteration 2
34 34
Speed-Up (Higher is Better)
40.00X
34.71X
35.00X
30.00X
14.92X
15.00X
10.00X Socket/Socket: 3X
4.59X 5.00X
5.00X 3.69X
1.94X
1.00X
0.00X
Single Thread 2 Threads 4 Threads 6 Threads 8 Threads OpenACC K40 OpenACC P100
The loop directive gives the compiler additional information about the next loop in
the source code through several clauses.
• independent – all iterations of the loop are independent
• collapse(N) – turn the next N loops into one, flattened loop
• tile(N[,M,…]) - break the next 1 or more loops into tiles based on the
provided dimensions.
These clauses and more will be discussed in greater detail in a later class.
36
Optimize Loop Performance
#pragma acc data copy(A) create(Anew)
while ( err > tol && iter < iter_max ) {
err=0.0;
35.00X 34.71X
20.00X
14.92X 15.46X
15.00X
10.00X
4.59X 5.00X
5.00X 3.69X
1.94X
1.00X
0.00X
Single 2 Threads 4 Threads 6 Threads 8 Threads OpenACC OpenACC OpenACC OpenACC
Thread (K40) Tuned (K40 P100 Tuned (P100)
35.00X
Intel Xeon E5-2698 v3 @ 2.30GHz (Haswell)
30.00X vs.
NVIDIA Tesla K40 & Tesla P100
25.00X
20.00X
15.46X
15.00X
10.00X
6.31X
5.25X 5.33X
5.00X
1.00X
0.00X
Single Thread Intel OpenMP (Best) PGI OpenMP (Best) PGI OpenACC (Best) OpenACC K40 OpenACC P100
Intel C Compiler 16, PGI 16.10 (OpenMP, K40, & P100), PGI 15.10 (multicore) 39
Next Lecture
40