Introduction To GPU Computing With CUDA: Siegfried Höfinger

Introduction to GPU Computing with CUDA
Siegfried Höfinger
VSC Research Center, TU Wien
October 2, 2021
→ https://tinyurl.com/cuda4dummies/i/l1/notes-l1.pdf
PRACE Autumn School 2021 — GPU Programming with CUDA

Outline
Current Situation — Glimpse into top500
Components
Historical
Consumer/Enterprise-Grade GPUs
CUDA — Basic Design Principles
Take Home Messages

Performance [PFLOPs/s]
HPC — 3rd pillar of

500 scientific discovery
450
400
350
300
250
200
150
100
2020
50 2019
0 2018
2017
10 2016
9 2015
8
7 2014
6 2013
5 2012
4
3 2011
2 2010
1
Rank
→ https://www.scientific-computing.com/sites/default/files/content/white-paper/pdfs/PRACE-Software%20Strategy%20for%20European%20Exa


450
400
350
Supercomputers —
300
250
why GPUs ?
200
150
100
2020
50 2019
0 2018
2017
10 2016
9 2015
8
7 2014
6 2013
5 2012
4
3 2011
2 2010
1
Rank


450
400
350
Supercomputers —
300
250
why GPUs ?
200
150 Trend is likely to
100
50
2020
2019
2018
continue
0
2017
10 2016
9 2015
8
7 2014
6 2013
5 2012
4
3 2011
2 2010
1
Rank

Power Efficiency [GFLOPs/Watt]

30
27.5
scientific discovery
25
22.5
20
Supercomputers —
17.5
15
why GPUs ?
12.5
10
7.5
Trend is likely to
5
2.5
2020
2019
2018
continue
0
2017
10
9
8
2016
2015
2014
Power efficiency is key
7
6 2013
5 2012
4
3 2011
2 2010
1
Rank

Components
GPU/Accelerator: HPC/Server:
Specs (A100): Specs (Skylake Platinum 8180M):

6912 cores, clock freq 1.4 GHz, 40 GB HBM2, 1.6 TB/s, FP64/FP32/TC- 28 cores, clock freq 2.5 GHz, up to 1500 GB DDR4, up to 119 GB/s, FP64
FP64 10/20/20 TFLOPs/s, TDP 400 Watt, PCIe4.0/NVLink3 32/600 (AVX-512) 2 TFLOPs/s, TDP 205 Watt;
GB/s;
→ https://en.wikipedia.org/wiki/Supercomputer
→ https://www.nvidia.com/en-us/data-center/a100

Components

GB/s;
Identical basic components

Components

GB/s;

GPU has much more cores, but less RAM

Components

GB/s;

GPU has much more cores, but less RAM
No network on the GPU (massive parallelism onboard)

Historical
Fermi
2010
Tsubame 2.0 (M2050)
GSIC/TITech
2.3 PFLOPs/s
→ https://www.nvidia.com

Historical
Fermi Kepler
2010 2012
Tsubame 2.0 (M2050) Titan (k20x)
GSIC/TITech ORNL
2.3 PFLOPs/s 17.6 PFLOPs/s
⋆ 3x #cores (1536)
⋆ improved power efficiency

Historical
Fermi Kepler Pascal
2010 2012 2016

Tsubame 2.0 (M2050) Titan (k20x) Piz Daint (P100)
GSIC/TITech ORNL CSCS

2.3 PFLOPs/s 17.6 PFLOPs/s 25.4 PFLOPs/s
⋆ 3x #cores (1536) ⋆ NVLink, 5x PCIe bw
⋆ improved power efficiency ⋆ HBM2, 3x memory bw
⋆ Unified memory, multi-
GPU/CPU

Historical
Fermi Kepler Pascal Volta
2010 2012 2016 2018

Tsubame 2.0 (M2050) Titan (k20x) Piz Daint (P100) Summit/Sierra (V100)
GSIC/TITech ORNL CSCS ORNL/LLNL

2.3 PFLOPs/s 17.6 PFLOPs/s 25.4 PFLOPs/s 148.6/94.6 PFLOPs/s
⋆ 3x #cores (1536) ⋆ NVLink, 5x PCIe bw ⋆ NVLink2, 2x previous
⋆ improved power efficiency ⋆ HBM2, 3x memory bw ⋆ AI, 640 tensor cores
⋆ Unified memory, multi-
GPU/CPU

Historical
Fermi Kepler Pascal Volta Ampere
2010 2012 2016 2018 2020

Tsubame 2.0 (M2050) Titan (k20x) Piz Daint (P100) Summit/Sierra (V100) Perlmutter/JuwelsBooster(A100)
GSIC/TITech ORNL CSCS ORNL/LLNL NERSC/FZJ

2.3 PFLOPs/s 17.6 PFLOPs/s 25.4 PFLOPs/s 148.6/94.6 PFLOPs/s 64.6/44.1 PFLOPs/s (#5/8)
⋆ 3x #cores (1536) ⋆ NVLink, 5x PCIe bw ⋆ NVLink2, 2x previous ⋆ FP64@TC (19.5 TFLOPS/s)
⋆ improved power efficiency ⋆ HBM2, 3x memory bw ⋆ AI, 640 tensor cores ⋆ all up by 1.5x
⋆ Unified memory, multi- ⋆ ≈25 GFLOPs/Watt (#6/7)
GPU/CPU

Historical
Compute Capabilities
Version Number GPU Architecture

8.0 Ampere
7.5 Turing
7.x Volta
6.x Pascal
5.x Maxwell
3.x Kepler
2.x Fermi
1.x Tesla
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

Historical

8.0 Ampere
7.5 Turing
7.x Volta
Major revision
6.x Pascal
number:
5.x Maxwell
identifies
3.x Kepler
core architec-
2.x Fermi
ture/hardware
1.x Tesla
features

Historical

8.0 Ampere
7.5 Turing
7.x Volta
Major revision Minor revision
6.x Pascal
number: number:
5.x Maxwell
identifies incremental up-
3.x Kepler
core architec- date to core ar-
2.x Fermi
ture/hardware chitecture, e.g.
1.x Tesla
features Turing-Volta

Consumer grade: made for gaming, cheaper devices with lower specs (FP64,
HBM2) and prohibited 24x7 usage in datacentres (EULA change by 12/2017
affecting NVIDIA driver); GeForce, Titan, Tegra
Enterprise grade: heavy HPC workloads and large-scale AI, expensive (10:1)
high-end devices with certified top notch components and explicit warranty for
stable and reliable 24x7 operation; Tesla, Quadro, DGX
Academia perhaps fine; NVIDIA doesn’t want to ban non-commercial uses and
research, key question is what qualifies as a “data center”
→ https://www.nvidia.com/enterpriseservices
→ https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus

Figuring Out Own Setup
[sh@n566−009]$ nvidia-smi
Wed Sep 29 10:39:15 2021

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A40 Off | 00000000:41:00.0 Off | 0 |
| 0% 46C P0 75W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 A40 Off | 00000000:A1:00.0 Off | 0 |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Figuring Out Own Setup cont.
[sh@n566−009]$ nvidia-smi topo --matrix
GPU0 GPU1 mlx5_0 CPU Affinity NUMA Affinity

GPU0 X SYS NODE 0-7,16-23 0
GPU1 SYS X SYS 8-15,24-31 1
mlx5_0 NODE SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

3 Basic Components
Driver CUDA Toolkit CUDA SDK
kernel modules: nvcc cuda-gdb nsight... examples in 7

nvidia.ko nvidia-uvm.ko libcudart.so sub-directories
also includes libcuda.so libcublas.so...
considers compute also considers compute
capability ! capability !
→ https://www.nvidia.com/Download/index.aspx?lang=en-us
→ https://developer.nvidia.com/cuda-gpus
→ https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Good to Know Facts
CUDA: Compute Uniform Device Architecture (NVIDIA 2006)

GPU programming model based on threads, shared memory and barrier
synchronization
Linux, Mac OS and Windows supported
Simple extensions to C functions (becoming kernels) to run on the GPU in parallel
as N different CUDA threads
Multiple GPUs per host supported
CUDA the de-facto standard in HPC for science
CUDA uses simplified logic, more focus on ALU rather than out-of-order
execution, branch prediction etc
→ https://developer.nvidia.com/cuda-faq

Good to Know Facts cont.
SIMD operations, single instruction multiple data

CUDA programs can be called from C, C++, Fortran, Python...
PTX (parallel thread execution) format is forward-compatible with upcoming GPU
driver *.exe
*.cu nvcc PTX cudart
generations
Provides a mini-HPC-cluster on the desktop-computer
Data movement over PCIe still critical
OpenCL (alternative API) supported too
Competitors, AMD (ATI), Intel
→ https://developer.nvidia.com/cuda-faq

CUDA C-Programming Guide, Introduction
GPU specialized for compute-intensive, highly parallel computation


More transistors are dedicated to data processing rather than data caching and
flow control


flow control
GPU is a highly parallel, multithreaded, manycore processor with very high
memory bandwidth


flow control
GPU is a highly parallel, multithreaded, manycore processor with very high
memory bandwidth
Power efficiency is key — eco-friendly computing

CUDA C-Programming Guide, Programming Model
Single Thread Block Vector Addition
// kernel definition;
_ _global_ _ void VecAdd(float *A, float *B, float *C)
{
int i;
i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
}
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu


only 3 basic
elements
{
int i;
i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
}


only 3 basic
elements
{
1) kernel dec- int i;
i = threadIdx.x;
laration speci- }
C[i] = A[i] + B[i];
fier
int main()
{
...
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
}


only 3 basic
elements
{
1) kernel dec- int i;
i = threadIdx.x;
laration speci- }
C[i] = A[i] + B[i];
fier
int main()
{
...
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
2) kernel exe-
} cution config-
uration


only 3 basic
elements
3) built-in
_ _global_ _ void VecAdd(float *A, float *B, float *C) variables,
{
1) kernel dec- int i; e.g. threa-
i = threadIdx.x;
laration speci- C[i] = A[i] + B[i]; dIdx.x=0,1,2...
}
fier
int main()
{
...
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
2) kernel exe-
} cution config-
uration

Compile and Run and Monitor
[sh@n566−009]$ nvcc single_thread_block_vector_addition.cu
[sh@n566−009]$ ./a.out
0 100.000000
1 100.000000
...
99 100.000000
[sh@n566−009]$ watch -n 0.1 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A40 Off | 00000000:41:00.0 Off | 0 |
+-------------------------------+----------------------+----------------------+
| 1 A40 Off | 00000000:A1:00.0 Off | 0 |
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6231 C ./a.out 59MiB |
+-----------------------------------------------------------------------------+

CUDA C-Programming Guide, Programming Model cont.
_ _global_ _ declares a function as being a GPU-kernel

1. executed on the device
2. callable from the host
3. also callable from the device for devices of compute capability ≥ 3.2


_ _host_ _ declares a function as being a host-kernel
1. executed on the host
2. callable from the host only
3. default when omitted


_ _host_ _ declares a function as being a host-kernel
1. executed on the host
2. callable from the host only
3. default when omitted
_ _device_ _ declares a function as being a GPU-only-kernel
2. callable from the device only

threadIdx.[x,y,z] may be one-dimensional, two-dimensional or three-dimensional

referring to a one-dimensional, two-dimensional or three-dimensional thread block


threadIdx.[x,y,z] provides a direct formal abstraction of domains, ie facilitates
straightforward reference to the elements of a vector, matrix, or a volume


threadIdx.[x,y,z] for N threads goes from 0 to N-1


threadIdx.[x,y,z] for N threads goes from 0 to N-1
when working with thread blocks of two/three-dimensional shapes, the declaration
switches from int N; to
dim3 threadsPerBlock(N, N); (structure with 3 members, x, y, z, of type int)

Single Thread Block Matrix Addition

_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = threadIdx.x;
j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
int numBlocks;
dim3 threadsPerBlock;
...
// kernel invocation with one block of N * N threads
numBlocks = 1;
threadsPerBlock.x = N;
threadsPerBlock.y = N;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_matrix_addition.cu


{
int i, j;
i = threadIdx.x;
j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
2D thread
int main()
block ini- {
int numBlocks;
tialization dim3 threadsPerBlock;
...
// kernel invocation with one block of N * N threads
numBlocks = 1;
threadsPerBlock.x = N;
...
}


{
int i, j;
i = threadIdx.x;
j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
2D thread
int main()
block ini- {
int numBlocks;
tialization dim3 threadsPerBlock; goes di-
...
// kernel invocation with one block of N * N threads rectly into
numBlocks = 1;
threadsPerBlock.x = N; kernel exe-
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C); cution con-
...
} figuration


{
built-in
int i, j;
i = threadIdx.x;
2D thread
j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
indices
} 0,1,2...
2D thread
int main()
block ini- {
int numBlocks;
tialization dim3 threadsPerBlock; goes di-
...
// kernel invocation with one block of N * N threads rectly into
numBlocks = 1;
threadsPerBlock.x = N; kernel exe-
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C); cution con-
...
} figuration

There is an upper limit to the number of threads in a thread block, e.g. ≈1024
for current GPUs, because all threads of a thread block are supposed to run on
the same SM (streaming multiprocessor)
→ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth

A single SM typically contains 64-128 CUDA cores (INT32, FP32, FP64, TC)

Different GPU architectures vary in terms of numbers of SMs, e.g. gtx1080 has
20 SMs, V100 has 80 SMs, A40 has 84 SMs, A100 has 108 SMs

Different GPU architectures vary in terms of numbers of SMs, e.g. gtx1080 has
20 SMs, V100 has 80 SMs, A40 has 84 SMs, A100 has 108 SMs
However, multiple thread blocks can be launched in parallel as defined by the
initial parameter numBlocks used in the kernel execution configuration
<<< numBlocks,threadsPerBlock>>>

How to Determine Max #Threads and Related
[sh@n566−009]$ deviceQuery
/opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/11.0.2/NVIDIA_CUDA-11.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "A40"
CUDA Driver Version / Runtime Version 11.2 / 11.0
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 45634 MBytes (47850782720 bytes)
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
(84) Multiprocessors, ( 64) CUDA Cores/MP: 5376 CUDA Cores
GPU Max Clock rate: 1740 MHz (1.74 GHz)
Memory Clock rate: 7251 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes

How to Determine Max #Threads and Related
[sh@n566−009]$ deviceQuery
/opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/11.0.2/NVIDIA_CUDA-11.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "A40"
CUDA Driver Version / Runtime Version 11.2 / 11.0
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 45634 MBytes (47850782720 bytes)
(84) Multiprocessors, ( 64) CUDA Cores/MP: 5376 CUDA Cores
GPU Max Clock rate: 1740 MHz (1.74 GHz)
Memory Clock rate: 7251 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes

How to Determine Max #Threads and Related cont.
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 65 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from A40 (GPU0) -> A40 (GPU1) : Yes
> Peer access from A40 (GPU1) -> A40 (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.0, NumDevs = 2
Result = PASS

numBlocks organization of the block grid is

similar to threadsPerBlock


Can again be one-dimensional, two-dimensional
or three-dimensional


Again dim3 declaration


blockIdx.[x,y,z] is again a built-in variable at the
kernel level to identify corresponding thread
blocks for each of the parallel threads


blockIdx.[x,y,z] is again a built-in variable at the
kernel level to identify corresponding thread
blocks for each of the parallel threads
blockDim.[x,y,z] is another built-in variable for
the kernel to asses thread block dimensions
blockDim.y
blockDim.x

Multiple Thread Blocks Matrix Addition
#define N 256;
{
int i, j;
i = (blockIdx.x * blockDim.x) + threadIdx.x;
j = (blockIdx.y * blockDim.y) + threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
dim3 threadsPerBlock, numBlocks;
...
// kernel invocation with blocks of 256 threads
threadsPerBlock.x = 16;
threadsPerBlock.y = 16;
numBlocks.x = N / threadsPerBlock.x;
numBlocks.y = N / threadsPerBlock.y;
...
}
→ https://tinyurl.com/cuda4dummies/i/l1/multiple_thread_blocks_matrix_addition.cu

#define N 256;
{
int i, j;
C[i][j] = A[i][j] + B[i][j];
2D initial- }
int main()
izations {
dim3 threadsPerBlock, numBlocks;
...
// kernel invocation with blocks of 256 threads
threadsPerBlock.y = 16;
numBlocks.y = N / threadsPerBlock.y;
...
}

#define N 256;
{
int i, j;
C[i][j] = A[i][j] + B[i][j];
2D initial- }
int main()
izations {
dim3 threadsPerBlock, numBlocks; general type
...
// kernel invocation with blocks of 256 threads kernel exe-
threadsPerBlock.y = 16; cution con-
numBlocks.y = N / threadsPerBlock.y; figuration
...
}

#define N 256;
// kernel definition; general us-

{ age of built-
int i, j;
i = (blockIdx.x * blockDim.x) + threadIdx.x; in variables
C[i][j] = A[i][j] + B[i][j];
in 2D
2D initial- }
int main()
izations {
dim3 threadsPerBlock, numBlocks; general type
...
// kernel invocation with blocks of 256 threads kernel exe-
threadsPerBlock.y = 16; cution con-
numBlocks.y = N / threadsPerBlock.y; figuration
...
}

256 threads per thread block is

arbitrary but a frequent choice


Thread blocks are required to
execute independently in any
order


Thread blocks are required to
execute independently in any
order
Scalability results from this
requirement

Workflow is devided between host and device


CUDA threads execute on the GPU the rest of
the program on the host CPU


Thread blocks are all parallel


Thread blocks are all parallel
Host code is usually serial, both sections may
execute concurrently

Take Home Messages
✈ GPU kernels need to account for proper logic

Take Home Messages
✈ Kernel execution configuration facilitates efficient operation of GPU
resources

Take Home Messages
resources
✈ Massive parallelism at the level of GPU threads replacing conventional
loops over array elements with many individual threads directly acting on
thread-specific data elements in parallel

Take Home Messages
resources
✈ Built-in variables to quasi-automatize various workloads, e.g. threadIdx,
blockIdx 0,1,2,3...

Take Home Messages
resources
blockIdx 0,1,2,3...
✈ Shapes of thread blocks go hand in hand with domain decomposition
(vector, matrix, volume)

Take Home Messages
resources
blockIdx 0,1,2,3...
✈ Shapes of thread blocks go hand in hand with domain decomposition
(vector, matrix, volume)
✈ GPU computing means eco-friendly computing !

Introduction To GPU Computing With CUDA: Siegfried Höfinger

Uploaded by

Copyright:

Available Formats

Introduction To GPU Computing With CUDA: Siegfried Höfinger

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To GPU Computing With CUDA: Siegfried Höfinger

Uploaded by

Copyright:

Available Formats

Introduction to GPU Computing with CUDA

VSC Research Center, TU Wien

PRACE Autumn School 2021 — GPU Programming with CUDA

Current Situation — Glimpse into top500

CUDA — Basic Design Principles

Take Home Messages

PRACE Autumn School 2021 — GPU Programming with CUDA

HPC — 3rd pillar of

PRACE Autumn School 2021 — GPU Programming with CUDA

HPC — 3rd pillar of

PRACE Autumn School 2021 — GPU Programming with CUDA

HPC — 3rd pillar of

PRACE Autumn School 2021 — GPU Programming with CUDA

HPC — 3rd pillar of

PRACE Autumn School 2021 — GPU Programming with CUDA

Specs (A100): Specs (Skylake Platinum 8180M):

PRACE Autumn School 2021 — GPU Programming with CUDA

Specs (A100): Specs (Skylake Platinum 8180M):

Identical basic components

PRACE Autumn School 2021 — GPU Programming with CUDA

Specs (A100): Specs (Skylake Platinum 8180M):

Identical basic components

PRACE Autumn School 2021 — GPU Programming with CUDA

Specs (A100): Specs (Skylake Platinum 8180M):

Identical basic components

PRACE Autumn School 2021 — GPU Programming with CUDA

PRACE Autumn School 2021 — GPU Programming with CUDA

PRACE Autumn School 2021 — GPU Programming with CUDA

Fermi Kepler Pascal

2010 2012 2016

GSIC/TITech ORNL CSCS

PRACE Autumn School 2021 — GPU Programming with CUDA

Fermi Kepler Pascal Volta

2010 2012 2016 2018

GSIC/TITech ORNL CSCS ORNL/LLNL

PRACE Autumn School 2021 — GPU Programming with CUDA

Fermi Kepler Pascal Volta Ampere

2010 2012 2016 2018 2020

GSIC/TITech ORNL CSCS ORNL/LLNL NERSC/FZJ

PRACE Autumn School 2021 — GPU Programming with CUDA

Version Number GPU Architecture

PRACE Autumn School 2021 — GPU Programming with CUDA

Version Number GPU Architecture

PRACE Autumn School 2021 — GPU Programming with CUDA

Version Number GPU Architecture

PRACE Autumn School 2021 — GPU Programming with CUDA

PRACE Autumn School 2021 — GPU Programming with CUDA

Wed Sep 29 10:39:15 2021

PRACE Autumn School 2021 — GPU Programming with CUDA

[sh@n566−009]$ nvidia-smi topo --matrix

GPU0 GPU1 mlx5_0 CPU Affinity NUMA Affinity

PRACE Autumn School 2021 — GPU Programming with CUDA

Driver CUDA Toolkit CUDA SDK

kernel modules: nvcc cuda-gdb nsight... examples in 7

PRACE Autumn School 2021 — GPU Programming with CUDA

CUDA: Compute Uniform Device Architecture (NVIDIA 2006)

PRACE Autumn School 2021 — GPU Programming with CUDA

SIMD operations, single instruction multiple data

PRACE Autumn School 2021 — GPU Programming with CUDA