Introduction To GPU Computing With CUDA: Siegfried Höfinger

Download as pdf or txt
Download as pdf or txt
You are on page 1of 74

Introduction to GPU Computing with CUDA

Siegfried Höfinger

VSC Research Center, TU Wien

October 2, 2021

→ https://tinyurl.com/cuda4dummies/i/l1/notes-l1.pdf

PRACE Autumn School 2021 — GPU Programming with CUDA


Outline

Current Situation — Glimpse into top500

Components

Historical

Consumer/Enterprise-Grade GPUs

CUDA — Basic Design Principles

Take Home Messages

PRACE Autumn School 2021 — GPU Programming with CUDA


Current Situation — Glimpse into top500
Performance [PFLOPs/s]

HPC — 3rd pillar of


500 scientific discovery
450
400
350
300
250
200
150
100
2020
50 2019
0 2018
2017
10 2016
9 2015
8
7 2014
6 2013
5 2012
4
3 2011
2 2010
1
Rank
→ https://www.scientific-computing.com/sites/default/files/content/white-paper/pdfs/PRACE-Software%20Strategy%20for%20European%20Exa

PRACE Autumn School 2021 — GPU Programming with CUDA


Current Situation — Glimpse into top500
Performance [PFLOPs/s]

HPC — 3rd pillar of


500 scientific discovery
450
400
350
Supercomputers —
300
250
why GPUs ?
200
150
100
2020
50 2019
0 2018
2017
10 2016
9 2015
8
7 2014
6 2013
5 2012
4
3 2011
2 2010
1
Rank
→ https://www.scientific-computing.com/sites/default/files/content/white-paper/pdfs/PRACE-Software%20Strategy%20for%20European%20Exa

PRACE Autumn School 2021 — GPU Programming with CUDA


Current Situation — Glimpse into top500
Performance [PFLOPs/s]

HPC — 3rd pillar of


500 scientific discovery
450
400
350
Supercomputers —
300
250
why GPUs ?
200
150 Trend is likely to
100
50
2020
2019
2018
continue
0
2017
10 2016
9 2015
8
7 2014
6 2013
5 2012
4
3 2011
2 2010
1
Rank
→ https://www.scientific-computing.com/sites/default/files/content/white-paper/pdfs/PRACE-Software%20Strategy%20for%20European%20Exa

PRACE Autumn School 2021 — GPU Programming with CUDA


Current Situation — Glimpse into top500
Power Efficiency [GFLOPs/Watt]

HPC — 3rd pillar of


30
27.5
scientific discovery
25
22.5
20
Supercomputers —
17.5
15
why GPUs ?
12.5
10
7.5
Trend is likely to
5
2.5
2020
2019
2018
continue
0
2017
10
9
8
2016
2015
2014
Power efficiency is key
7
6 2013
5 2012
4
3 2011
2 2010
1
Rank
→ https://www.scientific-computing.com/sites/default/files/content/white-paper/pdfs/PRACE-Software%20Strategy%20for%20European%20Exa

PRACE Autumn School 2021 — GPU Programming with CUDA


Components

GPU/Accelerator: HPC/Server:

Specs (A100): Specs (Skylake Platinum 8180M):


6912 cores, clock freq 1.4 GHz, 40 GB HBM2, 1.6 TB/s, FP64/FP32/TC- 28 cores, clock freq 2.5 GHz, up to 1500 GB DDR4, up to 119 GB/s, FP64
FP64 10/20/20 TFLOPs/s, TDP 400 Watt, PCIe4.0/NVLink3 32/600 (AVX-512) 2 TFLOPs/s, TDP 205 Watt;
GB/s;

→ https://en.wikipedia.org/wiki/Supercomputer
→ https://www.nvidia.com/en-us/data-center/a100

PRACE Autumn School 2021 — GPU Programming with CUDA


Components

GPU/Accelerator: HPC/Server:

Specs (A100): Specs (Skylake Platinum 8180M):


6912 cores, clock freq 1.4 GHz, 40 GB HBM2, 1.6 TB/s, FP64/FP32/TC- 28 cores, clock freq 2.5 GHz, up to 1500 GB DDR4, up to 119 GB/s, FP64
FP64 10/20/20 TFLOPs/s, TDP 400 Watt, PCIe4.0/NVLink3 32/600 (AVX-512) 2 TFLOPs/s, TDP 205 Watt;
GB/s;

Identical basic components

→ https://en.wikipedia.org/wiki/Supercomputer
→ https://www.nvidia.com/en-us/data-center/a100

PRACE Autumn School 2021 — GPU Programming with CUDA


Components

GPU/Accelerator: HPC/Server:

Specs (A100): Specs (Skylake Platinum 8180M):


6912 cores, clock freq 1.4 GHz, 40 GB HBM2, 1.6 TB/s, FP64/FP32/TC- 28 cores, clock freq 2.5 GHz, up to 1500 GB DDR4, up to 119 GB/s, FP64
FP64 10/20/20 TFLOPs/s, TDP 400 Watt, PCIe4.0/NVLink3 32/600 (AVX-512) 2 TFLOPs/s, TDP 205 Watt;
GB/s;

Identical basic components


GPU has much more cores, but less RAM

→ https://en.wikipedia.org/wiki/Supercomputer
→ https://www.nvidia.com/en-us/data-center/a100

PRACE Autumn School 2021 — GPU Programming with CUDA


Components

GPU/Accelerator: HPC/Server:

Specs (A100): Specs (Skylake Platinum 8180M):


6912 cores, clock freq 1.4 GHz, 40 GB HBM2, 1.6 TB/s, FP64/FP32/TC- 28 cores, clock freq 2.5 GHz, up to 1500 GB DDR4, up to 119 GB/s, FP64
FP64 10/20/20 TFLOPs/s, TDP 400 Watt, PCIe4.0/NVLink3 32/600 (AVX-512) 2 TFLOPs/s, TDP 205 Watt;
GB/s;

Identical basic components


GPU has much more cores, but less RAM
No network on the GPU (massive parallelism onboard)
→ https://en.wikipedia.org/wiki/Supercomputer
→ https://www.nvidia.com/en-us/data-center/a100

PRACE Autumn School 2021 — GPU Programming with CUDA


Historical

Fermi

2010
Tsubame 2.0 (M2050)

GSIC/TITech
2.3 PFLOPs/s

→ https://www.nvidia.com

PRACE Autumn School 2021 — GPU Programming with CUDA


Historical

Fermi Kepler

2010 2012
Tsubame 2.0 (M2050) Titan (k20x)

GSIC/TITech ORNL
2.3 PFLOPs/s 17.6 PFLOPs/s
⋆ 3x #cores (1536)
⋆ improved power efficiency

→ https://www.nvidia.com

PRACE Autumn School 2021 — GPU Programming with CUDA


Historical

Fermi Kepler Pascal

2010 2012 2016


Tsubame 2.0 (M2050) Titan (k20x) Piz Daint (P100)

GSIC/TITech ORNL CSCS


2.3 PFLOPs/s 17.6 PFLOPs/s 25.4 PFLOPs/s
⋆ 3x #cores (1536) ⋆ NVLink, 5x PCIe bw
⋆ improved power efficiency ⋆ HBM2, 3x memory bw
⋆ Unified memory, multi-
GPU/CPU

→ https://www.nvidia.com

PRACE Autumn School 2021 — GPU Programming with CUDA


Historical

Fermi Kepler Pascal Volta

2010 2012 2016 2018


Tsubame 2.0 (M2050) Titan (k20x) Piz Daint (P100) Summit/Sierra (V100)

GSIC/TITech ORNL CSCS ORNL/LLNL


2.3 PFLOPs/s 17.6 PFLOPs/s 25.4 PFLOPs/s 148.6/94.6 PFLOPs/s
⋆ 3x #cores (1536) ⋆ NVLink, 5x PCIe bw ⋆ NVLink2, 2x previous
⋆ improved power efficiency ⋆ HBM2, 3x memory bw ⋆ AI, 640 tensor cores
⋆ Unified memory, multi-
GPU/CPU

→ https://www.nvidia.com

PRACE Autumn School 2021 — GPU Programming with CUDA


Historical

Fermi Kepler Pascal Volta Ampere

2010 2012 2016 2018 2020


Tsubame 2.0 (M2050) Titan (k20x) Piz Daint (P100) Summit/Sierra (V100) Perlmutter/JuwelsBooster(A100)

GSIC/TITech ORNL CSCS ORNL/LLNL NERSC/FZJ


2.3 PFLOPs/s 17.6 PFLOPs/s 25.4 PFLOPs/s 148.6/94.6 PFLOPs/s 64.6/44.1 PFLOPs/s (#5/8)
⋆ 3x #cores (1536) ⋆ NVLink, 5x PCIe bw ⋆ NVLink2, 2x previous ⋆ FP64@TC (19.5 TFLOPS/s)
⋆ improved power efficiency ⋆ HBM2, 3x memory bw ⋆ AI, 640 tensor cores ⋆ all up by 1.5x
⋆ Unified memory, multi- ⋆ ≈25 GFLOPs/Watt (#6/7)
GPU/CPU

→ https://www.nvidia.com

PRACE Autumn School 2021 — GPU Programming with CUDA


Historical
Compute Capabilities

Version Number GPU Architecture


8.0 Ampere
7.5 Turing
7.x Volta
6.x Pascal
5.x Maxwell
3.x Kepler
2.x Fermi
1.x Tesla

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


Historical
Compute Capabilities

Version Number GPU Architecture


8.0 Ampere
7.5 Turing
7.x Volta
Major revision
6.x Pascal
number:
5.x Maxwell
identifies
3.x Kepler
core architec-
2.x Fermi
ture/hardware
1.x Tesla
features

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


Historical
Compute Capabilities

Version Number GPU Architecture


8.0 Ampere
7.5 Turing
7.x Volta
Major revision Minor revision
6.x Pascal
number: number:
5.x Maxwell
identifies incremental up-
3.x Kepler
core architec- date to core ar-
2.x Fermi
ture/hardware chitecture, e.g.
1.x Tesla
features Turing-Volta

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


Consumer/Enterprise-Grade GPUs

Consumer grade: made for gaming, cheaper devices with lower specs (FP64,
HBM2) and prohibited 24x7 usage in datacentres (EULA change by 12/2017
affecting NVIDIA driver); GeForce, Titan, Tegra
Enterprise grade: heavy HPC workloads and large-scale AI, expensive (10:1)
high-end devices with certified top notch components and explicit warranty for
stable and reliable 24x7 operation; Tesla, Quadro, DGX
Academia perhaps fine; NVIDIA doesn’t want to ban non-commercial uses and
research, key question is what qualifies as a “data center”

→ https://www.nvidia.com/enterpriseservices
→ https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus

PRACE Autumn School 2021 — GPU Programming with CUDA


Consumer/Enterprise-Grade GPUs
Figuring Out Own Setup

[sh@n566−009]$ nvidia-smi

Wed Sep 29 10:39:15 2021


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A40 Off | 00000000:41:00.0 Off | 0 |
| 0% 46C P0 75W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 A40 Off | 00000000:A1:00.0 Off | 0 |
| 0% 40C P0 72W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

PRACE Autumn School 2021 — GPU Programming with CUDA


Consumer/Enterprise-Grade GPUs
Figuring Out Own Setup cont.

[sh@n566−009]$ nvidia-smi topo --matrix

GPU0 GPU1 mlx5_0 CPU Affinity NUMA Affinity


GPU0 X SYS NODE 0-7,16-23 0
GPU1 SYS X SYS 8-15,24-31 1
mlx5_0 NODE SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
3 Basic Components

Driver CUDA Toolkit CUDA SDK

kernel modules: nvcc cuda-gdb nsight... examples in 7


nvidia.ko nvidia-uvm.ko libcudart.so sub-directories
also includes libcuda.so libcublas.so...
considers compute also considers compute
capability ! capability !
→ https://www.nvidia.com/Download/index.aspx?lang=en-us
→ https://developer.nvidia.com/cuda-gpus
→ https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
Good to Know Facts

CUDA: Compute Uniform Device Architecture (NVIDIA 2006)


GPU programming model based on threads, shared memory and barrier
synchronization
Linux, Mac OS and Windows supported
Simple extensions to C functions (becoming kernels) to run on the GPU in parallel
as N different CUDA threads
Multiple GPUs per host supported
CUDA the de-facto standard in HPC for science
CUDA uses simplified logic, more focus on ALU rather than out-of-order
execution, branch prediction etc
→ https://developer.nvidia.com/cuda-faq

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
Good to Know Facts cont.

SIMD operations, single instruction multiple data


CUDA programs can be called from C, C++, Fortran, Python...
PTX (parallel thread execution) format is forward-compatible with upcoming GPU
driver *.exe
*.cu nvcc PTX cudart
generations
Provides a mini-HPC-cluster on the desktop-computer
Data movement over PCIe still critical
OpenCL (alternative API) supported too
Competitors, AMD (ATI), Intel

→ https://developer.nvidia.com/cuda-faq

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Introduction

GPU specialized for compute-intensive, highly parallel computation

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Introduction

GPU specialized for compute-intensive, highly parallel computation


More transistors are dedicated to data processing rather than data caching and
flow control

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Introduction

GPU specialized for compute-intensive, highly parallel computation


More transistors are dedicated to data processing rather than data caching and
flow control
GPU is a highly parallel, multithreaded, manycore processor with very high
memory bandwidth

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Introduction

GPU specialized for compute-intensive, highly parallel computation


More transistors are dedicated to data processing rather than data caching and
flow control
GPU is a highly parallel, multithreaded, manycore processor with very high
memory bandwidth
Power efficiency is key — eco-friendly computing
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model

Single Thread Block Vector Addition

// kernel definition;
_ _global_ _ void VecAdd(float *A, float *B, float *C)
{
int i;
i = threadIdx.x;
C[i] = A[i] + B[i];
}

int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
}

→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model

Single Thread Block Vector Addition


only 3 basic
elements
// kernel definition;
_ _global_ _ void VecAdd(float *A, float *B, float *C)
{
int i;
i = threadIdx.x;
C[i] = A[i] + B[i];
}

int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
}

→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model

Single Thread Block Vector Addition


only 3 basic
elements
// kernel definition;
_ _global_ _ void VecAdd(float *A, float *B, float *C)
{
1) kernel dec- int i;
i = threadIdx.x;
laration speci- }
C[i] = A[i] + B[i];

fier
int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
}

→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model

Single Thread Block Vector Addition


only 3 basic
elements
// kernel definition;
_ _global_ _ void VecAdd(float *A, float *B, float *C)
{
1) kernel dec- int i;
i = threadIdx.x;
laration speci- }
C[i] = A[i] + B[i];

fier
int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
2) kernel exe-
} cution config-
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
uration

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model

Single Thread Block Vector Addition


only 3 basic
elements
3) built-in
// kernel definition;
_ _global_ _ void VecAdd(float *A, float *B, float *C) variables,
{
1) kernel dec- int i; e.g. threa-
i = threadIdx.x;
laration speci- C[i] = A[i] + B[i]; dIdx.x=0,1,2...
}
fier
int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
2) kernel exe-
} cution config-
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
uration

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
Compile and Run and Monitor
[sh@n566−009]$ nvcc single_thread_block_vector_addition.cu
[sh@n566−009]$ ./a.out
0 100.000000
1 100.000000
...
99 100.000000
[sh@n566−009]$ watch -n 0.1 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A40 Off | 00000000:41:00.0 Off | 0 |
| 0% 48C P0 76W / 300W | 17MiB / 45634MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 A40 Off | 00000000:A1:00.0 Off | 0 |
| 0% 40C P0 73W / 300W | 0MiB / 45634MiB | 0% Default |
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6231 C ./a.out 59MiB |
+-----------------------------------------------------------------------------+

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

_ _global_ _ declares a function as being a GPU-kernel


1. executed on the device
2. callable from the host
3. also callable from the device for devices of compute capability ≥ 3.2

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

_ _global_ _ declares a function as being a GPU-kernel


1. executed on the device
2. callable from the host
3. also callable from the device for devices of compute capability ≥ 3.2
_ _host_ _ declares a function as being a host-kernel
1. executed on the host
2. callable from the host only
3. default when omitted

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

_ _global_ _ declares a function as being a GPU-kernel


1. executed on the device
2. callable from the host
3. also callable from the device for devices of compute capability ≥ 3.2
_ _host_ _ declares a function as being a host-kernel
1. executed on the host
2. callable from the host only
3. default when omitted
_ _device_ _ declares a function as being a GPU-only-kernel
1. executed on the device
2. callable from the device only
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

threadIdx.[x,y,z] may be one-dimensional, two-dimensional or three-dimensional


referring to a one-dimensional, two-dimensional or three-dimensional thread block

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

threadIdx.[x,y,z] may be one-dimensional, two-dimensional or three-dimensional


referring to a one-dimensional, two-dimensional or three-dimensional thread block
threadIdx.[x,y,z] provides a direct formal abstraction of domains, ie facilitates
straightforward reference to the elements of a vector, matrix, or a volume

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

threadIdx.[x,y,z] may be one-dimensional, two-dimensional or three-dimensional


referring to a one-dimensional, two-dimensional or three-dimensional thread block
threadIdx.[x,y,z] provides a direct formal abstraction of domains, ie facilitates
straightforward reference to the elements of a vector, matrix, or a volume
threadIdx.[x,y,z] for N threads goes from 0 to N-1

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

threadIdx.[x,y,z] may be one-dimensional, two-dimensional or three-dimensional


referring to a one-dimensional, two-dimensional or three-dimensional thread block
threadIdx.[x,y,z] provides a direct formal abstraction of domains, ie facilitates
straightforward reference to the elements of a vector, matrix, or a volume
threadIdx.[x,y,z] for N threads goes from 0 to N-1
when working with thread blocks of two/three-dimensional shapes, the declaration
switches from int N; to
dim3 threadsPerBlock(N, N); (structure with 3 members, x, y, z, of type int)

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

Single Thread Block Matrix Addition


// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = threadIdx.x;
j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}

int main()
{
int numBlocks;
dim3 threadsPerBlock;
...
// kernel invocation with one block of N * N threads
numBlocks = 1;
threadsPerBlock.x = N;
threadsPerBlock.y = N;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_matrix_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

Single Thread Block Matrix Addition


// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = threadIdx.x;
j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
2D thread
int main()
block ini- {
int numBlocks;
tialization dim3 threadsPerBlock;
...
// kernel invocation with one block of N * N threads
numBlocks = 1;
threadsPerBlock.x = N;
threadsPerBlock.y = N;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_matrix_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

Single Thread Block Matrix Addition


// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = threadIdx.x;
j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
2D thread
int main()
block ini- {
int numBlocks;
tialization dim3 threadsPerBlock; goes di-
...
// kernel invocation with one block of N * N threads rectly into
numBlocks = 1;
threadsPerBlock.x = N; kernel exe-
threadsPerBlock.y = N;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C); cution con-
...
} figuration
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_matrix_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

Single Thread Block Matrix Addition


// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
built-in
int i, j;
i = threadIdx.x;
2D thread
j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
indices
} 0,1,2...
2D thread
int main()
block ini- {
int numBlocks;
tialization dim3 threadsPerBlock; goes di-
...
// kernel invocation with one block of N * N threads rectly into
numBlocks = 1;
threadsPerBlock.x = N; kernel exe-
threadsPerBlock.y = N;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C); cution con-
...
} figuration
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_matrix_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

There is an upper limit to the number of threads in a thread block, e.g. ≈1024
for current GPUs, because all threads of a thread block are supposed to run on
the same SM (streaming multiprocessor)

→ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

There is an upper limit to the number of threads in a thread block, e.g. ≈1024
for current GPUs, because all threads of a thread block are supposed to run on
the same SM (streaming multiprocessor)
A single SM typically contains 64-128 CUDA cores (INT32, FP32, FP64, TC)

→ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

There is an upper limit to the number of threads in a thread block, e.g. ≈1024
for current GPUs, because all threads of a thread block are supposed to run on
the same SM (streaming multiprocessor)
A single SM typically contains 64-128 CUDA cores (INT32, FP32, FP64, TC)
Different GPU architectures vary in terms of numbers of SMs, e.g. gtx1080 has
20 SMs, V100 has 80 SMs, A40 has 84 SMs, A100 has 108 SMs

→ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

There is an upper limit to the number of threads in a thread block, e.g. ≈1024
for current GPUs, because all threads of a thread block are supposed to run on
the same SM (streaming multiprocessor)
A single SM typically contains 64-128 CUDA cores (INT32, FP32, FP64, TC)
Different GPU architectures vary in terms of numbers of SMs, e.g. gtx1080 has
20 SMs, V100 has 80 SMs, A40 has 84 SMs, A100 has 108 SMs
However, multiple thread blocks can be launched in parallel as defined by the
initial parameter numBlocks used in the kernel execution configuration
<<< numBlocks,threadsPerBlock>>>

→ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
How to Determine Max #Threads and Related

[sh@n566−009]$ deviceQuery

/opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/11.0.2/NVIDIA_CUDA-11.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "A40"
CUDA Driver Version / Runtime Version 11.2 / 11.0
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 45634 MBytes (47850782720 bytes)
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
(84) Multiprocessors, ( 64) CUDA Cores/MP: 5376 CUDA Cores
GPU Max Clock rate: 1740 MHz (1.74 GHz)
Memory Clock rate: 7251 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
How to Determine Max #Threads and Related

[sh@n566−009]$ deviceQuery

/opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/11.0.2/NVIDIA_CUDA-11.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "A40"
CUDA Driver Version / Runtime Version 11.2 / 11.0
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 45634 MBytes (47850782720 bytes)
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
(84) Multiprocessors, ( 64) CUDA Cores/MP: 5376 CUDA Cores
GPU Max Clock rate: 1740 MHz (1.74 GHz)
Memory Clock rate: 7251 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
How to Determine Max #Threads and Related cont.
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 65 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from A40 (GPU0) -> A40 (GPU1) : Yes
> Peer access from A40 (GPU1) -> A40 (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.0, NumDevs = 2
Result = PASS

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

numBlocks organization of the block grid is


similar to threadsPerBlock

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

numBlocks organization of the block grid is


similar to threadsPerBlock
Can again be one-dimensional, two-dimensional
or three-dimensional

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

numBlocks organization of the block grid is


similar to threadsPerBlock
Can again be one-dimensional, two-dimensional
or three-dimensional
Again dim3 declaration

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

numBlocks organization of the block grid is


similar to threadsPerBlock
Can again be one-dimensional, two-dimensional
or three-dimensional
Again dim3 declaration
blockIdx.[x,y,z] is again a built-in variable at the
kernel level to identify corresponding thread
blocks for each of the parallel threads

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

numBlocks organization of the block grid is


similar to threadsPerBlock
Can again be one-dimensional, two-dimensional
or three-dimensional
Again dim3 declaration
blockIdx.[x,y,z] is again a built-in variable at the
kernel level to identify corresponding thread
blocks for each of the parallel threads
blockDim.[x,y,z] is another built-in variable for
the kernel to asses thread block dimensions

blockDim.y

blockDim.x
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.
Multiple Thread Blocks Matrix Addition
#define N 256;

// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = (blockIdx.x * blockDim.x) + threadIdx.x;
j = (blockIdx.y * blockDim.y) + threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
dim3 threadsPerBlock, numBlocks;
...
// kernel invocation with blocks of 256 threads
threadsPerBlock.x = 16;
threadsPerBlock.y = 16;
numBlocks.x = N / threadsPerBlock.x;
numBlocks.y = N / threadsPerBlock.y;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://tinyurl.com/cuda4dummies/i/l1/multiple_thread_blocks_matrix_addition.cu

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.
Multiple Thread Blocks Matrix Addition
#define N 256;

// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = (blockIdx.x * blockDim.x) + threadIdx.x;
j = (blockIdx.y * blockDim.y) + threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
2D initial- }
int main()
izations {
dim3 threadsPerBlock, numBlocks;
...
// kernel invocation with blocks of 256 threads
threadsPerBlock.x = 16;
threadsPerBlock.y = 16;
numBlocks.x = N / threadsPerBlock.x;
numBlocks.y = N / threadsPerBlock.y;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://tinyurl.com/cuda4dummies/i/l1/multiple_thread_blocks_matrix_addition.cu

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.
Multiple Thread Blocks Matrix Addition
#define N 256;

// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = (blockIdx.x * blockDim.x) + threadIdx.x;
j = (blockIdx.y * blockDim.y) + threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
2D initial- }
int main()
izations {
dim3 threadsPerBlock, numBlocks; general type
...
// kernel invocation with blocks of 256 threads kernel exe-
threadsPerBlock.x = 16;
threadsPerBlock.y = 16; cution con-
numBlocks.x = N / threadsPerBlock.x;
numBlocks.y = N / threadsPerBlock.y; figuration
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://tinyurl.com/cuda4dummies/i/l1/multiple_thread_blocks_matrix_addition.cu

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.
Multiple Thread Blocks Matrix Addition
#define N 256;

// kernel definition; general us-


_ _global_ _ void MatAdd(float **A, float **B, float **C)
{ age of built-
int i, j;
i = (blockIdx.x * blockDim.x) + threadIdx.x; in variables
j = (blockIdx.y * blockDim.y) + threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
in 2D
2D initial- }
int main()
izations {
dim3 threadsPerBlock, numBlocks; general type
...
// kernel invocation with blocks of 256 threads kernel exe-
threadsPerBlock.x = 16;
threadsPerBlock.y = 16; cution con-
numBlocks.x = N / threadsPerBlock.x;
numBlocks.y = N / threadsPerBlock.y; figuration
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://tinyurl.com/cuda4dummies/i/l1/multiple_thread_blocks_matrix_addition.cu

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

256 threads per thread block is


arbitrary but a frequent choice

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

256 threads per thread block is


arbitrary but a frequent choice
Thread blocks are required to
execute independently in any
order

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

256 threads per thread block is


arbitrary but a frequent choice
Thread blocks are required to
execute independently in any
order
Scalability results from this
requirement

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

Workflow is devided between host and device

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

Workflow is devided between host and device


CUDA threads execute on the GPU the rest of
the program on the host CPU

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

Workflow is devided between host and device


CUDA threads execute on the GPU the rest of
the program on the host CPU
Thread blocks are all parallel

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


CUDA — Basic Design Principles
CUDA C-Programming Guide, Programming Model cont.

Workflow is devided between host and device


CUDA threads execute on the GPU the rest of
the program on the host CPU
Thread blocks are all parallel
Host code is usually serial, both sections may
execute concurrently

→ https://docs.nvidia.com/cuda/cuda-c-programming-guide

PRACE Autumn School 2021 — GPU Programming with CUDA


Take Home Messages
✈ GPU kernels need to account for proper logic

PRACE Autumn School 2021 — GPU Programming with CUDA


Take Home Messages
✈ GPU kernels need to account for proper logic
✈ Kernel execution configuration facilitates efficient operation of GPU
resources

PRACE Autumn School 2021 — GPU Programming with CUDA


Take Home Messages
✈ GPU kernels need to account for proper logic
✈ Kernel execution configuration facilitates efficient operation of GPU
resources
✈ Massive parallelism at the level of GPU threads replacing conventional
loops over array elements with many individual threads directly acting on
thread-specific data elements in parallel

PRACE Autumn School 2021 — GPU Programming with CUDA


Take Home Messages
✈ GPU kernels need to account for proper logic
✈ Kernel execution configuration facilitates efficient operation of GPU
resources
✈ Massive parallelism at the level of GPU threads replacing conventional
loops over array elements with many individual threads directly acting on
thread-specific data elements in parallel
✈ Built-in variables to quasi-automatize various workloads, e.g. threadIdx,
blockIdx 0,1,2,3...

PRACE Autumn School 2021 — GPU Programming with CUDA


Take Home Messages
✈ GPU kernels need to account for proper logic
✈ Kernel execution configuration facilitates efficient operation of GPU
resources
✈ Massive parallelism at the level of GPU threads replacing conventional
loops over array elements with many individual threads directly acting on
thread-specific data elements in parallel
✈ Built-in variables to quasi-automatize various workloads, e.g. threadIdx,
blockIdx 0,1,2,3...
✈ Shapes of thread blocks go hand in hand with domain decomposition
(vector, matrix, volume)

PRACE Autumn School 2021 — GPU Programming with CUDA


Take Home Messages
✈ GPU kernels need to account for proper logic
✈ Kernel execution configuration facilitates efficient operation of GPU
resources
✈ Massive parallelism at the level of GPU threads replacing conventional
loops over array elements with many individual threads directly acting on
thread-specific data elements in parallel
✈ Built-in variables to quasi-automatize various workloads, e.g. threadIdx,
blockIdx 0,1,2,3...
✈ Shapes of thread blocks go hand in hand with domain decomposition
(vector, matrix, volume)
✈ GPU computing means eco-friendly computing !

PRACE Autumn School 2021 — GPU Programming with CUDA

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy