Introduction To GPU Computing With CUDA: Siegfried Höfinger
Introduction To GPU Computing With CUDA: Siegfried Höfinger
Introduction To GPU Computing With CUDA: Siegfried Höfinger
Siegfried Höfinger
October 2, 2021
→ https://tinyurl.com/cuda4dummies/i/l1/notes-l1.pdf
Components
Historical
Consumer/Enterprise-Grade GPUs
GPU/Accelerator: HPC/Server:
→ https://en.wikipedia.org/wiki/Supercomputer
→ https://www.nvidia.com/en-us/data-center/a100
GPU/Accelerator: HPC/Server:
→ https://en.wikipedia.org/wiki/Supercomputer
→ https://www.nvidia.com/en-us/data-center/a100
GPU/Accelerator: HPC/Server:
→ https://en.wikipedia.org/wiki/Supercomputer
→ https://www.nvidia.com/en-us/data-center/a100
GPU/Accelerator: HPC/Server:
Fermi
2010
Tsubame 2.0 (M2050)
GSIC/TITech
2.3 PFLOPs/s
→ https://www.nvidia.com
Fermi Kepler
2010 2012
Tsubame 2.0 (M2050) Titan (k20x)
GSIC/TITech ORNL
2.3 PFLOPs/s 17.6 PFLOPs/s
⋆ 3x #cores (1536)
⋆ improved power efficiency
→ https://www.nvidia.com
→ https://www.nvidia.com
→ https://www.nvidia.com
→ https://www.nvidia.com
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
Consumer grade: made for gaming, cheaper devices with lower specs (FP64,
HBM2) and prohibited 24x7 usage in datacentres (EULA change by 12/2017
affecting NVIDIA driver); GeForce, Titan, Tegra
Enterprise grade: heavy HPC workloads and large-scale AI, expensive (10:1)
high-end devices with certified top notch components and explicit warranty for
stable and reliable 24x7 operation; Tesla, Quadro, DGX
Academia perhaps fine; NVIDIA doesn’t want to ban non-commercial uses and
research, key question is what qualifies as a “data center”
→ https://www.nvidia.com/enterpriseservices
→ https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus
[sh@n566−009]$ nvidia-smi
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
→ https://developer.nvidia.com/cuda-faq
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
// kernel definition;
_ _global_ _ void VecAdd(float *A, float *B, float *C)
{
int i;
i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
}
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
}
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
fier
int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
}
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
fier
int main()
{
...
// kernel invocation with N threads
N = 100;
VecAdd <<< 1, N >>> (A, B, C);
...
2) kernel exe-
} cution config-
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_vector_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
uration
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
int main()
{
int numBlocks;
dim3 threadsPerBlock;
...
// kernel invocation with one block of N * N threads
numBlocks = 1;
threadsPerBlock.x = N;
threadsPerBlock.y = N;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://tinyurl.com/cuda4dummies/i/l1/single_thread_block_matrix_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
There is an upper limit to the number of threads in a thread block, e.g. ≈1024
for current GPUs, because all threads of a thread block are supposed to run on
the same SM (streaming multiprocessor)
→ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth
There is an upper limit to the number of threads in a thread block, e.g. ≈1024
for current GPUs, because all threads of a thread block are supposed to run on
the same SM (streaming multiprocessor)
A single SM typically contains 64-128 CUDA cores (INT32, FP32, FP64, TC)
→ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth
There is an upper limit to the number of threads in a thread block, e.g. ≈1024
for current GPUs, because all threads of a thread block are supposed to run on
the same SM (streaming multiprocessor)
A single SM typically contains 64-128 CUDA cores (INT32, FP32, FP64, TC)
Different GPU architectures vary in terms of numbers of SMs, e.g. gtx1080 has
20 SMs, V100 has 80 SMs, A40 has 84 SMs, A100 has 108 SMs
→ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth
There is an upper limit to the number of threads in a thread block, e.g. ≈1024
for current GPUs, because all threads of a thread block are supposed to run on
the same SM (streaming multiprocessor)
A single SM typically contains 64-128 CUDA cores (INT32, FP32, FP64, TC)
Different GPU architectures vary in terms of numbers of SMs, e.g. gtx1080 has
20 SMs, V100 has 80 SMs, A40 has 84 SMs, A100 has 108 SMs
However, multiple thread blocks can be launched in parallel as defined by the
initial parameter numBlocks used in the kernel execution configuration
<<< numBlocks,threadsPerBlock>>>
→ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth
[sh@n566−009]$ deviceQuery
/opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/11.0.2/NVIDIA_CUDA-11.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting...
Device 0: "A40"
CUDA Driver Version / Runtime Version 11.2 / 11.0
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 45634 MBytes (47850782720 bytes)
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
(84) Multiprocessors, ( 64) CUDA Cores/MP: 5376 CUDA Cores
GPU Max Clock rate: 1740 MHz (1.74 GHz)
Memory Clock rate: 7251 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
[sh@n566−009]$ deviceQuery
/opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/11.0.2/NVIDIA_CUDA-11.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting...
Device 0: "A40"
CUDA Driver Version / Runtime Version 11.2 / 11.0
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 45634 MBytes (47850782720 bytes)
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
(84) Multiprocessors, ( 64) CUDA Cores/MP: 5376 CUDA Cores
GPU Max Clock rate: 1740 MHz (1.74 GHz)
Memory Clock rate: 7251 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
blockDim.y
blockDim.x
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = (blockIdx.x * blockDim.x) + threadIdx.x;
j = (blockIdx.y * blockDim.y) + threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
dim3 threadsPerBlock, numBlocks;
...
// kernel invocation with blocks of 256 threads
threadsPerBlock.x = 16;
threadsPerBlock.y = 16;
numBlocks.x = N / threadsPerBlock.x;
numBlocks.y = N / threadsPerBlock.y;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://tinyurl.com/cuda4dummies/i/l1/multiple_thread_blocks_matrix_addition.cu
// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = (blockIdx.x * blockDim.x) + threadIdx.x;
j = (blockIdx.y * blockDim.y) + threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
2D initial- }
int main()
izations {
dim3 threadsPerBlock, numBlocks;
...
// kernel invocation with blocks of 256 threads
threadsPerBlock.x = 16;
threadsPerBlock.y = 16;
numBlocks.x = N / threadsPerBlock.x;
numBlocks.y = N / threadsPerBlock.y;
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://tinyurl.com/cuda4dummies/i/l1/multiple_thread_blocks_matrix_addition.cu
// kernel definition;
_ _global_ _ void MatAdd(float **A, float **B, float **C)
{
int i, j;
i = (blockIdx.x * blockDim.x) + threadIdx.x;
j = (blockIdx.y * blockDim.y) + threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
2D initial- }
int main()
izations {
dim3 threadsPerBlock, numBlocks; general type
...
// kernel invocation with blocks of 256 threads kernel exe-
threadsPerBlock.x = 16;
threadsPerBlock.y = 16; cution con-
numBlocks.x = N / threadsPerBlock.x;
numBlocks.y = N / threadsPerBlock.y; figuration
MatAdd <<< numBlocks, threadsPerBlock >>> (A, B, C);
...
}
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://tinyurl.com/cuda4dummies/i/l1/multiple_thread_blocks_matrix_addition.cu
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide
→ https://docs.nvidia.com/cuda/cuda-c-programming-guide