GPU Fundamentals
GPU Fundamentals
GPU Fundamentals
3
GPU Architecture
Two Main Components
Global memory
Analogous to RAM in a CPU server
Each SM has its own: Control units, registers, execution pipelines, caches
7
GPU Architecture
Streaming Multiprocessor (SM)
Special-function units
cos/sin/tan, etc.
8
GPU Architecture
CUDA Core
Floating point & Integer unit
IEEE 754-2008 floating-point CUDA Core
standard Dispatch Port
Operand Collector
Fused multiply-add (FMA)
instruction for both single and
FP Unit INT Unit
double precision
Grid Device 10
Warps
11
GPU Memory Hierarchy Review
L2
Global Memory
12
GPU Architecture
Memory System on each SM
14
Speed v. Throughput
Speed Throughput
CPU GPU
Optimized for low-latency access to Optimized for data-parallel, throughput
cached data sets computation
Control logic for out-of-order and Tolerant of memory latency
speculative execution More transistors dedicated to computation
10’s of threads 10,000’s of threads
16
Low Latency or High Throughput?
CPU architecture must minimize latency within each thread
GPU architecture hides latency with computation from other thread warps
17
Memory Coalescing
Global memory access happens in
transactions of 32 or 128 bytes
The hardware will try to reduce to 0 1 31
as few transactions as possible
Coalesced access:
A group of 32 contiguous threads
(“warp”) accessing adjacent words
0 1 31
Few transactions and high utilization
Uncoalesced access:
A warp of 32 threads accessing
scattered words
Many transactions and low utilization
18
SIMD and SIMT
19
SIMD and SIMT
20
SIMD and SIMT
23