Vector Processors
Vector Processors
● the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>> execution
configuration syntax
● Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through built-in variables.
● threads grouped into blocks
● Specify number of blocks and number of threads per block.
Why two levels of threads?
● A grid of thread blocks is easier to manage than one big block of threads.
● GPU has 1000’s of cores, grouped into 10’s of streaming multiprocessors (SMs).
○ Each SM has its own memory, scheduling.
○ Each SM has e.g. 64 cores (P100 architecture).
● GPU can start millions of threads, but they don’t all run simultaneously.
● Scheduler (Gigathread Engine) packs up to ~1000 threads into one block and
assigns the block to an SM.
○ The threads have consecutive IDs.
○ Several thread blocks can be assigned to an SM at same time.
○ Threads in a block don’t execute simultaneously either.
■ They run in warps of 32 threads; more later.
● A thread block assigned to an SM uses resources (registers, shared memory) on the SM.
○ All assigned threads are pre-allocated resources.
■ Since we know the block size when we invoke the kernel, the SM knows how much resources to assign.
○ This makes switching between threads very fast.
■ No dynamic resource allocation.
■ SM has huge number (e.g. 64K) of registers, so no register flush when switching threads.
● Each SM has its own (warp) scheduler to manage threads assigned to it.
● When all threads in a block finishes, the resources are freed.
● Then Gigathread Engine schedules a new block to the SM, using the freed resources.
● At any time, SM only needs to manage a block of a few thousand threads, instead of entire
grid of millions of threads.
GPU Memory organization