DS1822 - Parallel Computing-unit3
DS1822 - Parallel Computing-unit3
GPU Architectures – Data Parallelism – CUDA Basics – CUDA Program Structure – Threads, Blocks, Grids
– Memory Handling.
I. GPU Architectures
• GPU architecture focuses more on putting available cores to work and is less focused on low latency
cache memory access.
• In generic many core GPU, less space is devoted to control logic and caches. And, large numbers of
transistors are devoted to support parallel data processing. Following diagram shows GPU architecture.
1) The GPU consists of multiple Processor Clusters (PC).
3) Each Streaming Multiprocessor (SM) has number of Streaming Processors (SPs) (also known as cores) that
share control logic and L1(layer 1) instruction cache.
4) One Streaming Multiprocessor (SM) uses dedicated L1 (Layer 1) cache and shared L2 (Layer 2) cache
before pulling data from global memory i.e. Graphic Double Data Rate (GDDR) DRAM.
5) The number of Streaming Multiprocessors (SMs) and cores per Streaming Multiprocessor (SM) varies as
per the targeted price and market of the GPU.
6) Global memory of GPU consists of multiple GBs of DRAM. The growing size of global memory allows
keeping data longer in global memory thereby reducing transfers to the CPU.
7) GPU architecture is tolerant of memory latency. Higher bandwidth makes up for memory latency.
8) In comparison to CPU, GPU works with fewer and small memory cache layers. This is because GPU has
more transistors dedicated to computations and it worries less about retrieving data from memory.
9) Memory bus is optimized for bandwidth allowing serving large number of ALUs simultaneously.
10) GPU architecture is more optimized for data parallel throughput computations.
11) In order to execute tasks in parallel, tasks are scheduled at Processor Cluster (PC) or Streaming
Multiprocessor (SM) level.
• Data parallelism is a key concept in parallel computing where a particular dataset is divided into smaller
chunks, and the same operation is performed concurrently on each chunk. This approach leverages the
ability to process multiple data elements simultaneously, significantly speeding up computations.
• CPUs use Task Parallelism wherein
a. Multiple tasks map to multiple threads and tasks run different instructions
b. Generally threads are heavyweight
c. Programming is done for the individual thread.
• Modern applications process large amounts of data which incurs significant execution time on
sequential computers.
• Data parallelism is used to advantage in applications like image processing, computer graphics, algebra
libraries like matrix multiplication, etc.
CUDA Hardware:
• CUDA (Compute Unified Device Architecture) is scalable parallel computing platform and
programming model for general computing on GPUs and multicore CPUs.
• CUDA is introduced by NVIDIA in 2006.
• CUDA is data-parallel extension to the C/C++ languages and an API model for parallel programming.
• CUDA parallel programming model has three key abstractions –
o (1) a hierarchy of thread groups
o (2) shared memories, and
o (3) barrier synchronization.
• The programmer or compiler decomposes large computing problems into many small problems that
can be solved in parallel.
• Programs written using CUDA harness the power of GPU and thereby increase the computing
performance.
• In GPU accelerated applications, the sequential part of the workload runs on the CPU (as it is optimized
for single threaded performance) and the compute intensive portion of the application runs on thousands
of GPU cores in parallel.
• Using CUDA, developers can utilize the power of GPUs to perform general computing tasks like
multiplying matrices and performing linear algebra operations (instead of just doing graphical
calculations).
• In CUDA, developers program in popular languages such as C, C++, Fortran, Python, DirectCompute
and MATLAB. And, express parallelism through extensions in the form of basic keywords.
• At high level, graphics card with a many-core GPU and high speed graphics device memory sits inside
a standard PC / server with one or two multicore CPUs.
• The GPU consists of multiple Streaming Multiprocessors (SM). And, each Streaming Multiprocessor
(SM) has number of Streaming Processors (SPs) also known as cores. Streaming Multiprocessor (SM)
uses dedicated L1 cache and shared L2 cache. Following diagram shows high level overview GPU
hardware.
CUDA Basics:
The major difference between C and CUDA implementation is __global__ specifier and <<<...>>> syntax.
The __global__ specifier indicates a function that runs on device (GPU). Such function can be called
through host code, e.g. the main() function in the example, and is also known as "kernels".
When a kernel is called, its execution configuration is provided through <<<...>>> syntax,
e.g. cuda_hello<<<1,1>>>(). In CUDA terminology, this is called "kernel launch".
Following is an example of vector addition implemented in C (./vector_add.c). The example computes the
addtion of two vectors stored in array a and b and put the result in array out.
• We will convert vector_add.c to CUDA program vector_add.cu by using the hello world as
example.
• In CUDA terminology, CPU memory is called host memory and GPU memory is called device
memory. Pointers to CPU and GPU memory are called host pointer and device pointer, respectively.
• For data to be accessible by GPU, it must be presented in the device memory. CUDA provides APIs
for allocating device memory and data transfer between host and device memory.
• Following is the common workflow of CUDA programs.
1. Allocate host memory and initialized host data
2. Allocate device memory
3. Transfer input data from host to device memory
4. Execute kernels
5. Transfer output from device memory to host
• CUDA provides several functions for allocating device memory. The most common ones
are cudaMalloc() and cudaFree() . The syntax for both functions are as follow
Memory transfer:
• Transfering data between host and device memory can be done through cudaMemcpy function, which
is similar to memcpy in C. The syntax of cudaMemcpy is as follow
1. Initialization:
2. Memory Allocation:
3. Kernel Launch:
o Define Kernel Function: Write the function that will run on the GPU.
o Configure Execution Parameters: Determine the grid and block size for execution.
4. Synchronization:
o Synchronize: Ensure all threads have completed execution before proceeding.
5. Memory Cleanup:
o Transfer Data Back: Copy results from the device back to the host.
#include <cuda_runtime.h>
#include <iostream>
// Kernel function to be executed on the GPU
__global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x;
c[index] = a[index] + b[index];
}
int main() {
// Initialize host data
const int arraySize = 5;
int hostA[arraySize] = {1, 2, 3, 4, 5};
int hostB[arraySize] = {10, 20, 30, 40, 50};
int hostC[arraySize];
// Allocate device memory
int *deviceA, *deviceB, *deviceC;
cudaMalloc((void**)&deviceA, arraySize * sizeof(int));
cudaMalloc((void**)&deviceB, arraySize * sizeof(int));
cudaMalloc((void**)&deviceC, arraySize * sizeof(int));
// Copy data from host to device
cudaMemcpy(deviceA, hostA, arraySize * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(deviceB, hostB, arraySize * sizeof(int), cudaMemcpyHostToDevice);
// Launch kernel
add<<<1, arraySize>>>(deviceA, deviceB, deviceC);
// Copy result from device to host
cudaMemcpy(hostC, deviceC, arraySize * sizeof(int), cudaMemcpyDeviceToHost);
// Print result
std::cout << "Result: ";
for (int i = 0; i < arraySize; i++) {
std::cout << hostC[i] << " ";
}
std::cout << std::endl;
// Free device memory
cudaFree(deviceA);
cudaFree(deviceB);
cudaFree(deviceC);
return 0;
}
Explanation:
• Kernel Function: The add function runs on the GPU and performs addition on arrays.
• Memory Allocation: Memory is allocated on both the host and device.
• Data Transfer: Data is copied between the host and device.
• Kernel Launch: The kernel is launched with execution parameters.
• Synchronization: The results are copied back to the host once all threads complete.
• Memory Cleanup: Allocated memory is freed to prevent leaks.
CUDA provides a flexible and powerful way to harness the GPU’s capabilities, enabling significant
performance improvements for parallelizable tasks.
• The CUDA programming model provides an abstraction of GPU architecture that acts as a bridge
between an application and its possible implementation on GPU hardware.
• This post outlines the main concepts of the CUDA programming model by outlining how they are
exposed in general-purpose programming languages like C/C++.
• Let me introduce two keywords widely used in CUDA programming model: host and device.
• The host is the CPU available in the system.
• The system memory associated with the CPU is called host memory.
• The GPU is called a device and GPU memory likewise called device memory.
• Copy the input data from host memory to device memory, also known as host-to-device transfer.
• Load the GPU program and execute, caching data on-chip for performance.
• Copy the results from device memory to host memory, also called device-to-host transfer.
CUDA kernel and thread hierarchy:
Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your
applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like
regular C/C++ functions.
• Every CUDA kernel starts with a __global__ declaration specifier. Programmers provide a unique
global ID to each thread by using built-in variables.
• A group of threads is called a CUDA block.
• CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2).
• Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other
SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism).
• One SM can run several concurrent CUDA blocks depending on the resources needed by CUDA
blocks.
• Each kernel is executed on one device and CUDA supports running multiple kernels on a device at one
time. Figure 3 shows the kernel execution and mapping on hardware resources available in GPU.
• Memory handling in CUDA is a crucial aspect of achieving optimal performance. Proper management
of memory can significantly impact the efficiency and speed of your CUDA programs. Here are the key
concepts and types of memory in CUDA:
Types of Memory:
1. Global Memory:
o Used for data transfer between host (CPU) and device (GPU).
2. Shared Memory:
o Ideal for data that needs to be frequently accessed by multiple threads within a block.
3. Registers:
4. Constant Memory:
o Optimized for broadcast operations (when many threads read the same value).
5. Texture Memory:
int *d_array;
cudaMalloc((void**)&d_array, size);
2. Copy Data from Host to Device:
• Use cudaMemcpy to transfer data from the CPU to the GPU.
cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice);
3. Kernel Execution:
• Perform computations on the GPU using the allocated memory.
kernel<<<gridSize, blockSize>>>(d_array);
4. Copy Data from Device to Host:
• Use cudaMemcpy to transfer data back from the GPU to the CPU.
cudaMemcpy(h_array, d_array, size, cudaMemcpyDeviceToHost);
5. Free Memory on the Device:
• Use cudaFree to deallocate memory on the GPU.
Example Code:
#include <cuda_runtime.h>
#include <stdio.h>
int main() {
int h_c[arraySize];
cudaMalloc((void**)&d_a, size);
cudaMalloc((void**)&d_b, size);
cudaMalloc((void**)&d_c, size);
// Launch kernel
// Print result
printf("Result: ");
printf("\n");
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
Best Practices:
• Minimize Data Transfer: Data transfer between host and device is slow. Minimize the frequency and
size of these transfers.
• Use Shared Memory: Leverage shared memory for data that needs to be frequently accessed by
multiple threads.
• Optimize Memory Access Patterns: Ensure coalesced memory access patterns for global memory to
improve performance.
• Free Unused Memory: Always free device memory when it’s no longer needed to avoid memory leaks.