UNIT 4 GPU Computing - HPC
UNIT 4 GPU Computing - HPC
GPU computing is the use of a Graphics Processing Unit (GPU) to perform computation
traditionally handled by a Central Processing Unit (CPU). While GPUs were originally
designed to render graphics, their highly parallel structure makes them exceptionally well-
suited for a wide range of data-intensive and compute-heavy tasks beyond graphics.
This article explores the role of GPUs in modern computing, their architecture, applications
across various industries, and their impact on accelerating complex computations and
improving overall system performance.
Graphics Processing Unit (GPU) is a specialized electronic circuit designed to accelerate the
creation and rendering of images, animations, and video for computer displays. Originally
developed for rendering graphics in video games The required arithmetic computations are
completed quickly by a GPU, freeing up the CPU to conduct other activities or tasks.
A CPU uses some cores primarily for some required sequential serial processing, whereas a
GPU has many smaller cores designed for multitasking purpose. Whereas every CPU core
operates independently on a distinct job, the GPU cores concurrently do the required iterative
computations that underpin machine learning (ML) or deep learning.
GPUs are characterized by their high parallel processing power, which allows them to
perform thousands of computations simultaneously, making them well-suited for tasks
requiring heavy computational workload and real-time processing.
Features of GPU
A chip or electronic circuit that can render the required images for display on an electronic
device is referred to as a graphics processing unit (GPU).
Despite the fact that the terms are much distinct, "graphics card" and "GPU" are frequently
used synonymously.
Polygon rendering in 2-D and 3-D graphics that tastes good and the digital output to monitors
with flat panel displays properly.
The application support for graphically intensive programs like AutoCAD and YUV color
space support is another one.
Across a variety of devices or systems, including tablets, smart TVs, and smartphones of
various stripes, Arm GPUs deliver the best possible visual experience for the user.
Uses of GPU
GPUs are typically utilized to power the top-notch gaming experiences by producing required
incredibly smooth and lifelike graphics and rendering to the user. Nevertheless, a large
number of corporate applications or configurations also require powerful graphics processors.
The important uses are mentioned below:
For Machine Learning: The several intriguing GPU technology packages are commonly
available in the fields of artificial intelligence and machine learning. Due to their
extraordinary computational capacity in the system, GPUs may significantly speed up tasks
like the required image recognition that benefit from the highly parallel architecture of the
GPU.
For Gaming: Basically, with their expansive, incredibly realistic, and intricate in-game
universes, video games have become exceptionally computationally demanding for the user.
The need for graphics processing is very much rising quickly due to advances in display
technology or user interface, such as 4K screens and fast refresh rates, as well as a surge in
virtual reality games become very much attractive for the user.
For Content Creation and Video Editing: The initial lengthy process of video editing and
content creation has long plagued graphic designers, video editors, and other professionals as
per requirement. This has clogged system resources and very much impeded creative flow.
GPU parallel processing now facilitates the faster and easier rendering of graphics and video
in higher quality formats as per user requirement.
4.2 Evolution of GPUs: From Graphics to General-Purpose Computing
Concept: Use the GPU for non-graphics tasks by leveraging programmable shaders.
Challenges: Shaders were not designed for general computing, making programming
cumbersome.
Breakthrough: NVIDIA introduced CUDA (2006) – the first general-purpose
parallel computing architecture for GPUs.
Also Emerged: OpenCL (2009) – an open standard supporting multiple vendors.
Hardware Changes:
o Increased number of cores.
o Unified architecture (same core units handle multiple task types).
o Addition of high-bandwidth memory (e.g., HBM).
Use Cases Expanded:
o AI/Deep Learning (TensorFlow, PyTorch)
o Scientific simulations
o Data analytics
o Real-time rendering (ray tracing)
NVIDIA’s Tesla and later A100 series became central to data centers and AI
training.
AMD introduced ROCm, an open ecosystem similar to CUDA.
Specialized Hardware:
o Tensor Cores (NVIDIA Volta and later): Designed for matrix operations
crucial to AI.
o Ray Tracing Cores: For real-time lighting and shadows in games and
simulations.
Unified Ecosystems:
o NVIDIA: CUDA + cuDNN + TensorRT
o AMD: ROCm + HIP
Cloud & Edge Deployment:
o Cloud GPU instances (AWS, Azure, Google Cloud).
o GPUs now used in autonomous vehicles, mobile devices, and edge AI.
Next Frontier:
o AI-specific chips like NVIDIA Grace Hopper (CPU+GPU superchips).
o Integration of GPU features into CPUs (Apple Silicon, Intel Xe).
Summary Timeline
2020s–present Specialized AI hardware, cloud GPUs NVIDIA A100, H100, AMD MI300
4.3 Comparison of CPU vs. GPU architectures
CPU stands for Central While GPU stands for Graphics Processing
Processing Unit. Unit.
CPU GPU
Smaller cache memory (L1, L2, Larger memory (VRAM) optimized for high-
L3). speed data transfer.
Architecture of CUDA
CUDA Applications
Benefits of CUDA
There are several advantages that give CUDA an edge over traditional
general-purpose graphics processor (GPU) computers with graphics APIs:
Integrated memory (CUDA 6.0 or later) and Integrated virtual memory
(CUDA 4.0 or later).
Shared memory provides a fast area of shared memory for CUDA
threads. It can be used as a caching mechanism and provides more
bandwidth than texture lookup.
Scattered read codes can be read from any address in memory.
Improved performance on downloads and reads, which works well from
the GPU and to the GPU.
CUDA has full support for bitwise and integer operations.
Limitations of CUDA
The CUDA programming model enables developers to execute highly parallel computations
using NVIDIA GPUs. It abstracts the GPU's architecture in a way that lets you express your
problem in terms of data parallelism—mapping thousands of threads across large datasets.
A function written in CUDA C/C++ and marked with __global__. It runs on the
Kernel
GPU and is launched from the CPU.
Block A group of threads that execute together and can share memory.
Warp A group of 32 threads scheduled and executed together by the GPU hardware.
Memory Includes global memory, shared memory, local memory, constant memory, and
Hierarchy registers. Memory type determines speed and scope of access.
💾 2. Memory Management
🧮 3. Built-in Variables
Variable Description
🧪 5. Error Checking
cudaError_t err = cudaMemcpy(...);
if (err != cudaSuccess) {
printf("CUDA error: %s\n", cudaGetErrorString(err));
}