0% found this document useful (0 votes)
9 views13 pages

UNIT 4 GPU Computing - HPC

HIGH PERFORMANCE COMPUTING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

UNIT 4 GPU Computing - HPC

HIGH PERFORMANCE COMPUTING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT 4 GPU Computing 9

Overview of GPU computing - Evolution of GPUs from graphics to general-


purpose computing - Comparison of CPU vs. GPU architectures -
Applications of GPU computing - Introduction to CUDA - CUDA
programming model and API - Writing a simple CUDA program - Memory
management in CUDA (global, shared, and local memory)

4.1 Overview of GPU Computing

GPU computing is the use of a Graphics Processing Unit (GPU) to perform computation
traditionally handled by a Central Processing Unit (CPU). While GPUs were originally
designed to render graphics, their highly parallel structure makes them exceptionally well-
suited for a wide range of data-intensive and compute-heavy tasks beyond graphics.

Graphics Processing Unit (GPU) is a specialized processor originally designed to render


images and graphics efficiently for computer displays. In recent years, GPUs have evolved
into powerful co-processors that excel at performing parallel computations, making them
indispensable for tasks beyond graphics, such as scientific simulations, artificial intelligence,
and machine learning. Graphics processing unit (GPU) are made to process the required
images and speed up the rendering of 3D computer graphics on consumer electronics
including PCs, game consoles, and smartphones or any systems and they are also known as
video cards and graphics cards.

This article explores the role of GPUs in modern computing, their architecture, applications
across various industries, and their impact on accelerating complex computations and
improving overall system performance.

What is a Graphics Processing Unit (GPU)?

Graphics Processing Unit (GPU) is a specialized electronic circuit designed to accelerate the
creation and rendering of images, animations, and video for computer displays. Originally
developed for rendering graphics in video games The required arithmetic computations are
completed quickly by a GPU, freeing up the CPU to conduct other activities or tasks.

A CPU uses some cores primarily for some required sequential serial processing, whereas a
GPU has many smaller cores designed for multitasking purpose. Whereas every CPU core
operates independently on a distinct job, the GPU cores concurrently do the required iterative
computations that underpin machine learning (ML) or deep learning.
GPUs are characterized by their high parallel processing power, which allows them to
perform thousands of computations simultaneously, making them well-suited for tasks
requiring heavy computational workload and real-time processing.

Features of GPU

A chip or electronic circuit that can render the required images for display on an electronic
device is referred to as a graphics processing unit (GPU).

Despite the fact that the terms are much distinct, "graphics card" and "GPU" are frequently
used synonymously.

Polygon rendering in 2-D and 3-D graphics that tastes good and the digital output to monitors
with flat panel displays properly.

The application support for graphically intensive programs like AutoCAD and YUV color
space support is another one.

Across a variety of devices or systems, including tablets, smart TVs, and smartphones of
various stripes, Arm GPUs deliver the best possible visual experience for the user.

Uses of GPU

GPUs are typically utilized to power the top-notch gaming experiences by producing required
incredibly smooth and lifelike graphics and rendering to the user. Nevertheless, a large
number of corporate applications or configurations also require powerful graphics processors.
The important uses are mentioned below:

For Machine Learning: The several intriguing GPU technology packages are commonly
available in the fields of artificial intelligence and machine learning. Due to their
extraordinary computational capacity in the system, GPUs may significantly speed up tasks
like the required image recognition that benefit from the highly parallel architecture of the
GPU.

For Gaming: Basically, with their expansive, incredibly realistic, and intricate in-game
universes, video games have become exceptionally computationally demanding for the user.
The need for graphics processing is very much rising quickly due to advances in display
technology or user interface, such as 4K screens and fast refresh rates, as well as a surge in
virtual reality games become very much attractive for the user.

For Content Creation and Video Editing: The initial lengthy process of video editing and
content creation has long plagued graphic designers, video editors, and other professionals as
per requirement. This has clogged system resources and very much impeded creative flow.
GPU parallel processing now facilitates the faster and easier rendering of graphics and video
in higher quality formats as per user requirement.
4.2 Evolution of GPUs: From Graphics to General-Purpose Computing

The evolution of GPUs from fixed-function graphics processors to highly programmable


parallel computing engines marks one of the most transformative developments in modern
computing. Here’s a chronological overview of this evolution:

1. Early Days: Fixed-Function Graphics Processors (1980s–1990s)

 Primary Role: Rendering 2D and 3D graphics using fixed pipelines.


 Hardware Characteristics:
o Rigid hardware with no programmability.
o Tasks included rasterization, shading, and transformation.
 Key Example: 3dfx Voodoo (1996), NVIDIA RIVA TNT (1998)

2. Introduction of Programmable Shaders (Early 2000s)

 Key Milestone: The transition from fixed-function to programmable graphics


pipelines.
 Vertex & Pixel Shaders:
o Developers could write small programs to control how vertices and pixels
were processed.
o Shaders were written in domain-specific languages like HLSL (DirectX) and
GLSL (OpenGL).
 Major GPUs: NVIDIA GeForce 3 (2001), ATI Radeon 9700 (2002)
 Impact: Set the foundation for general-purpose computation on GPUs.

3. Emergence of GPGPU (General-Purpose GPU Computing) – Mid to Late


2000s

 Concept: Use the GPU for non-graphics tasks by leveraging programmable shaders.
 Challenges: Shaders were not designed for general computing, making programming
cumbersome.
 Breakthrough: NVIDIA introduced CUDA (2006) – the first general-purpose
parallel computing architecture for GPUs.
 Also Emerged: OpenCL (2009) – an open standard supporting multiple vendors.

4. GPUs as Parallel Computing Engines (2010s)

 Hardware Changes:
o Increased number of cores.
o Unified architecture (same core units handle multiple task types).
o Addition of high-bandwidth memory (e.g., HBM).
 Use Cases Expanded:
o AI/Deep Learning (TensorFlow, PyTorch)
o Scientific simulations
o Data analytics
o Real-time rendering (ray tracing)
 NVIDIA’s Tesla and later A100 series became central to data centers and AI
training.
 AMD introduced ROCm, an open ecosystem similar to CUDA.

5. AI-Optimized and Specialized GPU Architectures (2020s–Present)

 Specialized Hardware:
o Tensor Cores (NVIDIA Volta and later): Designed for matrix operations
crucial to AI.
o Ray Tracing Cores: For real-time lighting and shadows in games and
simulations.
 Unified Ecosystems:
o NVIDIA: CUDA + cuDNN + TensorRT
o AMD: ROCm + HIP
 Cloud & Edge Deployment:
o Cloud GPU instances (AWS, Azure, Google Cloud).
o GPUs now used in autonomous vehicles, mobile devices, and edge AI.
 Next Frontier:
o AI-specific chips like NVIDIA Grace Hopper (CPU+GPU superchips).
o Integration of GPU features into CPUs (Apple Silicon, Intel Xe).

Summary Timeline

Era Key Features Notable Hardware

1980s–1990s Fixed-function graphics 3dfx Voodoo, NVIDIA RIVA

Early 2000s Programmable shaders NVIDIA GeForce 3

Mid-2000s Early GPGPU experiments ATI Radeon, NVIDIA 8800

2006–2015 CUDA, OpenCL, rise of HPC NVIDIA Tesla, AMD FirePro

2015–2020 AI and deep learning boom NVIDIA Volta, AMD Vega

2020s–present Specialized AI hardware, cloud GPUs NVIDIA A100, H100, AMD MI300
4.3 Comparison of CPU vs. GPU architectures

Difference Between CPU and GPU


CPU GPU

CPU stands for Central While GPU stands for Graphics Processing
Processing Unit. Unit.
CPU GPU

Used for General-purpose Used for Specialized computation for graphics


computation. and parallel tasks.

Handles single-threaded, complex Handles highly parallel tasks (e.g., graphics


tasks. rendering).

Optimized for sequential


Optimized for parallel processing.
processing.

Smaller cache memory (L1, L2, Larger memory (VRAM) optimized for high-
L3). speed data transfer.

More energy-efficient for general Consumes more power due to parallel


tasks. processing needs.

CPU emphasis on low latency. While GPU emphasis on high throughput.

Runs operating system, Handles graphics rendering, AI, machine


applications, and tasks. learning.

Generally less expensive. More expensive due to specialized hardware.

4.4 Applications of GPU Computing


GPUs, with their parallel processing capabilities, are used in a wide range
of applications beyond gaming, including AI/ML, scientific computing, video
editing, and more. They excel in tasks requiring high throughput and
parallel calculations, such as training deep learning models, rendering 3D
graphics, and processing large datasets.
Here's a more detailed look at GPU applications:

1. AI and Machine Learning:


 GPUs are crucial for accelerating the training of large language models (LLMs) and
other deep learning models due to their ability to perform parallel calculations on
massive datasets.
 They enable faster processing of images, videos, and other data used in AI
applications.
2. Scientific Computing:
 GPUs are used in simulations, modeling, and analysis in fields like engineering,
medicine, and physics.
 They can accelerate tasks like molecular dynamics simulations and weather
modeling.
3. Video Editing and 3D Rendering:
 GPUs handle the complex calculations involved in creating and rendering 3D visuals
in games, movies, and other visual media.
 They are also used in professional video editing and animation software.
4. Gaming:
 GPUs are essential for delivering smooth, high-quality graphics and immersive
gaming experiences.
5. Cryptocurrency Mining:
 The parallel processing power of GPUs is utilized in mining cryptocurrencies like
Ethereum.
6. Other Applications:
 Data analysis: GPUs can accelerate the processing and analysis of large datasets
in various fields.
 Medical imaging: GPUs can be used to process and visualize medical images.
 Cloud computing: GPUs can accelerate cloud-based AI and machine learning
workloads.
 Edge computing: GPUs can be used in edge devices to perform real-time
processing and analysis.
 Quantum computing simulation: GPUs can be used to simulate quantum
systems.

4.5 Introduction to CUDA Programming




CUDA stands for Compute Unified Device Architecture. It is an extension
of C/C++ programming. CUDA is a programming language that uses the
Graphical Processing Unit (GPU). It is a parallel computing platform and
an API (Application Programming Interface) model, Compute Unified
Device Architecture was developed by Nvidia. This allows computations to
be performed in parallel while providing well-formed speed. Using CUDA,
one can harness the power of the Nvidia GPU to perform common
computing tasks, such as processing matrices and other linear algebra
operations, rather than simply performing graphical calculations.

Why do we need CUDA?

 GPUs are designed to perform high-speed parallel computations to


display graphics such as games.
 Use available CUDA resources. More than 100 million GPUs are
already deployed.
 It provides 30-100x speed-up over other microprocessors for some
applications.
 GPUs have very small Arithmetic Logic Units (ALUs) compared to the
somewhat larger CPUs. This allows for many parallel calculations,
such as calculating the color for each pixel on the screen, etc.

Architecture of CUDA

 16 Streaming Multiprocessor (SM) diagrams are shown in the above


diagram.
 Each Streaming Multiprocessor has 8 Streaming Processors (SP) ie,
we get a total of 128 Streaming Processors (SPs).
 Now, each Streaming processor has a MAD unit (Multiplication and
Addition Unit) and an additional MU (multiplication unit).
 The GT200 has 30 Streaming Multiprocessors (SMs) and each
Streaming Multiprocessor (SM) has 8 Streaming Processors (SPs) ie, a
total of 240 Streaming Processors (SPs), and more than 1 TFLOP
processing power.
 Each Streaming Processor is gracefully threaded and can run
thousands of threads per application.
 The G80 card has 16 Streaming Multiprocessors (SMs) and each SM
has 8 Streaming Processors (SPs), i.e., a total of 128 SPs and it
supports 768 threads per Streaming Multiprocessor (note: not per SP).
 Eventually, after each Streaming Multiprocessor has 8 SPs, each SP
supports a maximal of 768/8 = 96 threads. Total threads that can run
on 128 SPs - 128 * 96 = 12,228 times.
 Therefore these processors are called massively parallel.
 The G80 chips have a memory bandwidth of 86.4GB/s.
 It also has an 8GB/s communication channel with the CPU (4GB/s for
uploading to the CPU RAM, and 4GB/s for downloading from the CPU
RAM).

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.


 Each kernel consists of blocks, which are independent groups of ALUs.
 Each block contains threads, which are levels of computation.
 The threads in each block typically work together to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to the GPU is often the
most typical part of the computation.
 For each thread, local memory is the fastest, followed by shared
memory, global, static, and texture memory the slowest.
Typical CUDA Program flow
1. Load data into CPU memory
2. Copy data from CPU to GPU memory - e.g., cudaMemcpy(...,
cudaMemcpyHostToDevice)
3. Call GPU kernel using device variable - e.g., kernel<<<>>> (gpuVar)
4. Copy results from GPU to CPU memory - e.g., cudaMemcpy(..,
cudaMemcpyDeviceToHost)
5. Use results on CPU

How work is distributed?


 Each thread "knows" the x and y coordinates of the block it is in, and
the coordinates where it is in the block.
 These positions can be used to calculate a unique thread ID for each
thread.
 The computational work done will depend on the value of the thread ID.
For example, the thread ID corresponds to a group of matrix elements.

CUDA Applications

CUDA applications must run parallel operations on a lot of data, and be


processing-intensive.
1. Computational finance
2. Climate, weather, and ocean modeling
3. Data science and analytics
4. Deep learning and machine learning
5. Defence and intelligence
6. Manufacturing/AEC
7. Media and entertainment
8. Medical imaging
9. Oil and gas
10. Research
11. Safety and security
12. Tools and management

Benefits of CUDA

There are several advantages that give CUDA an edge over traditional
general-purpose graphics processor (GPU) computers with graphics APIs:
 Integrated memory (CUDA 6.0 or later) and Integrated virtual memory
(CUDA 4.0 or later).
 Shared memory provides a fast area of shared memory for CUDA
threads. It can be used as a caching mechanism and provides more
bandwidth than texture lookup.
 Scattered read codes can be read from any address in memory.
 Improved performance on downloads and reads, which works well from
the GPU and to the GPU.
 CUDA has full support for bitwise and integer operations.

Limitations of CUDA

 CUDA source code is given on the host machine or GPU, as defined


by the C++ syntax rules. Longstanding versions of CUDA use C syntax
rules, which means that up-to-date CUDA source code may or may not
work as required.
 CUDA has unilateral interoperability(the ability of computer systems or
software to exchange and make use of information) with transferor
languages like OpenGL. OpenGL can access CUDA registered
memory, but CUDA cannot access OpenGL memory.
 Afterward versions of CUDA do not provide emulators or fallback
support for older versions.
 CUDA supports only NVIDIA hardware.

4.6 CUDA programming model and API

The CUDA programming model enables developers to execute highly parallel computations
using NVIDIA GPUs. It abstracts the GPU's architecture in a way that lets you express your
problem in terms of data parallelism—mapping thousands of threads across large datasets.

🧠 CUDA Programming Model Overview

CUDA uses a heterogeneous computing model, meaning:

 The CPU (host) runs the main program.


 The GPU (device) executes parallel kernels (functions).
 The developer explicitly manages data transfer between host and device.

🏗️Key Concepts in the CUDA Programming Model


Concept Description

A function written in CUDA C/C++ and marked with __global__. It runs on the
Kernel
GPU and is launched from the CPU.

Thread The basic unit of execution on the GPU.

Block A group of threads that execute together and can share memory.

Grid A collection of blocks.

Warp A group of 32 threads scheduled and executed together by the GPU hardware.

Memory Includes global memory, shared memory, local memory, constant memory, and
Hierarchy registers. Memory type determines speed and scope of access.

📏 Thread Hierarchy in CUDA

When you launch a kernel, you specify:


kernel<<<numBlocks, threadsPerBlock>>>(args);

 threadsPerBlock: Number of threads in each block (e.g., 256).


 numBlocks: Number of blocks in the grid (e.g., 128).

Inside the kernel, you access:

int threadId = threadIdx.x; // Index within the block


int blockId = blockIdx.x; // Block index in the grid
int blockDim = blockDim.x; // Threads per block
int globalId = blockId * blockDim + threadId; // Global thread index

🧰 CUDA API Components

🧩 1. Kernel Definition and Launch


__global__ void myKernel(int *data) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
data[idx] *= 2;
}

// Launch with 4 blocks of 256 threads


myKernel<<<4, 256>>>(devicePointer);

💾 2. Memory Management

Host <-> Device memory allocation and transfer


int *h_data, *d_data;
cudaMalloc(&d_data, size); // Allocate on device
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice); // Copy to GPU
cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost); // Copy back
cudaFree(d_data); // Free device memory

🧮 3. Built-in Variables

Variable Description

threadIdx.x Thread index within the block

blockIdx.x Block index within the grid

blockDim.x Number of threads per block

gridDim.x Number of blocks in the grid

🗂️4. Memory Types


Memory Type Scope Latency Example Use

Global All threads High Data exchange between blocks

Shared Threads in a block Low Fast collaboration within a block

Local Single thread Moderate Local variables

Registers Single thread Fastest Most variables compiled to this

Constant Read-only cache Low Constants used by all threads

🧪 5. Error Checking
cudaError_t err = cudaMemcpy(...);
if (err != cudaSuccess) {
printf("CUDA error: %s\n", cudaGetErrorString(err));
}

🧩 CUDA Execution Flow Summary

1. Allocate memory on device (GPU) using cudaMalloc().


2. Copy input data from host (CPU) to device using cudaMemcpy().
3. Launch kernel using kernel<<<blocks, threads>>>().
4. Copy results back to host from device.
5. Free memory on the device using cudaFree().

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy