0% found this document useful (0 votes)

9 views13 pages

UNIT 4 GPU Computing - HPC

HIGH PERFORMANCE COMPUTING

Uploaded by

B.REENA HICET STAFF CSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views13 pages

UNIT 4 GPU Computing - HPC

HIGH PERFORMANCE COMPUTING

Uploaded by

B.REENA HICET STAFF CSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

UNIT 4 GPU Computing 9

Overview of GPU computing - Evolution of GPUs from graphics to general-

purpose computing - Comparison of CPU vs. GPU architectures -
Applications of GPU computing - Introduction to CUDA - CUDA
programming model and API - Writing a simple CUDA program - Memory
management in CUDA (global, shared, and local memory)

4.1 Overview of GPU Computing

GPU computing is the use of a Graphics Processing Unit (GPU) to perform computation
traditionally handled by a Central Processing Unit (CPU). While GPUs were originally
designed to render graphics, their highly parallel structure makes them exceptionally well-
suited for a wide range of data-intensive and compute-heavy tasks beyond graphics.

Graphics Processing Unit (GPU) is a specialized processor originally designed to render

images and graphics efficiently for computer displays. In recent years, GPUs have evolved
into powerful co-processors that excel at performing parallel computations, making them
indispensable for tasks beyond graphics, such as scientific simulations, artificial intelligence,
and machine learning. Graphics processing unit (GPU) are made to process the required
images and speed up the rendering of 3D computer graphics on consumer electronics
including PCs, game consoles, and smartphones or any systems and they are also known as
video cards and graphics cards.

This article explores the role of GPUs in modern computing, their architecture, applications
across various industries, and their impact on accelerating complex computations and
improving overall system performance.

What is a Graphics Processing Unit (GPU)?

Graphics Processing Unit (GPU) is a specialized electronic circuit designed to accelerate the
creation and rendering of images, animations, and video for computer displays. Originally
developed for rendering graphics in video games The required arithmetic computations are
completed quickly by a GPU, freeing up the CPU to conduct other activities or tasks.

A CPU uses some cores primarily for some required sequential serial processing, whereas a
GPU has many smaller cores designed for multitasking purpose. Whereas every CPU core
operates independently on a distinct job, the GPU cores concurrently do the required iterative
computations that underpin machine learning (ML) or deep learning.
GPUs are characterized by their high parallel processing power, which allows them to
perform thousands of computations simultaneously, making them well-suited for tasks
requiring heavy computational workload and real-time processing.

Features of GPU

A chip or electronic circuit that can render the required images for display on an electronic
device is referred to as a graphics processing unit (GPU).

Despite the fact that the terms are much distinct, "graphics card" and "GPU" are frequently
used synonymously.

Polygon rendering in 2-D and 3-D graphics that tastes good and the digital output to monitors
with flat panel displays properly.

The application support for graphically intensive programs like AutoCAD and YUV color
space support is another one.

Across a variety of devices or systems, including tablets, smart TVs, and smartphones of
various stripes, Arm GPUs deliver the best possible visual experience for the user.

Uses of GPU

GPUs are typically utilized to power the top-notch gaming experiences by producing required
incredibly smooth and lifelike graphics and rendering to the user. Nevertheless, a large
number of corporate applications or configurations also require powerful graphics processors.
The important uses are mentioned below:

For Machine Learning: The several intriguing GPU technology packages are commonly
available in the fields of artificial intelligence and machine learning. Due to their
extraordinary computational capacity in the system, GPUs may significantly speed up tasks
like the required image recognition that benefit from the highly parallel architecture of the
GPU.

For Gaming: Basically, with their expansive, incredibly realistic, and intricate in-game
universes, video games have become exceptionally computationally demanding for the user.
The need for graphics processing is very much rising quickly due to advances in display
technology or user interface, such as 4K screens and fast refresh rates, as well as a surge in
virtual reality games become very much attractive for the user.

For Content Creation and Video Editing: The initial lengthy process of video editing and
content creation has long plagued graphic designers, video editors, and other professionals as
per requirement. This has clogged system resources and very much impeded creative flow.
GPU parallel processing now facilitates the faster and easier rendering of graphics and video
in higher quality formats as per user requirement.
4.2 Evolution of GPUs: From Graphics to General-Purpose Computing

The evolution of GPUs from fixed-function graphics processors to highly programmable

parallel computing engines marks one of the most transformative developments in modern
computing. Here’s a chronological overview of this evolution:

1. Early Days: Fixed-Function Graphics Processors (1980s–1990s)

 Primary Role: Rendering 2D and 3D graphics using fixed pipelines.

 Hardware Characteristics:
o Rigid hardware with no programmability.
o Tasks included rasterization, shading, and transformation.
 Key Example: 3dfx Voodoo (1996), NVIDIA RIVA TNT (1998)

2. Introduction of Programmable Shaders (Early 2000s)

 Key Milestone: The transition from fixed-function to programmable graphics

pipelines.
 Vertex & Pixel Shaders:
o Developers could write small programs to control how vertices and pixels
were processed.
o Shaders were written in domain-specific languages like HLSL (DirectX) and
GLSL (OpenGL).
 Major GPUs: NVIDIA GeForce 3 (2001), ATI Radeon 9700 (2002)
 Impact: Set the foundation for general-purpose computation on GPUs.

3. Emergence of GPGPU (General-Purpose GPU Computing) – Mid to Late

2000s

 Concept: Use the GPU for non-graphics tasks by leveraging programmable shaders.
 Challenges: Shaders were not designed for general computing, making programming
cumbersome.
 Breakthrough: NVIDIA introduced CUDA (2006) – the first general-purpose
parallel computing architecture for GPUs.
 Also Emerged: OpenCL (2009) – an open standard supporting multiple vendors.

4. GPUs as Parallel Computing Engines (2010s)

 Hardware Changes:
o Increased number of cores.
o Unified architecture (same core units handle multiple task types).
o Addition of high-bandwidth memory (e.g., HBM).
 Use Cases Expanded:
o AI/Deep Learning (TensorFlow, PyTorch)
o Scientific simulations
o Data analytics
o Real-time rendering (ray tracing)
 NVIDIA’s Tesla and later A100 series became central to data centers and AI
training.
 AMD introduced ROCm, an open ecosystem similar to CUDA.

5. AI-Optimized and Specialized GPU Architectures (2020s–Present)

 Specialized Hardware:
o Tensor Cores (NVIDIA Volta and later): Designed for matrix operations
crucial to AI.
o Ray Tracing Cores: For real-time lighting and shadows in games and
simulations.
 Unified Ecosystems:
o NVIDIA: CUDA + cuDNN + TensorRT
o AMD: ROCm + HIP
 Cloud & Edge Deployment:
o Cloud GPU instances (AWS, Azure, Google Cloud).
o GPUs now used in autonomous vehicles, mobile devices, and edge AI.
 Next Frontier:
o AI-specific chips like NVIDIA Grace Hopper (CPU+GPU superchips).
o Integration of GPU features into CPUs (Apple Silicon, Intel Xe).

Summary Timeline

Era Key Features Notable Hardware

1980s–1990s Fixed-function graphics 3dfx Voodoo, NVIDIA RIVA

Early 2000s Programmable shaders NVIDIA GeForce 3

Mid-2000s Early GPGPU experiments ATI Radeon, NVIDIA 8800

2006–2015 CUDA, OpenCL, rise of HPC NVIDIA Tesla, AMD FirePro

2015–2020 AI and deep learning boom NVIDIA Volta, AMD Vega

2020s–present Specialized AI hardware, cloud GPUs NVIDIA A100, H100, AMD MI300
4.3 Comparison of CPU vs. GPU architectures

Difference Between CPU and GPU

CPU GPU

CPU stands for Central While GPU stands for Graphics Processing
Processing Unit. Unit.
CPU GPU

Used for General-purpose Used for Specialized computation for graphics

computation. and parallel tasks.

Handles single-threaded, complex Handles highly parallel tasks (e.g., graphics

tasks. rendering).

Optimized for sequential

Optimized for parallel processing.
processing.

Smaller cache memory (L1, L2, Larger memory (VRAM) optimized for high-
L3). speed data transfer.

More energy-efficient for general Consumes more power due to parallel

tasks. processing needs.

CPU emphasis on low latency. While GPU emphasis on high throughput.

Runs operating system, Handles graphics rendering, AI, machine

applications, and tasks. learning.

Generally less expensive. More expensive due to specialized hardware.

4.4 Applications of GPU Computing

GPUs, with their parallel processing capabilities, are used in a wide range
of applications beyond gaming, including AI/ML, scientific computing, video
editing, and more. They excel in tasks requiring high throughput and
parallel calculations, such as training deep learning models, rendering 3D
graphics, and processing large datasets.
Here's a more detailed look at GPU applications:

1. AI and Machine Learning:

 GPUs are crucial for accelerating the training of large language models (LLMs) and
other deep learning models due to their ability to perform parallel calculations on
massive datasets.
 They enable faster processing of images, videos, and other data used in AI
applications.
2. Scientific Computing:
 GPUs are used in simulations, modeling, and analysis in fields like engineering,
medicine, and physics.
 They can accelerate tasks like molecular dynamics simulations and weather
modeling.
3. Video Editing and 3D Rendering:
 GPUs handle the complex calculations involved in creating and rendering 3D visuals
in games, movies, and other visual media.
 They are also used in professional video editing and animation software.
4. Gaming:
 GPUs are essential for delivering smooth, high-quality graphics and immersive
gaming experiences.
5. Cryptocurrency Mining:
 The parallel processing power of GPUs is utilized in mining cryptocurrencies like
Ethereum.
6. Other Applications:
 Data analysis: GPUs can accelerate the processing and analysis of large datasets
in various fields.
 Medical imaging: GPUs can be used to process and visualize medical images.
 Cloud computing: GPUs can accelerate cloud-based AI and machine learning
workloads.
 Edge computing: GPUs can be used in edge devices to perform real-time
processing and analysis.
 Quantum computing simulation: GPUs can be used to simulate quantum
systems.

4.5 Introduction to CUDA Programming



CUDA stands for Compute Unified Device Architecture. It is an extension
of C/C++ programming. CUDA is a programming language that uses the
Graphical Processing Unit (GPU). It is a parallel computing platform and
an API (Application Programming Interface) model, Compute Unified
Device Architecture was developed by Nvidia. This allows computations to
be performed in parallel while providing well-formed speed. Using CUDA,
one can harness the power of the Nvidia GPU to perform common
computing tasks, such as processing matrices and other linear algebra
operations, rather than simply performing graphical calculations.

Why do we need CUDA?

 GPUs are designed to perform high-speed parallel computations to

display graphics such as games.
 Use available CUDA resources. More than 100 million GPUs are
already deployed.
 It provides 30-100x speed-up over other microprocessors for some
applications.
 GPUs have very small Arithmetic Logic Units (ALUs) compared to the
somewhat larger CPUs. This allows for many parallel calculations,
such as calculating the color for each pixel on the screen, etc.

Architecture of CUDA

 16 Streaming Multiprocessor (SM) diagrams are shown in the above

diagram.
 Each Streaming Multiprocessor has 8 Streaming Processors (SP) ie,
we get a total of 128 Streaming Processors (SPs).
 Now, each Streaming processor has a MAD unit (Multiplication and
Addition Unit) and an additional MU (multiplication unit).
 The GT200 has 30 Streaming Multiprocessors (SMs) and each
Streaming Multiprocessor (SM) has 8 Streaming Processors (SPs) ie, a
total of 240 Streaming Processors (SPs), and more than 1 TFLOP
processing power.
 Each Streaming Processor is gracefully threaded and can run
thousands of threads per application.
 The G80 card has 16 Streaming Multiprocessors (SMs) and each SM
has 8 Streaming Processors (SPs), i.e., a total of 128 SPs and it
supports 768 threads per Streaming Multiprocessor (note: not per SP).
 Eventually, after each Streaming Multiprocessor has 8 SPs, each SP
supports a maximal of 768/8 = 96 threads. Total threads that can run
on 128 SPs - 128 * 96 = 12,228 times.
 Therefore these processors are called massively parallel.
 The G80 chips have a memory bandwidth of 86.4GB/s.
 It also has an 8GB/s communication channel with the CPU (4GB/s for
uploading to the CPU RAM, and 4GB/s for downloading from the CPU
RAM).

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.

 Each kernel consists of blocks, which are independent groups of ALUs.
 Each block contains threads, which are levels of computation.
 The threads in each block typically work together to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to the GPU is often the
most typical part of the computation.
 For each thread, local memory is the fastest, followed by shared
memory, global, static, and texture memory the slowest.
Typical CUDA Program flow
1. Load data into CPU memory
2. Copy data from CPU to GPU memory - e.g., cudaMemcpy(...,
cudaMemcpyHostToDevice)
3. Call GPU kernel using device variable - e.g., kernel<<<>>> (gpuVar)
4. Copy results from GPU to CPU memory - e.g., cudaMemcpy(..,
cudaMemcpyDeviceToHost)
5. Use results on CPU

How work is distributed?

 Each thread "knows" the x and y coordinates of the block it is in, and
the coordinates where it is in the block.
 These positions can be used to calculate a unique thread ID for each
thread.
 The computational work done will depend on the value of the thread ID.
For example, the thread ID corresponds to a group of matrix elements.

CUDA Applications

CUDA applications must run parallel operations on a lot of data, and be

processing-intensive.
1. Computational finance
2. Climate, weather, and ocean modeling
3. Data science and analytics
4. Deep learning and machine learning
5. Defence and intelligence
6. Manufacturing/AEC
7. Media and entertainment
8. Medical imaging
9. Oil and gas
10. Research
11. Safety and security
12. Tools and management

Benefits of CUDA

There are several advantages that give CUDA an edge over traditional
general-purpose graphics processor (GPU) computers with graphics APIs:
 Integrated memory (CUDA 6.0 or later) and Integrated virtual memory
(CUDA 4.0 or later).
 Shared memory provides a fast area of shared memory for CUDA
threads. It can be used as a caching mechanism and provides more
bandwidth than texture lookup.
 Scattered read codes can be read from any address in memory.
 Improved performance on downloads and reads, which works well from
the GPU and to the GPU.
 CUDA has full support for bitwise and integer operations.

Limitations of CUDA

 CUDA source code is given on the host machine or GPU, as defined

by the C++ syntax rules. Longstanding versions of CUDA use C syntax
rules, which means that up-to-date CUDA source code may or may not
work as required.
 CUDA has unilateral interoperability(the ability of computer systems or
software to exchange and make use of information) with transferor
languages like OpenGL. OpenGL can access CUDA registered
memory, but CUDA cannot access OpenGL memory.
 Afterward versions of CUDA do not provide emulators or fallback
support for older versions.
 CUDA supports only NVIDIA hardware.

4.6 CUDA programming model and API

The CUDA programming model enables developers to execute highly parallel computations
using NVIDIA GPUs. It abstracts the GPU's architecture in a way that lets you express your
problem in terms of data parallelism—mapping thousands of threads across large datasets.

🧠 CUDA Programming Model Overview

CUDA uses a heterogeneous computing model, meaning:

 The CPU (host) runs the main program.

 The GPU (device) executes parallel kernels (functions).
 The developer explicitly manages data transfer between host and device.

🏗️Key Concepts in the CUDA Programming Model

Concept Description

A function written in CUDA C/C++ and marked with __global__. It runs on the
Kernel
GPU and is launched from the CPU.

Thread The basic unit of execution on the GPU.

Block A group of threads that execute together and can share memory.

Grid A collection of blocks.

Warp A group of 32 threads scheduled and executed together by the GPU hardware.

Memory Includes global memory, shared memory, local memory, constant memory, and
Hierarchy registers. Memory type determines speed and scope of access.

📏 Thread Hierarchy in CUDA

When you launch a kernel, you specify:

kernel<<<numBlocks, threadsPerBlock>>>(args);

 threadsPerBlock: Number of threads in each block (e.g., 256).

 numBlocks: Number of blocks in the grid (e.g., 128).

Inside the kernel, you access:

int threadId = threadIdx.x; // Index within the block

int blockId = blockIdx.x; // Block index in the grid
int blockDim = blockDim.x; // Threads per block
int globalId = blockId * blockDim + threadId; // Global thread index

🧰 CUDA API Components

🧩 1. Kernel Definition and Launch

__global__ void myKernel(int *data) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
data[idx] *= 2;
}

// Launch with 4 blocks of 256 threads

myKernel<<<4, 256>>>(devicePointer);

💾 2. Memory Management

Host <-> Device memory allocation and transfer

int *h_data, *d_data;
cudaMalloc(&d_data, size); // Allocate on device
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice); // Copy to GPU
cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost); // Copy back
cudaFree(d_data); // Free device memory

🧮 3. Built-in Variables

Variable Description

threadIdx.x Thread index within the block

blockIdx.x Block index within the grid

blockDim.x Number of threads per block

gridDim.x Number of blocks in the grid

🗂️4. Memory Types

Memory Type Scope Latency Example Use

Global All threads High Data exchange between blocks

Shared Threads in a block Low Fast collaboration within a block

Local Single thread Moderate Local variables

Registers Single thread Fastest Most variables compiled to this

Constant Read-only cache Low Constants used by all threads

🧪 5. Error Checking
cudaError_t err = cudaMemcpy(...);
if (err != cudaSuccess) {
printf("CUDA error: %s\n", cudaGetErrorString(err));
}

🧩 CUDA Execution Flow Summary

1. Allocate memory on device (GPU) using cudaMalloc().

2. Copy input data from host (CPU) to device using cudaMemcpy().
3. Launch kernel using kernel<<<blocks, threads>>>().
4. Copy results back to host from device.
5. Free memory on the device using cudaFree().

Motorpal Despiece Inyeccion
50% (2)
Motorpal Despiece Inyeccion
25 pages
Report On Gpu
No ratings yet
Report On Gpu
39 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
ID ISC - LRU1002: Manual
No ratings yet
ID ISC - LRU1002: Manual
175 pages
GPU (Graphics Processing Unit)
No ratings yet
GPU (Graphics Processing Unit)
23 pages
CAO Report
No ratings yet
CAO Report
17 pages
Parallel Processing Using GPU's
No ratings yet
Parallel Processing Using GPU's
34 pages
789
No ratings yet
789
5 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Sample 3
No ratings yet
Sample 3
2 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
10 pages
Gpu IEEE Paper
No ratings yet
Gpu IEEE Paper
14 pages
Gpus
No ratings yet
Gpus
32 pages
Graphics Processing Units Paper PDF
No ratings yet
Graphics Processing Units Paper PDF
14 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Developers Had To Map Scientific Calculations Onto Problems That Could Be Represented by Triangles and Polygons
No ratings yet
Developers Had To Map Scientific Calculations Onto Problems That Could Be Represented by Triangles and Polygons
2 pages
Gpu Research Paper
No ratings yet
Gpu Research Paper
6 pages
Intro Computing BCSM-F18-071 - Assignment 1
No ratings yet
Intro Computing BCSM-F18-071 - Assignment 1
10 pages
An Introduction To Graphical Processing Unit: Jayshree Ghorpade, Jitendra Parande, Rohan Kasat, Amit Anand
No ratings yet
An Introduction To Graphical Processing Unit: Jayshree Ghorpade, Jitendra Parande, Rohan Kasat, Amit Anand
6 pages
GPU-Co Processing
No ratings yet
GPU-Co Processing
8 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
14 pages
GPUIntro
No ratings yet
GPUIntro
21 pages
Part1 22
No ratings yet
Part1 22
77 pages
GPU Architecture and Function: Michael Foster and Ian Frasch
No ratings yet
GPU Architecture and Function: Michael Foster and Ian Frasch
35 pages
Graphic Processing Unit
100% (1)
Graphic Processing Unit
20 pages
CUDA
No ratings yet
CUDA
46 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
GPU (Graphics Processing Unit)
No ratings yet
GPU (Graphics Processing Unit)
11 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
D&I of GPU Based Image Processing On CASE Cluster
No ratings yet
D&I of GPU Based Image Processing On CASE Cluster
28 pages
GPU (Graphics Processing Unit)
No ratings yet
GPU (Graphics Processing Unit)
4 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
NVIDIAFermiComputeArchitectureWhitepaper PDF
No ratings yet
NVIDIAFermiComputeArchitectureWhitepaper PDF
21 pages
GPU Gpgpu Computing: Rajan Panigrahi
No ratings yet
GPU Gpgpu Computing: Rajan Panigrahi
24 pages
Te Wod Ros Seminar
No ratings yet
Te Wod Ros Seminar
14 pages
GPU Gems2 ch29
No ratings yet
GPU Gems2 ch29
21 pages
Architecture, Applications, and Accelerating AI
No ratings yet
Architecture, Applications, and Accelerating AI
11 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
Intro To Gpu &amp Cuda
No ratings yet
Intro To Gpu &amp Cuda
15 pages
Whitepaper NVIDIA's Next Generation CUDA Compute Architecture
No ratings yet
Whitepaper NVIDIA's Next Generation CUDA Compute Architecture
21 pages
Gpu Detailed
No ratings yet
Gpu Detailed
2 pages
Gtgpu 1
No ratings yet
Gtgpu 1
15 pages
GW-AI-Blog-2 19 2025 9 54 04 PM
No ratings yet
GW-AI-Blog-2 19 2025 9 54 04 PM
11 pages
Bava Kalai Final
No ratings yet
Bava Kalai Final
235 pages
Matter
No ratings yet
Matter
2 pages
Graphics Processing Unit (GPU)
No ratings yet
Graphics Processing Unit (GPU)
13 pages
Gpu Companies: Intel Nvidia Amd Ati Matrox Adreno Qualcomm Powervr Imagination Technologies Mali Gpus Arm
No ratings yet
Gpu Companies: Intel Nvidia Amd Ati Matrox Adreno Qualcomm Powervr Imagination Technologies Mali Gpus Arm
8 pages
What Is A GPU
No ratings yet
What Is A GPU
3 pages
مطوية الإنجليزية-1 نسخة
No ratings yet
مطوية الإنجليزية-1 نسخة
2 pages
Embedded Systems Mobile Phones Personal Computers Workstations Game Consoles Computer Graphics Cpus Algorithms
No ratings yet
Embedded Systems Mobile Phones Personal Computers Workstations Game Consoles Computer Graphics Cpus Algorithms
3 pages
Assingmentbic 10503
No ratings yet
Assingmentbic 10503
13 pages
Brodtkorb Etal Meta10
No ratings yet
Brodtkorb Etal Meta10
15 pages
Quiz3 - Pacuribot
No ratings yet
Quiz3 - Pacuribot
4 pages
Data Mining
No ratings yet
Data Mining
4 pages
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet
Mining for Knowledge: Exploring GPU Architectures In Cryptocurrency and AI: The Crypto Mining Mastery Series, #2
From Everand
Mining for Knowledge: Exploring GPU Architectures In Cryptocurrency and AI: The Crypto Mining Mastery Series, #2
Ladd Baby
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
HPC Part B
No ratings yet
HPC Part B
5 pages
NT Test
No ratings yet
NT Test
11 pages
NOTES Unit - II
No ratings yet
NOTES Unit - II
36 pages
IDA Pro Function Analysis and Graphing Part4
No ratings yet
IDA Pro Function Analysis and Graphing Part4
10 pages
CMSC 449 - Lec04 - Packed Malware
No ratings yet
CMSC 449 - Lec04 - Packed Malware
10 pages
Secure Hash Algorithm
No ratings yet
Secure Hash Algorithm
4 pages
Cryptography Notes UNIT 3
No ratings yet
Cryptography Notes UNIT 3
21 pages
Distillation - Definition, Detailed Process, Types, Uses
No ratings yet
Distillation - Definition, Detailed Process, Types, Uses
1 page
Chapter - 1 Introduction To Operations Research
No ratings yet
Chapter - 1 Introduction To Operations Research
16 pages
EagleBurgmann MG1 en
No ratings yet
EagleBurgmann MG1 en
5 pages
Bastidas Et Al, 2010
No ratings yet
Bastidas Et Al, 2010
8 pages
Applications of Machine Learning in Cryptography: A Survey: Mohammed M. Alani
No ratings yet
Applications of Machine Learning in Cryptography: A Survey: Mohammed M. Alani
8 pages
Training Course: 3D Land Sequence
No ratings yet
Training Course: 3D Land Sequence
22 pages
Bentrup TC m2 en
No ratings yet
Bentrup TC m2 en
14 pages
Engineering Physics I Small PDF Version
No ratings yet
Engineering Physics I Small PDF Version
21 pages
Ys 1700
No ratings yet
Ys 1700
33 pages
The Priority of Propositions A Pragmatist Philosophy of Logic Mara Jos Frpolli Instant Download
No ratings yet
The Priority of Propositions A Pragmatist Philosophy of Logic Mara Jos Frpolli Instant Download
76 pages
Maths 10 Assignment
No ratings yet
Maths 10 Assignment
4 pages
Solutions To Surface of Solids 2 - Part 2
No ratings yet
Solutions To Surface of Solids 2 - Part 2
2 pages
Food Cooling and Freezing
No ratings yet
Food Cooling and Freezing
16 pages
FT 3400 Flow Meter Guide Specification DOC 0005407
No ratings yet
FT 3400 Flow Meter Guide Specification DOC 0005407
4 pages
Class 12 Physics Topicwise Notes Chp-14 Byjus Transistors - 1
No ratings yet
Class 12 Physics Topicwise Notes Chp-14 Byjus Transistors - 1
28 pages
EC537 Slides Lecture 1 PDF
No ratings yet
EC537 Slides Lecture 1 PDF
79 pages
Amy Corns - Connecting Scatter Plots and Correlation Coefficients Activity
No ratings yet
Amy Corns - Connecting Scatter Plots and Correlation Coefficients Activity
23 pages
Sensitivity Sample Model: Tornado, Spider and Sensitivity Charts (Nonlinear)
No ratings yet
Sensitivity Sample Model: Tornado, Spider and Sensitivity Charts (Nonlinear)
6 pages
KARYER HEDEFLER Ã - LÃ - E ORJNAL 33 MADDE - Gregor - Umd - 0117E - 15818
No ratings yet
KARYER HEDEFLER Ã - LÃ - E ORJNAL 33 MADDE - Gregor - Umd - 0117E - 15818
108 pages
Fiitjee All India Test Series: JEE (Advanced) - 2020
No ratings yet
Fiitjee All India Test Series: JEE (Advanced) - 2020
19 pages
Forecasting
No ratings yet
Forecasting
14 pages
My Bank Statement
No ratings yet
My Bank Statement
3 pages
Weft Insertion
No ratings yet
Weft Insertion
6 pages
4402 TuofengSemiconductor
No ratings yet
4402 TuofengSemiconductor
6 pages
Effect of Usage of Sinter in BOF Steelmaking As A Replacement To Iron Ore As Coolant For Thermal Balance
No ratings yet
Effect of Usage of Sinter in BOF Steelmaking As A Replacement To Iron Ore As Coolant For Thermal Balance
11 pages
Semiconductor Notes
No ratings yet
Semiconductor Notes
14 pages
Optimazation of Thermoelectric Generator Array For Harnessing Wasted Heat From Home Appliances Journal Format
No ratings yet
Optimazation of Thermoelectric Generator Array For Harnessing Wasted Heat From Home Appliances Journal Format
5 pages
Phenol Water CST Experiment
No ratings yet
Phenol Water CST Experiment
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

UNIT 4 GPU Computing - HPC

Uploaded by

UNIT 4 GPU Computing - HPC

Uploaded by

UNIT 4 GPU Computing 9

Overview of GPU computing - Evolution of GPUs from graphics to general-

4.1 Overview of GPU Computing

Graphics Processing Unit (GPU) is a specialized processor originally designed to render

What is a Graphics Processing Unit (GPU)?

The evolution of GPUs from fixed-function graphics processors to highly programmable

1. Early Days: Fixed-Function Graphics Processors (1980s–1990s)

 Primary Role: Rendering 2D and 3D graphics using fixed pipelines.

2. Introduction of Programmable Shaders (Early 2000s)

 Key Milestone: The transition from fixed-function to programmable graphics

3. Emergence of GPGPU (General-Purpose GPU Computing) – Mid to Late

4. GPUs as Parallel Computing Engines (2010s)

5. AI-Optimized and Specialized GPU Architectures (2020s–Present)

Era Key Features Notable Hardware

1980s–1990s Fixed-function graphics 3dfx Voodoo, NVIDIA RIVA

Early 2000s Programmable shaders NVIDIA GeForce 3

Mid-2000s Early GPGPU experiments ATI Radeon, NVIDIA 8800

2006–2015 CUDA, OpenCL, rise of HPC NVIDIA Tesla, AMD FirePro

2015–2020 AI and deep learning boom NVIDIA Volta, AMD Vega

Difference Between CPU and GPU

Used for General-purpose Used for Specialized computation for graphics

Handles single-threaded, complex Handles highly parallel tasks (e.g., graphics

Optimized for sequential

More energy-efficient for general Consumes more power due to parallel

CPU emphasis on low latency. While GPU emphasis on high throughput.

Runs operating system, Handles graphics rendering, AI, machine

Generally less expensive. More expensive due to specialized hardware.

4.4 Applications of GPU Computing

1. AI and Machine Learning:

4.5 Introduction to CUDA Programming

Why do we need CUDA?

 GPUs are designed to perform high-speed parallel computations to

 16 Streaming Multiprocessor (SM) diagrams are shown in the above

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.

How work is distributed?

CUDA applications must run parallel operations on a lot of data, and be

 CUDA source code is given on the host machine or GPU, as defined

4.6 CUDA programming model and API

🧠 CUDA Programming Model Overview

CUDA uses a heterogeneous computing model, meaning:

 The CPU (host) runs the main program.

🏗️Key Concepts in the CUDA Programming Model

Thread The basic unit of execution on the GPU.

Grid A collection of blocks.

📏 Thread Hierarchy in CUDA

When you launch a kernel, you specify:

 threadsPerBlock: Number of threads in each block (e.g., 256).

Inside the kernel, you access:

int threadId = threadIdx.x; // Index within the block

🧰 CUDA API Components

🧩 1. Kernel Definition and Launch

// Launch with 4 blocks of 256 threads

Host <-> Device memory allocation and transfer

threadIdx.x Thread index within the block

blockIdx.x Block index within the grid

blockDim.x Number of threads per block

gridDim.x Number of blocks in the grid

🗂️4. Memory Types

Global All threads High Data exchange between blocks

Shared Threads in a block Low Fast collaboration within a block

Local Single thread Moderate Local variables

Registers Single thread Fastest Most variables compiled to this

Constant Read-only cache Low Constants used by all threads

🧩 CUDA Execution Flow Summary

1. Allocate memory on device (GPU) using cudaMalloc().

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.