0% found this document useful (0 votes)

13 views

UNIT-4

Uploaded by

swatiulligeri52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

UNIT-4

Uploaded by

swatiulligeri52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

UNIT-4

GPU
UNIT-IV 8
Hours Introduction: GPUs as Parallel Computers, Architecture of a
Model GPU, Why More Speed or Parallelism? GPU Computing.
Introduction to CUDA: Data Parallelism, CUDA Program Structure, A
Vector Addition Kernel , Device Global Memory And Data Transfer,
Kernel Functions and Threading.
Self-Study: GPUs History of GPU Computing: Evolution of Graphics
Pipelines, Parallel Programming Languages and Models, GPU Memory
Heterogeneous Parallel Computing

CPU drove rapid performance increase and cost reduction in computer applications for more
than two decades. i.e. GFLOPS, or giga (1012) floating-point operations per second, to the
desktop and TFLOPS, or tera (1015) floating-point operations per second, to cluster servers.

This drive, however, has slowed since 2003 due to energy consumption and heat dissipation
issues that limited the increase of the clock frequency and the level of productive activities
that can be performed in each clock period within a single CPU.
Heterogeneous The semiconductor industry has settled on two
main trajectories for designing
Parallel microprocessors

Computing • The multicore trajectory seeks to maintain

the execution speed of sequential programs
while moving into multiple cores.

• In contrast, the many-thread trajectory

focuses more on the execution throughput of
parallel applications.
CPUS AND GPUS HAVE FUNDAMENTALLY
DIFFERENT DESIGN PHILOSOPHIES
ALU ALU
Control
ALU ALU
CPU GPU

Cache

DRAM
DRAM
Design of a CPU is optimized for sequential code performance.

Use of sophisticated control logic to allow instructions from a

single thread to execute in parallel or even out of their sequential
order while maintaining the appearance of sequential execution.
large cache memories are provided to reduce the instruction and
data access latencies of large complex applications
Multicore CPU
Memory bandwidth is another important issue. The speed of
many applications is limited by the rate at which data can be
delivered from the memory system into the processors.
Latency-oriented design

CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time
• Shaped by the fast-growing video game industry
that expects tremendous massive number of
floating-pint calculations per video frame.
• Motive to look for ways to maximize the chip area
and power budget dedicated to floating-point
calculations : Optimize for the execution

Many-core throughput of massive number of threads.

• The design saves chip area and power by allowing

GPU
pipelined memory channels and arithmetic
operations to have long latency.
• The reduce area and power on memory and
arithmetic allows designers to have more cores on
a chip to increase the execution throughput.

• Throughput-oriented design
• GPU will not perform well on tasks on which
CPUs are design to perform well. For program
that have one or very few threads, CPUs with
lower operation latencies can achieve much
higher performance than GPUs.
CPU + GPU • When a program has many threads, GPUs with
higher execution throughput can achieve much
higher performance than CPUs.
• Many applications use both CPUs and GPUs,
executing the sequential parts on the CPU and
numerically intensive parts on the GPUs.
WHY MASSIVELY
PARALLEL
PROCESSOR
• A quiet revolution and potential build-up
• Calculation: 367 GFLOPS vs. 32 GFLOPS
• Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
• Until last year, programmed through
graphics API
• GPU in every PC and workstation – massive
volume and potential impact
Architecture of a CUDA-capable GPU
Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory

It is organized into an array of highly threaded streaming multiprocessors (SMs).

• Two streaming multiprocessors form a building block.
• The number of SMs in a building block can vary from one generation
of CUDA GPUs to another generation
• Each SM has few streaming processors (SPs) that share control logic
and an instruction cache.
• Each GPU comes with multiple gigabytes of DRAM (global memory).
• A good application runs 5k to 12k threads. CPU support 2 to 8
threads.
Stretching Traditional Architectures
The centre core of the peach represent the sequential portions of
application.
The sequential portions have been target of modern instruction
level parallelism.
In modern CPU that include cache memories ,branch prediction,
data forwarding are important in preserving instruction level
parallelism in these portions.
Orange portion is data parallel portion of application
These have large data sets whose elements are to be processed on
parallel.
Traditional CPUs do not have sufficient amount of execution
resources to achieve dramatic increase in performance.

Meshed layer represents types of data parallel applications that

can be efficiently covered by manycore architectures today.
The figure explains us that there is a large population of parallel application that are currently neither covered by CPUs
nor many core processors.
WHY MORE SPEED OR PARALLELISM?
• The main motivation for massively parallel programming is for
applications to enjoy continued speed increase in future hardware
generations.
• When an application is suitable for parallel execution, a good
implementation on a GPU can achieve more than 100 times (100X)
speedup over sequential execution on a single CPU core.
• If the application includes data parallelism, it’s often a simple task to
achieve a 10X speedup with just a few hours of work
GPU Computing
per thread
Input Registers per Shader
per Context

Fragment
Program Texture

Constants

Temp Registers

Output Registers

FB
Memory
The restricted input and output capabilities of a shader programming model.
GPU beyond Graphics
Same components as a typical CPU

However,…

Architecture of a • More computing elements

• More types of memory

GPU Original GPUs had vertex and pixel shaders

• Specifically for graphics

Modern GPUs are slightly different

• CUDA – Compute Unified Device Architecture

Computational Elements of a GPU

Streaming Processor – Core of the design

• Place where all of the computation takes place

Streaming Multiprocessor
• Groups of streaming multiprocessors
• In addition to the SPs, these also contain the Special Function Units and Load/Store Units

Instructional Schedulers

Complex Control Logic

Types of GPU Memory

Global Texture Constant Shared

DRAM Cached Global Memory Cached Global Memory Local to a block of threads
Slowest Performance “Bound” at runtime
• Evolution of Graphics Pipelines — mcs572 0.7.8 documentation (uic.e
du)
Terminology
Thread Thread – The smallest grain of the hierarchy of device computation

Block Block – A group of threads

Grid Grid – A group of blocks

Warp Warp – A group of 32 threads that are executed simultaneously on the device

Kernel Kernel ‐ The creator of a grid for GPU execution

Grids, Blocks, and
Threads
CUDA MEMORY

Faster, per-
block
Fastest, per-
thread

Slower,
global
Read-only,
cached
ARE GPUS FASTER THAN
CPUS?

NO, NOT ALWAYS

Most general-purpose computing, a CPU performs much better
than a GPU.

That’s because CPUs are designed with fewer processor

cores that have higher clock speeds than the ones found on
GPUs, allowing them to complete series of tasks very quickly.

GPUs, on the other hand, have much greater number of

cores and are designed for a different purpose
HETEROGENEOUS COMPUTING

Host: the CPU and its Device: the GPU and its
memory memory
Introduction to CUDA

CUDA C is an extension to the popular C programming languages with

new keywords and application programming interfaces for
programmers to take advantage of heterogeneous computing systems
that contain both CPUs and massively parallel GPU’s.

To a CUDA programmer, the computing system consists of a host that

is a traditional CPU, such as an Intel architecture microprocessor in
personal computers today, and one or more devices that are
processors with a massive number of arithmetic units.
CUDA

A CUDA device is typically a GPU.

CUDA devices accelerate the execution of these

applications by applying their massive number of
arithmetic units to these data-parallel program sections.
DATA PARALLELISM

Modern software applications often process a

large amount of data and incur long execution
time on sequential computers.
In general, data parallelism is the main source
of scalability for parallel programs.
Task parallelism
• Task parallelism is typically exposed through task decomposition of
applications. For example, a simple application may need to do a
vector addition and a matrix vector multiplication. Each of these
would be a task. Task parallelism exists if the two tasks can be done
independently
CUDA –provides Data & Task Paralleism
Data Parallelisms Task Parallelisms
1. Same task are performed on different subsets of 1. Different task are performed on the same or different
same data. data.

2. Synchronous computation is performed. 2. Asynchronous computation is performed.

3. As there is only one execution thread operating on all 3. As each processor will execute a different thread or
sets of data, so the speedup is more. process on the same or different set of data, so speedup
is less.

4. Amount of parallelization is proportional to the input 4. Amount of parallelization is proportional to the number
size. of independent tasks is performed.

5. It is designed for optimum load balance on 5. Here, load balancing depends upon on the availability
multiprocessor system. of the hardware and scheduling algorithms like static and
dynamic scheduling.
Example of data parallelism : Vector addition
CUDA PROGRAM STRUCTURE
• The structure of a CUDA program reflects the coexistence of a host (CPU) and one or
more devices (GPUs) in the computer.
• Each CUDA source file can have a mixture of both host and device code.
• By default, any traditional C program is a CUDA program that contains only host code.
• One can add device functions and data declarations into any C source file.
• The function or data declarations for the device are clearly marked with special CUDA
keywords.
• These are typically functions that exhibit a rich amount of data parallelism
• Once device functions and data declarations
are added to a source file, it is no longer
acceptable to a traditional C compiler.

• The code needs to be compiled by a compiler

that recognizes and understands these
additional declaration.
• We will be using a CUDA C compiler by NVIDIA
called NVCC.

CUDA keywords are used to separate

the host code and device code.
The device code is marked with CUDA keywords
for labelling data-parallel functions, called
kernels, and their associated data structures.

The device code is further compiled by a

runtime component of NVCC and executed on a
GPU device.

In situations where there is no device available

or a kernel can be appropriately executed on a
CPU, one can also choose to execute the kernel
on a CPU using tools like MCUDA.
Execution of a CUDA program
A VECTOR ADDITION KERNEL : C program
// Compute vector sum h_C = h_A+h_B
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{ •In each piece of host code,
for (i = 0; i < n; i++) h_C[i] = h_A[i] + h_B[i];
we will prefix the names of
variables that are mainly
} processed by the host with
int main() h_ and those of variables
{ that are mainly processed by
// Memory allocation for h_A, h_B, and h_C a device d_
// I/O to read h_A and h_B, N elements each
…
vecAdd(h_A, h_B, h_C, N);
}
DEVICE GLOBAL MEMORY AND DATA TRANSFER
• In CUDA, host and devices have separate memory spaces.
• To execute a kernel on a device, the programmer needs to allocate global memory on the
device and transfer pertinent data from the host memory to the allocated device
memory.
• After device execution, the programmer needs to transfer result data from the device
memory back to the host memory and free up the device memory that is no longer
needed.
• The CUDA runtime system provides Application Programming Interface (API) functions to
perform these activities on behalf of the programmer
API functions for allocating and freeing device global memory.
API functions
for allocating
and freeing
device global
memory.
Example
KERNEL FUNCTIONS AND THREADING
• CUDA programming is an instance of the well-known SPMD, since all
these threads execute the same code.
• When a host code launches a kernel, the CUDA runtime system
generates a grid of threads that are organized in a two-level hierarchy.
• Each grid is organized into an array of thread blocks, which will be
referred to as blocks.
• All blocks of a grid are of the same size; each block can contain up to
1,024 threads.
•The number of threads in each
thread block is specified by the
host code when a kernel is
launched.
•For a given grid of threads, the
number of threads in a block is
available in the blockDim variable.
•The same kernel can be launched
with different numbers of threads
at different parts of the host code.
•The value of the blockDim.x
variable is 256.
•In general, the dimensions of
thread blocks should be multiples
of 32 due to hardware efficiency
reasons
• Each thread in a block has a unique threadIdx value.
• For example, the first thread in block 0 has value 0 in its threadIdx
variable, the second thread has value 1, the third thread has value 2, etc.
• This allows each thread to combine its threadIdx and blockIdx values to
create a unique global index for itself with the entire grid.
• a data index i is calculated as
i = blockIdx.x *blockDim.x +threadIdx.x
• since blockDim is 256 in our example, the i values of threads in block 0
ranges from 0 to 255.
• The i values of threads in block 1 range from 256 to 511. The i values of
threads in block 2 range from 512 to 767. That is, the i values of the
threads in these three blocks form a continuous coverage of the values
from 0 to 767.
• By launching the kernel with a larger
number of blocks, one can process larger Note that all threads execute the same
vectors. By launching a kernel with n or kernel code
more threads, one can process vectors of
length.
When the host code launches a kernel, it sets the grid and thread block dimensions via execution
configuration parameters.

The configuration parameters are given between the <<< and >>> before the traditional C function
arguments.

The first configuration parameter gives the number of thread blocks in the grid. The second specifies the
number of threads in each thread block.

To ensure that we have enough threads to cover all the vector elements, we apply the C ceiling function to
n/256.0.
Example

Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Cuda
No ratings yet
Cuda
69 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
1 Cuda
100% (1)
1 Cuda
173 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
No ratings yet
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
71 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Part1 22
No ratings yet
Part1 22
77 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
Gpu Series i Cpu vs Gpu 1720694318
No ratings yet
Gpu Series i Cpu vs Gpu 1720694318
4 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPU Architecture
0% (2)
GPU Architecture
28 pages
GPGPU
No ratings yet
GPGPU
139 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
chapter-8
No ratings yet
chapter-8
58 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
6 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
A Look Into Parallel Architectures
No ratings yet
A Look Into Parallel Architectures
43 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Analyzing CUDA Workloads Using A Detailed GPU Simulator
No ratings yet
Analyzing CUDA Workloads Using A Detailed GPU Simulator
12 pages
1
No ratings yet
1
44 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
CUDA
No ratings yet
CUDA
46 pages
gpus
No ratings yet
gpus
32 pages
CUDA
No ratings yet
CUDA
33 pages
Lec 14
No ratings yet
Lec 14
52 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
Ppar2017 Gpu 1
No ratings yet
Ppar2017 Gpu 1
61 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
Using GPUs
No ratings yet
Using GPUs
18 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Discussion Forum
No ratings yet
Discussion Forum
17 pages
Data Science With Rust - A Comprehensive Guide - Data Analysis, Machine Learning, Data Visualization & More-Reactive Publishing (2024)
No ratings yet
Data Science With Rust - A Comprehensive Guide - Data Analysis, Machine Learning, Data Visualization & More-Reactive Publishing (2024)
355 pages
Java
No ratings yet
Java
10 pages
5th Sem Syll
No ratings yet
5th Sem Syll
17 pages
Ecos RTOS
100% (1)
Ecos RTOS
36 pages
Middleware Technologies Question Bank July 2021
No ratings yet
Middleware Technologies Question Bank July 2021
18 pages
Fluent - 14.0 Udf
No ratings yet
Fluent - 14.0 Udf
31 pages
Lab Assignment 6-Process Synchronization
No ratings yet
Lab Assignment 6-Process Synchronization
7 pages
Meep Openmp
No ratings yet
Meep Openmp
13 pages
C,C++,VC++ Interview Questions
No ratings yet
C,C++,VC++ Interview Questions
37 pages
HP Aries Troubleshooting Tips
No ratings yet
HP Aries Troubleshooting Tips
10 pages
Linux-Unix Useful Commands
No ratings yet
Linux-Unix Useful Commands
6 pages
CE-303 Operating Systems Final Spring 2021
No ratings yet
CE-303 Operating Systems Final Spring 2021
3 pages
Programming Memory - Constrained Networked Embedded Systems
No ratings yet
Programming Memory - Constrained Networked Embedded Systems
222 pages
IT Sem-IV
No ratings yet
IT Sem-IV
37 pages
Programming Ruby 1 9 3rd Edition Dave Thomas instant download
100% (1)
Programming Ruby 1 9 3rd Edition Dave Thomas instant download
85 pages
Direct Kernel Object Manipulation
No ratings yet
Direct Kernel Object Manipulation
45 pages
CS311 Exam
No ratings yet
CS311 Exam
16 pages
Denis Bakhvalov - Performance Analysis and Tuning On Modern CPUs
No ratings yet
Denis Bakhvalov - Performance Analysis and Tuning On Modern CPUs
175 pages
Parallel Programming
No ratings yet
Parallel Programming
454 pages
Chapter 6
No ratings yet
Chapter 6
31 pages
A Full-Scale Fluvial Flood Modelling Framework Based On A High-Performance Integrated Hydrodynamic Modelling System (HiPIMS)
No ratings yet
A Full-Scale Fluvial Flood Modelling Framework Based On A High-Performance Integrated Hydrodynamic Modelling System (HiPIMS)
42 pages
CoSc-2042 Operating System
No ratings yet
CoSc-2042 Operating System
2 pages
COREPROCESSOR
No ratings yet
COREPROCESSOR
7 pages
Java Interview Questions
No ratings yet
Java Interview Questions
2 pages
21cs44 Os - Simp 2023 (For 21 Scheme Only) - Tie
No ratings yet
21cs44 Os - Simp 2023 (For 21 Scheme Only) - Tie
2 pages
Software in Silicon in The Oracle SPARC M7 Processor
No ratings yet
Software in Silicon in The Oracle SPARC M7 Processor
31 pages
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
No ratings yet
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
27 pages
Operating System
No ratings yet
Operating System
33 pages
Introduction To Nachos
No ratings yet
Introduction To Nachos
22 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

UNIT-4

Uploaded by

UNIT-4

Uploaded by

UNIT-4

Computing • The multicore trajectory seeks to maintain

• In contrast, the many-thread trajectory

Use of sophisticated control logic to allow instructions from a

Many-core throughput of massive number of threads.

Thread Execution Manager

Load/store Load/store Load/store Load/store Load/store Load/store

It is organized into an array of highly threaded streaming multiprocessors (SMs).

Meshed layer represents types of data parallel applications that

Architecture of a • More computing elements

GPU Original GPUs had vertex and pixel shaders

• Specifically for graphics

Modern GPUs are slightly different

• CUDA – Compute Unified Device Architecture

Streaming Processor – Core of the design

Complex Control Logic

Global Texture Constant Shared

Block Block – A group of threads

Grid Grid – A group of blocks

Kernel Kernel ‐ The creator of a grid for GPU execution

NO, NOT ALWAYS

That’s because CPUs are designed with fewer processor

GPUs, on the other hand, have much greater number of

CUDA C is an extension to the popular C programming languages with

To a CUDA programmer, the computing system consists of a host that

A CUDA device is typically a GPU.

CUDA devices accelerate the execution of these

Modern software applications often process a

2. Synchronous computation is performed. 2. Asynchronous computation is performed.

• The code needs to be compiled by a compiler

CUDA keywords are used to separate

The device code is further compiled by a

In situations where there is no device available

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.