UNIT-4
UNIT-4
GPU
UNIT-IV 8
Hours Introduction: GPUs as Parallel Computers, Architecture of a
Model GPU, Why More Speed or Parallelism? GPU Computing.
Introduction to CUDA: Data Parallelism, CUDA Program Structure, A
Vector Addition Kernel , Device Global Memory And Data Transfer,
Kernel Functions and Threading.
Self-Study: GPUs History of GPU Computing: Evolution of Graphics
Pipelines, Parallel Programming Languages and Models, GPU Memory
Heterogeneous Parallel Computing
CPU drove rapid performance increase and cost reduction in computer applications for more
than two decades. i.e. GFLOPS, or giga (1012) floating-point operations per second, to the
desktop and TFLOPS, or tera (1015) floating-point operations per second, to cluster servers.
This drive, however, has slowed since 2003 due to energy consumption and heat dissipation
issues that limited the increase of the clock frequency and the level of productive activities
that can be performed in each clock period within a single CPU.
Heterogeneous The semiconductor industry has settled on two
main trajectories for designing
Parallel microprocessors
Cache
DRAM
DRAM
Design of a CPU is optimized for sequential code performance.
CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time
• Shaped by the fast-growing video game industry
that expects tremendous massive number of
floating-pint calculations per video frame.
• Motive to look for ways to maximize the chip area
and power budget dedicated to floating-point
calculations : Optimize for the execution
GPU
pipelined memory channels and arithmetic
operations to have long latency.
• The reduce area and power on memory and
arithmetic allows designers to have more cores on
a chip to increase the execution throughput.
• Throughput-oriented design
• GPU will not perform well on tasks on which
CPUs are design to perform well. For program
that have one or very few threads, CPUs with
lower operation latencies can achieve much
higher performance than GPUs.
CPU + GPU • When a program has many threads, GPUs with
higher execution throughput can achieve much
higher performance than CPUs.
• Many applications use both CPUs and GPUs,
executing the sequential parts on the CPU and
numerically intensive parts on the GPUs.
WHY MASSIVELY
PARALLEL
PROCESSOR
• A quiet revolution and potential build-up
• Calculation: 367 GFLOPS vs. 32 GFLOPS
• Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
• Until last year, programmed through
graphics API
• GPU in every PC and workstation – massive
volume and potential impact
Architecture of a CUDA-capable GPU
Host
Input Assembler
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e
Global Memory
Fragment
Program Texture
Constants
Temp Registers
Output Registers
FB
Memory
The restricted input and output capabilities of a shader programming model.
GPU beyond Graphics
Same components as a typical CPU
However,…
Streaming Multiprocessor
• Groups of streaming multiprocessors
• In addition to the SPs, these also contain the Special Function Units and Load/Store Units
Instructional Schedulers
Warp Warp – A group of 32 threads that are executed simultaneously on the device
Faster, per-
block
Fastest, per-
thread
Slower,
global
Read-only,
cached
ARE GPUS FASTER THAN
CPUS?
Host: the CPU and its Device: the GPU and its
memory memory
Introduction to CUDA
3. As there is only one execution thread operating on all 3. As each processor will execute a different thread or
sets of data, so the speedup is more. process on the same or different set of data, so speedup
is less.
4. Amount of parallelization is proportional to the input 4. Amount of parallelization is proportional to the number
size. of independent tasks is performed.
5. It is designed for optimum load balance on 5. Here, load balancing depends upon on the availability
multiprocessor system. of the hardware and scheduling algorithms like static and
dynamic scheduling.
Example of data parallelism : Vector addition
CUDA PROGRAM STRUCTURE
• The structure of a CUDA program reflects the coexistence of a host (CPU) and one or
more devices (GPUs) in the computer.
• Each CUDA source file can have a mixture of both host and device code.
• By default, any traditional C program is a CUDA program that contains only host code.
• One can add device functions and data declarations into any C source file.
• The function or data declarations for the device are clearly marked with special CUDA
keywords.
• These are typically functions that exhibit a rich amount of data parallelism
• Once device functions and data declarations
are added to a source file, it is no longer
acceptable to a traditional C compiler.
The configuration parameters are given between the <<< and >>> before the traditional C function
arguments.
The first configuration parameter gives the number of thread blocks in the grid. The second specifies the
number of threads in each thread block.
To ensure that we have enough threads to cover all the vector elements, we apply the C ceiling function to
n/256.0.
Example