0% found this document useful (0 votes)
12 views17 pages

DS1822 - Parallel Computing-unit3

This document provides an overview of GPU programming, focusing on GPU architectures, data parallelism, and the CUDA programming model. It details the structure of CUDA programs, including memory management, kernel execution, and the organization of threads, blocks, and grids. Additionally, it highlights the differences between CPU and GPU processing, emphasizing the advantages of using CUDA for parallel computing tasks.

Uploaded by

as.nisha.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views17 pages

DS1822 - Parallel Computing-unit3

This document provides an overview of GPU programming, focusing on GPU architectures, data parallelism, and the CUDA programming model. It details the structure of CUDA programs, including memory management, kernel execution, and the organization of threads, blocks, and grids. Additionally, it highlights the differences between CPU and GPU processing, emphasizing the advantages of using CUDA for parallel computing tasks.

Uploaded by

as.nisha.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT III PROGRAMMING GPU’s

GPU Architectures – Data Parallelism – CUDA Basics – CUDA Program Structure – Threads, Blocks, Grids
– Memory Handling.

I. GPU Architectures

• GPU architecture is mainly driven by following key factors:


1. Amount of data processed at one time (Parallel processing).
2. Processing speed on each data element (Clock frequency).
3. Amount of data transferred at one time (Memory bandwidth).
4. Time for each data element to be transferred (Memory latency).
• To begin with, let us first look at main design distinctions between CPU and GPU.
• CPU design consists of multicore processors having large cores and large caches using control units for
optimal serial performance.
• Whereas, GPU design consists of large number of threads with small caches and minimized control
units for optimizing execution throughput.
• GPU provides much higher instruction throughput and memory bandwidth than the CPU within a
similar price and power envelope.

• GPU architecture focuses more on putting available cores to work and is less focused on low latency
cache memory access.
• In generic many core GPU, less space is devoted to control logic and caches. And, large numbers of
transistors are devoted to support parallel data processing. Following diagram shows GPU architecture.
1) The GPU consists of multiple Processor Clusters (PC).

2) Each Processor Cluster (PC) contains multiple Streaming Multiprocessors (SM).

3) Each Streaming Multiprocessor (SM) has number of Streaming Processors (SPs) (also known as cores) that
share control logic and L1(layer 1) instruction cache.

4) One Streaming Multiprocessor (SM) uses dedicated L1 (Layer 1) cache and shared L2 (Layer 2) cache
before pulling data from global memory i.e. Graphic Double Data Rate (GDDR) DRAM.

5) The number of Streaming Multiprocessors (SMs) and cores per Streaming Multiprocessor (SM) varies as
per the targeted price and market of the GPU.
6) Global memory of GPU consists of multiple GBs of DRAM. The growing size of global memory allows
keeping data longer in global memory thereby reducing transfers to the CPU.

7) GPU architecture is tolerant of memory latency. Higher bandwidth makes up for memory latency.

8) In comparison to CPU, GPU works with fewer and small memory cache layers. This is because GPU has
more transistors dedicated to computations and it worries less about retrieving data from memory.

9) Memory bus is optimized for bandwidth allowing serving large number of ALUs simultaneously.

10) GPU architecture is more optimized for data parallel throughput computations.

11) In order to execute tasks in parallel, tasks are scheduled at Processor Cluster (PC) or Streaming
Multiprocessor (SM) level.

II. Data Parallelism

• Data parallelism is a key concept in parallel computing where a particular dataset is divided into smaller
chunks, and the same operation is performed concurrently on each chunk. This approach leverages the
ability to process multiple data elements simultaneously, significantly speeding up computations.
• CPUs use Task Parallelism wherein
a. Multiple tasks map to multiple threads and tasks run different instructions
b. Generally threads are heavyweight
c. Programming is done for the individual thread.

• Whereas GPUs use data parallelism wherein


a. Same instruction is executed on different data
b. Generally threads are lightweight
c. Programming is done for batches of threads (e.g. one pixel shader per group of pixels).
• In Data Parallelism, performance improvement is achieved by applying the same small set of tasks
iteratively over multiple streams of data.
• It is nothing but way of performing parallel execution of an application on multiple processors.
• In Data Parallelism, the goal is to scale the throughput of processing based on the ability to decompose
the data set into concurrent processing streams, all performing the same set of operations.
• CPU application manages the GPU and uses it to offload specific computations.
• GPU code is encapsulated in parallel routines called kernels.
• CPU executes the main program, which prepares the input data for GPU processing, invokes the kernel
on the GPU, and then obtains the results after the kernel terminates.
• A GPU kernel maintains its own application state. A GPU kernel is ordinary sequential function, but it
is executed in parallel by thousands of GPU threads.
• Data Parallelism is achieved in SIMD (Single Instruction Multiple Data) mode.
• In SIMD mode, an instruction is only decoded once and multiple ALUs perform the work on multiple
data elements in parallel.
• In this either single controller controls the parallel data operations or multiple threads work in the same
way on the individual compute nodes (SPMD).
• SIMD parallelism enhances the performance of computationally intensive applications that execute the
same operation on distinct elements in a dataset.

• Modern applications process large amounts of data which incurs significant execution time on
sequential computers.
• Data parallelism is used to advantage in applications like image processing, computer graphics, algebra
libraries like matrix multiplication, etc.

III. CUDA Hardware & CUDA Basics

CUDA Hardware:
• CUDA (Compute Unified Device Architecture) is scalable parallel computing platform and
programming model for general computing on GPUs and multicore CPUs.
• CUDA is introduced by NVIDIA in 2006.
• CUDA is data-parallel extension to the C/C++ languages and an API model for parallel programming.
• CUDA parallel programming model has three key abstractions –
o (1) a hierarchy of thread groups
o (2) shared memories, and
o (3) barrier synchronization.
• The programmer or compiler decomposes large computing problems into many small problems that
can be solved in parallel.
• Programs written using CUDA harness the power of GPU and thereby increase the computing
performance.
• In GPU accelerated applications, the sequential part of the workload runs on the CPU (as it is optimized
for single threaded performance) and the compute intensive portion of the application runs on thousands
of GPU cores in parallel.
• Using CUDA, developers can utilize the power of GPUs to perform general computing tasks like
multiplying matrices and performing linear algebra operations (instead of just doing graphical
calculations).
• In CUDA, developers program in popular languages such as C, C++, Fortran, Python, DirectCompute
and MATLAB. And, express parallelism through extensions in the form of basic keywords.
• At high level, graphics card with a many-core GPU and high speed graphics device memory sits inside
a standard PC / server with one or two multicore CPUs.

• The GPU consists of multiple Streaming Multiprocessors (SM). And, each Streaming Multiprocessor
(SM) has number of Streaming Processors (SPs) also known as cores. Streaming Multiprocessor (SM)
uses dedicated L1 cache and shared L2 cache. Following diagram shows high level overview GPU
hardware.

CUDA Basics:

• CUDA is a platform and programming model for CUDA-enabled GPUs.


• The platform exposes GPUs for general purpose computing. CUDA provides C/C++ language
extension and APIs for programming and managing GPUs.
• In CUDA programming, both CPUs and GPUs are used for computing.
• Typically, we refer to CPU and GPU system as host and device, respectively.
• CPUs and GPUs are separated platforms with their own memory space. Typically, we run serial
workload on CPU and offload parallel computation to GPUs.

Comparison between CUDA and C:

The major difference between C and CUDA implementation is __global__ specifier and <<<...>>> syntax.
The __global__ specifier indicates a function that runs on device (GPU). Such function can be called
through host code, e.g. the main() function in the example, and is also known as "kernels".

When a kernel is called, its execution configuration is provided through <<<...>>> syntax,
e.g. cuda_hello<<<1,1>>>(). In CUDA terminology, this is called "kernel launch".

Compiling CUDA programs:

• Compiling a CUDA program is similar to C program. NVIDIA provides a CUDA compiler


called nvcc in the CUDA toolkit to compile CUDA code, typically stored in a file with
extension .cu. For example
$> nvcc hello.cu -o hello

Following is an example of vector addition implemented in C (./vector_add.c). The example computes the
addtion of two vectors stored in array a and b and put the result in array out.

Converting vector addition to CUDA:

• We will convert vector_add.c to CUDA program vector_add.cu by using the hello world as
example.
• In CUDA terminology, CPU memory is called host memory and GPU memory is called device
memory. Pointers to CPU and GPU memory are called host pointer and device pointer, respectively.
• For data to be accessible by GPU, it must be presented in the device memory. CUDA provides APIs
for allocating device memory and data transfer between host and device memory.
• Following is the common workflow of CUDA programs.
1. Allocate host memory and initialized host data
2. Allocate device memory
3. Transfer input data from host to device memory
4. Execute kernels
5. Transfer output from device memory to host

Device memory management

• CUDA provides several functions for allocating device memory. The most common ones
are cudaMalloc() and cudaFree() . The syntax for both functions are as follow
Memory transfer:

• Transfering data between host and device memory can be done through cudaMemcpy function, which
is similar to memcpy in C. The syntax of cudaMemcpy is as follow

IV.CUDA Program Structure


Basic Structure of a CUDA Program:

1. Initialization:

o Device Query: Check the number and capabilities of available GPUs.

o Set Device: Select the appropriate GPU for computations.

2. Memory Allocation:

o Allocate Host Memory: Allocate memory on the CPU (host) side.

o Allocate Device Memory: Allocate memory on the GPU (device) side.

o Transfer Data: Copy data from the host to the device.

3. Kernel Launch:

o Define Kernel Function: Write the function that will run on the GPU.

o Configure Execution Parameters: Determine the grid and block size for execution.

o Launch Kernel: Execute the kernel function on the GPU.

4. Synchronization:
o Synchronize: Ensure all threads have completed execution before proceeding.

5. Memory Cleanup:

o Transfer Data Back: Copy results from the device back to the host.

o Free Memory: Deallocate memory on both the host and device.

#include <cuda_runtime.h>
#include <iostream>
// Kernel function to be executed on the GPU
__global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x;
c[index] = a[index] + b[index];
}

int main() {
// Initialize host data
const int arraySize = 5;
int hostA[arraySize] = {1, 2, 3, 4, 5};
int hostB[arraySize] = {10, 20, 30, 40, 50};
int hostC[arraySize];
// Allocate device memory
int *deviceA, *deviceB, *deviceC;
cudaMalloc((void**)&deviceA, arraySize * sizeof(int));
cudaMalloc((void**)&deviceB, arraySize * sizeof(int));
cudaMalloc((void**)&deviceC, arraySize * sizeof(int));
// Copy data from host to device
cudaMemcpy(deviceA, hostA, arraySize * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(deviceB, hostB, arraySize * sizeof(int), cudaMemcpyHostToDevice);
// Launch kernel
add<<<1, arraySize>>>(deviceA, deviceB, deviceC);
// Copy result from device to host
cudaMemcpy(hostC, deviceC, arraySize * sizeof(int), cudaMemcpyDeviceToHost);
// Print result
std::cout << "Result: ";
for (int i = 0; i < arraySize; i++) {
std::cout << hostC[i] << " ";
}
std::cout << std::endl;
// Free device memory
cudaFree(deviceA);
cudaFree(deviceB);
cudaFree(deviceC);
return 0;
}
Explanation:
• Kernel Function: The add function runs on the GPU and performs addition on arrays.
• Memory Allocation: Memory is allocated on both the host and device.
• Data Transfer: Data is copied between the host and device.
• Kernel Launch: The kernel is launched with execution parameters.
• Synchronization: The results are copied back to the host once all threads complete.
• Memory Cleanup: Allocated memory is freed to prevent leaks.
CUDA provides a flexible and powerful way to harness the GPU’s capabilities, enabling significant
performance improvements for parallelizable tasks.

V.Threads, Blocks and Grids


The CUDA Programming Model:

• The CUDA programming model provides an abstraction of GPU architecture that acts as a bridge
between an application and its possible implementation on GPU hardware.
• This post outlines the main concepts of the CUDA programming model by outlining how they are
exposed in general-purpose programming languages like C/C++.
• Let me introduce two keywords widely used in CUDA programming model: host and device.
• The host is the CPU available in the system.
• The system memory associated with the CPU is called host memory.
• The GPU is called a device and GPU memory likewise called device memory.

To execute any CUDA program, there are three main steps:

• Copy the input data from host memory to device memory, also known as host-to-device transfer.

• Load the GPU program and execute, caching data on-chip for performance.

• Copy the results from device memory to host memory, also called device-to-host transfer.
CUDA kernel and thread hierarchy:

Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your
applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like
regular C/C++ functions.

• Every CUDA kernel starts with a __global__ declaration specifier. Programmers provide a unique
global ID to each thread by using built-in variables.
• A group of threads is called a CUDA block.
• CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2).
• Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other
SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism).
• One SM can run several concurrent CUDA blocks depending on the resources needed by CUDA
blocks.
• Each kernel is executed on one device and CUDA supports running multiple kernels on a device at one
time. Figure 3 shows the kernel execution and mapping on hardware resources available in GPU.

• CUDA defines built-in 3D variables for threads and blocks.


• Threads are indexed using the built-in 3D variable threadIdx.
• Three-dimensional indexing provides a natural way to index elements in vectors, matrix, and volume
and makes CUDA programming easier. Similarly, blocks are also indexed using the in-built 3D variable
called blockIdx.
Here are a few noticeable points:
• CUDA architecture limits the numbers of threads per block (1024 threads per block limit).
• The dimension of the thread block is accessible within the kernel through the built-in blockDim
variable.
• All threads within a block can be synchronized using an intrinsic function __syncthreads.
With __syncthreads, all threads in the block must wait before anyone can proceed.
• The number of threads per block and the number of blocks per grid specified in the <<<…>>> syntax
can be of type int or dim3.
• These triple angle brackets mark a call from host code to device code. It is also called a kernel launch.

VI. CUDA - Memory Handling

• Memory handling in CUDA is a crucial aspect of achieving optimal performance. Proper management
of memory can significantly impact the efficiency and speed of your CUDA programs. Here are the key
concepts and types of memory in CUDA:

Types of Memory:

1. Global Memory:

o Accessible by all threads and has high latency.

o Used for data transfer between host (CPU) and device (GPU).

o Typically, the largest memory space, but also the slowest.

2. Shared Memory:

o Shared among threads within the same block.

o Much faster than global memory.

o Ideal for data that needs to be frequently accessed by multiple threads within a block.

3. Registers:

o Fastest memory available, located within each thread.

o Limited in number and used for storing frequently accessed variables.

4. Constant Memory:

o Read-only memory accessible by all threads.

o Optimized for broadcast operations (when many threads read the same value).
5. Texture Memory:

o Read-only memory optimized for certain types of access patterns.

o Provides caching and is often used in image processing.

Memory Handling Workflow:

1. Allocate Memory on the Device:

o Use cudaMalloc to allocate memory on the GPU.

int *d_array;
cudaMalloc((void**)&d_array, size);
2. Copy Data from Host to Device:
• Use cudaMemcpy to transfer data from the CPU to the GPU.
cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice);
3. Kernel Execution:
• Perform computations on the GPU using the allocated memory.
kernel<<<gridSize, blockSize>>>(d_array);
4. Copy Data from Device to Host:
• Use cudaMemcpy to transfer data back from the GPU to the CPU.
cudaMemcpy(h_array, d_array, size, cudaMemcpyDeviceToHost);
5. Free Memory on the Device:
• Use cudaFree to deallocate memory on the GPU.

Example Code:

• Here’s a simple example demonstrating memory handling in CUDA:

#include <cuda_runtime.h>

#include <stdio.h>

__global__ void add(int *a, int *b, int *c) {

int index = threadIdx.x;

c[index] = a[index] + b[index];

int main() {

const int arraySize = 5;


int h_a[arraySize] = {1, 2, 3, 4, 5};

int h_b[arraySize] = {10, 20, 30, 40, 50};

int h_c[arraySize];

int *d_a, *d_b, *d_c;

size_t size = arraySize * sizeof(int);

// Allocate device memory

cudaMalloc((void**)&d_a, size);

cudaMalloc((void**)&d_b, size);

cudaMalloc((void**)&d_c, size);

// Copy data from host to device

cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

// Launch kernel

add<<<1, arraySize>>>(d_a, d_b, d_c);

// Copy result from device to host

cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

// Print result

printf("Result: ");

for (int i = 0; i < arraySize; i++) {

printf("%d ", h_c[i]);

printf("\n");

// Free device memory

cudaFree(d_a);

cudaFree(d_b);

cudaFree(d_c);
return 0;

Best Practices:
• Minimize Data Transfer: Data transfer between host and device is slow. Minimize the frequency and
size of these transfers.
• Use Shared Memory: Leverage shared memory for data that needs to be frequently accessed by
multiple threads.
• Optimize Memory Access Patterns: Ensure coalesced memory access patterns for global memory to
improve performance.
• Free Unused Memory: Always free device memory when it’s no longer needed to avoid memory leaks.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy