0% found this document useful (0 votes)

12 views17 pages

DS1822 - Parallel Computing-unit3

This document provides an overview of GPU programming, focusing on GPU architectures, data parallelism, and the CUDA programming model. It details the structure of CUDA programs, including memory management, kernel execution, and the organization of threads, blocks, and grids. Additionally, it highlights the differences between CPU and GPU processing, emphasizing the advantages of using CUDA for parallel computing tasks.

Uploaded by

as.nisha.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views17 pages

DS1822 - Parallel Computing-unit3

Uploaded by

as.nisha.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

UNIT III PROGRAMMING GPU’s

GPU Architectures – Data Parallelism – CUDA Basics – CUDA Program Structure – Threads, Blocks, Grids
– Memory Handling.

I. GPU Architectures

• GPU architecture is mainly driven by following key factors:

1. Amount of data processed at one time (Parallel processing).
2. Processing speed on each data element (Clock frequency).
3. Amount of data transferred at one time (Memory bandwidth).
4. Time for each data element to be transferred (Memory latency).
• To begin with, let us first look at main design distinctions between CPU and GPU.
• CPU design consists of multicore processors having large cores and large caches using control units for
optimal serial performance.
• Whereas, GPU design consists of large number of threads with small caches and minimized control
units for optimizing execution throughput.
• GPU provides much higher instruction throughput and memory bandwidth than the CPU within a
similar price and power envelope.

• GPU architecture focuses more on putting available cores to work and is less focused on low latency
cache memory access.
• In generic many core GPU, less space is devoted to control logic and caches. And, large numbers of
transistors are devoted to support parallel data processing. Following diagram shows GPU architecture.
1) The GPU consists of multiple Processor Clusters (PC).

2) Each Processor Cluster (PC) contains multiple Streaming Multiprocessors (SM).

3) Each Streaming Multiprocessor (SM) has number of Streaming Processors (SPs) (also known as cores) that
share control logic and L1(layer 1) instruction cache.

4) One Streaming Multiprocessor (SM) uses dedicated L1 (Layer 1) cache and shared L2 (Layer 2) cache
before pulling data from global memory i.e. Graphic Double Data Rate (GDDR) DRAM.

5) The number of Streaming Multiprocessors (SMs) and cores per Streaming Multiprocessor (SM) varies as
per the targeted price and market of the GPU.
6) Global memory of GPU consists of multiple GBs of DRAM. The growing size of global memory allows
keeping data longer in global memory thereby reducing transfers to the CPU.

7) GPU architecture is tolerant of memory latency. Higher bandwidth makes up for memory latency.

8) In comparison to CPU, GPU works with fewer and small memory cache layers. This is because GPU has
more transistors dedicated to computations and it worries less about retrieving data from memory.

9) Memory bus is optimized for bandwidth allowing serving large number of ALUs simultaneously.

10) GPU architecture is more optimized for data parallel throughput computations.

11) In order to execute tasks in parallel, tasks are scheduled at Processor Cluster (PC) or Streaming
Multiprocessor (SM) level.

II. Data Parallelism

• Data parallelism is a key concept in parallel computing where a particular dataset is divided into smaller
chunks, and the same operation is performed concurrently on each chunk. This approach leverages the
ability to process multiple data elements simultaneously, significantly speeding up computations.
• CPUs use Task Parallelism wherein
a. Multiple tasks map to multiple threads and tasks run different instructions
b. Generally threads are heavyweight
c. Programming is done for the individual thread.

• Whereas GPUs use data parallelism wherein

a. Same instruction is executed on different data
b. Generally threads are lightweight
c. Programming is done for batches of threads (e.g. one pixel shader per group of pixels).
• In Data Parallelism, performance improvement is achieved by applying the same small set of tasks
iteratively over multiple streams of data.
• It is nothing but way of performing parallel execution of an application on multiple processors.
• In Data Parallelism, the goal is to scale the throughput of processing based on the ability to decompose
the data set into concurrent processing streams, all performing the same set of operations.
• CPU application manages the GPU and uses it to offload specific computations.
• GPU code is encapsulated in parallel routines called kernels.
• CPU executes the main program, which prepares the input data for GPU processing, invokes the kernel
on the GPU, and then obtains the results after the kernel terminates.
• A GPU kernel maintains its own application state. A GPU kernel is ordinary sequential function, but it
is executed in parallel by thousands of GPU threads.
• Data Parallelism is achieved in SIMD (Single Instruction Multiple Data) mode.
• In SIMD mode, an instruction is only decoded once and multiple ALUs perform the work on multiple
data elements in parallel.
• In this either single controller controls the parallel data operations or multiple threads work in the same
way on the individual compute nodes (SPMD).
• SIMD parallelism enhances the performance of computationally intensive applications that execute the
same operation on distinct elements in a dataset.

• Modern applications process large amounts of data which incurs significant execution time on
sequential computers.
• Data parallelism is used to advantage in applications like image processing, computer graphics, algebra
libraries like matrix multiplication, etc.

III. CUDA Hardware & CUDA Basics

CUDA Hardware:
• CUDA (Compute Unified Device Architecture) is scalable parallel computing platform and
programming model for general computing on GPUs and multicore CPUs.
• CUDA is introduced by NVIDIA in 2006.
• CUDA is data-parallel extension to the C/C++ languages and an API model for parallel programming.
• CUDA parallel programming model has three key abstractions –
o (1) a hierarchy of thread groups
o (2) shared memories, and
o (3) barrier synchronization.
• The programmer or compiler decomposes large computing problems into many small problems that
can be solved in parallel.
• Programs written using CUDA harness the power of GPU and thereby increase the computing
performance.
• In GPU accelerated applications, the sequential part of the workload runs on the CPU (as it is optimized
for single threaded performance) and the compute intensive portion of the application runs on thousands
of GPU cores in parallel.
• Using CUDA, developers can utilize the power of GPUs to perform general computing tasks like
multiplying matrices and performing linear algebra operations (instead of just doing graphical
calculations).
• In CUDA, developers program in popular languages such as C, C++, Fortran, Python, DirectCompute
and MATLAB. And, express parallelism through extensions in the form of basic keywords.
• At high level, graphics card with a many-core GPU and high speed graphics device memory sits inside
a standard PC / server with one or two multicore CPUs.

• The GPU consists of multiple Streaming Multiprocessors (SM). And, each Streaming Multiprocessor
(SM) has number of Streaming Processors (SPs) also known as cores. Streaming Multiprocessor (SM)
uses dedicated L1 cache and shared L2 cache. Following diagram shows high level overview GPU
hardware.

CUDA Basics:

• CUDA is a platform and programming model for CUDA-enabled GPUs.

• The platform exposes GPUs for general purpose computing. CUDA provides C/C++ language
extension and APIs for programming and managing GPUs.
• In CUDA programming, both CPUs and GPUs are used for computing.
• Typically, we refer to CPU and GPU system as host and device, respectively.
• CPUs and GPUs are separated platforms with their own memory space. Typically, we run serial
workload on CPU and offload parallel computation to GPUs.

Comparison between CUDA and C:

The major difference between C and CUDA implementation is __global__ specifier and <<<...>>> syntax.
The __global__ specifier indicates a function that runs on device (GPU). Such function can be called
through host code, e.g. the main() function in the example, and is also known as "kernels".

When a kernel is called, its execution configuration is provided through <<<...>>> syntax,
e.g. cuda_hello<<<1,1>>>(). In CUDA terminology, this is called "kernel launch".

Compiling CUDA programs:

• Compiling a CUDA program is similar to C program. NVIDIA provides a CUDA compiler

called nvcc in the CUDA toolkit to compile CUDA code, typically stored in a file with
extension .cu. For example
$> nvcc hello.cu -o hello

Following is an example of vector addition implemented in C (./vector_add.c). The example computes the
addtion of two vectors stored in array a and b and put the result in array out.

Converting vector addition to CUDA:

• We will convert vector_add.c to CUDA program vector_add.cu by using the hello world as
example.
• In CUDA terminology, CPU memory is called host memory and GPU memory is called device
memory. Pointers to CPU and GPU memory are called host pointer and device pointer, respectively.
• For data to be accessible by GPU, it must be presented in the device memory. CUDA provides APIs
for allocating device memory and data transfer between host and device memory.
• Following is the common workflow of CUDA programs.
1. Allocate host memory and initialized host data
2. Allocate device memory
3. Transfer input data from host to device memory
4. Execute kernels
5. Transfer output from device memory to host

Device memory management

• CUDA provides several functions for allocating device memory. The most common ones
are cudaMalloc() and cudaFree() . The syntax for both functions are as follow
Memory transfer:

• Transfering data between host and device memory can be done through cudaMemcpy function, which
is similar to memcpy in C. The syntax of cudaMemcpy is as follow

IV.CUDA Program Structure

Basic Structure of a CUDA Program:

1. Initialization:

o Device Query: Check the number and capabilities of available GPUs.

o Set Device: Select the appropriate GPU for computations.

2. Memory Allocation:

o Allocate Host Memory: Allocate memory on the CPU (host) side.

o Allocate Device Memory: Allocate memory on the GPU (device) side.

o Transfer Data: Copy data from the host to the device.

3. Kernel Launch:

o Define Kernel Function: Write the function that will run on the GPU.

o Configure Execution Parameters: Determine the grid and block size for execution.

o Launch Kernel: Execute the kernel function on the GPU.

4. Synchronization:
o Synchronize: Ensure all threads have completed execution before proceeding.

5. Memory Cleanup:

o Transfer Data Back: Copy results from the device back to the host.

o Free Memory: Deallocate memory on both the host and device.

#include <cuda_runtime.h>
#include <iostream>
// Kernel function to be executed on the GPU
__global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x;
c[index] = a[index] + b[index];
}

int main() {
// Initialize host data
const int arraySize = 5;
int hostA[arraySize] = {1, 2, 3, 4, 5};
int hostB[arraySize] = {10, 20, 30, 40, 50};
int hostC[arraySize];
// Allocate device memory
int *deviceA, *deviceB, *deviceC;
cudaMalloc((void**)&deviceA, arraySize * sizeof(int));
cudaMalloc((void**)&deviceB, arraySize * sizeof(int));
cudaMalloc((void**)&deviceC, arraySize * sizeof(int));
// Copy data from host to device
cudaMemcpy(deviceA, hostA, arraySize * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(deviceB, hostB, arraySize * sizeof(int), cudaMemcpyHostToDevice);
// Launch kernel
add<<<1, arraySize>>>(deviceA, deviceB, deviceC);
// Copy result from device to host
cudaMemcpy(hostC, deviceC, arraySize * sizeof(int), cudaMemcpyDeviceToHost);
// Print result
std::cout << "Result: ";
for (int i = 0; i < arraySize; i++) {
std::cout << hostC[i] << " ";
}
std::cout << std::endl;
// Free device memory
cudaFree(deviceA);
cudaFree(deviceB);
cudaFree(deviceC);
return 0;
}
Explanation:
• Kernel Function: The add function runs on the GPU and performs addition on arrays.
• Memory Allocation: Memory is allocated on both the host and device.
• Data Transfer: Data is copied between the host and device.
• Kernel Launch: The kernel is launched with execution parameters.
• Synchronization: The results are copied back to the host once all threads complete.
• Memory Cleanup: Allocated memory is freed to prevent leaks.
CUDA provides a flexible and powerful way to harness the GPU’s capabilities, enabling significant
performance improvements for parallelizable tasks.

V.Threads, Blocks and Grids

The CUDA Programming Model:

• The CUDA programming model provides an abstraction of GPU architecture that acts as a bridge
between an application and its possible implementation on GPU hardware.
• This post outlines the main concepts of the CUDA programming model by outlining how they are
exposed in general-purpose programming languages like C/C++.
• Let me introduce two keywords widely used in CUDA programming model: host and device.
• The host is the CPU available in the system.
• The system memory associated with the CPU is called host memory.
• The GPU is called a device and GPU memory likewise called device memory.

To execute any CUDA program, there are three main steps:

• Copy the input data from host memory to device memory, also known as host-to-device transfer.

• Load the GPU program and execute, caching data on-chip for performance.

• Copy the results from device memory to host memory, also called device-to-host transfer.
CUDA kernel and thread hierarchy:

Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your
applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like
regular C/C++ functions.

• Every CUDA kernel starts with a __global__ declaration specifier. Programmers provide a unique
global ID to each thread by using built-in variables.
• A group of threads is called a CUDA block.
• CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2).
• Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other
SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism).
• One SM can run several concurrent CUDA blocks depending on the resources needed by CUDA
blocks.
• Each kernel is executed on one device and CUDA supports running multiple kernels on a device at one
time. Figure 3 shows the kernel execution and mapping on hardware resources available in GPU.

• CUDA defines built-in 3D variables for threads and blocks.

• Threads are indexed using the built-in 3D variable threadIdx.
• Three-dimensional indexing provides a natural way to index elements in vectors, matrix, and volume
and makes CUDA programming easier. Similarly, blocks are also indexed using the in-built 3D variable
called blockIdx.
Here are a few noticeable points:
• CUDA architecture limits the numbers of threads per block (1024 threads per block limit).
• The dimension of the thread block is accessible within the kernel through the built-in blockDim
variable.
• All threads within a block can be synchronized using an intrinsic function __syncthreads.
With __syncthreads, all threads in the block must wait before anyone can proceed.
• The number of threads per block and the number of blocks per grid specified in the <<<…>>> syntax
can be of type int or dim3.
• These triple angle brackets mark a call from host code to device code. It is also called a kernel launch.

VI. CUDA - Memory Handling

• Memory handling in CUDA is a crucial aspect of achieving optimal performance. Proper management
of memory can significantly impact the efficiency and speed of your CUDA programs. Here are the key
concepts and types of memory in CUDA:

Types of Memory:

1. Global Memory:

o Accessible by all threads and has high latency.

o Used for data transfer between host (CPU) and device (GPU).

o Typically, the largest memory space, but also the slowest.

2. Shared Memory:

o Shared among threads within the same block.

o Much faster than global memory.

o Ideal for data that needs to be frequently accessed by multiple threads within a block.

3. Registers:

o Fastest memory available, located within each thread.

o Limited in number and used for storing frequently accessed variables.

4. Constant Memory:

o Read-only memory accessible by all threads.

o Optimized for broadcast operations (when many threads read the same value).
5. Texture Memory:

o Read-only memory optimized for certain types of access patterns.

o Provides caching and is often used in image processing.

Memory Handling Workflow:

1. Allocate Memory on the Device:

o Use cudaMalloc to allocate memory on the GPU.

int *d_array;
cudaMalloc((void**)&d_array, size);
2. Copy Data from Host to Device:
• Use cudaMemcpy to transfer data from the CPU to the GPU.
cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice);
3. Kernel Execution:
• Perform computations on the GPU using the allocated memory.
kernel<<<gridSize, blockSize>>>(d_array);
4. Copy Data from Device to Host:
• Use cudaMemcpy to transfer data back from the GPU to the CPU.
cudaMemcpy(h_array, d_array, size, cudaMemcpyDeviceToHost);
5. Free Memory on the Device:
• Use cudaFree to deallocate memory on the GPU.

Example Code:

• Here’s a simple example demonstrating memory handling in CUDA:

#include <cuda_runtime.h>

#include <stdio.h>

global void add(int a, int b, int *c) {

int index = threadIdx.x;

c[index] = a[index] + b[index];

int main() {

const int arraySize = 5;

int h_a[arraySize] = {1, 2, 3, 4, 5};

int h_b[arraySize] = {10, 20, 30, 40, 50};

int h_c[arraySize];

int d_a, d_b, *d_c;

size_t size = arraySize * sizeof(int);

// Allocate device memory

cudaMalloc((void**)&d_a, size);

cudaMalloc((void**)&d_b, size);

cudaMalloc((void**)&d_c, size);

// Copy data from host to device

cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

// Launch kernel

add<<<1, arraySize>>>(d_a, d_b, d_c);

// Copy result from device to host

cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

// Print result

printf("Result: ");

for (int i = 0; i < arraySize; i++) {

printf("%d ", h_c[i]);

printf("\n");

// Free device memory

cudaFree(d_a);

cudaFree(d_b);

cudaFree(d_c);
return 0;

Best Practices:
• Minimize Data Transfer: Data transfer between host and device is slow. Minimize the frequency and
size of these transfers.
• Use Shared Memory: Leverage shared memory for data that needs to be frequently accessed by
multiple threads.
• Optimize Memory Access Patterns: Ensure coalesced memory access patterns for global memory to
improve performance.
• Free Unused Memory: Always free device memory when it’s no longer needed to avoid memory leaks.

Yes I Ran Away Ruchika Verma download
No ratings yet
Yes I Ran Away Ruchika Verma download
80 pages
Time Study Format 147
100% (1)
Time Study Format 147
17 pages
RCNi Care After Death Hospital
No ratings yet
RCNi Care After Death Hospital
5 pages
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
No ratings yet
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
71 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
Chapter7_GPU
No ratings yet
Chapter7_GPU
45 pages
Module 1-Basic Criteria of Police Writing
No ratings yet
Module 1-Basic Criteria of Police Writing
39 pages
Curriculum Alignment Guide - Second Grade
No ratings yet
Curriculum Alignment Guide - Second Grade
2 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Lecture-12-GPU-Programming
No ratings yet
Lecture-12-GPU-Programming
65 pages
Lecture3 Fundamentals of CUDA(Part1)_2025
No ratings yet
Lecture3 Fundamentals of CUDA(Part1)_2025
52 pages
Day1 1
No ratings yet
Day1 1
25 pages
Cuda
No ratings yet
Cuda
69 pages
MBAFs Frank Gonzalez Named President of The South Florida Banking Institute
No ratings yet
MBAFs Frank Gonzalez Named President of The South Florida Banking Institute
2 pages
Livestock Market Feasibility Study-Final Report-English
No ratings yet
Livestock Market Feasibility Study-Final Report-English
77 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
CUDA
No ratings yet
CUDA
18 pages
1 Cuda
100% (1)
1 Cuda
173 pages
BinFinder asiaCCS
No ratings yet
BinFinder asiaCCS
15 pages
Endsem Imp Hpc Unit 5
No ratings yet
Endsem Imp Hpc Unit 5
24 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
GPU_Programming_slides_2
No ratings yet
GPU_Programming_slides_2
37 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
CUDA 1_Introduction to GPU, CUDA (1)
No ratings yet
CUDA 1_Introduction to GPU, CUDA (1)
21 pages
Analyzing_Radar_Cross_Section_Signatures_of_Diverse_Drone_Models_at_mmWave_Frequencies (1)
No ratings yet
Analyzing_Radar_Cross_Section_Signatures_of_Diverse_Drone_Models_at_mmWave_Frequencies (1)
12 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
Syllabus - CE NUMES Flexobedized 2021
No ratings yet
Syllabus - CE NUMES Flexobedized 2021
9 pages
200 Sight Words Flashcards PDF
No ratings yet
200 Sight Words Flashcards PDF
53 pages
chapter-8
No ratings yet
chapter-8
58 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
6 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
046: Unsafe Acts vs. Unsafe Conditions: Background Discussion Leader Duties For This Session
No ratings yet
046: Unsafe Acts vs. Unsafe Conditions: Background Discussion Leader Duties For This Session
2 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Science 10 1st Grading Module 1
No ratings yet
Science 10 1st Grading Module 1
27 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Old CHE1301 Practice Final
No ratings yet
Old CHE1301 Practice Final
12 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
Hillstone SA Firewall Series & GreenBow IPSec VPN Client Software Configuration (English)
No ratings yet
Hillstone SA Firewall Series & GreenBow IPSec VPN Client Software Configuration (English)
15 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
The Golden Circle for Windsor House
No ratings yet
The Golden Circle for Windsor House
3 pages
Gpu Cuda Part2
No ratings yet
Gpu Cuda Part2
15 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
cuda
No ratings yet
cuda
25 pages
CUDA
No ratings yet
CUDA
46 pages
1 Ahad 21052023
No ratings yet
1 Ahad 21052023
2 pages
Beneath The Graves v1.0
No ratings yet
Beneath The Graves v1.0
2 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Care and Maintenance of Bearings: CAT - No
No ratings yet
Care and Maintenance of Bearings: CAT - No
26 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
GPU_Architecture_and_Programming_Lecture
No ratings yet
GPU_Architecture_and_Programming_Lecture
9 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
3 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
VAM-J EEDEN19A Data Books English
No ratings yet
VAM-J EEDEN19A Data Books English
37 pages
EPSY 578 Syllabus - Fa 20 v2
No ratings yet
EPSY 578 Syllabus - Fa 20 v2
6 pages
Mad Lab Record
No ratings yet
Mad Lab Record
95 pages
SBA - Module 3
No ratings yet
SBA - Module 3
7 pages
Jotaro Face - Roblox
No ratings yet
Jotaro Face - Roblox
1 page
CUDA
No ratings yet
CUDA
33 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
3.DME - ME3RD - UNIT-3.2 - Design of Joints
No ratings yet
3.DME - ME3RD - UNIT-3.2 - Design of Joints
70 pages
Cosplay Fitness
No ratings yet
Cosplay Fitness
3 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Department of Science and Technology: Republic of The Philippines
No ratings yet
Department of Science and Technology: Republic of The Philippines
2 pages
Engineering Manual
No ratings yet
Engineering Manual
8 pages
References: Guide Manual Palatine, IL: AAMA, 1979
No ratings yet
References: Guide Manual Palatine, IL: AAMA, 1979
5 pages
STR-051 Shoring Sections & Details MR. QADHEEB
No ratings yet
STR-051 Shoring Sections & Details MR. QADHEEB
1 page
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DS1822 - Parallel Computing-unit3

Uploaded by

DS1822 - Parallel Computing-unit3

Uploaded by

UNIT III PROGRAMMING GPU’s

• GPU architecture is mainly driven by following key factors:

2) Each Processor Cluster (PC) contains multiple Streaming Multiprocessors (SM).

II. Data Parallelism

• Whereas GPUs use data parallelism wherein

III. CUDA Hardware & CUDA Basics

• CUDA is a platform and programming model for CUDA-enabled GPUs.

Comparison between CUDA and C:

Compiling CUDA programs:

• Compiling a CUDA program is similar to C program. NVIDIA provides a CUDA compiler

Converting vector addition to CUDA:

Device memory management

IV.CUDA Program Structure

o Device Query: Check the number and capabilities of available GPUs.

o Set Device: Select the appropriate GPU for computations.

o Allocate Host Memory: Allocate memory on the CPU (host) side.

o Allocate Device Memory: Allocate memory on the GPU (device) side.

o Transfer Data: Copy data from the host to the device.

o Launch Kernel: Execute the kernel function on the GPU.

o Free Memory: Deallocate memory on both the host and device.

V.Threads, Blocks and Grids

To execute any CUDA program, there are three main steps:

• CUDA defines built-in 3D variables for threads and blocks.

VI. CUDA - Memory Handling

o Accessible by all threads and has high latency.

o Typically, the largest memory space, but also the slowest.

o Shared among threads within the same block.

o Much faster than global memory.

o Fastest memory available, located within each thread.

o Limited in number and used for storing frequently accessed variables.

o Read-only memory accessible by all threads.

o Read-only memory optimized for certain types of access patterns.

o Provides caching and is often used in image processing.

Memory Handling Workflow:

1. Allocate Memory on the Device:

o Use cudaMalloc to allocate memory on the GPU.

• Here’s a simple example demonstrating memory handling in CUDA:

__global__ void add(int *a, int *b, int *c) {

int index = threadIdx.x;

c[index] = a[index] + b[index];

const int arraySize = 5;

int h_b[arraySize] = {10, 20, 30, 40, 50};

int *d_a, *d_b, *d_c;

size_t size = arraySize * sizeof(int);

// Allocate device memory

// Copy data from host to device

cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

add<<<1, arraySize>>>(d_a, d_b, d_c);

// Copy result from device to host

cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

for (int i = 0; i < arraySize; i++) {

printf("%d ", h_c[i]);

// Free device memory

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

global void add(int a, int b, int *c) {

int d_a, d_b, *d_c;