Main GPU

Image Processing on NVIDIA GPU
A Major Project Report
Submitted in partial fulfillment of the requirements
For the Degree of
Bachelor of Technology
In
ELECTRONICS & COMMUNICATION ENGINEERING
By
Jaymeen Aseem (09BEC003)

Shivang Ghetia (09BEC019)
Under the Guidance of
Prof. N.P.Gajjar
Department of Electrical Engineering

Electronics & Communication Engineering Program
Institute of Technology, Nirma University
Ahmedabad-382 481
May 2013
ii
Certificate
This is to certify that the Major Project Report entitled Image Processing on
NVIDIA GPU submitted by Aseem Jaymeen Bharatkumar (09BEC003)
and Ghetia Shivang Arvindbhai (09BEC019), as the partial fulfillment of the
requirements for the award of the degree of Bachelor of Technology in Electronics &
Communication Engineering, Institute of Technology, Nirma University, Ahmedabad
is the record of work carried out by them under our supervision and guidance. The
work submitted in my opinion has reached a level required for being accepted for the
examination.
Date: Place: Ahmedabad
Prof. N.P.Gajjar
Sr. Assoc. Professor,
Elect. & Comm. Engineering,
Institute of Technology,
Nirma University, Ahmedabad.
Prof.(Dr.) D.K.Kothari Prof. (Dr.) P.N.Tekwani

Section Head, Head of Department,
Dept. of Elect. & Comm. Engineering, Dept. of Electrical Engineering,
Institute of Technology, Institute of Technology,
Nirma University, Ahmedabad. Nirma University, Ahmedabad.
iii
Acknowledgement
We would like to thank our professors for their support, encouragement and guidance
for this major project. It is not possible for us to name and thank them all individ-
ually, We must make special mention some of the personalities and acknowledge our
sincere indebtedness to them.
We express deep and sincere gratitude to our guide Prof. N.P.Gajjar for his con-
stant encouragement, valuable guidance and constructive suggestions during all the
stages of the project work. We are deeply indebted to Prof.Vijay Savani for his
valuable suggestions and support.
We are also thankful to all the faculty members of Department of Electronics &
Communication Engineering and our collegues, for providing us all the necessary
guidance through out the term which provides lots of help in course of our project
work. We are also thankful to the authors whose works we have consulted and quoted
in this work. We also gratefully acknowledge Mr. Prasann Shukla for providing us
the full laboratory support at PG-High Performance Computing Lab.
Jaymeen B. Aseem (09BEC003)

Shivang A. Ghetia (09BEC019)
iv
Abstract
The aim of this project is to explore the potential performance improvements that
could be gained through the use of CUDA architecture and GPU processing tech-
niques for the image processing. There are various algorithms availbale for the
processing the image. And there are various platforms available, on which, image
processing can be carried out. These algorithms include image compression, noise
removal, deblurring, edge detection etc. For processing the image, MATLAB is used
as common tool. But for this project, we use the Graphics Processing Unit (GPU)
provided by NVIDIA, having CUDA support. CUDA is compute unified device archi-
tecture, used mainly for parallel processing and high performance computing. With
the procesing capabilites of CUDA architecture (software) and GPU (hardware), great
improvement can be seen in the image processing techinques.
In this project, certain image processing algorithms have been implemented on GPU
with the use of CUDA architecture. This includes the DFT and DCT computation.
Also IDFT and IDCT computuation have been implemented. Thereafter DCT is
applied on each and every pixel of the image. Quantization can be implemented with
various levels. Then, image is regenerated with IDCT application. The execution
time has been measured. The algorithm has been applied to various images and
the results are compared. Aslo, CPU version of algorithm has been compared with
project results. The observation and comparison shows that Device based Image
Processing yeilds better results with less processing time. This can be very much
useful in processing very high resolution images.
v
Motivation
There are various platforms available for Image Processing. MATLAB is commonly
used tool for image processing. These platforms have their own advantages and dis-
advantages. The new techniques have been invented for image processing to overcome
the limitations of some programming languages. NVIDIA CUDA is such a technique.
It is a architecture based on parallel processing. With the computational power of
GPU, the parallel processing capabilities of CUDA makes NVIDIA Graphics process-
ing devices very much suiatble for complex image processing algorithms. So, these
GPUs provide accleration in processing techniques and they posses many benifits
over conventional programming tools.These GPUs can process high resolution image
in just fraction of second. They provide solutions of many complex graphics related
problems.
The execution time taken by general processing tools is comparatively large for high
resolution images than low resolution images. So, for complex image processing al-
gorithms, these tools are not adequate ones. GPUs provide altenate solution by
applying algorithm on each pixel and process them parallely. Hence, GPUs are very
much useful for image processing.
vi
Outline
The chapters of this dissertation are organized as follows:

Chapter 1 illustrates the project developement environment i.e. hardware and soft-
ware needed to carry out project work.
Chapter 2 explains the basic fundamentals of GPUs and CUDA. It explains the need
of GPU and parallel processing architecture CUDA in high performance computing.
Chapter 3 gives examples of general purpose computing with GPU and data pro-
cessing codes.
Chapter 4 explains how to implement DCT and DFT with CUDA C.
Chapter 5 explains the image processing workflow and algorithm with project re-
sults.
Chapter 6 concludes the project and the future scope.

Contents
Certificate ii
Acknowledgement iii
Abstract iv
Motivation v
Outline vi
List of Figures ix
List of Tables xi
Nomenclature xii
1 Project Development Environment 1

1.1 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hardware and Software Specifications . . . . . . . . . . . . . . . . . . 2
1.2.1 Hardware Specifications . . . . . . . . . . . . . . . . . . . . . 2
2 Introduction to GPU and CUDA 3

2.1 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Introduction and Architecture . . . . . . . . . . . . . . . . . . 3
2.1.2 General Purpose Computing With GPU . . . . . . . . . . . . 8
2.2 CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Need of Parallel Processing . . . . . . . . . . . . . . . . . . . 10
2.2.2 Basics of CUDA Architecture . . . . . . . . . . . . . . . . . . 12
3 Data Processing Codes and Examples 18

3.1 Hello, World Program: Code-1 . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 CPU (Host) Version . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 GPU (Device) Version . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Addition Program: Code-2 . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 CUDA Device Properties: Code-3 . . . . . . . . . . . . . . . . . . . . 22
vii
CONTENTS viii
3.4 Array Addition: Code-4 . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 DFT and DCT Implementation 29

4.1 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . 29
4.2 Discrete Cosine Transform (DCT) . . . . . . . . . . . . . . . . . . . . 39
5 Image Processing Algorithm 55

5.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Image Reading with MATLAB . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Image Processing Code . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Image Regeneration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 PSNR Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Conclusion and Future Scope 72

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
References 74
List of Figures
2.1 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 GPU cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Hardware architecture of GPU . . . . . . . . . . . . . . . . . . . . . . 5
2.4 GPU vs CPU architecture . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Processing Capabilties of GPUs and Bandwidth Comparison . . . . . 7
2.7 GPU Computing Model . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Getting performance in two ways . . . . . . . . . . . . . . . . . . . . 10
2.9 SIMD Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.10 SIMD operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.11 NVIDIA CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.12 CUDA Hierarchy of threads, blocks and grids . . . . . . . . . . . . . 14
2.13 Wrap Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.14 Streaming Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Processing Flow on CUDA . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Output: Code-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Output: Code-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Output: Code-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Output: Code-4(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Output: Code-4(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Output: Code-4(3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Output: 4-Point DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Time Plot for DFT Code . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Output of 2D DFT code . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 DCT on image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 DCT on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 DCT on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 DCT and IDCT code output . . . . . . . . . . . . . . . . . . . . . . . 50
4.8 Convert2N kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.9 ToDCT kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.10 Transpose kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ix
LIST OF FIGURES x
4.11 IDCT kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Image Processing Workflow . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Output of image processing code . . . . . . . . . . . . . . . . . . . . 64
5.3 Original values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Execution time for 1024x768 image . . . . . . . . . . . . . . . . . . . 65
5.5 Original Image and Regenerated Image:128x128 . . . . . . . . . . . . 67
5.10 Original Image and Regenerated Image:1024x768 . . . . . . . . . . . 69
List of Tables
4.1 DCT on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 DCT on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Execution time for different images . . . . . . . . . . . . . . . . . . . 69

5.2 PSNR Values for Different Images . . . . . . . . . . . . . . . . . . . . 71
xi
Nomenclature
Abbreviations
GPU Graphics Processing Unit

CPU Central Processing Unit
GPGPU General Purpose Computing with GPU
CUDA Compute Unified Device Architecture
SM Stream Multiprocessor
SIMD Single Instruction Multiple Data
ALU Arithmetic Logical Unit
FPU Floating Point Unit
SFU Special Function Unit
DFT Discrete Fourier Transform
DCT Discrete Cosine Transform
IDCT Inverse Discrete Cosine Transform
FFT Fast Fourier Transform
JPEG Joint Picture Expert Group
C2C Complex to Complex
R2C Real to Complex
C2R Complex to Real
PSNR Peak Signal-to-Noise Ratio
bin Binary File
Symbols
Micro
Beta
G Giga
M Mega
m Mili
xii
Chapter 1
Project Development Environment
1.1 Development Environment
For the Image Processing on NVIDIA GPU, we require to set up an environment

in which we can develope necessary codes. The prerequisites to developing code in
CUDA C are as follows:
I . A CUDA enabled Graphics Processor
II . An NVIDIA Device Driver
III . A CUDA Developement toolkit
IV . A standard C Complier
For this project, we have used NVIDIA GeForce GTX 480 GPU. For the code devel-
opment, Microsoft Visual Studio 2010 has been used.
1
CHAPTER 1. PROJECT DEVELOPMENT ENVIRONMENT 2
1.2 Hardware and Software Specifications
1.2.1 Hardware Specifications
GPU specifications are as follow:
Device : NVIDIA GeForce GTX 480
Total Global Memory : 1.6 GB
Processor Clock : 1215 MHz
CUDA cores : 480
Memory Clock : 1674 MHz
Memory Bandwidth : 133.9 GB/s
Multiprocessor Count : 15
Maximum Thread Dimensions : (1024, 1024, 64)
Maximum Grid Dimensions : (65535, 65535, 65535)
CPU and Software Specifications are as follow:
Processor: intel(R) core-i5 2400
Clock Rate : 3.1 GHz
RAM : 8 GB
OS : 64-bit operating system
MATLAB version : version 2010
Microsoft Visual Studio Version : version 2010 professional
CUDA version : CUDA 4.2
NVIDIA Visual Profiler version : version 2008

Chapter 2
Introduction to GPU and CUDA
2.1 Graphics Processing Unit
2.1.1 Introduction and Architecture
A graphics processing unit commonly known as GPU is a specialized electronic device

designed to rapidly manipulate and alter memory to accelerate the graphics related
processing. GPU is aslo known as visual processing unit (VPU). GPUs were origi-
nally designed for the graphics processing. They are used for enhancing the visual
information on the output display. They differ from the Central procesing unit (CPU)
in terms of architecture and processing power. GPUs are used in embedded systems,
mobile phones, personal computers, workstations and game consoles [13].
The term GPU was popularized by NVIDIA coroporation in 1999. NVIDIA made
the Geforce 256 and marketed it as worlds first GPU. Graphics processing units are
manufactured mainly by NVIDA and ATI. ATI called GPU as VPU and released
Radeon 9700 in 2002 [13]. Modern GPUs are very efficient at manipulating computer
graphics. Their highly parallel structure makes them more effective than general-
purpose CPUs, when processing large resolution images. GPU is available as graphics
card in many computers as well as embedded in motherboard.
3
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 4
Modern GPUs have the architecture and hardware to do the calculations related
to the 3D computer graphics. They were initially used to accelerate the memory-
intensive work of texture mapping and rendering polygons. Thereafter, units have
been added to accelerate geometric calculations into different coordinate systems [13].
Most of the calculations need matrix and vector operations and GPUs along with
CUDA architecture provide best support for these operations with rich pre-defined
functions and libraries. GPUs are used for image processing, video decoding and
stream processing.
The graphics processing unit is shown in figure 2.1 [10]. The GPU shown in figure
is NVIDIA Geforce GTX 280. The normal CPUs have either single core or multiple
cores. But, the number of cores which CPUs having are less compared to GPUs. The
CPUs are used for general computing, while GPUs are used for high performance
computing. So, GPUs need support of parallel processing. This can be achieved with
many cores. The figure 2.2 shows the difference between number of cores that CPUs
and GPUs are having [9].
Figure 2.1: Graphics Processing Unit

Figure 2.2: GPU cores
The architecture of GPU is shown in figure 2.3 [9]. It differs from the CPU architec-
ture. GPU devotes more transistors to the data processing. GPU has scalable array
of streaming microprocessors. GPU also has multiple memory spaces. On-chip mem-
ory includes shared memory and registers. Off-chip memory includes global memory.
PCI interface is used for interfacing with CPU [10].
Figure 2.3: Hardware architecture of GPU

The difference between GPU and CPU architecture is shown in figure 2.4 [9]. Figure
2.5 shows the chip design of GPU [9].
Figure 2.4: GPU vs CPU architecture
Figure 2.5: Chip Design
As discussed earlier, GPUs are designed to accelerate graphics operations and they are
extensively used for high performance computing. So, GPU is a hardware specially
designed for highly parallel applications like graphics. The processing capabilities of
GPU can be seen fro the graph shown in figure 2.6 [9]. These processing capabilities
require large bandwidth. So, GPUs come with high bandwidth. The comparison can
also be seen in figure 2.6. GPUs use many cores and huge number of transistors. So,
they need adequate cooling mechanism.
Figure 2.6: Processing Capabilties of GPUs and Bandwidth Comparison
The general purpose computing with GPU is discussed in next section.

2.1.2 General Purpose Computing With GPU
GPUs can also be used in general purpose computing because of their massive parallel
processing capabilites. This is known as general purpose computing with GPU or
GPGPU.
Earlier GPU computing was limited to the graphics processing. After the introduction
of CUDA in NVIDIA GPUs, the programming capabilities have been enhanced. Now,
GPUs are also used for general computing. Large data handling, complex algorithms,
Matrix related operations: all these processing and programming have been simplified
by GPUs.
The model for GPU computing is to use a CPU and GPU together in a heteroge-
neous co-processing computing model. The sequential part of the application runs
on the CPU and the computationally-intensive part is accelerated by the GPU. The
application runs faster because it is using the high-performance of the GPU to boost
performance. GPGPU takes the benefit of parallel architecture of GPU, which has
hundreds of processor cores that operate together. In the programming model, the
application developer modify their application to take the compute-intensive kernels
and map them to the GPU. The rest of the application remains on the CPU. Mapping
a function to the GPU involves rewriting the function to expose the parallelism in the
function and adding C keywords to move data to and from the GPU. The developer is
tasked with launching 10s of 1000s of threads simultaneously [2]. The GPU hardware
manages the threads and does thread scheduling.
The GPGPU computing model is shown in figure 2.7. It explains the hardware and
software parts of computing process. The NVIDIA GPU with CUDA architecture
is the hardware. While, CUDA development environment is a software, which can
be C or C++ or other programming languages. These languages include OpenCL,
DirectX, OpenGL, Python etc. Using these two parts, many applications can be de-
signed.
Figure 2.7: GPU Computing Model
The GPGPU includes complex matrix operations, vector operations, floating point
scientific applications etc. For all these applications, GPU provides easier way for
manipulations. With CPU programming, these operations require huge execution
time and it becomes tedious and complex. With GPU programming An array of very
large length or matrix of very large size, can easily added or multiplied with another
array or matrix respectively. So, GPUs are not only useful in graphics operations,
but they are also useful in general purpose computing.
2.2 CUDA Architecture
2.2.1 Need of Parallel Processing
CUDA is the architecture that supports parallel processing. Parallel processing is

an important part of high performance computing. To increase the throughput from
computer processing, there are two ways: Either increase processor speed or imple-
ment parallel execution. Since, the processor speed is increased to limited value,
second choice is obviously the better one. Today, software engineers and developers
need to cope with a variety of parallel computing platforms and technologies in order
to provide novel and rich experiences for an increasingly sophisticated base of users
[2]. Multicore programming and parallel processing have marked evolution in com-
puting market.
The difference between increasing processor clock rate and increasing processor can
be seen from figure 2.8 given below:
Figure 2.8: Getting performance in two ways

As seen from above figure, one can conclude that parallel processing is more bene-
fitial. The two important parameters for determining performance are latency and
bandwidth. Latency is the time to complete task and throughput is number of tasks
in a fixed time.GPUs are designed for getting maximum performance. The GPU core
is the stream processor. Stream processors are grouped in stream multiprocessors.
Stream processor is basically a SIMD processor (Single Instruction Multiple Data)
[9]. SIMD architecture is shown in figure 2.9. SM operation is shown in figure 2.10.
Figure 2.9: SIMD Architecture
Figure 2.10: SIMD operation

2.2.2 Basics of CUDA Architecture
CUDA Architecture
Figure 2.11: NVIDIA CUDA
Compute Unified Device Architecture (CUDA) is a parallel computing platform and

programming model created by NVIDIA. It was implemented first in GeForce 8800
GPU. Now, all the GPUs from NVIDIA come with CUDA architecture. CUDA gives
developers access to virtual instruction set and memory of the parallel computational
elements in CUDA GPUs [10]. Using CUDA, the latest NVIDIA GPUs become acces-
sible for computation like CPUs. Unlike CPUs, however GPUs GPUs have a parallel
throughput architecture that emphasizes executing many concurrent threads slowly,
rather than executing a single thread very quickly.
A CUDA program basically calls the efficiently written parallel kernels. A kernel is a
function that executes on GPU. A kernel executes in parallel across a set of parallel
threads. The programmer or compiler organizes these threads in thread blocks and
grids of thread blocks according to instructions given by user. The GPU instantiates
a kernel program on a grid of parallel thread blocks. Each thread within a thread
block executes an instance of the kernel. And each thread has a thread ID within
its thread block, program counter, registers, per-thread private memory, inputs, and
output results. By using such indexing of block and threads, one can keep trace of
each and every instance of work.
Concept of Thread, Block and Grid
A thread block is a set of concurrently executing threads that can work coherently
among themselves through synchronization and shared memory (L1 and L2 cache).
A thread block has a block ID within its grid. Even each block has its dimensions.
This dimension is nothing but number of threads within a single block. There are in-
struction given in CUDA through which one can directly access index and dimension
of block [1].
A grid is an array of thread blocks that execute the same kernel, read inputs from
global memory, write results to global memory, and synchronize between dependent
kernel calls. It is also very important to keep all thread within a grid to be syn-
chronized. In the CUDA parallel programming model, each thread has a per-thread
private memory space used for register spills, function calls, and C automatic ar-
ray variables [4]. Each thread block has a per-Block shared memory space used for
inter-thread communication, data sharing, and result sharing in parallel algorithms.
Threads of same grid will have common shared memory. Grids of thread blocks share
results in Global Memory space after kernel-wide global synchronization [12]. CUDA
hierarchy is shown in figure 2.12 [2].
Figure 2.12: CUDA Hierarchy of threads, blocks and grids
Wrap Scheduler
The SM (Streaming Multiprocessor) schedules threads in groups of 32 parallel threads

called warps [2]. Each SM features two warp schedulers and two instruction dispatch
units (Concept is similar to having parallel pipelining for instruction dispatch), al-
lowing two warps to be issued and executed concurrently [9]. Dual warp scheduler
selects two warps, and issues one instruction from each warp to a group of sixteen
cores, sixteen load/store units, or four SFUs (Special Function Units). Because warps
execute independently, Fermis scheduler does not need to check for dependencies from
within the instruction stream. Using this elegant model of dual-issue, Fermi achieves
near peak hardware performance. Schematic view of wrap scheduler is shown in figure
2.13 [9].
Most instructions can be dual issued; two integer instructions, two floating instruc-
tions, or a mix of integer, floating point, load, store, and SFU instructions can be
issued concurrently. Double precision instructions do not support dual dispatch with
any other operation.
Figure 2.13: Wrap Scheduler
One of the key architectural innovations that efficiently improved both the pro-
grammability (that has facilitated programmers for speed up) and performance of
GPU applications is on-chip shared memory. Shared memory enables threads within
the same thread block to cooperate, facilitates extensive reuse of on-chip data, and
greatly reduces off-chip traffic. All blocks on a single grid share common memory.
Shared memory is a key enabler for many high-performance CUDA applications.
Streaming Multiprocessor
Each SM features 32 CUDA processors a fourfold increase over prior SM designs.

Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and
floating point unit (FPU). The Fermi architecture implements the fused multiply-add
(FMA) instruction for both single and double precision arithmetic. FMA improves
over a multiply-add (MAD) instruction by doing the multiplication and addition with
a single final rounding step, with no loss of precision in the addition. FMA is more
accurate than performing the operations separately.
Each SM has 16 load/store units, allowing source and destination addresses to be

calculated for sixteen threads per clock. Supporting units load and store the data at
each address to cache or DRAM. Special Function Units (SFUs) execute transcenden-
tal instructions such as sin, cosine, reciprocal, and square root. Each SFU executes
one instruction per thread, per clock; a warp executes over eight clocks. The SFU
pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to
other execution units while the SFU is occupied. Schematic view of SM is shown in
figure 2.14 [9].
Some of the applications of CUDA are as follows [1]:
Medical Imaging
Computational Fluid Dynamics
Image Processing
Matrix Manipulations
High Performance Computing
Environmental Science
Figure 2.14: Streaming Multiprocessor
Video Decoding
3D Graphics
Gaming
Chapter 3
Data Processing Codes and

Examples
The code processing flow on CUDA is sown in figure 3.1.
Figure 3.1: Processing Flow on CUDA
18
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 19
3.1 Hello, World Program: Code-1
3.1.1 CPU (Host) Version
#include <iostream>
int main(void)
{
printf(Hello, World);
return 0;
}
3.1.2 GPU (Device) Version
#include<iostream>
global void kernel (void)
{
}
int main(void)
{
kernel<<<1, 1>>>();
printf(Hello, World);
return 0;
}
The code-1 runs on host or CPU, while code-2 runs on device or GPU [1], [15]. The
output of above code is shown in figure 3.2. Here, world is replaced by good morning.
Figure 3.2: Output: Code-1
3.2 Addition Program: Code-2
#include<iostream>
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>
global void add(int a,int b,int *c)

{
*c = a + b;
}
int main(void)
int c;
int a;
int b;
int *dev c;
cudaMalloc((void**)&dev c,sizeof(int));
printf(Enter two values:);

printf(1st Value:);
scanf(%d,&a);
printf(2nd Value:);
scanf(%d,&b);
add<<<1,1>>>(a,b,dev c);
cudaMemcpy(&c,dev c,sizeof(int),cudaMemcpyDeviceToHost);
printf(Result:);
printf(%d + %d = %d,a,b,c);
cudaFree(dev c);
system(PAUSE);
return 0;
}
Code-2 explains the addition of two integer numbers [1], [15]. The output of this code
is shown in figure 3.3.
3.3 CUDA Device Properties: Code-3
Code-3 explains about GPU device properties. Using the structure cudaDeviceProp,
we can query the properties of GPU. So, by knowing properties, we can know limita-
tions of device and the programming capabilities of device. The output of this code
is shown in figure 3.4.
#include<iostream>
#include<cuda.h>
int main( void )

{
cudaDeviceProp prop;
int count;
cudaGetDeviceCount( &count ) ;
for (int i=0; i< count; i++)
{
cudaGetDeviceProperties( &prop, i );
printf( General Information for device %d , i );
printf( Name: %s, prop.name );
printf( Compute capability: %d.%d, prop.major, prop.minor );
printf( Clock rate: %d, prop.clockRate );
printf( Device copy overlap: );
if (prop.deviceOverlap)
printf( Enabled );
else
printf( Disabled );
printf( Kernel execition timeout : );
if (prop.kernelExecTimeoutEnabled)
printf( Enabled );
else
printf( Disabled );
printf( Memory Information for device %d , i );
printf( Total global mem: %ld, prop.totalGlobalMem );
printf( Total constant Mem: %ld, prop.totalConstMem );
printf( Max mem pitch: %ld, prop.memPitch );
printf( Texture Alignment: %ld, prop.textureAlignment );
printf( MP Information for device %d , i );
printf( Multiprocessor count: %d,
prop.multiProcessorCount );
printf( Shared mem per mp: %ld, prop.sharedMemPerBlock );
printf( Registers per mp: %d, prop.regsPerBlock );
printf( Threads in warp: %d, prop.warpSize );
printf( Max threads per block: %d, prop.maxThreadsPerBlock );
printf( Max thread dimensions: (%d, %d, %d), prop.maxThreadsDim[0], prop.maxThreadsDim[
prop.maxThreadsDim[2] );
printf( Max grid dimensions: (%d, %d, %d), prop.maxGridSize[0], prop.maxGridSize[1],

prop.maxGridSize[2] );
}
system(PAUSE);
return 0;
}
3.4 Array Addition: Code-4
Code-4 explains the long array addiiton. Here, two arrays of length 1024*16 are
added into third array. This addition takes very less time with GPU implementation.
Output of code is shown in figures 3.5, 3.6 and 3.7.
#include<iostream>
#include<cuda.h>
#include<stdio.h>
#include<conio.h>
] define size 1024*16
global void addKernel(int *c, int *a, int *b)

{
int i = threadIdx.x + blockIdx.x * blockDim.x ;
c[i] = a[i] + b[i];
}
int main()
{
int *a, *b, *c;
a = (int*)malloc(sizeof(int)*size);
if(!a)
{
printf(A allocation error);
getch();
return 1;
}
b = (int*)malloc(sizeof(int)*size);
if(!b)
{
printf(B allocation error);
getch();
return 1;
}
c = (int*)malloc(sizeof(int)*size);
if(!c)
{
printf(C allocation error);
getch();
return 1;
}
for(int i =0; i < size;i++)

{
a[i]=1;
b[i]=2;
}
int *dev a = 0;
int *dev b = 0;
int *dev c = 0;
. cudaMalloc((void**)&dev c, size * sizeof(int));
cudaMalloc((void**)&dev a, size * sizeof(int));
cudaMalloc((void**)&dev b, size * sizeof(int));
cudaMemcpy(dev a, a, size * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev b, b, size * sizeof(int), cudaMemcpyHostToDevice);

addKernel<<<16,1024>>>(dev c, dev a, dev b);
cudaMemcpy(c, dev c, size * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(dev c);
cudaFree(dev a);
cudaFree(dev b);
for(int i=0;i<size;i++)
{
printf(%d , c[i]);
}
return 0;
}
Figure 3.5: Output: Code-4(1)


Chapter 4
DFT and DCT Implementation
4.1 Discrete Fourier Transform (DFT)
Discrete Fourier Transform (DFT) or Fast Fourier Transform (FFT) can be imple-
mented on GPU with the use of libraries available in CUDA. CUDA supports FFT
and DFT in CUFFT library. With the functions available in CUFFT library, we
can easily apply DFT on input data sequence and on image as well. These require
knowledge of certain function definitions and their input parameters.
CUFFT library supports three types of Fourier Transforms:
Complex-to-Complex (C2C)
Real-to-Complex (R2C)
Complex-to-Real (C2R)
C2C Fourier Transfrom is generally used to carry out FFT and DFT. In C2C type of
transform, input and output both sequences are of complex type. In R2C transform,
input sequence is real, but output sequence is complex. Likewise, C2R has complex
input sequence and real output sequence. CUFFT library provides 1D, 2D and 3D
29
CHAPTER 4. DFT AND DCT IMPLEMENTATION 30
Fourier Transforms. It provides single-precision (32-bit floating point) as well as dou-

ble precision operations (64-bit floating point) [3].
The FFT is a divide-and-conquer algorithm for efficiently computing Discrete Fourier

Transforms of complex or real-valued data sets. It is one of the most important and
widely used numerical algorithms in computational physics and general signal pro-
cessing [11]. The CUFFT library provides a simple interface for computing parallel
FFTs on an NVIDIA GPU, which allows users to quickly leverage the floating-point
power and parallelism of the GPU in a highly optimized and tested FFT library [3].
The most important functions related to DFT programming are as follow [3]:
cufftPlan1d()
cufftDestory()
cufftExecC2C()
cufftExecR2C()
cufftExecC2R()
cufftPlan1d() : Creates a 1D FFT plan configuration for a specified signal size and
data type. The batch input parameter tells CUFFT how many 1D transforms to
configure. Similarly 2D and 3D transforms can be configured.
definition: cufftPlan1d(cufftHandle *plan, int nx, cufftType type, int batch)
Here, plan is a pointer to cufftHandle object. NX parameter tells the transform size
(e.g. 8 for 8-point DFT). The parameter type tells the compiler the type(R2C, C2R
or C2C) of transform. BATCH indicates the number of transforms of size NX. Out-
put of this function is the plan object value.
cufftDestroy() : Frees all GPU resources associated with a CUFFT plan and de-
stroys the internal plan data structure. This function should be called once a plan is
no longer needed, to avoid wasting GPU memory.
definition: cufftDestroy(cufftHandle plan)
The input parameter is the plan that is to be destroyed.
cufftExecC2C() : cufftExecC2C executes a single-precision complexto- complex

transform plan in the transform direction as specified by direction parameter. CUFFT
uses the GPU memory pointed to by the idata parameter as input data. This function
stores the Fourier coefficients in the odata array. If idata and odata are the same,
this method does an in-place transform.
definition : cufftExecC2C(cufftHandle *plan, cufftComplex *idata, cufft-
Complex *odata, int direction)
Direction specifies whether the transform is in forward direction or in inverse direc-
tion. Similarly, cufftExecR2C and cufftExecC2R can be used. In these functions,
direction parameter is absent.
The data types available in CUFFT libraries are as follows:
cufftHandle
cufftComplex
cufftReal
cufftDoubleReal
cufftDoubleComplex
The code given below illustrates the 4-point DFT on a real input sequence. It also
applied inverse DFT to get input sequence back. Its output is shown in figure 4.1.
#include<iostream>
#include<cuda.h>
#include<stdio.h>
#include<conio.h>
#include<cufft.h>
#define NX 4
#define BATCH 1
void main(void)
{
cufftHandle plan;
int i;
cufftComplex *device;
cufftComplex host[NX*BATCH];
cufftComplex temp[NX*BATCH];
cudaMalloc((void **)&device,sizeof(cufftComplex)*NX*BATCH);
printf(Welcome to FFT C2C Program);
printf(INPUT DATA:);
host[0].x=1;
host[1].x=2;
host[2].x=0;
host[3].x=1;
host[0].y=0;
host[1].y=0;
host[2].y=0;
host[3].y=0;
for(i=0;i<4;i++)
{
printf(%f + i%f,host[i].x,host[i].y);
}
cudaMemcpy(device,host,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyHostToDevice);
if(cufftPlan1d(&plan,NX,CUFFT C2C,BATCH) != CUFFT SUCCESS)
{
fprintf(stderr,CUFFT ERROR: Plan creation failed);
}
if(cufftExecC2C(plan,device,device,CUFFT FORWARD) != CUFFT SUCCESS)

{ fprintf(stderr,CUFFT ERROR: ExecC2C Forward failed);
}
cudaMemcpy(host,device,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyDeviceToHost);
printf(OUTPUT DATA:);
for(i=0;i<4;i++)
{
}
if(cufftExecC2C(plan,device,device,CUFFT INVERSE) != CUFFT SUCCESS)

{
fprintf(stderr,CUFFT ERROR: ExecC2C Inverse failed);
}
cudaMemcpy(temp,device,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyDeviceToHost);
printf(INVERSE FFT:);
for(i=0;i<4;i++)
{
printf(%f + i%f,temp[i].x/NX,temp[i].y/NX);
}
cufftDestroy(plan);
cudaFree(host);
cudaFree(temp);
cudaFree(device);
system(PAUSE);
}
Figure 4.1: Output: 4-Point DFT
Time plot and kernel execution timing details are shown in figure 4.2.
In DFT code, we are not writing any kernels. But, CUFFT library itself contains
kernels for the execution of DFT and IDFT. CUFFT library uses two important
kernels for cufftPlan1d function and cufftExecC2C function. According to the time
plot, kernel time is 2.23% of the total GPU time. The time axis (Y-axis) is in units
of milli-seconds. The kernel vectorRadix2 takes the maximum time [14]. But, the
overall computation time is quite less. When DFT is calculated for large number of
points, the CUDA C programming provides better performance.
Figure 4.2: Time Plot for DFT Code
Similarly 2D DFT can be calculated. This 2D DFT can be applied on 2D matrix or

on image. Since image is 2D in nature, applications of 2D DFT and 2D DCT are
more suitable. DFT is applied along the length and width. NX parameter tells num-
ber of points in x-directon. While, NY parameter tells number of points in y-direction.
The code given below illustrates 2D DFT. It contains the core part only.
#define NX 8
#define NY 8
int main(void)
{
cufftHandle plan;
cufftComplex host[NX][NY];
cufftComplex temp[NX][NY];
cudaMalloc((void **)&device,sizeof(cufftComplex)*NX*NY);
cudaMemcpy(device,host,sizeof(cufftComplex)*NX*NY,cudaMemcpyHostToDevice);
if(cufftPlan2d(&plan,NX,NY,CUFFT C2C) != CUFFT SUCCESS)

{
}

{
fprintf(stderr,CUFFT ERROR: ExecC2C Forward failed);
}
cudaMemcpy(host,device,sizeof(cufftComplex)*NX*NY,cudaMemcpyDeviceToHost);
cudaMemcpy(temp,device,sizeof(cufftComplex)*NX*NY,cudaMemcpyDeviceToHost);
cufftDestroy(plan);
}
The output of this code is shown in figure 4.3 The input matrix is of order 3x3 and
its parameters are : [1 2 1; 1 3 1; 1 4 1].
Figure 4.3: Output of 2D DFT code

4.2 Discrete Cosine Transform (DCT)
The discrete cosine transforms (DCT) and discrete sine transform (DST) are mem-
bers of a family of sinusoidal unitary transforms. They are real, orthogonal, and
separable with fast algorithms for its computation. They have a great relevance to
data compression.
DCT is a Fourier-related transform similar to the discrete Fourier transform (DFT),

but using only real numbers. DCTs are equivalent to DFTs of roughly twice the
length, operating on real data with even symmetry. The obvious distinction between
a DCT and a DFT is that the former uses only cosine functions, while the latter uses
both cosines and sines (in the form of complex exponentials)[5].
Compared with DFT, DCT has two main advantages [5]:
Its a real transform with better computational efficiency than DFT which by
definition is a complex transform.
It does not introduce discontinuity while imposing periodicity in the time signal.
In DFT, as the time signal is truncated and assumed periodic, discontinuity is
introduced in time domain and some corresponding artifacts is introduced in
frequency domain. But as even symmetry is assumed while truncating the time
signal, no discontinuity and related artifacts are introduced in DCT [7].
DCT have four different definitions. DCT-II is generally used for data compression.
The importance of DCT-II is further accentuated by its-
Superiority in bandwidth compression (redundancy reduction) of a wide range

of signals.
Powerful performance in the bit-rate reduction.
Existence of fast algorithms for its implementation.
So, DCT is preferred over DFT for image processing algorithms. When DCT is ap-
plied to any image, it introduces no loss to the source image samples [7]. The image
pixels have brightness values ranging from 0 to 255 in integer. These are real numbers
and DCT is suitable for real numbers. When DCT is applied, the first coefficient is
always DC, while all others are AC. DC component has the highest value and DCT
coefficient value is decreasing form low order coefficients to higher ones. These high
frequency coefficients can be neglected since they have less values. DCT application
on image is shown in figure 4.4.
Figure 4.4: DCT on image
The DCT algorithm can also be implemented by using CUFFT library in CUDA.
But, the problem is : Direct DCT computation functions are not available in CUFFT
library. So, DCT can be derived from DFT computation. For, N-point DCT compu-
tation, 2N-point DFT computation is required. Similarly, for N-point IDCT compu-
tation, 2N-point IDFT computation is required.
The algorithm for DCT computation from DFT is given below [6]:
1 . Take any N-point input signal say x[n].
2 . Form 2N-point signal y[n] from x[n] by following:

y[n] = x[n] if 0 6 n 6 N-1,
y[n] = x[2N-1-n] if N 6 n 6 2N-1.
3 . Calculate Y[k], 2N-point DFT of y[n].
4 . For 0 6 k 6 N-1,
k/2
C[k] = W2N * Y[k]
C[k] denotes the N-point DCT of N-point input x[n]. The algorithm for IDCT com-
putation from IDFT is given below [6]:
1 . Construct Y[k] from N-point DCT C[k] as following:

k/2
Y[k] = W2N * C[k] for 0 6 k 6 N-1;
Y[k] = 0 for k = N;
Y[k] = W2N k/2 * C[2N-k] for N+1 6 k 6 2N-1.
2 . Calculate y[n], the 2N-point inverse DFT of Y[k].
3 . For 0 6 n 6 2N-1, let x[n] = y[n].

Here, x[n] is the N-point IDCT for C[k]. The code for DCT-IDCT computation for
CPU (Host) is given below:
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<time.h>
#include<stdlib.h>
#define length 10240

#define PI 3.1428
void main()
{
clock t start,stop;
start = clock();
float data[length],xdct[length]={0},idct[2*length]={0};
for(int i=0;i<length;i++)
{
data[i]=i;
}
{
printf(element in%d is: %f,i,data[i]);
}
{
for(int j=0;j<length;j++)
{
xdct[i]=xdct[i]+2*data[j]*cos(PI*i*(2*j+1)/2/length);
}
}
for(int i=0;i<2*length;i++)
{
for(int j=1;j<length;j++)
{
idct[i]=idct[i]+xdct[j]*cos(PI*j*(2*i+1)/2/length)/length;
}
idct[i]=idct[i]+xdct[0]/2/length;
}
stop = clock();
printf(Execution Time in milliseconds (DCT on CPU) : %f,(double)((stop-start)));
system(PAUSE);
The output of above code is shown in figure 4.5. The output gives the idea about
execution time that CPU takes for the calculation of DCT for large number of input
points. For small number of points, CPU takes very less time. But, for large number
of points like 10240, it takes much time of about 11.5 secs. On the other hand, GPU
version of this code takes very less time of about 0.174 sec. It is shown in figure 4.6.
Figure 4.5: DCT on CPU
When This code runs on GPU, it takes less execution time for large points. It con-
tains the kernels called for DFT computations, IDFT computations, twiddle factor
multiplication, 2N-point conversion etc. The comparison for CPU and GPU version
of code is given in table 4.1 and table 4.2.
Table 4.1: DCT on CPU
DCT on CPU
Length of code is comparatively smaller.
It executes entirely on CPU. There is no use of CUDA Programming.
Execution time is comparatively less for less number points (e.g. 100).
Execution time is comparatively more for more number points (e.g. 100000).
Not suitable for image processing algorithm.
Regenerated values are not much accurate.
Table 4.2: DCT on GPU
DCT on GPU
Length of code is comparatively larger.
It executes by different kernel calls and takes benefits of CUDA programming.
Execution time is comparatively more for less number of points (e.g. 100).
Execution time is comparatively less for more number points (e.g. 100000).
Highly suitable for image processing algorithm.
Regenerated values are very accurate compared to CPU generated value due to SFU.
Figure 4.6: DCT on GPU
The DCT code for GPU is shown below.
#include<iostream>
#include<cuda.h>
#include<stdio.h>
#include<conio.h>
#include<math.h>
#define NX 4
#define PI 3.14285714
#define BATCH 1
int main(void)
{
int i;
cufftComplex host1[NX*BATCH];
cufftComplex host3[NX*BATCH];
cufftComplex host2[NX*2*BATCH];
cufftHandle plan;
cufftComplex host[NX*2*BATCH];
cudaMalloc((void **)&device,sizeof(cufftComplex)*NX*2*BATCH);
printf(N-point DCT computation using 2N-point DFT);
printf(Enter the input data:);
for(i=0;i<NX;i++)
{
printf(Enter value of host[%d].x:,i);
scanf(%f,&host1[i].x);
host1[i].y = 0;
}
for(i=0;i<NX;i++)
{
host[i].x = host1[i].x;
host[i].y = 0;
}
for(i=NX;i<2*NX;i++)
{
host[i] = host1[2*NX-1-i];
host[i].y = 0;
}
cudaMemcpy(device,host,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyHostToDevice);
if(cufftPlan1d(&plan,2*NX,CUFFT C2C,BATCH) != CUFFT SUCCESS)
{
}

{
}
cudaMemcpy(host,device,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyDeviceToHost);
printf(OUTPUT DATA:);
for(i=0;i<2*NX;i++)
{
}
for(i=0;i <NX;i++)
{
host[i].x = host[i].x * cos(PI*i/2/NX) + host[i].y * sin(PI*i/2/NX);
host[i].y = 0;
}
printf(OUTPUT DCT);
for(i=0;i<NX;i++)
{
printf(%f ,host[i].x);
}
for(i=0;i<NX;i++)
{
host2[i].x = host[i].x * cos(PI*i/2/NX);
host2[i].y = host[i].x * sin(PI*i/2/NX);
}
if(i==NX)
{
host2[i].x = 0;
host2[i].y = 0;
}
for(i=NX+1;i<2*NX;i++)
{
host2[i].x = -host[2*NX-i].x * cos(PI*i/2/NX);
host2[i].y = -host[2*NX-i].x * sin(PI*i/2/NX);

}
cudaMemcpy(device,host2,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyHostToDevice);
if(cufftPlan1d(&plan,2*NX,CUFFT C2C,BATCH) != CUFFT SUCCESS)

{
}
{
}
cudaMemcpy(host3,device,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyDeviceToHost);
for(i=0;i<NX;i++)
{
printf(INVERSE DCT OUTPUT:);
printf(%f ,host3[i].x/2/NX);
}
system(PAUSE);
return 0;
}
This code uses CUFFT library and DFT to DCT conversion algorithm. The output
of this code is shown in figure 4.7. The input data is [1 2 3 4]. Still, this code is not
completely implemented on GPU because it uses for loops for twiddle factor multi-
plication and 2N-point conversion. The kernels are called only for DFT computation.
So all-kernel calls code is given below. Here, only the kernels definitions are shown.
The complete code for image processing is illustrated in chapter 5.
Figure 4.7: DCT and IDCT code output
global void convert2N (cufftComplex *fromhostdata, cufftComplex *gpu2Ndata)

{
unsigned long int i = blockIdx.x * (blockDim.x*2) + threadIdx.x;
unsigned long int j = blockIdx.x * (blockDim.x) + threadIdx.x;
unsigned long int p = blockIdx.x * (blockDim.x);
unsigned long int k = threadIdx.x;
gpu2Ndata[i].x = fromhostdata[j].x;
gpu2Ndata[NX+i].x = fromhostdata[p+NX-k-1].x;
gpu2Ndata[i].y = 0;
gpu2Ndata[NX+i].y = 0;
}
Figure 4.8: Convert2N kernel
Convert2N kernel is used to form 2N-point sequence from N-point input sequence. It
is shown in figure 4.8. The twiddel multiplication kernel is given below.
global void twiddle(cufftComplex *gpu2Ndata)

{
unsigned long int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned long int k=threadIdx.x;
gpu2Ndata[i].x = gpu2Ndata[i].x * cos(PI*k/2/NX) + gpu2Ndata[i].y * sin(PI*k/2/NX);
gpu2Ndata[i].y=0;
}
The ToDCT kernel is illustrated in figure 4.9 and given code. It is used to collect
data row-wise.
global void todct(cufftComplex *xdct,cufftComplex *data2N)

{
unsigned long int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
unsigned long int j = blockIdx.x*(blockDim.x) + threadIdx.x;
xdct[j].x=data2N[i].x;
xdct[j].y=0;
Figure 4.9: ToDCT kernel
Figure 4.10: Transpose kernel
global void transpose(cufftComplex *data, cufftComplex *ct)

{
unsigned long int id=threadIdx.x;
for(int i=0;i<BATCH;i++)
{
ct[id*BATCH+i]=data[i*NX+id];
}
}
Transpose kernel is explained in figure 4.10. It is very useful for 2D DCT implemen-
tation. It uses 1D DCT for the calculation of 2D DCT by transpose method. Finally,
the kernel for IDCT computation is shown in figure 4.11 and explained in the code
given below:
Figure 4.11: IDCT kernel
global void idct2N(cufftComplex *dctdata, cufftComplex *idct2Ndata)

{
unsigned long int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
unsigned long int j = blockIdx.x*(blockDim.x) + threadIdx.x;
unsigned long int k=threadIdx.x;
unsigned long int q=blockIdx.x*(blockDim.x*2);
unsigned long int p=blockIdx.x*(blockDim.x);
idct2Ndata[i].x=dctdata[j].x*cos(PI*(i+q)/2/NX);
idct2Ndata[i].y=dctdata[j].x*sin(PI*(i+q)/2/NX);
if(k==0)
{
idct2Ndata[NX+q].x=0;
idct2Ndata[NX+q].y=0;
}
else
{
idct2Ndata[NX+i].x= -dctdata[p+NX-k].x* cos(PI*(q+NX+i)/2/NX);
idct2Ndata[NX+i].y= -dctdata[p+NX-k].x*sin(PI*(q+NX+i)/2/NX);
}
}
These kernels are called for the DCT implementation on signal as well as on image.
Application of this algorithm in image processing is explained in next chapter.
Chapter 5
Image Processing Algorithm
5.1 Image Processing
Image Processing is any form of signal processing for which the input is an image.
It is generally referred to processing of a 2D picture by computer. The output of
image processing may be either an image or the set of parameters which are related
to image. There are different types of image processing : Digital Image Processing,
Analog Image Processing and Optical Image Processing. Image processing refers to
application of different algorithms on images.
There are different algorithms available to process the image. Some of them are listed
below:
Image Enhancement
Image Restoration
Image Compression
55
CHAPTER 5. IMAGE PROCESSING ALGORITHM 56
De-Blurring
De-Noising
Edge Detection
Image Smoothing
Convolution Operation
In this project, the algorithm has been applied on grayscale image. Only red compo-
nent of every image is taken. The same algorithm can be applied for color images, in
which the red, green and blue components are considered saparately.
5.2 Image Reading with MATLAB
To read the image in CUDA C, it must be available in BIN format. So, by using code
given below, image can be converted into .bin file. It contains pixel values. This file
is read in CUDA C by FILE operations.
c = imread(Koala.jpg);
r = c(:,:,1);
t = r(:);
p = t;
fid = fopen(koala-bin.bin,w);
fwrite(fid,p);
fclose(fid);
The image processing workflow is illustrated in figure 5.1.
Figure 5.1: Image Processing Workflow
5.3 Image Processing Code
The image processing code is illustrated below. It does not contain some of the kernel
definitions as they are discussed in chapter 4. The kernels those are not discussed are
included.
#define NX 768
#define BATCH 1024

#define PI 3.14285
unsigned long int i;
const unsigned long int buffer size=786432;

FILE *source;
unsigned long int count=0;
unsigned long int written=0;
global void devideNX(cufftComplex *xdct)

{
unsigned long int id=threadIdx.x+blockIdx.x*blockDim.x;
xdct[id].x=xdct[id].x/2/NX;
global void devideBATCH(cufftComplex *xdct)
{
unsigned long int id=threadIdx.x+blockIdx.x*blockDim.x;
xdct[id].x=xdct[id].x/2/BATCH;
}
int main(void)
{
cudaError t cudaStatus;
cudaDeviceReset();
cufftComplex host[NX*BATCH],host2N[2*NX*BATCH],temp[NX*BATCH];
cufftHandle plan2N;
cufftHandle plan2N2;
clock t start,stop,start1,stop1; if(cufftPlan1d(&plan2N,2*NX,CUFFT C2C,BATCH)

!= CUFFT SUCCESS)
{
getch();
}
if(cufftPlan1d(&plan2N2,2*BATCH,CUFFT C2C,NX) != CUFFT SUCCESS)

{
getch();
}
unsigned char buffer;

source=fopen(img1024x768.bin,r);
unsigned long int i=0,n=0;
while(i<buffer size)
{
fseek(source,i,0);
n=fread(&buffer,1,1,source);
host[i].x=buffer;
host[i].y=0;
i++;
}
printf(Reading is done.....);
fclose(source);
start=clock();
cufftComplex *device1N,*device1Ncpy,*device2N,*devicecpy2,*ansdct,*ansidct;
cudaMalloc((void **)&device1N,sizeof(cufftComplex)*NX*BATCH);
cudaMalloc((void **)&device1Ncpy,sizeof(cufftComplex)*NX*BATCH);
cudaMemcpy(device1N,host,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyHostToDevice);
cudaMalloc((void **)&device2N,sizeof(cufftComplex)*NX*2*BATCH);
cudaMalloc((void **)&devicecpy2,sizeof(cufftComplex)*NX*2*BATCH);
cudaMalloc((void **)&ansidct,sizeof(cufftComplex)*NX*BATCH);
convert2N<<<BATCH,NX>>>(device1N,device2N);
if(cufftExecC2C(plan2N,device2N,devicecpy2,CUFFT FORWARD) != CUFFT SUCCESS)

{
getch();
}
cudaMalloc((void **)&ansdct,sizeof(cufftComplex)*NX*BATCH);
twiddle<<<(2*BATCH),NX>>>(devicecpy2);
todct<<<BATCH,NX>>>(ansdct,devicecpy2);
transpose<<<BATCH,NX>>>(ansdct,device1Ncpy);
convert2N2<<<NX,BATCH>>>(device1Ncpy,device2N);
if(cufftExecC2C(plan2N2,device2N,device2N,CUFFT FORWARD) != CUFFT SUCCESS)

{
getch();
}
twiddle<<<(2*NX),BATCH>>>(device2N);
todct<<<NX,BATCH>>>(ansdct,device2N);
idct2N2<<<NX,BATCH>>>(ansdct,devicecpy2);
if(cufftExecC2C(plan2N2,devicecpy2,devicecpy2,CUFFT INVERSE) != CUFFT SUCCESS)

{
}
todct<<<NX,BATCH>>>(ansidct,devicecpy2);
devideBATCH<<<NX,BATCH>>>(ansidct);
transpose2<<<NX,BATCH>>>(ansidct,ansdct);
idct2N<<<BATCH,NX>>>(ansdct,device2N);
if(cufftExecC2C(plan2N,device2N,device2N,CUFFT INVERSE) != CUFFT SUCCESS)

{
}
todct<<<BATCH,NX>>>(ansidct,device2N);
devideNX<<<BATCH,NX>>>(ansidct);
stop = clock();
printf(IDCT done);
start1 = clock();
cudaStatus =cudaMemcpy(host,ansidct,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyDeviceT
stop1 = clock();
if (cudaStatus != cudaSuccess)
{
printf(cudaMemcpy failed);
getch();
}
printf(INVERSE DCT OUTPUT:);

for(i=0;i<NX*BATCH;i++)
{
printf(IDCT in %d is %f ,i,host[i].x);
}
source=fopen(img1024x768binIDCT.bin,w);
i=0;
while(i<buffer size)
{
buffer=abs(host[i].x);
fseek(source,i,0);
fwrite(&buffer,1,1,source);
i++;
}
printf(writing is done.....);
getch();
fclose(source);
printf(EXECUTION TIME in milliseconds (DCT on GPU) for %d * %d points :

%f,NX,BATCH,(double)((stop-start)));
printf(Cudamemcpy in milliseconds (DCT on GPU) for %d * %d points : %f,NX,BATCH,(doubl
start1)));
system(PAUSE);
cufftDestroy(plan2N);
cufftDestroy(plan2N2);
cudaFree(device1N);
cudaFree(ansdct);
cudaFree(ansidct);
cudaFree(device2N);
cudaFree(devicecpy2);
cudaDeviceReset();
return 0;
}
The output of above code is shown in figure 5.2. The original data of image is shown
in figure 5.3. The regenerated values are nearly same. The execution time output is
shown in figure 5.4.
Figure 5.2: Output of image processing code
By comparing the results, we can say that regenerated values of pixels are nearly
same of original pixel values. There is slight difference is some of the values. This is
the reason for scattering of pixels in some of high resolution images. This algorithm
is applied to different images.
Figure 5.3: Original values
Figure 5.4: Execution time for 1024x768 image

This algorithm has been applied to different images. Smallest resoultion of taken
is 128x128. The largest resoultion taken is 1024x768. For GPU machine GTX 480,
maximum number of threads which can run parallel is 1024. So, the largest image
that can be processed by GPU is limited to 1024 pixels in horizontal direction. DCT
is applied on each and every pixel. NX and BATCH parameters are varied accord-
ingly with image pixels. Process and flow remains same for different images.
5.4 Image Regeneration
fid =fopen(img1024x768binIDCT.bin,r);
p=fread(fid);
fclose(fid);
for i=1:clm
for j=1:row
q(j,i)=p(j+(i-1)*row);
end
end
imwrite(q,img1024x768-regenerated.jpg);
The results of image processing code are shown in figures 5.5, 5.6, 5.7, 5.8, 5.9, and
5.10.
Figure 5.5: Original Image and Regenerated Image:128x128


Table 5.1: Execution time for different images
Image Size Processing Time cudaMemcpy Time

128x128 Less than 1 ms Less than 1 ms
176x144 Less than 1 ms Less than 1 ms
256x256 Less than 1 ms 15 ms
512x512 Less than 1 ms 249 ms
800x600 15 ms 826 ms
1024 x 768 16 ms 1762 ms
The execution time for complete processing for different images and the cudamemcpy
time (time required for copying the final values from device to host) are compared in
table 5.1.
5.5 PSNR Calculation
Peak Signal-to-Noise Ratio, often abbreviated PSNR, is an engineering term for the
ratio between the maximum possible power of a signal and the power of corrupting
noise that affects the fidelity of its representation. Because many signals have a very
wide dynamic range, PSNR is usually expressed in terms of the logarithmic decibel
scale.PSNR is most commonly used to measure the quality of reconstruction of lossy
compression codecs (e.g., for image compression). The signal in this case is the orig-
inal data, and the noise is the error introduced by compression. When comparing
compression codecs, PSNR is an approximation to human perception of reconstruc-
tion quality.
PSNR calculation code is given below:
I1 = imread(img1024x768-original.jpg);
I2 = imread(img1024x768-regenerated.jpg);
P = 255;
MSE = mean((I1(:)-I2(:)) * (I1(:)-I2(:)));
PSNR = 10*log10(P*P/MSE);
Here, P is the maximum pixel value of image that is 255. The original and regener-
ated images are compared and PSNR is calculated with MATLAB. The PSNR values
for different images are given in table 5.2.
Table 5.2: PSNR Values for Different Images
Image Size PSNR Value(in dB)

128x128 34.5625
176x144 35.6318
256x256 32.4320
512x512 36.4364
800x600 25.8664
1024 x 768 26.2030
PSNR values above 30 dB are acceptable. The low PSNR values in 800x600 and
1024x768 images are due to black portion. This is problem because GPU machine is
not much powerful. This can be solved by using more powerful GPU.
Chapter 6
Conclusion and Future Scope
6.1 Conclusion
As the problem of fast computation of massive data is very important now a day,
parallel processing using Graphics Processing Unit can provide solution. Image pro-
cessing is also a very wide field for providing solution for many real time problems.
But processing of image on CPU will take too much time and can make our system
non-real time.
With the purpose of making image processing near- real time algorithm named Dis-
crete Cosine transform for image processing has been successfully implemented on
GPU. Output results obtained from the project are really alluring and very much
effective compare to results of CPU processed data. For large image, result shows
very impressive time speed up. Only problem noticed is the little distortion of image
which is present only in very large size image. But still the regenerated image has
much information value. This project can further be expanded with implementation
of quantization which will make compression of data after applying proper compres-
sion technique.
72
CHAPTER 6. CONCLUSION AND FUTURE SCOPE 73
Apart from this, there are so many other image processing algorithms and filtering
algorithms can be implemented on GPU which and that can eventually lead to speed
up of processing. The only bottleneck of processing with GPU is the time consump-
tion of data transfer from CPU memory to GPU memory or vice-versa. There are
researches going on to overcome this obstruction.
6.2 Future Scope
We have implemented DCT and IDCT algorithm in CUDA architecture. We have

implemented image processing algorithm on different images. In this algorithm, we
have applied DFT, DCT and IDCT algorithms. Now, using these codes, one can pro-
cess different images by applying compression algorithm. For different quantization
tables available, it is easy to compress images. One just need to write quantization
logic in exitising codes. Also, this algorithm can be applied to various resolution im-
ages having different sizes. One can also verify the execution time needed for process
the image on CPU or other platform and on GPU platform with CUDA architecture.
With this algorithm, images can be easily compressed or processed with less execution
time. So, this algorithm proves quite useful for large images (having high resolution).
References
[1] Jason Sanders and Edward Kandrot, CUDA BY EXAMPLE: An Introduction to

General Purpose GPU Programming, Addison-Wesley Publication,2011.
[2] NVIDIA CUDA Compute Unified Device Architecture Programming Guide,

NVIDA corporation, 2006.
[3] CUDA CUFFT Library Programming Guide, NVIDIA corporation, 2007.
[4] Cyril Zeller, NVIDIA Tutorial on CUDA, NVIDIA corporation, 2008.
[5] K.R. Rao and P.C.Yip, The Transform and Data Compression Handbook, 2009.
[6] Image Processing and Computer Vision: Relationship between DCT and DFT,
Eel 6562.
[7] Anton Obukhov and Alexander Kharlamov, Discrete Cosine Transform for 8x8
Blocks With CUDA, NVIDA corporation, October, 2008.
[8] Pranit Patel, Jeff Wong, Manisha Tatikonda and Jarek Marczewski, JPEG Com-
pression Algorithm Using CUDA, October, 2009
[9] Cristopher Cooper, GPU Computing With CUDA-Introduction, Boston Univer-

sity, August, 2011.
[10] Dana Schaa and Byungyun Jang, Programming with CUDA and OpenCL, North-
eastern University, July, 2010.
74
REFERENCES 75
[11] NVIDIA CUDA SDK Browser, http://www.nvidia.com/object/cudaget.html
[12] www.en.wikipedia.org/cuda
[13] www.en.wikipedia.org/GPU
[14] http://stackoverflow.com
[15] http://devtalk.nvidia.com : CUDA Developer Zone

Main GPU

Uploaded by

Copyright:

Available Formats

Main GPU

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Main GPU

Uploaded by

Copyright:

Available Formats

Image Processing on NVIDIA GPU

A Major Project Report

Submitted in partial fulfillment of the requirements

For the Degree of

Jaymeen Aseem (09BEC003)

Under the Guidance of

Department of Electrical Engineering

Date: Place: Ahmedabad

Prof.(Dr.) D.K.Kothari Prof. (Dr.) P.N.Tekwani

Jaymeen B. Aseem (09BEC003)

The chapters of this dissertation are organized as follows:

Chapter 4 explains how to implement DCT and DFT with CUDA C.

Chapter 6 concludes the project and the future scope.

1 Project Development Environment 1

2 Introduction to GPU and CUDA 3

3 Data Processing Codes and Examples 18

3.4 Array Addition: Code-4 . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 DFT and DCT Implementation 29

5 Image Processing Algorithm 55

6 Conclusion and Future Scope 72

2.1 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Processing Flow on CUDA . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Output: 4-Point DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.11 IDCT kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1 Image Processing Workflow . . . . . . . . . . . . . . . . . . . . . . . 57

4.1 DCT on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 Execution time for different images . . . . . . . . . . . . . . . . . . . 69

GPU Graphics Processing Unit

Project Development Environment

1.1 Development Environment

For the Image Processing on NVIDIA GPU, we require to set up an environment

I . A CUDA enabled Graphics Processor

II . An NVIDIA Device Driver

III . A CUDA Developement toolkit

1.2 Hardware and Software Specifications

1.2.1 Hardware Specifications

GPU specifications are as follow:

Device : NVIDIA GeForce GTX 480

Total Global Memory : 1.6 GB

Processor Clock : 1215 MHz

CUDA cores : 480

Memory Clock : 1674 MHz

Memory Bandwidth : 133.9 GB/s

Maximum Thread Dimensions : (1024, 1024, 64)

Maximum Grid Dimensions : (65535, 65535, 65535)

CPU and Software Specifications are as follow:

Processor: intel(R) core-i5 2400

Clock Rate : 3.1 GHz

OS : 64-bit operating system

MATLAB version : version 2010

Microsoft Visual Studio Version : version 2010 professional

CUDA version : CUDA 4.2

NVIDIA Visual Profiler version : version 2008

Introduction to GPU and CUDA

2.1 Graphics Processing Unit

2.1.1 Introduction and Architecture

A graphics processing unit commonly known as GPU is a specialized electronic device

Figure 2.1: Graphics Processing Unit

global void addKernel(int c, int a, int *b)