Main GPU

Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

Image Processing on NVIDIA GPU

A Major Project Report

Submitted in partial fulfillment of the requirements

For the Degree of

Bachelor of Technology
In
ELECTRONICS & COMMUNICATION ENGINEERING

By

Jaymeen Aseem (09BEC003)


Shivang Ghetia (09BEC019)

Under the Guidance of

Prof. N.P.Gajjar

Department of Electrical Engineering


Electronics & Communication Engineering Program
Institute of Technology, Nirma University
Ahmedabad-382 481
May 2013
ii

Certificate

This is to certify that the Major Project Report entitled Image Processing on
NVIDIA GPU submitted by Aseem Jaymeen Bharatkumar (09BEC003)
and Ghetia Shivang Arvindbhai (09BEC019), as the partial fulfillment of the
requirements for the award of the degree of Bachelor of Technology in Electronics &
Communication Engineering, Institute of Technology, Nirma University, Ahmedabad
is the record of work carried out by them under our supervision and guidance. The
work submitted in my opinion has reached a level required for being accepted for the
examination.

Date: Place: Ahmedabad

Prof. N.P.Gajjar
Sr. Assoc. Professor,
Elect. & Comm. Engineering,
Institute of Technology,
Nirma University, Ahmedabad.

Prof.(Dr.) D.K.Kothari Prof. (Dr.) P.N.Tekwani


Section Head, Head of Department,
Dept. of Elect. & Comm. Engineering, Dept. of Electrical Engineering,
Institute of Technology, Institute of Technology,
Nirma University, Ahmedabad. Nirma University, Ahmedabad.
iii

Acknowledgement

We would like to thank our professors for their support, encouragement and guidance
for this major project. It is not possible for us to name and thank them all individ-
ually, We must make special mention some of the personalities and acknowledge our
sincere indebtedness to them.

We express deep and sincere gratitude to our guide Prof. N.P.Gajjar for his con-
stant encouragement, valuable guidance and constructive suggestions during all the
stages of the project work. We are deeply indebted to Prof.Vijay Savani for his
valuable suggestions and support.

We are also thankful to all the faculty members of Department of Electronics &
Communication Engineering and our collegues, for providing us all the necessary
guidance through out the term which provides lots of help in course of our project
work. We are also thankful to the authors whose works we have consulted and quoted
in this work. We also gratefully acknowledge Mr. Prasann Shukla for providing us
the full laboratory support at PG-High Performance Computing Lab.

Jaymeen B. Aseem (09BEC003)


Shivang A. Ghetia (09BEC019)
iv

Abstract

The aim of this project is to explore the potential performance improvements that
could be gained through the use of CUDA architecture and GPU processing tech-
niques for the image processing. There are various algorithms availbale for the
processing the image. And there are various platforms available, on which, image
processing can be carried out. These algorithms include image compression, noise
removal, deblurring, edge detection etc. For processing the image, MATLAB is used
as common tool. But for this project, we use the Graphics Processing Unit (GPU)
provided by NVIDIA, having CUDA support. CUDA is compute unified device archi-
tecture, used mainly for parallel processing and high performance computing. With
the procesing capabilites of CUDA architecture (software) and GPU (hardware), great
improvement can be seen in the image processing techinques.

In this project, certain image processing algorithms have been implemented on GPU
with the use of CUDA architecture. This includes the DFT and DCT computation.
Also IDFT and IDCT computuation have been implemented. Thereafter DCT is
applied on each and every pixel of the image. Quantization can be implemented with
various levels. Then, image is regenerated with IDCT application. The execution
time has been measured. The algorithm has been applied to various images and
the results are compared. Aslo, CPU version of algorithm has been compared with
project results. The observation and comparison shows that Device based Image
Processing yeilds better results with less processing time. This can be very much
useful in processing very high resolution images.
v

Motivation

There are various platforms available for Image Processing. MATLAB is commonly
used tool for image processing. These platforms have their own advantages and dis-
advantages. The new techniques have been invented for image processing to overcome
the limitations of some programming languages. NVIDIA CUDA is such a technique.
It is a architecture based on parallel processing. With the computational power of
GPU, the parallel processing capabilities of CUDA makes NVIDIA Graphics process-
ing devices very much suiatble for complex image processing algorithms. So, these
GPUs provide accleration in processing techniques and they posses many benifits
over conventional programming tools.These GPUs can process high resolution image
in just fraction of second. They provide solutions of many complex graphics related
problems.
The execution time taken by general processing tools is comparatively large for high
resolution images than low resolution images. So, for complex image processing al-
gorithms, these tools are not adequate ones. GPUs provide altenate solution by
applying algorithm on each pixel and process them parallely. Hence, GPUs are very
much useful for image processing.
vi

Outline

The chapters of this dissertation are organized as follows:


Chapter 1 illustrates the project developement environment i.e. hardware and soft-
ware needed to carry out project work.

Chapter 2 explains the basic fundamentals of GPUs and CUDA. It explains the need
of GPU and parallel processing architecture CUDA in high performance computing.

Chapter 3 gives examples of general purpose computing with GPU and data pro-
cessing codes.

Chapter 4 explains how to implement DCT and DFT with CUDA C.

Chapter 5 explains the image processing workflow and algorithm with project re-
sults.

Chapter 6 concludes the project and the future scope.


Contents

Certificate ii

Acknowledgement iii

Abstract iv

Motivation v

Outline vi

List of Figures ix

List of Tables xi

Nomenclature xii

1 Project Development Environment 1


1.1 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hardware and Software Specifications . . . . . . . . . . . . . . . . . . 2
1.2.1 Hardware Specifications . . . . . . . . . . . . . . . . . . . . . 2

2 Introduction to GPU and CUDA 3


2.1 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Introduction and Architecture . . . . . . . . . . . . . . . . . . 3
2.1.2 General Purpose Computing With GPU . . . . . . . . . . . . 8
2.2 CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Need of Parallel Processing . . . . . . . . . . . . . . . . . . . 10
2.2.2 Basics of CUDA Architecture . . . . . . . . . . . . . . . . . . 12

3 Data Processing Codes and Examples 18


3.1 Hello, World Program: Code-1 . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 CPU (Host) Version . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 GPU (Device) Version . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Addition Program: Code-2 . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 CUDA Device Properties: Code-3 . . . . . . . . . . . . . . . . . . . . 22

vii
CONTENTS viii

3.4 Array Addition: Code-4 . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 DFT and DCT Implementation 29


4.1 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . 29
4.2 Discrete Cosine Transform (DCT) . . . . . . . . . . . . . . . . . . . . 39

5 Image Processing Algorithm 55


5.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Image Reading with MATLAB . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Image Processing Code . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Image Regeneration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 PSNR Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Conclusion and Future Scope 72


6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

References 74
List of Figures

2.1 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 4


2.2 GPU cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Hardware architecture of GPU . . . . . . . . . . . . . . . . . . . . . . 5
2.4 GPU vs CPU architecture . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Processing Capabilties of GPUs and Bandwidth Comparison . . . . . 7
2.7 GPU Computing Model . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Getting performance in two ways . . . . . . . . . . . . . . . . . . . . 10
2.9 SIMD Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.10 SIMD operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.11 NVIDIA CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.12 CUDA Hierarchy of threads, blocks and grids . . . . . . . . . . . . . 14
2.13 Wrap Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.14 Streaming Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Processing Flow on CUDA . . . . . . . . . . . . . . . . . . . . . . . . 18


3.2 Output: Code-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Output: Code-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Output: Code-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Output: Code-4(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Output: Code-4(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Output: Code-4(3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Output: 4-Point DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


4.2 Time Plot for DFT Code . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Output of 2D DFT code . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 DCT on image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 DCT on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 DCT on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 DCT and IDCT code output . . . . . . . . . . . . . . . . . . . . . . . 50
4.8 Convert2N kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.9 ToDCT kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.10 Transpose kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

ix
LIST OF FIGURES x

4.11 IDCT kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1 Image Processing Workflow . . . . . . . . . . . . . . . . . . . . . . . 57


5.2 Output of image processing code . . . . . . . . . . . . . . . . . . . . 64
5.3 Original values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Execution time for 1024x768 image . . . . . . . . . . . . . . . . . . . 65
5.5 Original Image and Regenerated Image:128x128 . . . . . . . . . . . . 67
5.6 Original Image and Regenerated Image:176x144 . . . . . . . . . . . . 67
5.7 Original Image and Regenerated Image:256x256 . . . . . . . . . . . . 67
5.8 Original Image and Regenerated Image:512x512 . . . . . . . . . . . . 68
5.9 Original Image and Regenerated Image:800x600 . . . . . . . . . . . . 68
5.10 Original Image and Regenerated Image:1024x768 . . . . . . . . . . . 69
List of Tables

4.1 DCT on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


4.2 DCT on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Execution time for different images . . . . . . . . . . . . . . . . . . . 69


5.2 PSNR Values for Different Images . . . . . . . . . . . . . . . . . . . . 71

xi
Nomenclature

Abbreviations

GPU Graphics Processing Unit


CPU Central Processing Unit
GPGPU General Purpose Computing with GPU
CUDA Compute Unified Device Architecture
SM Stream Multiprocessor
SIMD Single Instruction Multiple Data
ALU Arithmetic Logical Unit
FPU Floating Point Unit
SFU Special Function Unit
DFT Discrete Fourier Transform
DCT Discrete Cosine Transform
IDCT Inverse Discrete Cosine Transform
FFT Fast Fourier Transform
JPEG Joint Picture Expert Group
C2C Complex to Complex
R2C Real to Complex
C2R Complex to Real
PSNR Peak Signal-to-Noise Ratio
bin Binary File
Symbols
Micro
Beta
G Giga
M Mega
m Mili

xii
Chapter 1

Project Development Environment

1.1 Development Environment

For the Image Processing on NVIDIA GPU, we require to set up an environment


in which we can develope necessary codes. The prerequisites to developing code in
CUDA C are as follows:

I . A CUDA enabled Graphics Processor

II . An NVIDIA Device Driver

III . A CUDA Developement toolkit

IV . A standard C Complier

For this project, we have used NVIDIA GeForce GTX 480 GPU. For the code devel-
opment, Microsoft Visual Studio 2010 has been used.

1
CHAPTER 1. PROJECT DEVELOPMENT ENVIRONMENT 2

1.2 Hardware and Software Specifications

1.2.1 Hardware Specifications

GPU specifications are as follow:

Device : NVIDIA GeForce GTX 480

Total Global Memory : 1.6 GB

Processor Clock : 1215 MHz

CUDA cores : 480

Memory Clock : 1674 MHz

Memory Bandwidth : 133.9 GB/s

Multiprocessor Count : 15

Maximum Thread Dimensions : (1024, 1024, 64)

Maximum Grid Dimensions : (65535, 65535, 65535)

CPU and Software Specifications are as follow:

Processor: intel(R) core-i5 2400

Clock Rate : 3.1 GHz

RAM : 8 GB

OS : 64-bit operating system

MATLAB version : version 2010

Microsoft Visual Studio Version : version 2010 professional

CUDA version : CUDA 4.2

NVIDIA Visual Profiler version : version 2008


Chapter 2

Introduction to GPU and CUDA

2.1 Graphics Processing Unit

2.1.1 Introduction and Architecture

A graphics processing unit commonly known as GPU is a specialized electronic device


designed to rapidly manipulate and alter memory to accelerate the graphics related
processing. GPU is aslo known as visual processing unit (VPU). GPUs were origi-
nally designed for the graphics processing. They are used for enhancing the visual
information on the output display. They differ from the Central procesing unit (CPU)
in terms of architecture and processing power. GPUs are used in embedded systems,
mobile phones, personal computers, workstations and game consoles [13].

The term GPU was popularized by NVIDIA coroporation in 1999. NVIDIA made
the Geforce 256 and marketed it as worlds first GPU. Graphics processing units are
manufactured mainly by NVIDA and ATI. ATI called GPU as VPU and released
Radeon 9700 in 2002 [13]. Modern GPUs are very efficient at manipulating computer
graphics. Their highly parallel structure makes them more effective than general-
purpose CPUs, when processing large resolution images. GPU is available as graphics
card in many computers as well as embedded in motherboard.

3
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 4

Modern GPUs have the architecture and hardware to do the calculations related
to the 3D computer graphics. They were initially used to accelerate the memory-
intensive work of texture mapping and rendering polygons. Thereafter, units have
been added to accelerate geometric calculations into different coordinate systems [13].
Most of the calculations need matrix and vector operations and GPUs along with
CUDA architecture provide best support for these operations with rich pre-defined
functions and libraries. GPUs are used for image processing, video decoding and
stream processing.

The graphics processing unit is shown in figure 2.1 [10]. The GPU shown in figure
is NVIDIA Geforce GTX 280. The normal CPUs have either single core or multiple
cores. But, the number of cores which CPUs having are less compared to GPUs. The
CPUs are used for general computing, while GPUs are used for high performance
computing. So, GPUs need support of parallel processing. This can be achieved with
many cores. The figure 2.2 shows the difference between number of cores that CPUs
and GPUs are having [9].

Figure 2.1: Graphics Processing Unit


CHAPTER 2. INTRODUCTION TO GPU AND CUDA 5

Figure 2.2: GPU cores

The architecture of GPU is shown in figure 2.3 [9]. It differs from the CPU architec-
ture. GPU devotes more transistors to the data processing. GPU has scalable array
of streaming microprocessors. GPU also has multiple memory spaces. On-chip mem-
ory includes shared memory and registers. Off-chip memory includes global memory.
PCI interface is used for interfacing with CPU [10].

Figure 2.3: Hardware architecture of GPU


CHAPTER 2. INTRODUCTION TO GPU AND CUDA 6

The difference between GPU and CPU architecture is shown in figure 2.4 [9]. Figure
2.5 shows the chip design of GPU [9].

Figure 2.4: GPU vs CPU architecture

Figure 2.5: Chip Design

As discussed earlier, GPUs are designed to accelerate graphics operations and they are
extensively used for high performance computing. So, GPU is a hardware specially
designed for highly parallel applications like graphics. The processing capabilities of
GPU can be seen fro the graph shown in figure 2.6 [9]. These processing capabilities
require large bandwidth. So, GPUs come with high bandwidth. The comparison can
also be seen in figure 2.6. GPUs use many cores and huge number of transistors. So,
they need adequate cooling mechanism.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 7

Figure 2.6: Processing Capabilties of GPUs and Bandwidth Comparison

The general purpose computing with GPU is discussed in next section.


CHAPTER 2. INTRODUCTION TO GPU AND CUDA 8

2.1.2 General Purpose Computing With GPU

GPUs can also be used in general purpose computing because of their massive parallel
processing capabilites. This is known as general purpose computing with GPU or
GPGPU.

Earlier GPU computing was limited to the graphics processing. After the introduction
of CUDA in NVIDIA GPUs, the programming capabilities have been enhanced. Now,
GPUs are also used for general computing. Large data handling, complex algorithms,
Matrix related operations: all these processing and programming have been simplified
by GPUs.

The model for GPU computing is to use a CPU and GPU together in a heteroge-
neous co-processing computing model. The sequential part of the application runs
on the CPU and the computationally-intensive part is accelerated by the GPU. The
application runs faster because it is using the high-performance of the GPU to boost
performance. GPGPU takes the benefit of parallel architecture of GPU, which has
hundreds of processor cores that operate together. In the programming model, the
application developer modify their application to take the compute-intensive kernels
and map them to the GPU. The rest of the application remains on the CPU. Mapping
a function to the GPU involves rewriting the function to expose the parallelism in the
function and adding C keywords to move data to and from the GPU. The developer is
tasked with launching 10s of 1000s of threads simultaneously [2]. The GPU hardware
manages the threads and does thread scheduling.

The GPGPU computing model is shown in figure 2.7. It explains the hardware and
software parts of computing process. The NVIDIA GPU with CUDA architecture
is the hardware. While, CUDA development environment is a software, which can
be C or C++ or other programming languages. These languages include OpenCL,
DirectX, OpenGL, Python etc. Using these two parts, many applications can be de-
signed.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 9

Figure 2.7: GPU Computing Model

The GPGPU includes complex matrix operations, vector operations, floating point
scientific applications etc. For all these applications, GPU provides easier way for
manipulations. With CPU programming, these operations require huge execution
time and it becomes tedious and complex. With GPU programming An array of very
large length or matrix of very large size, can easily added or multiplied with another
array or matrix respectively. So, GPUs are not only useful in graphics operations,
but they are also useful in general purpose computing.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 10

2.2 CUDA Architecture

2.2.1 Need of Parallel Processing

CUDA is the architecture that supports parallel processing. Parallel processing is


an important part of high performance computing. To increase the throughput from
computer processing, there are two ways: Either increase processor speed or imple-
ment parallel execution. Since, the processor speed is increased to limited value,
second choice is obviously the better one. Today, software engineers and developers
need to cope with a variety of parallel computing platforms and technologies in order
to provide novel and rich experiences for an increasingly sophisticated base of users
[2]. Multicore programming and parallel processing have marked evolution in com-
puting market.

The difference between increasing processor clock rate and increasing processor can
be seen from figure 2.8 given below:

Figure 2.8: Getting performance in two ways


CHAPTER 2. INTRODUCTION TO GPU AND CUDA 11

As seen from above figure, one can conclude that parallel processing is more bene-
fitial. The two important parameters for determining performance are latency and
bandwidth. Latency is the time to complete task and throughput is number of tasks
in a fixed time.GPUs are designed for getting maximum performance. The GPU core
is the stream processor. Stream processors are grouped in stream multiprocessors.
Stream processor is basically a SIMD processor (Single Instruction Multiple Data)
[9]. SIMD architecture is shown in figure 2.9. SM operation is shown in figure 2.10.

Figure 2.9: SIMD Architecture

Figure 2.10: SIMD operation


CHAPTER 2. INTRODUCTION TO GPU AND CUDA 12

2.2.2 Basics of CUDA Architecture

CUDA Architecture

Figure 2.11: NVIDIA CUDA

Compute Unified Device Architecture (CUDA) is a parallel computing platform and


programming model created by NVIDIA. It was implemented first in GeForce 8800
GPU. Now, all the GPUs from NVIDIA come with CUDA architecture. CUDA gives
developers access to virtual instruction set and memory of the parallel computational
elements in CUDA GPUs [10]. Using CUDA, the latest NVIDIA GPUs become acces-
sible for computation like CPUs. Unlike CPUs, however GPUs GPUs have a parallel
throughput architecture that emphasizes executing many concurrent threads slowly,
rather than executing a single thread very quickly.

A CUDA program basically calls the efficiently written parallel kernels. A kernel is a
function that executes on GPU. A kernel executes in parallel across a set of parallel
threads. The programmer or compiler organizes these threads in thread blocks and
grids of thread blocks according to instructions given by user. The GPU instantiates
a kernel program on a grid of parallel thread blocks. Each thread within a thread
block executes an instance of the kernel. And each thread has a thread ID within
its thread block, program counter, registers, per-thread private memory, inputs, and
output results. By using such indexing of block and threads, one can keep trace of
each and every instance of work.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 13

Concept of Thread, Block and Grid

A thread block is a set of concurrently executing threads that can work coherently
among themselves through synchronization and shared memory (L1 and L2 cache).
A thread block has a block ID within its grid. Even each block has its dimensions.
This dimension is nothing but number of threads within a single block. There are in-
struction given in CUDA through which one can directly access index and dimension
of block [1].

A grid is an array of thread blocks that execute the same kernel, read inputs from
global memory, write results to global memory, and synchronize between dependent
kernel calls. It is also very important to keep all thread within a grid to be syn-
chronized. In the CUDA parallel programming model, each thread has a per-thread
private memory space used for register spills, function calls, and C automatic ar-
ray variables [4]. Each thread block has a per-Block shared memory space used for
inter-thread communication, data sharing, and result sharing in parallel algorithms.
Threads of same grid will have common shared memory. Grids of thread blocks share
results in Global Memory space after kernel-wide global synchronization [12]. CUDA
hierarchy is shown in figure 2.12 [2].
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 14

Figure 2.12: CUDA Hierarchy of threads, blocks and grids

Wrap Scheduler

The SM (Streaming Multiprocessor) schedules threads in groups of 32 parallel threads


called warps [2]. Each SM features two warp schedulers and two instruction dispatch
units (Concept is similar to having parallel pipelining for instruction dispatch), al-
lowing two warps to be issued and executed concurrently [9]. Dual warp scheduler
selects two warps, and issues one instruction from each warp to a group of sixteen
cores, sixteen load/store units, or four SFUs (Special Function Units). Because warps
execute independently, Fermis scheduler does not need to check for dependencies from
within the instruction stream. Using this elegant model of dual-issue, Fermi achieves
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 15

near peak hardware performance. Schematic view of wrap scheduler is shown in figure
2.13 [9].

Most instructions can be dual issued; two integer instructions, two floating instruc-
tions, or a mix of integer, floating point, load, store, and SFU instructions can be
issued concurrently. Double precision instructions do not support dual dispatch with
any other operation.

Figure 2.13: Wrap Scheduler

One of the key architectural innovations that efficiently improved both the pro-
grammability (that has facilitated programmers for speed up) and performance of
GPU applications is on-chip shared memory. Shared memory enables threads within
the same thread block to cooperate, facilitates extensive reuse of on-chip data, and
greatly reduces off-chip traffic. All blocks on a single grid share common memory.
Shared memory is a key enabler for many high-performance CUDA applications.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 16

Streaming Multiprocessor

Each SM features 32 CUDA processors a fourfold increase over prior SM designs.


Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and
floating point unit (FPU). The Fermi architecture implements the fused multiply-add
(FMA) instruction for both single and double precision arithmetic. FMA improves
over a multiply-add (MAD) instruction by doing the multiplication and addition with
a single final rounding step, with no loss of precision in the addition. FMA is more
accurate than performing the operations separately.

Each SM has 16 load/store units, allowing source and destination addresses to be


calculated for sixteen threads per clock. Supporting units load and store the data at
each address to cache or DRAM. Special Function Units (SFUs) execute transcenden-
tal instructions such as sin, cosine, reciprocal, and square root. Each SFU executes
one instruction per thread, per clock; a warp executes over eight clocks. The SFU
pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to
other execution units while the SFU is occupied. Schematic view of SM is shown in
figure 2.14 [9].

Some of the applications of CUDA are as follows [1]:

Medical Imaging

Computational Fluid Dynamics

Image Processing

Matrix Manipulations

High Performance Computing

Environmental Science
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 17

Figure 2.14: Streaming Multiprocessor

Video Decoding

3D Graphics

Gaming
Chapter 3

Data Processing Codes and


Examples

The code processing flow on CUDA is sown in figure 3.1.

Figure 3.1: Processing Flow on CUDA

18
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 19

3.1 Hello, World Program: Code-1

3.1.1 CPU (Host) Version

#include <iostream>
int main(void)
{
printf(Hello, World);
return 0;
}

3.1.2 GPU (Device) Version

#include<iostream>
global void kernel (void)
{
}
int main(void)
{
kernel<<<1, 1>>>();
printf(Hello, World);
return 0;
}

The code-1 runs on host or CPU, while code-2 runs on device or GPU [1], [15]. The
output of above code is shown in figure 3.2. Here, world is replaced by good morning.
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 20

Figure 3.2: Output: Code-1

3.2 Addition Program: Code-2

#include<iostream>
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>

global void add(int a,int b,int *c)


{
*c = a + b;
}

int main(void)

int c;
int a;
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 21

int b;
int *dev c;
cudaMalloc((void**)&dev c,sizeof(int));

printf(Enter two values:);


printf(1st Value:);
scanf(%d,&a);

printf(2nd Value:);
scanf(%d,&b);
add<<<1,1>>>(a,b,dev c);

cudaMemcpy(&c,dev c,sizeof(int),cudaMemcpyDeviceToHost);
printf(Result:);
printf(%d + %d = %d,a,b,c);
cudaFree(dev c);

system(PAUSE);
return 0;
}

Code-2 explains the addition of two integer numbers [1], [15]. The output of this code
is shown in figure 3.3.
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 22

Figure 3.3: Output: Code-2

3.3 CUDA Device Properties: Code-3

Code-3 explains about GPU device properties. Using the structure cudaDeviceProp,
we can query the properties of GPU. So, by knowing properties, we can know limita-
tions of device and the programming capabilities of device. The output of this code
is shown in figure 3.4.

#include<iostream>
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>

int main( void )


{
cudaDeviceProp prop;
int count;
cudaGetDeviceCount( &count ) ;
for (int i=0; i< count; i++)
{
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 23

cudaGetDeviceProperties( &prop, i );
printf( General Information for device %d , i );
printf( Name: %s, prop.name );
printf( Compute capability: %d.%d, prop.major, prop.minor );
printf( Clock rate: %d, prop.clockRate );
printf( Device copy overlap: );
if (prop.deviceOverlap)
printf( Enabled );
else
printf( Disabled );
printf( Kernel execition timeout : );
if (prop.kernelExecTimeoutEnabled)
printf( Enabled );
else
printf( Disabled );
printf( Memory Information for device %d , i );
printf( Total global mem: %ld, prop.totalGlobalMem );
printf( Total constant Mem: %ld, prop.totalConstMem );
printf( Max mem pitch: %ld, prop.memPitch );
printf( Texture Alignment: %ld, prop.textureAlignment );
printf( MP Information for device %d , i );
printf( Multiprocessor count: %d,
prop.multiProcessorCount );
printf( Shared mem per mp: %ld, prop.sharedMemPerBlock );
printf( Registers per mp: %d, prop.regsPerBlock );
printf( Threads in warp: %d, prop.warpSize );
printf( Max threads per block: %d, prop.maxThreadsPerBlock );
printf( Max thread dimensions: (%d, %d, %d), prop.maxThreadsDim[0], prop.maxThreadsDim[
prop.maxThreadsDim[2] );
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 24

printf( Max grid dimensions: (%d, %d, %d), prop.maxGridSize[0], prop.maxGridSize[1],


prop.maxGridSize[2] );
}
system(PAUSE);
return 0;
}

Figure 3.4: Output: Code-3

3.4 Array Addition: Code-4

Code-4 explains the long array addiiton. Here, two arrays of length 1024*16 are
added into third array. This addition takes very less time with GPU implementation.
Output of code is shown in figures 3.5, 3.6 and 3.7.

#include<iostream>
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 25

#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>
#include<stdio.h>
#include<conio.h>

] define size 1024*16

global void addKernel(int *c, int *a, int *b)


{
int i = threadIdx.x + blockIdx.x * blockDim.x ;
c[i] = a[i] + b[i];
}

int main()
{
int *a, *b, *c;

a = (int*)malloc(sizeof(int)*size);
if(!a)
{
printf(A allocation error);
getch();
return 1;
}

b = (int*)malloc(sizeof(int)*size);
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 26

if(!b)
{
printf(B allocation error);
getch();
return 1;
}

c = (int*)malloc(sizeof(int)*size);
if(!c)
{
printf(C allocation error);
getch();
return 1;
}

for(int i =0; i < size;i++)


{
a[i]=1;
b[i]=2;
}

int *dev a = 0;
int *dev b = 0;
int *dev c = 0;
. cudaMalloc((void**)&dev c, size * sizeof(int));
cudaMalloc((void**)&dev a, size * sizeof(int));
cudaMalloc((void**)&dev b, size * sizeof(int));
cudaMemcpy(dev a, a, size * sizeof(int), cudaMemcpyHostToDevice);
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 27

cudaMemcpy(dev b, b, size * sizeof(int), cudaMemcpyHostToDevice);


addKernel<<<16,1024>>>(dev c, dev a, dev b);

cudaMemcpy(c, dev c, size * sizeof(int), cudaMemcpyDeviceToHost);


cudaFree(dev c);
cudaFree(dev a);
cudaFree(dev b);
for(int i=0;i<size;i++)
{
printf(%d , c[i]);
}
return 0;
}

Figure 3.5: Output: Code-4(1)


CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 28

Figure 3.6: Output: Code-4(2)

Figure 3.7: Output: Code-4(3)


Chapter 4

DFT and DCT Implementation

4.1 Discrete Fourier Transform (DFT)

Discrete Fourier Transform (DFT) or Fast Fourier Transform (FFT) can be imple-
mented on GPU with the use of libraries available in CUDA. CUDA supports FFT
and DFT in CUFFT library. With the functions available in CUFFT library, we
can easily apply DFT on input data sequence and on image as well. These require
knowledge of certain function definitions and their input parameters.

CUFFT library supports three types of Fourier Transforms:

Complex-to-Complex (C2C)

Real-to-Complex (R2C)

Complex-to-Real (C2R)

C2C Fourier Transfrom is generally used to carry out FFT and DFT. In C2C type of
transform, input and output both sequences are of complex type. In R2C transform,
input sequence is real, but output sequence is complex. Likewise, C2R has complex
input sequence and real output sequence. CUFFT library provides 1D, 2D and 3D

29
CHAPTER 4. DFT AND DCT IMPLEMENTATION 30

Fourier Transforms. It provides single-precision (32-bit floating point) as well as dou-


ble precision operations (64-bit floating point) [3].

The FFT is a divide-and-conquer algorithm for efficiently computing Discrete Fourier


Transforms of complex or real-valued data sets. It is one of the most important and
widely used numerical algorithms in computational physics and general signal pro-
cessing [11]. The CUFFT library provides a simple interface for computing parallel
FFTs on an NVIDIA GPU, which allows users to quickly leverage the floating-point
power and parallelism of the GPU in a highly optimized and tested FFT library [3].

The most important functions related to DFT programming are as follow [3]:

cufftPlan1d()

cufftDestory()

cufftExecC2C()

cufftExecR2C()

cufftExecC2R()

cufftPlan1d() : Creates a 1D FFT plan configuration for a specified signal size and
data type. The batch input parameter tells CUFFT how many 1D transforms to
configure. Similarly 2D and 3D transforms can be configured.
definition: cufftPlan1d(cufftHandle *plan, int nx, cufftType type, int batch)
Here, plan is a pointer to cufftHandle object. NX parameter tells the transform size
(e.g. 8 for 8-point DFT). The parameter type tells the compiler the type(R2C, C2R
or C2C) of transform. BATCH indicates the number of transforms of size NX. Out-
CHAPTER 4. DFT AND DCT IMPLEMENTATION 31

put of this function is the plan object value.

cufftDestroy() : Frees all GPU resources associated with a CUFFT plan and de-
stroys the internal plan data structure. This function should be called once a plan is
no longer needed, to avoid wasting GPU memory.
definition: cufftDestroy(cufftHandle plan)
The input parameter is the plan that is to be destroyed.

cufftExecC2C() : cufftExecC2C executes a single-precision complexto- complex


transform plan in the transform direction as specified by direction parameter. CUFFT
uses the GPU memory pointed to by the idata parameter as input data. This function
stores the Fourier coefficients in the odata array. If idata and odata are the same,
this method does an in-place transform.
definition : cufftExecC2C(cufftHandle *plan, cufftComplex *idata, cufft-
Complex *odata, int direction)
Direction specifies whether the transform is in forward direction or in inverse direc-
tion. Similarly, cufftExecR2C and cufftExecC2R can be used. In these functions,
direction parameter is absent.

The data types available in CUFFT libraries are as follows:

cufftHandle

cufftComplex

cufftReal

cufftDoubleReal

cufftDoubleComplex
CHAPTER 4. DFT AND DCT IMPLEMENTATION 32

The code given below illustrates the 4-point DFT on a real input sequence. It also
applied inverse DFT to get input sequence back. Its output is shown in figure 4.1.

#include<iostream>
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>
#include<stdio.h>
#include<conio.h>
#include<cufft.h>

#define NX 4
#define BATCH 1

void main(void)
{
cufftHandle plan;
int i;
cufftComplex *device;
cufftComplex host[NX*BATCH];
cufftComplex temp[NX*BATCH];

cudaMalloc((void **)&device,sizeof(cufftComplex)*NX*BATCH);

printf(Welcome to FFT C2C Program);

printf(INPUT DATA:);
CHAPTER 4. DFT AND DCT IMPLEMENTATION 33

host[0].x=1;
host[1].x=2;
host[2].x=0;
host[3].x=1;

host[0].y=0;
host[1].y=0;
host[2].y=0;
host[3].y=0;

for(i=0;i<4;i++)
{
printf(%f + i%f,host[i].x,host[i].y);
}

cudaMemcpy(device,host,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyHostToDevice);
if(cufftPlan1d(&plan,NX,CUFFT C2C,BATCH) != CUFFT SUCCESS)
{
fprintf(stderr,CUFFT ERROR: Plan creation failed);
}

if(cufftExecC2C(plan,device,device,CUFFT FORWARD) != CUFFT SUCCESS)


{ fprintf(stderr,CUFFT ERROR: ExecC2C Forward failed);
}

cudaMemcpy(host,device,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyDeviceToHost);
CHAPTER 4. DFT AND DCT IMPLEMENTATION 34

printf(OUTPUT DATA:);

for(i=0;i<4;i++)
{
printf(%f + i%f,host[i].x,host[i].y);
}

if(cufftExecC2C(plan,device,device,CUFFT INVERSE) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: ExecC2C Inverse failed);
}

cudaMemcpy(temp,device,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyDeviceToHost);

printf(INVERSE FFT:);

for(i=0;i<4;i++)
{
printf(%f + i%f,temp[i].x/NX,temp[i].y/NX);
}
cufftDestroy(plan);
cudaFree(host);
cudaFree(temp);
cudaFree(device);
system(PAUSE);
}
CHAPTER 4. DFT AND DCT IMPLEMENTATION 35

Figure 4.1: Output: 4-Point DFT

Time plot and kernel execution timing details are shown in figure 4.2.

In DFT code, we are not writing any kernels. But, CUFFT library itself contains
kernels for the execution of DFT and IDFT. CUFFT library uses two important
kernels for cufftPlan1d function and cufftExecC2C function. According to the time
plot, kernel time is 2.23% of the total GPU time. The time axis (Y-axis) is in units
of milli-seconds. The kernel vectorRadix2 takes the maximum time [14]. But, the
overall computation time is quite less. When DFT is calculated for large number of
points, the CUDA C programming provides better performance.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 36

Figure 4.2: Time Plot for DFT Code

Similarly 2D DFT can be calculated. This 2D DFT can be applied on 2D matrix or


on image. Since image is 2D in nature, applications of 2D DFT and 2D DCT are
more suitable. DFT is applied along the length and width. NX parameter tells num-
ber of points in x-directon. While, NY parameter tells number of points in y-direction.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 37

The code given below illustrates 2D DFT. It contains the core part only.
#define NX 8
#define NY 8

int main(void)
{
cufftHandle plan;
cufftComplex *device;
cufftComplex host[NX][NY];
cufftComplex temp[NX][NY];

cudaMalloc((void **)&device,sizeof(cufftComplex)*NX*NY);
cudaMemcpy(device,host,sizeof(cufftComplex)*NX*NY,cudaMemcpyHostToDevice);

if(cufftPlan2d(&plan,NX,NY,CUFFT C2C) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: Plan creation failed);
}

if(cufftExecC2C(plan,device,device,CUFFT FORWARD) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: ExecC2C Forward failed);
}

cudaMemcpy(host,device,sizeof(cufftComplex)*NX*NY,cudaMemcpyDeviceToHost);
if(cufftExecC2C(plan,device,device,CUFFT INVERSE) != CUFFT SUCCESS)
CHAPTER 4. DFT AND DCT IMPLEMENTATION 38

fprintf(stderr,CUFFT ERROR: ExecC2C Inverse failed);

cudaMemcpy(temp,device,sizeof(cufftComplex)*NX*NY,cudaMemcpyDeviceToHost);
cufftDestroy(plan);
}

The output of this code is shown in figure 4.3 The input matrix is of order 3x3 and
its parameters are : [1 2 1; 1 3 1; 1 4 1].

Figure 4.3: Output of 2D DFT code


CHAPTER 4. DFT AND DCT IMPLEMENTATION 39

4.2 Discrete Cosine Transform (DCT)

The discrete cosine transforms (DCT) and discrete sine transform (DST) are mem-
bers of a family of sinusoidal unitary transforms. They are real, orthogonal, and
separable with fast algorithms for its computation. They have a great relevance to
data compression.

DCT is a Fourier-related transform similar to the discrete Fourier transform (DFT),


but using only real numbers. DCTs are equivalent to DFTs of roughly twice the
length, operating on real data with even symmetry. The obvious distinction between
a DCT and a DFT is that the former uses only cosine functions, while the latter uses
both cosines and sines (in the form of complex exponentials)[5].

Compared with DFT, DCT has two main advantages [5]:

Its a real transform with better computational efficiency than DFT which by
definition is a complex transform.

It does not introduce discontinuity while imposing periodicity in the time signal.
In DFT, as the time signal is truncated and assumed periodic, discontinuity is
introduced in time domain and some corresponding artifacts is introduced in
frequency domain. But as even symmetry is assumed while truncating the time
signal, no discontinuity and related artifacts are introduced in DCT [7].

DCT have four different definitions. DCT-II is generally used for data compression.
The importance of DCT-II is further accentuated by its-

Superiority in bandwidth compression (redundancy reduction) of a wide range


of signals.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 40

Powerful performance in the bit-rate reduction.

Existence of fast algorithms for its implementation.

So, DCT is preferred over DFT for image processing algorithms. When DCT is ap-
plied to any image, it introduces no loss to the source image samples [7]. The image
pixels have brightness values ranging from 0 to 255 in integer. These are real numbers
and DCT is suitable for real numbers. When DCT is applied, the first coefficient is
always DC, while all others are AC. DC component has the highest value and DCT
coefficient value is decreasing form low order coefficients to higher ones. These high
frequency coefficients can be neglected since they have less values. DCT application
on image is shown in figure 4.4.

Figure 4.4: DCT on image

The DCT algorithm can also be implemented by using CUFFT library in CUDA.
But, the problem is : Direct DCT computation functions are not available in CUFFT
CHAPTER 4. DFT AND DCT IMPLEMENTATION 41

library. So, DCT can be derived from DFT computation. For, N-point DCT compu-
tation, 2N-point DFT computation is required. Similarly, for N-point IDCT compu-
tation, 2N-point IDFT computation is required.

The algorithm for DCT computation from DFT is given below [6]:

1 . Take any N-point input signal say x[n].

2 . Form 2N-point signal y[n] from x[n] by following:


y[n] = x[n] if 0 6 n 6 N-1,
y[n] = x[2N-1-n] if N 6 n 6 2N-1.

3 . Calculate Y[k], 2N-point DFT of y[n].

4 . For 0 6 k 6 N-1,
k/2
C[k] = W2N * Y[k]

C[k] denotes the N-point DCT of N-point input x[n]. The algorithm for IDCT com-
putation from IDFT is given below [6]:

1 . Construct Y[k] from N-point DCT C[k] as following:


k/2
Y[k] = W2N * C[k] for 0 6 k 6 N-1;
Y[k] = 0 for k = N;
Y[k] = W2N k/2 * C[2N-k] for N+1 6 k 6 2N-1.

2 . Calculate y[n], the 2N-point inverse DFT of Y[k].

3 . For 0 6 n 6 2N-1, let x[n] = y[n].


CHAPTER 4. DFT AND DCT IMPLEMENTATION 42

Here, x[n] is the N-point IDCT for C[k]. The code for DCT-IDCT computation for
CPU (Host) is given below:

#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<time.h>
#include<stdlib.h>

#define length 10240


#define PI 3.1428
void main()
{
clock t start,stop;
start = clock();
float data[length],xdct[length]={0},idct[2*length]={0};

for(int i=0;i<length;i++)
{
data[i]=i;
}
for(int i=0;i<length;i++)
{
printf(element in%d is: %f,i,data[i]);
}
for(int i=0;i<length;i++)
{
for(int j=0;j<length;j++)
CHAPTER 4. DFT AND DCT IMPLEMENTATION 43

{
xdct[i]=xdct[i]+2*data[j]*cos(PI*i*(2*j+1)/2/length);
}
}
for(int i=0;i<2*length;i++)
{
for(int j=1;j<length;j++)
{
idct[i]=idct[i]+xdct[j]*cos(PI*j*(2*i+1)/2/length)/length;
}
idct[i]=idct[i]+xdct[0]/2/length;
}
stop = clock();
printf(Execution Time in milliseconds (DCT on CPU) : %f,(double)((stop-start)));
system(PAUSE);

The output of above code is shown in figure 4.5. The output gives the idea about
execution time that CPU takes for the calculation of DCT for large number of input
points. For small number of points, CPU takes very less time. But, for large number
of points like 10240, it takes much time of about 11.5 secs. On the other hand, GPU
version of this code takes very less time of about 0.174 sec. It is shown in figure 4.6.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 44

Figure 4.5: DCT on CPU

When This code runs on GPU, it takes less execution time for large points. It con-
tains the kernels called for DFT computations, IDFT computations, twiddle factor
multiplication, 2N-point conversion etc. The comparison for CPU and GPU version
of code is given in table 4.1 and table 4.2.

Table 4.1: DCT on CPU

DCT on CPU
Length of code is comparatively smaller.
It executes entirely on CPU. There is no use of CUDA Programming.
Execution time is comparatively less for less number points (e.g. 100).
Execution time is comparatively more for more number points (e.g. 100000).
Not suitable for image processing algorithm.
Regenerated values are not much accurate.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 45

Table 4.2: DCT on GPU

DCT on GPU
Length of code is comparatively larger.
It executes by different kernel calls and takes benefits of CUDA programming.
Execution time is comparatively more for less number of points (e.g. 100).
Execution time is comparatively less for more number points (e.g. 100000).
Highly suitable for image processing algorithm.
Regenerated values are very accurate compared to CPU generated value due to SFU.

Figure 4.6: DCT on GPU

The DCT code for GPU is shown below.

#include<iostream>
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>
#include<stdio.h>
CHAPTER 4. DFT AND DCT IMPLEMENTATION 46

#include<conio.h>
#include<math.h>

#define NX 4
#define PI 3.14285714
#define BATCH 1

int main(void)
{
int i;
cufftComplex host1[NX*BATCH];
cufftComplex host3[NX*BATCH];
cufftComplex host2[NX*2*BATCH];
cufftHandle plan;
cufftComplex *device;
cufftComplex host[NX*2*BATCH];
cudaMalloc((void **)&device,sizeof(cufftComplex)*NX*2*BATCH);
printf(N-point DCT computation using 2N-point DFT);
printf(Enter the input data:);
for(i=0;i<NX;i++)
{
printf(Enter value of host[%d].x:,i);
scanf(%f,&host1[i].x);
host1[i].y = 0;
}

for(i=0;i<NX;i++)
{
CHAPTER 4. DFT AND DCT IMPLEMENTATION 47

host[i].x = host1[i].x;
host[i].y = 0;
}

for(i=NX;i<2*NX;i++)
{
host[i] = host1[2*NX-1-i];
host[i].y = 0;
}

cudaMemcpy(device,host,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyHostToDevice);
if(cufftPlan1d(&plan,2*NX,CUFFT C2C,BATCH) != CUFFT SUCCESS)
{
fprintf(stderr,CUFFT ERROR: Plan creation failed);
}

if(cufftExecC2C(plan,device,device,CUFFT FORWARD) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: ExecC2C Forward failed);
}

cudaMemcpy(host,device,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyDeviceToHost);
printf(OUTPUT DATA:);
for(i=0;i<2*NX;i++)
{
printf(%f + i%f,host[i].x,host[i].y);
}
CHAPTER 4. DFT AND DCT IMPLEMENTATION 48

for(i=0;i <NX;i++)
{
host[i].x = host[i].x * cos(PI*i/2/NX) + host[i].y * sin(PI*i/2/NX);
host[i].y = 0;
}

printf(OUTPUT DCT);

for(i=0;i<NX;i++)
{
printf(%f ,host[i].x);
}

for(i=0;i<NX;i++)
{
host2[i].x = host[i].x * cos(PI*i/2/NX);
host2[i].y = host[i].x * sin(PI*i/2/NX);
}
if(i==NX)
{
host2[i].x = 0;
host2[i].y = 0;
}
for(i=NX+1;i<2*NX;i++)
{
host2[i].x = -host[2*NX-i].x * cos(PI*i/2/NX);
CHAPTER 4. DFT AND DCT IMPLEMENTATION 49

host2[i].y = -host[2*NX-i].x * sin(PI*i/2/NX);


}
cudaMemcpy(device,host2,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyHostToDevice);

if(cufftPlan1d(&plan,2*NX,CUFFT C2C,BATCH) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: Plan creation failed);
}
if(cufftExecC2C(plan,device,device,CUFFT INVERSE) != CUFFT SUCCESS)
{
fprintf(stderr,CUFFT ERROR: ExecC2C Inverse failed);
}
cudaMemcpy(host3,device,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyDeviceToHost);
for(i=0;i<NX;i++)
{
printf(INVERSE DCT OUTPUT:);
printf(%f ,host3[i].x/2/NX);
}
system(PAUSE);
return 0;
}

This code uses CUFFT library and DFT to DCT conversion algorithm. The output
of this code is shown in figure 4.7. The input data is [1 2 3 4]. Still, this code is not
completely implemented on GPU because it uses for loops for twiddle factor multi-
plication and 2N-point conversion. The kernels are called only for DFT computation.
So all-kernel calls code is given below. Here, only the kernels definitions are shown.
The complete code for image processing is illustrated in chapter 5.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 50

Figure 4.7: DCT and IDCT code output

global void convert2N (cufftComplex *fromhostdata, cufftComplex *gpu2Ndata)


{
unsigned long int i = blockIdx.x * (blockDim.x*2) + threadIdx.x;
unsigned long int j = blockIdx.x * (blockDim.x) + threadIdx.x;
unsigned long int p = blockIdx.x * (blockDim.x);
unsigned long int k = threadIdx.x;
gpu2Ndata[i].x = fromhostdata[j].x;
gpu2Ndata[NX+i].x = fromhostdata[p+NX-k-1].x;
gpu2Ndata[i].y = 0;
gpu2Ndata[NX+i].y = 0;
}
CHAPTER 4. DFT AND DCT IMPLEMENTATION 51

Figure 4.8: Convert2N kernel

Convert2N kernel is used to form 2N-point sequence from N-point input sequence. It
is shown in figure 4.8. The twiddel multiplication kernel is given below.

global void twiddle(cufftComplex *gpu2Ndata)


{
unsigned long int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned long int k=threadIdx.x;
gpu2Ndata[i].x = gpu2Ndata[i].x * cos(PI*k/2/NX) + gpu2Ndata[i].y * sin(PI*k/2/NX);
gpu2Ndata[i].y=0;
}

The ToDCT kernel is illustrated in figure 4.9 and given code. It is used to collect
data row-wise.

global void todct(cufftComplex *xdct,cufftComplex *data2N)


{
unsigned long int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
unsigned long int j = blockIdx.x*(blockDim.x) + threadIdx.x;
xdct[j].x=data2N[i].x;
xdct[j].y=0;
CHAPTER 4. DFT AND DCT IMPLEMENTATION 52

Figure 4.9: ToDCT kernel

Figure 4.10: Transpose kernel

global void transpose(cufftComplex *data, cufftComplex *ct)


{
unsigned long int id=threadIdx.x;
for(int i=0;i<BATCH;i++)
{
ct[id*BATCH+i]=data[i*NX+id];
CHAPTER 4. DFT AND DCT IMPLEMENTATION 53

}
}

Transpose kernel is explained in figure 4.10. It is very useful for 2D DCT implemen-
tation. It uses 1D DCT for the calculation of 2D DCT by transpose method. Finally,
the kernel for IDCT computation is shown in figure 4.11 and explained in the code
given below:

Figure 4.11: IDCT kernel

global void idct2N(cufftComplex *dctdata, cufftComplex *idct2Ndata)


{
unsigned long int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
unsigned long int j = blockIdx.x*(blockDim.x) + threadIdx.x;
unsigned long int k=threadIdx.x;
unsigned long int q=blockIdx.x*(blockDim.x*2);
unsigned long int p=blockIdx.x*(blockDim.x);
idct2Ndata[i].x=dctdata[j].x*cos(PI*(i+q)/2/NX);
idct2Ndata[i].y=dctdata[j].x*sin(PI*(i+q)/2/NX);
if(k==0)
{
idct2Ndata[NX+q].x=0;
CHAPTER 4. DFT AND DCT IMPLEMENTATION 54

idct2Ndata[NX+q].y=0;
}
else
{
idct2Ndata[NX+i].x= -dctdata[p+NX-k].x* cos(PI*(q+NX+i)/2/NX);
idct2Ndata[NX+i].y= -dctdata[p+NX-k].x*sin(PI*(q+NX+i)/2/NX);
}
}

These kernels are called for the DCT implementation on signal as well as on image.
Application of this algorithm in image processing is explained in next chapter.
Chapter 5

Image Processing Algorithm

5.1 Image Processing

Image Processing is any form of signal processing for which the input is an image.
It is generally referred to processing of a 2D picture by computer. The output of
image processing may be either an image or the set of parameters which are related
to image. There are different types of image processing : Digital Image Processing,
Analog Image Processing and Optical Image Processing. Image processing refers to
application of different algorithms on images.

There are different algorithms available to process the image. Some of them are listed
below:

Image Enhancement

Image Restoration

Image Compression

55
CHAPTER 5. IMAGE PROCESSING ALGORITHM 56

De-Blurring

De-Noising

Edge Detection

Image Smoothing

Convolution Operation

In this project, the algorithm has been applied on grayscale image. Only red compo-
nent of every image is taken. The same algorithm can be applied for color images, in
which the red, green and blue components are considered saparately.

5.2 Image Reading with MATLAB

To read the image in CUDA C, it must be available in BIN format. So, by using code
given below, image can be converted into .bin file. It contains pixel values. This file
is read in CUDA C by FILE operations.

c = imread(Koala.jpg);
r = c(:,:,1);
t = r(:);
p = t;
fid = fopen(koala-bin.bin,w);
fwrite(fid,p);
fclose(fid);
CHAPTER 5. IMAGE PROCESSING ALGORITHM 57

The image processing workflow is illustrated in figure 5.1.

Figure 5.1: Image Processing Workflow

5.3 Image Processing Code

The image processing code is illustrated below. It does not contain some of the kernel
definitions as they are discussed in chapter 4. The kernels those are not discussed are
included.

#define NX 768
CHAPTER 5. IMAGE PROCESSING ALGORITHM 58

#define BATCH 1024


#define PI 3.14285
unsigned long int i;

const unsigned long int buffer size=786432;


FILE *source;
unsigned long int count=0;
unsigned long int written=0;

global void devideNX(cufftComplex *xdct)


{
unsigned long int id=threadIdx.x+blockIdx.x*blockDim.x;
xdct[id].x=xdct[id].x/2/NX;
global void devideBATCH(cufftComplex *xdct)
{
unsigned long int id=threadIdx.x+blockIdx.x*blockDim.x;
xdct[id].x=xdct[id].x/2/BATCH;
}

int main(void)
{

cudaError t cudaStatus;
cudaDeviceReset();
cufftComplex host[NX*BATCH],host2N[2*NX*BATCH],temp[NX*BATCH];
cufftHandle plan2N;
cufftHandle plan2N2;
CHAPTER 5. IMAGE PROCESSING ALGORITHM 59

clock t start,stop,start1,stop1; if(cufftPlan1d(&plan2N,2*NX,CUFFT C2C,BATCH)


!= CUFFT SUCCESS)
{
fprintf(stderr,CUFFT ERROR: Plan creation failed);
getch();
}

if(cufftPlan1d(&plan2N2,2*BATCH,CUFFT C2C,NX) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: Plan creation failed);
getch();
}

unsigned char buffer;


source=fopen(img1024x768.bin,r);
unsigned long int i=0,n=0;
while(i<buffer size)
{
fseek(source,i,0);
n=fread(&buffer,1,1,source);
host[i].x=buffer;
host[i].y=0;
i++;
}
printf(Reading is done.....);
fclose(source);
start=clock();
CHAPTER 5. IMAGE PROCESSING ALGORITHM 60

cufftComplex *device1N,*device1Ncpy,*device2N,*devicecpy2,*ansdct,*ansidct;
cudaMalloc((void **)&device1N,sizeof(cufftComplex)*NX*BATCH);
cudaMalloc((void **)&device1Ncpy,sizeof(cufftComplex)*NX*BATCH);
cudaMemcpy(device1N,host,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyHostToDevice);
cudaMalloc((void **)&device2N,sizeof(cufftComplex)*NX*2*BATCH);
cudaMalloc((void **)&devicecpy2,sizeof(cufftComplex)*NX*2*BATCH);

cudaMalloc((void **)&ansidct,sizeof(cufftComplex)*NX*BATCH);
convert2N<<<BATCH,NX>>>(device1N,device2N);

if(cufftExecC2C(plan2N,device2N,devicecpy2,CUFFT FORWARD) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: ExecC2C Forward failed);
getch();
}

cudaMalloc((void **)&ansdct,sizeof(cufftComplex)*NX*BATCH);
twiddle<<<(2*BATCH),NX>>>(devicecpy2);
todct<<<BATCH,NX>>>(ansdct,devicecpy2);

transpose<<<BATCH,NX>>>(ansdct,device1Ncpy);

convert2N2<<<NX,BATCH>>>(device1Ncpy,device2N);

if(cufftExecC2C(plan2N2,device2N,device2N,CUFFT FORWARD) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: ExecC2C Forward failed);
CHAPTER 5. IMAGE PROCESSING ALGORITHM 61

getch();
}

twiddle<<<(2*NX),BATCH>>>(device2N);
todct<<<NX,BATCH>>>(ansdct,device2N);

idct2N2<<<NX,BATCH>>>(ansdct,devicecpy2);

if(cufftExecC2C(plan2N2,devicecpy2,devicecpy2,CUFFT INVERSE) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: ExecC2C Inverse failed);
}

todct<<<NX,BATCH>>>(ansidct,devicecpy2);
devideBATCH<<<NX,BATCH>>>(ansidct);

transpose2<<<NX,BATCH>>>(ansidct,ansdct);

idct2N<<<BATCH,NX>>>(ansdct,device2N);

if(cufftExecC2C(plan2N,device2N,device2N,CUFFT INVERSE) != CUFFT SUCCESS)


{
fprintf(stderr,CUFFT ERROR: ExecC2C Inverse failed);
}

todct<<<BATCH,NX>>>(ansidct,device2N);
CHAPTER 5. IMAGE PROCESSING ALGORITHM 62

devideNX<<<BATCH,NX>>>(ansidct);
stop = clock();
printf(IDCT done);
start1 = clock();
cudaStatus =cudaMemcpy(host,ansidct,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyDeviceT
stop1 = clock();
if (cudaStatus != cudaSuccess)
{
printf(cudaMemcpy failed);
getch();
}

printf(INVERSE DCT OUTPUT:);


for(i=0;i<NX*BATCH;i++)
{
printf(IDCT in %d is %f ,i,host[i].x);
}

source=fopen(img1024x768binIDCT.bin,w);
i=0;
while(i<buffer size)
{
buffer=abs(host[i].x);
fseek(source,i,0);
fwrite(&buffer,1,1,source);
i++;
}
printf(writing is done.....);
CHAPTER 5. IMAGE PROCESSING ALGORITHM 63

getch();
fclose(source);

printf(EXECUTION TIME in milliseconds (DCT on GPU) for %d * %d points :


%f,NX,BATCH,(double)((stop-start)));
printf(Cudamemcpy in milliseconds (DCT on GPU) for %d * %d points : %f,NX,BATCH,(doubl
start1)));
system(PAUSE);
cufftDestroy(plan2N);
cufftDestroy(plan2N2);
cudaFree(device1N);
cudaFree(ansdct);
cudaFree(ansidct);
cudaFree(device2N);
cudaFree(devicecpy2);
cudaDeviceReset();
return 0;
}

The output of above code is shown in figure 5.2. The original data of image is shown
in figure 5.3. The regenerated values are nearly same. The execution time output is
shown in figure 5.4.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 64

Figure 5.2: Output of image processing code

By comparing the results, we can say that regenerated values of pixels are nearly
same of original pixel values. There is slight difference is some of the values. This is
the reason for scattering of pixels in some of high resolution images. This algorithm
is applied to different images.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 65

Figure 5.3: Original values

Figure 5.4: Execution time for 1024x768 image


CHAPTER 5. IMAGE PROCESSING ALGORITHM 66

This algorithm has been applied to different images. Smallest resoultion of taken
is 128x128. The largest resoultion taken is 1024x768. For GPU machine GTX 480,
maximum number of threads which can run parallel is 1024. So, the largest image
that can be processed by GPU is limited to 1024 pixels in horizontal direction. DCT
is applied on each and every pixel. NX and BATCH parameters are varied accord-
ingly with image pixels. Process and flow remains same for different images.

5.4 Image Regeneration

fid =fopen(img1024x768binIDCT.bin,r);
p=fread(fid);
fclose(fid);
for i=1:clm
for j=1:row
q(j,i)=p(j+(i-1)*row);
end
end
imwrite(q,img1024x768-regenerated.jpg);

The results of image processing code are shown in figures 5.5, 5.6, 5.7, 5.8, 5.9, and
5.10.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 67

Figure 5.5: Original Image and Regenerated Image:128x128

Figure 5.6: Original Image and Regenerated Image:176x144

Figure 5.7: Original Image and Regenerated Image:256x256


CHAPTER 5. IMAGE PROCESSING ALGORITHM 68

Figure 5.8: Original Image and Regenerated Image:512x512

Figure 5.9: Original Image and Regenerated Image:800x600


CHAPTER 5. IMAGE PROCESSING ALGORITHM 69

Figure 5.10: Original Image and Regenerated Image:1024x768

Table 5.1: Execution time for different images

Image Size Processing Time cudaMemcpy Time


128x128 Less than 1 ms Less than 1 ms
176x144 Less than 1 ms Less than 1 ms
256x256 Less than 1 ms 15 ms
512x512 Less than 1 ms 249 ms
800x600 15 ms 826 ms
1024 x 768 16 ms 1762 ms

The execution time for complete processing for different images and the cudamemcpy
time (time required for copying the final values from device to host) are compared in
table 5.1.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 70

5.5 PSNR Calculation

Peak Signal-to-Noise Ratio, often abbreviated PSNR, is an engineering term for the
ratio between the maximum possible power of a signal and the power of corrupting
noise that affects the fidelity of its representation. Because many signals have a very
wide dynamic range, PSNR is usually expressed in terms of the logarithmic decibel
scale.PSNR is most commonly used to measure the quality of reconstruction of lossy
compression codecs (e.g., for image compression). The signal in this case is the orig-
inal data, and the noise is the error introduced by compression. When comparing
compression codecs, PSNR is an approximation to human perception of reconstruc-
tion quality.

PSNR calculation code is given below:

I1 = imread(img1024x768-original.jpg);
I2 = imread(img1024x768-regenerated.jpg);
P = 255;
MSE = mean((I1(:)-I2(:)) * (I1(:)-I2(:)));
PSNR = 10*log10(P*P/MSE);

Here, P is the maximum pixel value of image that is 255. The original and regener-
ated images are compared and PSNR is calculated with MATLAB. The PSNR values
for different images are given in table 5.2.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 71

Table 5.2: PSNR Values for Different Images

Image Size PSNR Value(in dB)


128x128 34.5625
176x144 35.6318
256x256 32.4320
512x512 36.4364
800x600 25.8664
1024 x 768 26.2030

PSNR values above 30 dB are acceptable. The low PSNR values in 800x600 and
1024x768 images are due to black portion. This is problem because GPU machine is
not much powerful. This can be solved by using more powerful GPU.
Chapter 6

Conclusion and Future Scope

6.1 Conclusion

As the problem of fast computation of massive data is very important now a day,
parallel processing using Graphics Processing Unit can provide solution. Image pro-
cessing is also a very wide field for providing solution for many real time problems.
But processing of image on CPU will take too much time and can make our system
non-real time.

With the purpose of making image processing near- real time algorithm named Dis-
crete Cosine transform for image processing has been successfully implemented on
GPU. Output results obtained from the project are really alluring and very much
effective compare to results of CPU processed data. For large image, result shows
very impressive time speed up. Only problem noticed is the little distortion of image
which is present only in very large size image. But still the regenerated image has
much information value. This project can further be expanded with implementation
of quantization which will make compression of data after applying proper compres-
sion technique.

72
CHAPTER 6. CONCLUSION AND FUTURE SCOPE 73

Apart from this, there are so many other image processing algorithms and filtering
algorithms can be implemented on GPU which and that can eventually lead to speed
up of processing. The only bottleneck of processing with GPU is the time consump-
tion of data transfer from CPU memory to GPU memory or vice-versa. There are
researches going on to overcome this obstruction.

6.2 Future Scope

We have implemented DCT and IDCT algorithm in CUDA architecture. We have


implemented image processing algorithm on different images. In this algorithm, we
have applied DFT, DCT and IDCT algorithms. Now, using these codes, one can pro-
cess different images by applying compression algorithm. For different quantization
tables available, it is easy to compress images. One just need to write quantization
logic in exitising codes. Also, this algorithm can be applied to various resolution im-
ages having different sizes. One can also verify the execution time needed for process
the image on CPU or other platform and on GPU platform with CUDA architecture.
With this algorithm, images can be easily compressed or processed with less execution
time. So, this algorithm proves quite useful for large images (having high resolution).
References

[1] Jason Sanders and Edward Kandrot, CUDA BY EXAMPLE: An Introduction to


General Purpose GPU Programming, Addison-Wesley Publication,2011.

[2] NVIDIA CUDA Compute Unified Device Architecture Programming Guide,


NVIDA corporation, 2006.

[3] CUDA CUFFT Library Programming Guide, NVIDIA corporation, 2007.

[4] Cyril Zeller, NVIDIA Tutorial on CUDA, NVIDIA corporation, 2008.

[5] K.R. Rao and P.C.Yip, The Transform and Data Compression Handbook, 2009.

[6] Image Processing and Computer Vision: Relationship between DCT and DFT,
Eel 6562.

[7] Anton Obukhov and Alexander Kharlamov, Discrete Cosine Transform for 8x8
Blocks With CUDA, NVIDA corporation, October, 2008.

[8] Pranit Patel, Jeff Wong, Manisha Tatikonda and Jarek Marczewski, JPEG Com-
pression Algorithm Using CUDA, October, 2009

[9] Cristopher Cooper, GPU Computing With CUDA-Introduction, Boston Univer-


sity, August, 2011.

[10] Dana Schaa and Byungyun Jang, Programming with CUDA and OpenCL, North-
eastern University, July, 2010.

74
REFERENCES 75

[11] NVIDIA CUDA SDK Browser, http://www.nvidia.com/object/cudaget.html

[12] www.en.wikipedia.org/cuda

[13] www.en.wikipedia.org/GPU

[14] http://stackoverflow.com

[15] http://devtalk.nvidia.com : CUDA Developer Zone

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy