Main GPU
Main GPU
Main GPU
Bachelor of Technology
In
ELECTRONICS & COMMUNICATION ENGINEERING
By
Prof. N.P.Gajjar
Certificate
This is to certify that the Major Project Report entitled Image Processing on
NVIDIA GPU submitted by Aseem Jaymeen Bharatkumar (09BEC003)
and Ghetia Shivang Arvindbhai (09BEC019), as the partial fulfillment of the
requirements for the award of the degree of Bachelor of Technology in Electronics &
Communication Engineering, Institute of Technology, Nirma University, Ahmedabad
is the record of work carried out by them under our supervision and guidance. The
work submitted in my opinion has reached a level required for being accepted for the
examination.
Prof. N.P.Gajjar
Sr. Assoc. Professor,
Elect. & Comm. Engineering,
Institute of Technology,
Nirma University, Ahmedabad.
Acknowledgement
We would like to thank our professors for their support, encouragement and guidance
for this major project. It is not possible for us to name and thank them all individ-
ually, We must make special mention some of the personalities and acknowledge our
sincere indebtedness to them.
We express deep and sincere gratitude to our guide Prof. N.P.Gajjar for his con-
stant encouragement, valuable guidance and constructive suggestions during all the
stages of the project work. We are deeply indebted to Prof.Vijay Savani for his
valuable suggestions and support.
We are also thankful to all the faculty members of Department of Electronics &
Communication Engineering and our collegues, for providing us all the necessary
guidance through out the term which provides lots of help in course of our project
work. We are also thankful to the authors whose works we have consulted and quoted
in this work. We also gratefully acknowledge Mr. Prasann Shukla for providing us
the full laboratory support at PG-High Performance Computing Lab.
Abstract
The aim of this project is to explore the potential performance improvements that
could be gained through the use of CUDA architecture and GPU processing tech-
niques for the image processing. There are various algorithms availbale for the
processing the image. And there are various platforms available, on which, image
processing can be carried out. These algorithms include image compression, noise
removal, deblurring, edge detection etc. For processing the image, MATLAB is used
as common tool. But for this project, we use the Graphics Processing Unit (GPU)
provided by NVIDIA, having CUDA support. CUDA is compute unified device archi-
tecture, used mainly for parallel processing and high performance computing. With
the procesing capabilites of CUDA architecture (software) and GPU (hardware), great
improvement can be seen in the image processing techinques.
In this project, certain image processing algorithms have been implemented on GPU
with the use of CUDA architecture. This includes the DFT and DCT computation.
Also IDFT and IDCT computuation have been implemented. Thereafter DCT is
applied on each and every pixel of the image. Quantization can be implemented with
various levels. Then, image is regenerated with IDCT application. The execution
time has been measured. The algorithm has been applied to various images and
the results are compared. Aslo, CPU version of algorithm has been compared with
project results. The observation and comparison shows that Device based Image
Processing yeilds better results with less processing time. This can be very much
useful in processing very high resolution images.
v
Motivation
There are various platforms available for Image Processing. MATLAB is commonly
used tool for image processing. These platforms have their own advantages and dis-
advantages. The new techniques have been invented for image processing to overcome
the limitations of some programming languages. NVIDIA CUDA is such a technique.
It is a architecture based on parallel processing. With the computational power of
GPU, the parallel processing capabilities of CUDA makes NVIDIA Graphics process-
ing devices very much suiatble for complex image processing algorithms. So, these
GPUs provide accleration in processing techniques and they posses many benifits
over conventional programming tools.These GPUs can process high resolution image
in just fraction of second. They provide solutions of many complex graphics related
problems.
The execution time taken by general processing tools is comparatively large for high
resolution images than low resolution images. So, for complex image processing al-
gorithms, these tools are not adequate ones. GPUs provide altenate solution by
applying algorithm on each pixel and process them parallely. Hence, GPUs are very
much useful for image processing.
vi
Outline
Chapter 2 explains the basic fundamentals of GPUs and CUDA. It explains the need
of GPU and parallel processing architecture CUDA in high performance computing.
Chapter 3 gives examples of general purpose computing with GPU and data pro-
cessing codes.
Chapter 5 explains the image processing workflow and algorithm with project re-
sults.
Certificate ii
Acknowledgement iii
Abstract iv
Motivation v
Outline vi
List of Figures ix
List of Tables xi
Nomenclature xii
vii
CONTENTS viii
References 74
List of Figures
ix
LIST OF FIGURES x
xi
Nomenclature
Abbreviations
xii
Chapter 1
IV . A standard C Complier
For this project, we have used NVIDIA GeForce GTX 480 GPU. For the code devel-
opment, Microsoft Visual Studio 2010 has been used.
1
CHAPTER 1. PROJECT DEVELOPMENT ENVIRONMENT 2
Multiprocessor Count : 15
RAM : 8 GB
The term GPU was popularized by NVIDIA coroporation in 1999. NVIDIA made
the Geforce 256 and marketed it as worlds first GPU. Graphics processing units are
manufactured mainly by NVIDA and ATI. ATI called GPU as VPU and released
Radeon 9700 in 2002 [13]. Modern GPUs are very efficient at manipulating computer
graphics. Their highly parallel structure makes them more effective than general-
purpose CPUs, when processing large resolution images. GPU is available as graphics
card in many computers as well as embedded in motherboard.
3
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 4
Modern GPUs have the architecture and hardware to do the calculations related
to the 3D computer graphics. They were initially used to accelerate the memory-
intensive work of texture mapping and rendering polygons. Thereafter, units have
been added to accelerate geometric calculations into different coordinate systems [13].
Most of the calculations need matrix and vector operations and GPUs along with
CUDA architecture provide best support for these operations with rich pre-defined
functions and libraries. GPUs are used for image processing, video decoding and
stream processing.
The graphics processing unit is shown in figure 2.1 [10]. The GPU shown in figure
is NVIDIA Geforce GTX 280. The normal CPUs have either single core or multiple
cores. But, the number of cores which CPUs having are less compared to GPUs. The
CPUs are used for general computing, while GPUs are used for high performance
computing. So, GPUs need support of parallel processing. This can be achieved with
many cores. The figure 2.2 shows the difference between number of cores that CPUs
and GPUs are having [9].
The architecture of GPU is shown in figure 2.3 [9]. It differs from the CPU architec-
ture. GPU devotes more transistors to the data processing. GPU has scalable array
of streaming microprocessors. GPU also has multiple memory spaces. On-chip mem-
ory includes shared memory and registers. Off-chip memory includes global memory.
PCI interface is used for interfacing with CPU [10].
The difference between GPU and CPU architecture is shown in figure 2.4 [9]. Figure
2.5 shows the chip design of GPU [9].
As discussed earlier, GPUs are designed to accelerate graphics operations and they are
extensively used for high performance computing. So, GPU is a hardware specially
designed for highly parallel applications like graphics. The processing capabilities of
GPU can be seen fro the graph shown in figure 2.6 [9]. These processing capabilities
require large bandwidth. So, GPUs come with high bandwidth. The comparison can
also be seen in figure 2.6. GPUs use many cores and huge number of transistors. So,
they need adequate cooling mechanism.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 7
GPUs can also be used in general purpose computing because of their massive parallel
processing capabilites. This is known as general purpose computing with GPU or
GPGPU.
Earlier GPU computing was limited to the graphics processing. After the introduction
of CUDA in NVIDIA GPUs, the programming capabilities have been enhanced. Now,
GPUs are also used for general computing. Large data handling, complex algorithms,
Matrix related operations: all these processing and programming have been simplified
by GPUs.
The model for GPU computing is to use a CPU and GPU together in a heteroge-
neous co-processing computing model. The sequential part of the application runs
on the CPU and the computationally-intensive part is accelerated by the GPU. The
application runs faster because it is using the high-performance of the GPU to boost
performance. GPGPU takes the benefit of parallel architecture of GPU, which has
hundreds of processor cores that operate together. In the programming model, the
application developer modify their application to take the compute-intensive kernels
and map them to the GPU. The rest of the application remains on the CPU. Mapping
a function to the GPU involves rewriting the function to expose the parallelism in the
function and adding C keywords to move data to and from the GPU. The developer is
tasked with launching 10s of 1000s of threads simultaneously [2]. The GPU hardware
manages the threads and does thread scheduling.
The GPGPU computing model is shown in figure 2.7. It explains the hardware and
software parts of computing process. The NVIDIA GPU with CUDA architecture
is the hardware. While, CUDA development environment is a software, which can
be C or C++ or other programming languages. These languages include OpenCL,
DirectX, OpenGL, Python etc. Using these two parts, many applications can be de-
signed.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 9
The GPGPU includes complex matrix operations, vector operations, floating point
scientific applications etc. For all these applications, GPU provides easier way for
manipulations. With CPU programming, these operations require huge execution
time and it becomes tedious and complex. With GPU programming An array of very
large length or matrix of very large size, can easily added or multiplied with another
array or matrix respectively. So, GPUs are not only useful in graphics operations,
but they are also useful in general purpose computing.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 10
The difference between increasing processor clock rate and increasing processor can
be seen from figure 2.8 given below:
As seen from above figure, one can conclude that parallel processing is more bene-
fitial. The two important parameters for determining performance are latency and
bandwidth. Latency is the time to complete task and throughput is number of tasks
in a fixed time.GPUs are designed for getting maximum performance. The GPU core
is the stream processor. Stream processors are grouped in stream multiprocessors.
Stream processor is basically a SIMD processor (Single Instruction Multiple Data)
[9]. SIMD architecture is shown in figure 2.9. SM operation is shown in figure 2.10.
CUDA Architecture
A CUDA program basically calls the efficiently written parallel kernels. A kernel is a
function that executes on GPU. A kernel executes in parallel across a set of parallel
threads. The programmer or compiler organizes these threads in thread blocks and
grids of thread blocks according to instructions given by user. The GPU instantiates
a kernel program on a grid of parallel thread blocks. Each thread within a thread
block executes an instance of the kernel. And each thread has a thread ID within
its thread block, program counter, registers, per-thread private memory, inputs, and
output results. By using such indexing of block and threads, one can keep trace of
each and every instance of work.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 13
A thread block is a set of concurrently executing threads that can work coherently
among themselves through synchronization and shared memory (L1 and L2 cache).
A thread block has a block ID within its grid. Even each block has its dimensions.
This dimension is nothing but number of threads within a single block. There are in-
struction given in CUDA through which one can directly access index and dimension
of block [1].
A grid is an array of thread blocks that execute the same kernel, read inputs from
global memory, write results to global memory, and synchronize between dependent
kernel calls. It is also very important to keep all thread within a grid to be syn-
chronized. In the CUDA parallel programming model, each thread has a per-thread
private memory space used for register spills, function calls, and C automatic ar-
ray variables [4]. Each thread block has a per-Block shared memory space used for
inter-thread communication, data sharing, and result sharing in parallel algorithms.
Threads of same grid will have common shared memory. Grids of thread blocks share
results in Global Memory space after kernel-wide global synchronization [12]. CUDA
hierarchy is shown in figure 2.12 [2].
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 14
Wrap Scheduler
near peak hardware performance. Schematic view of wrap scheduler is shown in figure
2.13 [9].
Most instructions can be dual issued; two integer instructions, two floating instruc-
tions, or a mix of integer, floating point, load, store, and SFU instructions can be
issued concurrently. Double precision instructions do not support dual dispatch with
any other operation.
One of the key architectural innovations that efficiently improved both the pro-
grammability (that has facilitated programmers for speed up) and performance of
GPU applications is on-chip shared memory. Shared memory enables threads within
the same thread block to cooperate, facilitates extensive reuse of on-chip data, and
greatly reduces off-chip traffic. All blocks on a single grid share common memory.
Shared memory is a key enabler for many high-performance CUDA applications.
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 16
Streaming Multiprocessor
Medical Imaging
Image Processing
Matrix Manipulations
Environmental Science
CHAPTER 2. INTRODUCTION TO GPU AND CUDA 17
Video Decoding
3D Graphics
Gaming
Chapter 3
18
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 19
#include <iostream>
int main(void)
{
printf(Hello, World);
return 0;
}
#include<iostream>
global void kernel (void)
{
}
int main(void)
{
kernel<<<1, 1>>>();
printf(Hello, World);
return 0;
}
The code-1 runs on host or CPU, while code-2 runs on device or GPU [1], [15]. The
output of above code is shown in figure 3.2. Here, world is replaced by good morning.
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 20
#include<iostream>
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>
int main(void)
int c;
int a;
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 21
int b;
int *dev c;
cudaMalloc((void**)&dev c,sizeof(int));
printf(2nd Value:);
scanf(%d,&b);
add<<<1,1>>>(a,b,dev c);
cudaMemcpy(&c,dev c,sizeof(int),cudaMemcpyDeviceToHost);
printf(Result:);
printf(%d + %d = %d,a,b,c);
cudaFree(dev c);
system(PAUSE);
return 0;
}
Code-2 explains the addition of two integer numbers [1], [15]. The output of this code
is shown in figure 3.3.
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 22
Code-3 explains about GPU device properties. Using the structure cudaDeviceProp,
we can query the properties of GPU. So, by knowing properties, we can know limita-
tions of device and the programming capabilities of device. The output of this code
is shown in figure 3.4.
#include<iostream>
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>
cudaGetDeviceProperties( &prop, i );
printf( General Information for device %d , i );
printf( Name: %s, prop.name );
printf( Compute capability: %d.%d, prop.major, prop.minor );
printf( Clock rate: %d, prop.clockRate );
printf( Device copy overlap: );
if (prop.deviceOverlap)
printf( Enabled );
else
printf( Disabled );
printf( Kernel execition timeout : );
if (prop.kernelExecTimeoutEnabled)
printf( Enabled );
else
printf( Disabled );
printf( Memory Information for device %d , i );
printf( Total global mem: %ld, prop.totalGlobalMem );
printf( Total constant Mem: %ld, prop.totalConstMem );
printf( Max mem pitch: %ld, prop.memPitch );
printf( Texture Alignment: %ld, prop.textureAlignment );
printf( MP Information for device %d , i );
printf( Multiprocessor count: %d,
prop.multiProcessorCount );
printf( Shared mem per mp: %ld, prop.sharedMemPerBlock );
printf( Registers per mp: %d, prop.regsPerBlock );
printf( Threads in warp: %d, prop.warpSize );
printf( Max threads per block: %d, prop.maxThreadsPerBlock );
printf( Max thread dimensions: (%d, %d, %d), prop.maxThreadsDim[0], prop.maxThreadsDim[
prop.maxThreadsDim[2] );
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 24
Code-4 explains the long array addiiton. Here, two arrays of length 1024*16 are
added into third array. This addition takes very less time with GPU implementation.
Output of code is shown in figures 3.5, 3.6 and 3.7.
#include<iostream>
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 25
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>
#include<stdio.h>
#include<conio.h>
int main()
{
int *a, *b, *c;
a = (int*)malloc(sizeof(int)*size);
if(!a)
{
printf(A allocation error);
getch();
return 1;
}
b = (int*)malloc(sizeof(int)*size);
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 26
if(!b)
{
printf(B allocation error);
getch();
return 1;
}
c = (int*)malloc(sizeof(int)*size);
if(!c)
{
printf(C allocation error);
getch();
return 1;
}
int *dev a = 0;
int *dev b = 0;
int *dev c = 0;
. cudaMalloc((void**)&dev c, size * sizeof(int));
cudaMalloc((void**)&dev a, size * sizeof(int));
cudaMalloc((void**)&dev b, size * sizeof(int));
cudaMemcpy(dev a, a, size * sizeof(int), cudaMemcpyHostToDevice);
CHAPTER 3. DATA PROCESSING CODES AND EXAMPLES 27
Discrete Fourier Transform (DFT) or Fast Fourier Transform (FFT) can be imple-
mented on GPU with the use of libraries available in CUDA. CUDA supports FFT
and DFT in CUFFT library. With the functions available in CUFFT library, we
can easily apply DFT on input data sequence and on image as well. These require
knowledge of certain function definitions and their input parameters.
Complex-to-Complex (C2C)
Real-to-Complex (R2C)
Complex-to-Real (C2R)
C2C Fourier Transfrom is generally used to carry out FFT and DFT. In C2C type of
transform, input and output both sequences are of complex type. In R2C transform,
input sequence is real, but output sequence is complex. Likewise, C2R has complex
input sequence and real output sequence. CUFFT library provides 1D, 2D and 3D
29
CHAPTER 4. DFT AND DCT IMPLEMENTATION 30
The most important functions related to DFT programming are as follow [3]:
cufftPlan1d()
cufftDestory()
cufftExecC2C()
cufftExecR2C()
cufftExecC2R()
cufftPlan1d() : Creates a 1D FFT plan configuration for a specified signal size and
data type. The batch input parameter tells CUFFT how many 1D transforms to
configure. Similarly 2D and 3D transforms can be configured.
definition: cufftPlan1d(cufftHandle *plan, int nx, cufftType type, int batch)
Here, plan is a pointer to cufftHandle object. NX parameter tells the transform size
(e.g. 8 for 8-point DFT). The parameter type tells the compiler the type(R2C, C2R
or C2C) of transform. BATCH indicates the number of transforms of size NX. Out-
CHAPTER 4. DFT AND DCT IMPLEMENTATION 31
cufftDestroy() : Frees all GPU resources associated with a CUFFT plan and de-
stroys the internal plan data structure. This function should be called once a plan is
no longer needed, to avoid wasting GPU memory.
definition: cufftDestroy(cufftHandle plan)
The input parameter is the plan that is to be destroyed.
cufftHandle
cufftComplex
cufftReal
cufftDoubleReal
cufftDoubleComplex
CHAPTER 4. DFT AND DCT IMPLEMENTATION 32
The code given below illustrates the 4-point DFT on a real input sequence. It also
applied inverse DFT to get input sequence back. Its output is shown in figure 4.1.
#include<iostream>
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>
#include<stdio.h>
#include<conio.h>
#include<cufft.h>
#define NX 4
#define BATCH 1
void main(void)
{
cufftHandle plan;
int i;
cufftComplex *device;
cufftComplex host[NX*BATCH];
cufftComplex temp[NX*BATCH];
cudaMalloc((void **)&device,sizeof(cufftComplex)*NX*BATCH);
printf(INPUT DATA:);
CHAPTER 4. DFT AND DCT IMPLEMENTATION 33
host[0].x=1;
host[1].x=2;
host[2].x=0;
host[3].x=1;
host[0].y=0;
host[1].y=0;
host[2].y=0;
host[3].y=0;
for(i=0;i<4;i++)
{
printf(%f + i%f,host[i].x,host[i].y);
}
cudaMemcpy(device,host,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyHostToDevice);
if(cufftPlan1d(&plan,NX,CUFFT C2C,BATCH) != CUFFT SUCCESS)
{
fprintf(stderr,CUFFT ERROR: Plan creation failed);
}
cudaMemcpy(host,device,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyDeviceToHost);
CHAPTER 4. DFT AND DCT IMPLEMENTATION 34
printf(OUTPUT DATA:);
for(i=0;i<4;i++)
{
printf(%f + i%f,host[i].x,host[i].y);
}
cudaMemcpy(temp,device,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyDeviceToHost);
printf(INVERSE FFT:);
for(i=0;i<4;i++)
{
printf(%f + i%f,temp[i].x/NX,temp[i].y/NX);
}
cufftDestroy(plan);
cudaFree(host);
cudaFree(temp);
cudaFree(device);
system(PAUSE);
}
CHAPTER 4. DFT AND DCT IMPLEMENTATION 35
Time plot and kernel execution timing details are shown in figure 4.2.
In DFT code, we are not writing any kernels. But, CUFFT library itself contains
kernels for the execution of DFT and IDFT. CUFFT library uses two important
kernels for cufftPlan1d function and cufftExecC2C function. According to the time
plot, kernel time is 2.23% of the total GPU time. The time axis (Y-axis) is in units
of milli-seconds. The kernel vectorRadix2 takes the maximum time [14]. But, the
overall computation time is quite less. When DFT is calculated for large number of
points, the CUDA C programming provides better performance.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 36
The code given below illustrates 2D DFT. It contains the core part only.
#define NX 8
#define NY 8
int main(void)
{
cufftHandle plan;
cufftComplex *device;
cufftComplex host[NX][NY];
cufftComplex temp[NX][NY];
cudaMalloc((void **)&device,sizeof(cufftComplex)*NX*NY);
cudaMemcpy(device,host,sizeof(cufftComplex)*NX*NY,cudaMemcpyHostToDevice);
cudaMemcpy(host,device,sizeof(cufftComplex)*NX*NY,cudaMemcpyDeviceToHost);
if(cufftExecC2C(plan,device,device,CUFFT INVERSE) != CUFFT SUCCESS)
CHAPTER 4. DFT AND DCT IMPLEMENTATION 38
cudaMemcpy(temp,device,sizeof(cufftComplex)*NX*NY,cudaMemcpyDeviceToHost);
cufftDestroy(plan);
}
The output of this code is shown in figure 4.3 The input matrix is of order 3x3 and
its parameters are : [1 2 1; 1 3 1; 1 4 1].
The discrete cosine transforms (DCT) and discrete sine transform (DST) are mem-
bers of a family of sinusoidal unitary transforms. They are real, orthogonal, and
separable with fast algorithms for its computation. They have a great relevance to
data compression.
Its a real transform with better computational efficiency than DFT which by
definition is a complex transform.
It does not introduce discontinuity while imposing periodicity in the time signal.
In DFT, as the time signal is truncated and assumed periodic, discontinuity is
introduced in time domain and some corresponding artifacts is introduced in
frequency domain. But as even symmetry is assumed while truncating the time
signal, no discontinuity and related artifacts are introduced in DCT [7].
DCT have four different definitions. DCT-II is generally used for data compression.
The importance of DCT-II is further accentuated by its-
So, DCT is preferred over DFT for image processing algorithms. When DCT is ap-
plied to any image, it introduces no loss to the source image samples [7]. The image
pixels have brightness values ranging from 0 to 255 in integer. These are real numbers
and DCT is suitable for real numbers. When DCT is applied, the first coefficient is
always DC, while all others are AC. DC component has the highest value and DCT
coefficient value is decreasing form low order coefficients to higher ones. These high
frequency coefficients can be neglected since they have less values. DCT application
on image is shown in figure 4.4.
The DCT algorithm can also be implemented by using CUFFT library in CUDA.
But, the problem is : Direct DCT computation functions are not available in CUFFT
CHAPTER 4. DFT AND DCT IMPLEMENTATION 41
library. So, DCT can be derived from DFT computation. For, N-point DCT compu-
tation, 2N-point DFT computation is required. Similarly, for N-point IDCT compu-
tation, 2N-point IDFT computation is required.
The algorithm for DCT computation from DFT is given below [6]:
4 . For 0 6 k 6 N-1,
k/2
C[k] = W2N * Y[k]
C[k] denotes the N-point DCT of N-point input x[n]. The algorithm for IDCT com-
putation from IDFT is given below [6]:
Here, x[n] is the N-point IDCT for C[k]. The code for DCT-IDCT computation for
CPU (Host) is given below:
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<time.h>
#include<stdlib.h>
for(int i=0;i<length;i++)
{
data[i]=i;
}
for(int i=0;i<length;i++)
{
printf(element in%d is: %f,i,data[i]);
}
for(int i=0;i<length;i++)
{
for(int j=0;j<length;j++)
CHAPTER 4. DFT AND DCT IMPLEMENTATION 43
{
xdct[i]=xdct[i]+2*data[j]*cos(PI*i*(2*j+1)/2/length);
}
}
for(int i=0;i<2*length;i++)
{
for(int j=1;j<length;j++)
{
idct[i]=idct[i]+xdct[j]*cos(PI*j*(2*i+1)/2/length)/length;
}
idct[i]=idct[i]+xdct[0]/2/length;
}
stop = clock();
printf(Execution Time in milliseconds (DCT on CPU) : %f,(double)((stop-start)));
system(PAUSE);
The output of above code is shown in figure 4.5. The output gives the idea about
execution time that CPU takes for the calculation of DCT for large number of input
points. For small number of points, CPU takes very less time. But, for large number
of points like 10240, it takes much time of about 11.5 secs. On the other hand, GPU
version of this code takes very less time of about 0.174 sec. It is shown in figure 4.6.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 44
When This code runs on GPU, it takes less execution time for large points. It con-
tains the kernels called for DFT computations, IDFT computations, twiddle factor
multiplication, 2N-point conversion etc. The comparison for CPU and GPU version
of code is given in table 4.1 and table 4.2.
DCT on CPU
Length of code is comparatively smaller.
It executes entirely on CPU. There is no use of CUDA Programming.
Execution time is comparatively less for less number points (e.g. 100).
Execution time is comparatively more for more number points (e.g. 100000).
Not suitable for image processing algorithm.
Regenerated values are not much accurate.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 45
DCT on GPU
Length of code is comparatively larger.
It executes by different kernel calls and takes benefits of CUDA programming.
Execution time is comparatively more for less number of points (e.g. 100).
Execution time is comparatively less for more number points (e.g. 100000).
Highly suitable for image processing algorithm.
Regenerated values are very accurate compared to CPU generated value due to SFU.
#include<iostream>
#include<cuda.h>
#include<cuda runtime.h>
#include<device launch parameters.h>
#include<stdio.h>
CHAPTER 4. DFT AND DCT IMPLEMENTATION 46
#include<conio.h>
#include<math.h>
#define NX 4
#define PI 3.14285714
#define BATCH 1
int main(void)
{
int i;
cufftComplex host1[NX*BATCH];
cufftComplex host3[NX*BATCH];
cufftComplex host2[NX*2*BATCH];
cufftHandle plan;
cufftComplex *device;
cufftComplex host[NX*2*BATCH];
cudaMalloc((void **)&device,sizeof(cufftComplex)*NX*2*BATCH);
printf(N-point DCT computation using 2N-point DFT);
printf(Enter the input data:);
for(i=0;i<NX;i++)
{
printf(Enter value of host[%d].x:,i);
scanf(%f,&host1[i].x);
host1[i].y = 0;
}
for(i=0;i<NX;i++)
{
CHAPTER 4. DFT AND DCT IMPLEMENTATION 47
host[i].x = host1[i].x;
host[i].y = 0;
}
for(i=NX;i<2*NX;i++)
{
host[i] = host1[2*NX-1-i];
host[i].y = 0;
}
cudaMemcpy(device,host,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyHostToDevice);
if(cufftPlan1d(&plan,2*NX,CUFFT C2C,BATCH) != CUFFT SUCCESS)
{
fprintf(stderr,CUFFT ERROR: Plan creation failed);
}
cudaMemcpy(host,device,sizeof(cufftComplex)*NX*2*BATCH,cudaMemcpyDeviceToHost);
printf(OUTPUT DATA:);
for(i=0;i<2*NX;i++)
{
printf(%f + i%f,host[i].x,host[i].y);
}
CHAPTER 4. DFT AND DCT IMPLEMENTATION 48
for(i=0;i <NX;i++)
{
host[i].x = host[i].x * cos(PI*i/2/NX) + host[i].y * sin(PI*i/2/NX);
host[i].y = 0;
}
printf(OUTPUT DCT);
for(i=0;i<NX;i++)
{
printf(%f ,host[i].x);
}
for(i=0;i<NX;i++)
{
host2[i].x = host[i].x * cos(PI*i/2/NX);
host2[i].y = host[i].x * sin(PI*i/2/NX);
}
if(i==NX)
{
host2[i].x = 0;
host2[i].y = 0;
}
for(i=NX+1;i<2*NX;i++)
{
host2[i].x = -host[2*NX-i].x * cos(PI*i/2/NX);
CHAPTER 4. DFT AND DCT IMPLEMENTATION 49
This code uses CUFFT library and DFT to DCT conversion algorithm. The output
of this code is shown in figure 4.7. The input data is [1 2 3 4]. Still, this code is not
completely implemented on GPU because it uses for loops for twiddle factor multi-
plication and 2N-point conversion. The kernels are called only for DFT computation.
So all-kernel calls code is given below. Here, only the kernels definitions are shown.
The complete code for image processing is illustrated in chapter 5.
CHAPTER 4. DFT AND DCT IMPLEMENTATION 50
Convert2N kernel is used to form 2N-point sequence from N-point input sequence. It
is shown in figure 4.8. The twiddel multiplication kernel is given below.
The ToDCT kernel is illustrated in figure 4.9 and given code. It is used to collect
data row-wise.
}
}
Transpose kernel is explained in figure 4.10. It is very useful for 2D DCT implemen-
tation. It uses 1D DCT for the calculation of 2D DCT by transpose method. Finally,
the kernel for IDCT computation is shown in figure 4.11 and explained in the code
given below:
idct2Ndata[NX+q].y=0;
}
else
{
idct2Ndata[NX+i].x= -dctdata[p+NX-k].x* cos(PI*(q+NX+i)/2/NX);
idct2Ndata[NX+i].y= -dctdata[p+NX-k].x*sin(PI*(q+NX+i)/2/NX);
}
}
These kernels are called for the DCT implementation on signal as well as on image.
Application of this algorithm in image processing is explained in next chapter.
Chapter 5
Image Processing is any form of signal processing for which the input is an image.
It is generally referred to processing of a 2D picture by computer. The output of
image processing may be either an image or the set of parameters which are related
to image. There are different types of image processing : Digital Image Processing,
Analog Image Processing and Optical Image Processing. Image processing refers to
application of different algorithms on images.
There are different algorithms available to process the image. Some of them are listed
below:
Image Enhancement
Image Restoration
Image Compression
55
CHAPTER 5. IMAGE PROCESSING ALGORITHM 56
De-Blurring
De-Noising
Edge Detection
Image Smoothing
Convolution Operation
In this project, the algorithm has been applied on grayscale image. Only red compo-
nent of every image is taken. The same algorithm can be applied for color images, in
which the red, green and blue components are considered saparately.
To read the image in CUDA C, it must be available in BIN format. So, by using code
given below, image can be converted into .bin file. It contains pixel values. This file
is read in CUDA C by FILE operations.
c = imread(Koala.jpg);
r = c(:,:,1);
t = r(:);
p = t;
fid = fopen(koala-bin.bin,w);
fwrite(fid,p);
fclose(fid);
CHAPTER 5. IMAGE PROCESSING ALGORITHM 57
The image processing code is illustrated below. It does not contain some of the kernel
definitions as they are discussed in chapter 4. The kernels those are not discussed are
included.
#define NX 768
CHAPTER 5. IMAGE PROCESSING ALGORITHM 58
int main(void)
{
cudaError t cudaStatus;
cudaDeviceReset();
cufftComplex host[NX*BATCH],host2N[2*NX*BATCH],temp[NX*BATCH];
cufftHandle plan2N;
cufftHandle plan2N2;
CHAPTER 5. IMAGE PROCESSING ALGORITHM 59
cufftComplex *device1N,*device1Ncpy,*device2N,*devicecpy2,*ansdct,*ansidct;
cudaMalloc((void **)&device1N,sizeof(cufftComplex)*NX*BATCH);
cudaMalloc((void **)&device1Ncpy,sizeof(cufftComplex)*NX*BATCH);
cudaMemcpy(device1N,host,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyHostToDevice);
cudaMalloc((void **)&device2N,sizeof(cufftComplex)*NX*2*BATCH);
cudaMalloc((void **)&devicecpy2,sizeof(cufftComplex)*NX*2*BATCH);
cudaMalloc((void **)&ansidct,sizeof(cufftComplex)*NX*BATCH);
convert2N<<<BATCH,NX>>>(device1N,device2N);
cudaMalloc((void **)&ansdct,sizeof(cufftComplex)*NX*BATCH);
twiddle<<<(2*BATCH),NX>>>(devicecpy2);
todct<<<BATCH,NX>>>(ansdct,devicecpy2);
transpose<<<BATCH,NX>>>(ansdct,device1Ncpy);
convert2N2<<<NX,BATCH>>>(device1Ncpy,device2N);
getch();
}
twiddle<<<(2*NX),BATCH>>>(device2N);
todct<<<NX,BATCH>>>(ansdct,device2N);
idct2N2<<<NX,BATCH>>>(ansdct,devicecpy2);
todct<<<NX,BATCH>>>(ansidct,devicecpy2);
devideBATCH<<<NX,BATCH>>>(ansidct);
transpose2<<<NX,BATCH>>>(ansidct,ansdct);
idct2N<<<BATCH,NX>>>(ansdct,device2N);
todct<<<BATCH,NX>>>(ansidct,device2N);
CHAPTER 5. IMAGE PROCESSING ALGORITHM 62
devideNX<<<BATCH,NX>>>(ansidct);
stop = clock();
printf(IDCT done);
start1 = clock();
cudaStatus =cudaMemcpy(host,ansidct,sizeof(cufftComplex)*NX*BATCH,cudaMemcpyDeviceT
stop1 = clock();
if (cudaStatus != cudaSuccess)
{
printf(cudaMemcpy failed);
getch();
}
source=fopen(img1024x768binIDCT.bin,w);
i=0;
while(i<buffer size)
{
buffer=abs(host[i].x);
fseek(source,i,0);
fwrite(&buffer,1,1,source);
i++;
}
printf(writing is done.....);
CHAPTER 5. IMAGE PROCESSING ALGORITHM 63
getch();
fclose(source);
The output of above code is shown in figure 5.2. The original data of image is shown
in figure 5.3. The regenerated values are nearly same. The execution time output is
shown in figure 5.4.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 64
By comparing the results, we can say that regenerated values of pixels are nearly
same of original pixel values. There is slight difference is some of the values. This is
the reason for scattering of pixels in some of high resolution images. This algorithm
is applied to different images.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 65
This algorithm has been applied to different images. Smallest resoultion of taken
is 128x128. The largest resoultion taken is 1024x768. For GPU machine GTX 480,
maximum number of threads which can run parallel is 1024. So, the largest image
that can be processed by GPU is limited to 1024 pixels in horizontal direction. DCT
is applied on each and every pixel. NX and BATCH parameters are varied accord-
ingly with image pixels. Process and flow remains same for different images.
fid =fopen(img1024x768binIDCT.bin,r);
p=fread(fid);
fclose(fid);
for i=1:clm
for j=1:row
q(j,i)=p(j+(i-1)*row);
end
end
imwrite(q,img1024x768-regenerated.jpg);
The results of image processing code are shown in figures 5.5, 5.6, 5.7, 5.8, 5.9, and
5.10.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 67
The execution time for complete processing for different images and the cudamemcpy
time (time required for copying the final values from device to host) are compared in
table 5.1.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 70
Peak Signal-to-Noise Ratio, often abbreviated PSNR, is an engineering term for the
ratio between the maximum possible power of a signal and the power of corrupting
noise that affects the fidelity of its representation. Because many signals have a very
wide dynamic range, PSNR is usually expressed in terms of the logarithmic decibel
scale.PSNR is most commonly used to measure the quality of reconstruction of lossy
compression codecs (e.g., for image compression). The signal in this case is the orig-
inal data, and the noise is the error introduced by compression. When comparing
compression codecs, PSNR is an approximation to human perception of reconstruc-
tion quality.
I1 = imread(img1024x768-original.jpg);
I2 = imread(img1024x768-regenerated.jpg);
P = 255;
MSE = mean((I1(:)-I2(:)) * (I1(:)-I2(:)));
PSNR = 10*log10(P*P/MSE);
Here, P is the maximum pixel value of image that is 255. The original and regener-
ated images are compared and PSNR is calculated with MATLAB. The PSNR values
for different images are given in table 5.2.
CHAPTER 5. IMAGE PROCESSING ALGORITHM 71
PSNR values above 30 dB are acceptable. The low PSNR values in 800x600 and
1024x768 images are due to black portion. This is problem because GPU machine is
not much powerful. This can be solved by using more powerful GPU.
Chapter 6
6.1 Conclusion
As the problem of fast computation of massive data is very important now a day,
parallel processing using Graphics Processing Unit can provide solution. Image pro-
cessing is also a very wide field for providing solution for many real time problems.
But processing of image on CPU will take too much time and can make our system
non-real time.
With the purpose of making image processing near- real time algorithm named Dis-
crete Cosine transform for image processing has been successfully implemented on
GPU. Output results obtained from the project are really alluring and very much
effective compare to results of CPU processed data. For large image, result shows
very impressive time speed up. Only problem noticed is the little distortion of image
which is present only in very large size image. But still the regenerated image has
much information value. This project can further be expanded with implementation
of quantization which will make compression of data after applying proper compres-
sion technique.
72
CHAPTER 6. CONCLUSION AND FUTURE SCOPE 73
Apart from this, there are so many other image processing algorithms and filtering
algorithms can be implemented on GPU which and that can eventually lead to speed
up of processing. The only bottleneck of processing with GPU is the time consump-
tion of data transfer from CPU memory to GPU memory or vice-versa. There are
researches going on to overcome this obstruction.
[5] K.R. Rao and P.C.Yip, The Transform and Data Compression Handbook, 2009.
[6] Image Processing and Computer Vision: Relationship between DCT and DFT,
Eel 6562.
[7] Anton Obukhov and Alexander Kharlamov, Discrete Cosine Transform for 8x8
Blocks With CUDA, NVIDA corporation, October, 2008.
[8] Pranit Patel, Jeff Wong, Manisha Tatikonda and Jarek Marczewski, JPEG Com-
pression Algorithm Using CUDA, October, 2009
[10] Dana Schaa and Byungyun Jang, Programming with CUDA and OpenCL, North-
eastern University, July, 2010.
74
REFERENCES 75
[12] www.en.wikipedia.org/cuda
[13] www.en.wikipedia.org/GPU
[14] http://stackoverflow.com