Parallel Programming With CUDA - Architecture, Analysis
Parallel Programming With CUDA - Architecture, Analysis
and
Agilent Technologies Deutschland GmbH
Waldbronn
by
cand. inform.
David Münch
Advisor:
Prof. Dr. W. Tichy
Dr. Victor Pankratius
3 Matrix Multiplication 9
3.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Sequential CPU implementation . . . . . . . . . . . . . . . . . 9
3.1.2 OpenMP Optimised CPU implementation . . . . . . . . . . . 10
3.1.3 GPU implementation . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.4 CUBLASSGEMM Library Function . . . . . . . . . . . . . . . 10
3.2 Environment for Performance Evaluations . . . . . . . . . . . . . . . 10
3.2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 Testing Process . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Discrete Convolution 15
4.1 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Designing the Parallel Algorithm . . . . . . . . . . . . . . . . . . . . 16
4.3 Transform the Parallel Algorithm to the GPU - First . . . . . . . . . 17
4.4 Transform the Parallel Algorithm to the GPU - Second . . . . . . . . 18
4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
viii Contents
5 Rolling Ball 29
5.1 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Designing the Parallel Algorithm . . . . . . . . . . . . . . . . . . . . 31
5.3 Transform the Parallel Algorithm to the GPU . . . . . . . . . . . . . 31
5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Limitations of CUDA 41
6.1 Kernel Call Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Memory Copying Overhead . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Upper Bound of Performance . . . . . . . . . . . . . . . . . . . . . . 43
6.4 IEEE-754 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.5 CUDA depends on NVIDIA . . . . . . . . . . . . . . . . . . . . . . . 45
6.6 Other Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7 Discussion 47
7.1 Comparison with Multi-core Processors . . . . . . . . . . . . . . . . . 47
7.2 Consequences for Software Engineering . . . . . . . . . . . . . . . . . 48
7.3 CUDA worth the effort . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Bibliography 89
1. Introduction to NVIDIA’s
CUDA
Over the last few years Parallel Programming has turned into a major area in com-
puter science. Theoretical basics of Parallel Programming have been developed since
the 1950s[Gil58][Wil94], but no affordable, parallel hardware was available for the
consumer market. Times changed in 2005[inta] when Intel released its first main-
stream multi-core CPU, which was the advent of Parallel Programming. Considering
that Graphics Processing Units (GPU) already are many-core processors, in 2007
Nvidia introduced their architecture Compute Unified Device Architecture (CUDA).
There are three reasons why Parallel Programming with CUDA is getting more and
more popular: the hardware is now available, it is comparably cheap and a great
number of consumer computers have a CUDA-capable Nvidia GPU.
A modern GPU is no longer only a memory controller and display generator as it
used to be in the 1990s. Instead, it is a highly parallel and multithreaded multipro-
cessor. Being both a programmable graphics and a scalable programming platform,
a modern GPU is breaking the mould concerning the variety of capabilities. To
take advantage, it was necessary to add some processor instructions and memory
hardware to the GPU and provide a more general API. With these modifications the
data-parallel GPU can be used as a general-purpose, programmable many-core pro-
cessor with its own benefits and limitations. The modern GPU is characterised by its
large amount of floating-point processing power, which can be used for nongraphical
problems. This was the birth of the programming model CUDA, which bypasses
the graphics API of the GPU and allows simple programs in C. Single-Program,
Multiple Data (SPMD) is the underlying abstraction to achieve high parallelism on
the thread level. In the SPMD style of parallel programming all the threads execute
the same code on different portions of data, see [Ata99]. The coordination is done
with a barrier synchronisation method. In summary, the three key abstractions of
the CUDA programming model are:
• barrier synchronisation.
The two components of the programming system are the host (=CPU) and at least
one device (=GPU).
uses as a coprocessor
host −−−−−−−−−−−→ device
The host calls and controls functions running massively parallel on the device. The
host code has a few extensions of a programming language or API (see Figure 1.1) to
specify the execution parameters for device functions, to control the device, memory
and context management and more. Currently, the functions callable on the device
are limited by those provided by the high or low-level CUDA APIs. They comprise
some mathematical, texture and memory functions as well as barrier synchronisa-
tion. Nvidia’s marketing department is successfully advertising their CUDA-capable
Application
C OpenCL Fortran C++ DX11
Compute
CUDA Architecture
GPUs and promising an easy and instantly learnable programming model resulting
in speedups of 10 to 200[nvic]. However, a more detailed and a closer inspection re-
veals a rudimentary programming model with many major limitations compared to
standard programming, such as recursion and the IEEE 754 standard, see Chapter
6.
The following chapters will provide details on the GPU’s architecture and the CUDA
programming model, a presentation of our test configuration, investigation of an
existing matrix multiplication and development of a discrete convolution algorithm
to become familiar with CUDA. Finally, a morphological filter for a real life product
will be developed, the limitations of CUDA will be evaluated and its programming
model will be discussed.
2. Hardware Structure &
Programming Model
This chapter takes a closer look at the CUDA programming model and the under-
lying hardware structure. The first part introduces the hardware implementation
of the CUDA programming model, the second presents a CUDA-capable GPU and
some CUDA basics and the third explains the thread and memory hierarchy.
Multiprocessor, so that each SP gets four threads for four clock cycles. Such an ar-
chitecture is called a Single-Instruction Multiple-Thread (SIMT) Architecture. The
physical limits for the GPU are 24 warps per Streaming Multiprocessor, 768 threads
per Streaming Multiprocessor, 8 threadblocks, see 2.3, per Streaming Multiprocessor
and 16kB shared memory per Streaming Multiprocessor. The programmer defines
the execution parameters with the size of the threadblock. E.g. dimBlock(256,1,1)
implies three active threadblocks per Streaming Multiprocessor consisting of 24 ac-
tive warps of 32 threads. CUDA maps the different threadblocks to the Streaming
Multiprocessors. It is difficult to find the best execution parameters for a particu-
lar application. Most of the time it is better not to have 100% occupancy of each
Streaming Multiprocessor, but more shared memory per threadblock. These pa-
rameters have to be tested with every application and cannot be predicted, at most
estimated. The CUDA Toolkit contains the CUDA GPU Occupancy Calculator or
the CUDA Visual Profiler to vary these parameters.
CPU GPU
Control ALU ALU
ALU ALU
Cache
DRAM DRAM
As seen in chapter 1, the CUDA programming model consists of a host and at least
one device, working as the host’s coprocessor running C code. The programmer can
choose between the CUDA Runtime API and the CUDA Driver API. The Driver
API is closer to the hardware of the GPU and more difficult to use than the simpler
Runtime API. On top of the CUDA APIs, there are some libraries such as a CUDA-
adopted BLAS library, CUBLAS[nvia]. The different layers of the APIs, Libraries
and the Application are shown in Figure 2.3.
An example programming code using the Runtime API for the host can be seen in
Listing 2.1. The truncated host function main calls in line 5 the device function
pMul, also named kernel function. Each thread computes exactly one element of the
array C with all the threads running parallel. The elements in line 13 are computed
2.3. Thread Hierarchy 5
Host
Application
CUDA Libraries
Device
1 // Host f u n c t i o n , c a l l i n g t h e k e r n e l f u n c t i o n pMul
2 int main ( )
3 {
4 // K e r n e l i n v o c a t i o n . |A| = |B| = |C|=N
5 pMul<<<1,N>>>(A, B, C) ;
6 }
7
In the following section the execution parameters and the thread hierarchy of the
device will be examined.
Block(1,1)
Thread Thread Thread Thread Thread
(0,0) (1,0) (2,0) (3,0) (4,0)
Typically, the programmer defines the execution parameters by the problem size and
not by the number of multi-cores on the GPU.
Adding the memory management to functionality Listing 2.1, generates Listing 2.2.
In the host function lines 5 - 10 and 16 are added to allocate memory on the host and
on the device and to copy data from the host to the device and back. In the device
function, it is necessary to allocate sufficient shared memory and copy the data from
global to shared memory and back. Fortunately, this can be done in parallel. This
is the reason why barrier synchronisation in Lines 29 and 37 is needed, to ensure
that the copying action has finished.
1 // Host f u n c t i o n , c a l l i n g t h e k e r n e l f u n c t i o n pMul
2 // a s s e r t |A| = |B| = |C|=N < ( s h a r e d memory c a p a c i t y / 3 )
3 int main ( )
4 {
5 i n i t i a l i z e host A , host B and host C ;
6 i n i t i a l i z e A, B and C on t h e d e v i c e ;
7
15 //Copy t h e r e s u l t C back t o t h e h o s t
16 cudaMemcpy ( C host , C, s i z e o f (C) , cudaMemcpyDeviceToHost ) ;
17 }
18
25 unsigned int i = t h r e a d I d x . x ;
26
31 // do some a r i t h m e t i c o p e r a t i o n s
32 // i n p r a c t i c e much more o p e r a t i o n s !
33 s C [ i ]= s A [ i ] ∗ s B [ i ] ;
34
2.5 Summary
In this chapter a general overview of the differences between a CPU and a GPU
has been given. Currently there are major differences, especially in the rudimentary
memory management. Additionally, the basic structure of the CUDA programming
model is simple and not powerful. The whole programming model is very close to
the hardware which means a considerable programming effort.
3. Matrix Multiplication
In the following chapter, matrix multiplication will be examined. First, the different
approaches used, second, the evaluation and finally the comparison of the results
will be shown.
3.1 Approaches
block size
width A
block size
A C
C sub
height A
block size
block size
width A width B
Figure 3.1: Visualization of Algorithm 3. Each thread block computes one submatrix
CSub of C. Each thread within the block computes one element of CSub . See [nvib].
3.2.1 Hardware
All the following examinations of algorithms will be performed on a DELL Preci-
sion Workstation T5400, which is equipped with an Intel Xeon E5430 running at
2,66GHz, with 8GB Random Access Memory (RAM) and a Nvidia Quadro FX 3700
GPU. This kind of workstation is appropriate for the needs of the algorithms and it
is comparable to those available for chemical engineers who will apply the examined
algorithms.
3.2.2 Software
Microsoft Windows XP Professional 32Bit with Service Pack 3 is used as the operat-
ing system. As the x86 address size of 32bits cannot address the entire 8GB RAM,
it is necessary to enable the Physical Address Extension. The decision for using
the 32Bit operating system instead of 64Bit is due to of well-engineered software
libraries instead of experimental beta releases. It has been a requirement to use the
Microsoft Visual Studio 2008 Development Edition for the implementation of the
algorithms. The applied key components of CUDA 2.1 are:
The Profiler is only compatible with Windows XP. The CUDA resources are free to
download from the web[nvie].
12 3. Matrix Multiplication
T1
S=
TP
It is a wrong assumption that the speedup increases linearly with the number of
cores. Instead, it is limited by the sequential part of every program as Amdahl
describes in [Rod85].
3.4 Summary
The analysis of the different matrix multiplication approaches leads to the conclusion
that one algorithm cannot be preferred over another in general. Each one has its
field of application, depending on the problem size. Whereas small problems are
solved faster on the CPU, bigger problems are solved faster on the GPU.
4. Discrete Convolution
In Figure 4.1 the discrete convolution is visualised. The output value is the sum of
the distinct signal data “weighted” with the filter. Thus, first |N | multiplications
16 4. Discrete Convolution
Signal data
+ + + + + + + + =
Filter
The following paragraph will show that it is not as simple as that with a parallel
algorithm on the GPU.
4.3. Transform the Parallel Algorithm to the GPU - First 17
The first attempt tries to parallelise the outer loop of Algorithm 5 line 3 and follow-
ing lines in distributing parts to different blocks of the GPU. Each block computes
the elements stepwise. Additionally, the whole block also parallelises the reduction
part in line 5. See Figure 4.2. In an example configuration with 256 threads per
block and 3 active threadblocks per multiprocessor 2 · 3 · 14 = 84 elements of the
result are computed at the same time. But the speedup was poor: 1.5 times faster
then Algorithm 4. Looking at CUDA’s architecture, three major reasons were iden-
tified: First, considering the load: A reduction with n threads takes log2 n time.
There are 2n − 1 active and n log2 n − n + 1 idle units of time of the threads. With
n = 256, a load of only 28.5% is achieved, thus wasting computing time. As already
shown above, a high computational component is needed to amortise the expensive
memory transfers. Second, the profiler revealed slow memory transactions and di-
vergent branches, causing the entity of a warp to be serialised. Third, the fast, but
small 16kB shared memory limits the execution time. Additionally, the windows
watchdog causes a runtime problem resulting in big amounts of data, as it termi-
nates the device function after having been run for ca 5 seconds.
Filter
Signal data
+ + + + + + + +
Step 2: reduction
+ +
Step 3: reduction
• Unroll the last 32 threads -corresponds a warp- of the reduction part, because
a warp is the smallest entity executing in parallel.
Afterwards a performance gain of 5.2 and an overall speedup of 7.8 were achieved,
however this result was not satisfying and a second attempt to transform Algorithm
5 to the GPU was started.
In this second attempt the direction of parallelisation from the reduction of one or
two elements orthogonal to the computation of many elements in parallel is changed.
Now, knowing some problems with CUDA, every design decision was made very care-
fully.
One aim of Algorithm 6 and 7 is to be scalable, another to deal with the various
problems of CUDA like the Windows watchdog and last not to overflow shared mem-
ory and results in slow global memory. To achieve this, the whole convolution is not
computed with one kernel call, but the filter of size N is split up into parts of 384
each. Thus, Algorithm 7 calls the device function, Algorithm 6, several times, until
the whole filter is completed. The runtime of one kernel call is only dependent on the
input data M , resulting in a runtime of milliseconds. Our concrete implementation
of the host and device function is shown in Appendix A.3 and A.4.
4.4. Transform the Parallel Algorithm to the GPU - Second 19
Thus, O(|N |) arithmetic operations have to be done per thread. As the filter is
the same for all threadblocks and does not change, it is stored in the fast cached
constant memory. Additionally this is done to save shared memory, which is lim-
ited to 16kB per multiprocessor and was a bottleneck in the first attempt. See
Figure 4.3. In an example, the configuration is 384 threads per block, which com-
putes 384 elements in parallel. The blocksize of 384 implies 8kB shared memory per
thread block, two active thread blocks per SM and an occupancy of 100% of the SM.
20 4. Discrete Convolution
Block 1
Block 2
s M : Signal data s M : Signal data c N : Current part
of the filter
s P : Output data s P : Output data
Global memory
Filter
+ + + + + + + + + + + + + + + +
Thread 1 Thread n
The device function of the discrete convolution is visualised in Figure 4.4. With each
threadblock of size n, n elements are computed in parallel, thereby every thread
works on one output element. The CUDA implementation of the device function
can be seen in Algorithm 6. First, it is necessary to allocate shared memory for the
input signal data and the result, see lines 1f. The filter remains in the fast cached
constant memory. Second, in lines 6f the each threadblock dedicated data is copied
from global to shared memory. The subsequent barrier synchronisation completes
the copying operation. Third, in the computing part of this algorithm, lines 9-11
loop over every element of the currently considered part of the filter. Finally, the
intermediate result is written back to global memory, ready for further use. The
above described memory hierarchy can be seen in Figure 4.3.
To prove that the implementations A.3 and A.4 are almost perfect, the CUDA
profiler was used. For the profiler output, see Table 4.1. 100% occupancy, enough
shared memory and registers, no uncoherent global stores and loads, no local stores
and loads, no divergent branches and no divergent warps imply an well-thought
implementation with CUDA.
4.6 Summary
In this chapter an algorithm for discrete convolution was transformed into a parallel
algorithm, followed by the presentation and evaluation of two attempts to adapt it
to the GPU. The result was a speedup of 80, but only for large datasizes.
4.6. Summary 23
Figure 4.10: Overhead time of GPU, like memory transfer and allocation
26 4. Discrete Convolution
blockSizeX
blockSizeY
blockSizeZ
timestamp
occupancy
gridSizeX
gridSizeY
gputime
cputime
method
registerPerThread
memTransferSize
divergent branch
memTransferDir
warp serialize
cta launched
gld coherent
gst coherent
instructions
streamID
branch
411056 0
407064 0
0 4636 6 0 2736 3648 175788 0 400499 0 38
0 4636 6 0 2736 3648 175788 0 400372 0 38
0 4636 6 0 2736 3648 175788 0 400827 0 38
403992 1
Table 4.1: Profiler output for N = 100.000, M = 999. The first two lines are the
memory transfer from host to device and the last line from device to host. Line three
to five are three kernel calls (d DIM BLOCK
M
e = 3). The numbers are the counters of
the profiler.
28 4. Discrete Convolution
5. Rolling Ball
In this chapter Rolling Ball (RB, see [DN00]) will be examined, a parallel algorithm
will be developed from a sequential algorithm and then be adapted to the CUDA
programming model. Finally, the different algorithms will be evaluated.
The RB is “a method for processing [...] measuring values, such as chromatograms”
used in chemical laboratories. As “disturbed by an underlying drifting and noisy
baseline” it is “difficult to localize the peaks in the chromatogram.” RB is a prepro-
cessing first step applying a morphological filter. Here the filter used is a structuring
element. The second step, not part of the rolling ball algorithm, is analysis to detect
“any peaks corresponding to peaks in said representation of measuring values.”
RB is a binary morphological filter operation, called opening, which consists of
erosion and dilation. The erosion with one-dimensional data M is definied as
M L = min(M (x + j) − L(j)) , (x + j) ∈ M
j∈L
and dilation as
M ⊕ L = max(M (x − j) + L(j)) , (x − j) ∈ M.
j∈L
M ◦L=M L⊕L
is called rolling ball algorithm. In Figure 5.1 the rolling ball algorithm is visualised.
In skillfully transforming the transformation above, instead of just coding it, Algo-
rithm 4 was adapted resulting in Algorithm 8, which is visualised in Figure 5.2.
Signal data
Step 1: Erosion
min{ - , - , - , - , - , - , - , - , - } =
Filter
Step 2: Dilation
max{ + , + , + , + , + , + , + , + , + } =
Filter
min{ - , - , - , - , - , - , - , - , - } − max{ - , - , - , - , - , - , - , - , - }
Step 1a: Erosion
Result of
Thread 1 Thread n
erosion
from other
thread-
block
max{ + , + , + , + , + , + , + , + , + } max{ + , + , + , + , + , + , + , + , + }
Step 2: Dilation
Thread 1 Thread n
Block 1
Block 2
s M : Signal data s M : Signal data c N : Current part
of the filter
s P : Output data s P : Output data
Global memory
the main part is possible but not realistic as an increasing overhead relativises the
speedup.
In Figure 5.12, the speedup of the GPU with the overhead in relation to Algorithm
9 on a Quadcore CPU is visualised. This figure describes the expected speedup in
a real application. The instance has to be large enough, i.e. datasize · f iltersize >
100.000.000, to gain a speedup up to 50. A instance in practice is about datasize ∼
100.000 and f iltersize ∼ 10.000. Bearing in mind Amdahl’s law, a speedup of 50
with 114 cores is quite a success.
To prove that the implementations A.7 and A.8 are almost perfect, the CUDA pro-
filer was used. For the profiler output see Tabel 5.1. 100% occupancy, enough
shared memory and registers, no uncoherent global stores and loads, no local stores
and loads, no divergent branches and no divergent warps imply a well-thought im-
plementation with CUDA.
5.5 Summary
In this chapter the RB method has been introduced and a sequential algorithm has
been developed in reducing the discrete convolution from chapter 4 to RB. Similar
to discrete convolution a parallel and a GPU version of RB was developed. The
evaluation states clearly that for large problems a realistic speedup on the GPU up
to 50 can be reached. Even used in a C# application, the speedup can be measured.
36 5. Rolling Ball
Figure 5.10: Overhead time of GPU, like memory transfer and allocation
5.5. Summary 39
registerPerThread
memTransferSize
gld coherent
gst coherent
instructions
blockSizeX
timestamp
gridSizeY
method
branch
6320.42 memcopy 407064
8638.56 memcopy 403072
9031.92 memcopy 407064
9716.72 rbKernel 261 384 4640 7 3648 3648 175788 388215
13005.2 rbKernel 261 384 4640 7 3648 3648 175788 386697
16706 rbKernel 261 384 4640 7 3648 3648 175788 386963
19815.2 rbKernel2 261 384 20 2 912 3648 228 1093
20465.9 memcopy 400000
20613.3 rbKernel3 261 384 4636 7 2664 3552 175560 374566
23478.6 rbKernel3 261 384 4636 7 2664 3552 175560 369858
26338.3 rbKernel3 261 384 4636 7 2736 3648 175560 369299
29189.6 memcopy 400000
Table 5.1: Profiler output for N = 100.000, M = 999
6. Limitations of CUDA
In this chapter, the various limitations of the CUDA programming model will be
presented. First, the invocation time of kernel functions on the GPU will be deter-
mined, second, the bandwidth of memory transactions will be measured, third, the
roofline model as a model of performance will be introduced and finally, the floating
point issues and other major problems are to be presented.
The first time a program using CUDA is executed, there is a minimum initialisation
overhead of about 40 − 90ms, as CUDA has to be initialised and the program has
to be loaded from disk. The overhead increases if some shared libraries have to be
loaded, too.
upper bound bandwidth is reached. The upper bound of host to device and device
to host bandwidth of about 4GB/s can be a bottleneck in an application.
The time of the memory transfers in Figure 6.4 is directly computed from Figure
6.3:
transf ered data[B]
time[s] = .
bandwidth[ MsB ] · 10242
Despite irregularities for small sizes of data, the time increases linearly. According to
Figure 6.4, it is even faster to copy 4kB − 1500kB than less. The Nvidia employee
Tim Murray gives the answer to this surprising result, claiming that “it’s almost
certainly a BIOS issue.” Others who did the bandwidth test have a strictly linear
growth of time.
• A low floating-point operational part and divergent warps limit the upper
bound,
Figure 6.3: Bandwidth of memory transfers from host to device, device to host and
device to device. The vertical lines show an increasing increment of transfered data.
Figure 6.4: Time of memory transfers from host to device, device to host and device
to device.
6.4. IEEE-754 Precision 45
Both hardware architecture and the program play the crucial part in the roofline
model. An algorithm with a low arithmetic intensity will never unlock the peak
bandwidth of the GPU as it is limited by bandwidth.
• Signalling non-numbers (NaN) and some of the rounding modes are not sup-
ported,
• precision of the division and the square root are below the standard.
Bearing in mind these limitations, it is almost impossible to get the same results on
the GPU as on the CPU. In the worst case, these errors can lead to cancellation,
perhaps when solving a problem hybridly on the CPU and GPU. Computing with
CUDA can be useful if programs do not deal with high precision numbers.
• There are no recursion and function pointers in CUDA. Thus recursive algo-
rithms have to be redesigned, if possible.
• Only one kernel a time can be run on the device, so the device functions have
to be strictly modular.
• It is not possible to write directly in GPU’s memory with DMA, therefore
memory transfer time increases.
• The host code is C++ with the device code being subset of C.
• A mode switch of the screen can be critical and crash the GPU.
• Only Microsoft Windows XP, Microsoft Windows Vista, Mac OS X and some
Linux operating systems are supported.
• A debugger is only available for Linux, this means an increasing implementa-
tion time.
• In Microsoft Windows Vista the profiler does not work properly, as counters
are not supported.
• In Microsoft Windows the Timeout Detection and Recovery mechanism, a
watchdog, kills kernels calls on GPUs with a display attached after 2 − 5s.
CUDA claims the GPU for its computations and the watchdog handles this
as a graphics driver crash. The expensive solution is to buy a second GPU to
attach the display there. The other challenging solution is to build scalable
below two seconds running kernels.
6.7 Summary
As seen above, the CUDA programming model has major limitations and architecture-
related time overhead. To amortise the time overhead, a program should have a high
arithmetic intensity meaning much more arithmetic operations than memory opera-
tions. The non-standard floating-point implementation forces the program to waste
time in software-simulated floating point operations or to accept the inaccuracies
and limitations. E.g. the Microsoft Windows Timeout Detection and Recovery
mechanism and other mostly software based limitations listed above are annoying
as the programmer has to do a workaround, if even possible.
7. Discussion
The discussion following the examination in this work will extend on three fields.
First, CUDA-capable GPUs related to the context of its classification in multi-core
processors are to be discussed. Second, the consequences of using parallel CUDA
programming languages in software engineering will be considered. Finally, we will
be dealing with the effort of programming with the CUDA programming model.
will take much longer since every algorithm has to be analysed in the view of data-
parallelism. Sometimes, however, the runtime cannot even be improved.
What makes things worse is that there are only few libraries the programmer can
use[nvid]. Unfortunately, they are difficult to use and require deep knowledge of the
GPU’s hardware. As seen above, the libraries are not applicable in all cases: With
the amount of input data being too large, library calls have failed.
Apart from this, the CUDA approach is completely new, which means that the pro-
grammer has to rethink and restructure his algorithms. This is a great effort as
the software engineering primitives of the ancient sixties need to be overcome. Ad-
ditionally, CUDA does not have any object-oriented techniques, with the available
wrappers for higher languages being not even able provide additional object orien-
tation but only pass-through the CUDA commands. Some wrappers are jCUDA
for Java, CUDA.NET for the .NET platform and FORTRAN CUDA[gas]. The
programmer has to deal with parallelism in particular again.
If CUDA becomes object oriented, software engineers will claim patterns for massive
parallel designs usable with CUDA. Perhaps master-worker patterns will gain more
performance than fine-granulated in-code parallelisation. Additionally, there might
be an asynchronous run-pattern where an event is fired once the computation on the
GPU is done. It could also be a good idea to organize data exchange in a kind of
parallel queue. Probably, the users of CUDA invent their own patterns, as CUDA
has its very own programming paradigms.
The intelligence should move from the programmer to the system. Humans are
prone to make errors again and again but a system can learn permanently. As an
example, the class should decide if it is worth to compute on the GPU according
to the current amount and type of input values. Even the existence, capability and
amount of GPUs and the delegation of work to them should be done by the system
itself. NVIDIA has not implemented anything in this direction, yet.
During the process of programming, a developer wants to be able to debug his written
code. In the CUDA programming model he is faced with the problem as it is for the
time beeing only possible to debug within a linux operating system environment.
Without debugging, however, it is more difficult to find and identify errors.
Additionally, race conditions and synchronisation errors can occur in parallel pro-
grams. Thus, the question is who will use CUDA, as debugging is only possible with
linux.
Many software developers are not able to program in parallel at all, as it was neither
part of their education nor part of their job challenges. Thus, it is even more
unlikely that they will not use a close-to-hardware parallel programming tool like
CUDA in the near future. Furthermore, the automatic parallelisation of the code
is not realistic at all, as we need the definition of a sequential algorithm and then
generate a parallel algorithm. One solution to this dilemma would be adding parallel
programming lectures and tutorials to young and old software developer’s schedules.
They should learn how to deal with each level of parallelisation.
Recent software engineering research claims that parallel programming cannot only
be delegated to compilers and libraries which means that new programming tools
are needed in the near future. They comprise new programming languages, parallel
50 7. Discussion
design patterns, better search of concurrency and synchronisation errors and new
methods of testing.
In this work, the CUDA programming model has been investigated and the three
sample algorithms matrix multiplication, discrete convolution and rolling ball have
been implemented. The results are throughout comparable as a speedup of more
than 100 is possible, but only for large instances. The CUBLAS library is not easy to
use as the programmer has to allocate memory manually and as it is not completely
optimised, as small-size problems are having a slow runtime. If the application to be
transformed to the GPU is memory intensive, a low speedup is expected caused by
memory latency. An advanced algorithm with a complex memory management is a
challenge for every experienced programmer even on the CPU, thus, a big speedup
is not really realistic.
A kernel with a high arithmetic intensity and low memory transactions is therefore
the best candidate for impressive speedups. The problem to solve has to be large
enough, to amortise the GPUs overhead and an accurate knowledge of the GPUs
hardware architecture is a must to gain runtime benefits.
In all cases, better tools are necessary to specify the runtime structure of the ker-
nels for best performance. Some research on automated optimisations for the GPU
architecture should be worked on. A higher level API is needed to simplify pro-
gramming with CUDA. This API should include high-level data structures managing
concurrency, communication and synchronisation. The libraries for CUDA such as
CUBLAS are to be analysed, negative aspects should be discovered and consequently
performance should be improved. The bandwidth of one GPU may be sufficient but
we have to think about big clusters of GPUs, where bandwidth probably appears to
be a bottleneck. Finally, double precision and a standard implementation of IEEE-
754 floting point numbers should be a short-term goal, using the GPU as a reliable
numerical co-processor.
52 8. Conclusion & Future Work
A. Appendix - Source Code
1 // ///////////////////////////////////////////////////////////
2 // computes s i m p l e 1D d i s c r e t e c o n v o l u t i o n on t h e CPU
3 //
4 // M2 s i g n a l data i n p u t array , type : f l o a t
5 // N f i l t e r data i n p u t array , type : f l o a t
6 // M length l e n g t h o f a r r a y M2
7 // N l e n g t h l e n g t h o f a r r a y N
8 //
9 // P output r e s u l t array , type : f l o a t
10 // P i s o f s i z e M length+N length −2
11 // ///////////////////////////////////////////////////////////
12 f l o a t ∗ s i m p l e c o n v o l u t i o n ( f l o a t ∗ M2, f l o a t ∗ N, int M length , int
N length )
13 {
14 // output a r r a y
15 f l o a t ∗ P = ( f l o a t ∗ ) m a l l o c ( (M L+N L−1)∗ s i z e o f ( f l o a t ) ) ;
16 // i n i t i a l i s e output a r r a y
17 i n i t a r r a y w i t h z e r o (P , M length+N length −1) ;
18 f l o a t sum=0;
19 for ( int p=0; p<=M length+N length −2; p++)
20 {
21 sum=0;
22 for ( int k=0; k<N l e n g t h ; k++)
23 {
24 sum+=M2[ p+k ] ∗N[ N length−k − 1 ] ;
25 }
26 P [ p]=sum ;
27 }
28 return P ;
29 }
Listing A.1: sequential discrete convolution algorithm
54 A. Appendix - Source Code
1 // ///////////////////////////////////////////////////////////
2 // computes s i m p l e 1D d i s c r e t e c o n v o l u t i o n on t h e CPU,
3 // u s i n g t h e OpenMP l i b r a r y f o r p a r a l l e l e x e c u t i o n
4 //
5 // M2 s i g n a l data i n p u t array , type : f l o a t
6 // N f i l t e r data i n p u t array , type : f l o a t
7 // M length l e n g t h o f a r r a y M2
8 // N l e n g t h l e n g t h o f a r r a y N
9 //
10 // P output r e s u l t array , type : f l o a t
11 // P i s o f s i z e M length+N length −1
12 // ///////////////////////////////////////////////////////////
13 f l o a t ∗ s i m p l e c o n v o l u t i o n o m p ( f l o a t ∗ M2, f l o a t ∗ N, int M length ,
int N l e n g t h )
14 {
15 // output a r r a y
16 f l o a t ∗ P = ( f l o a t ∗ ) m a l l o c ( (M L+N L−1)∗ s i z e o f ( f l o a t ) ) ;
17 // i n i t i a l i z e output a r r a y
18 i n i t a r r a y w i t h z e r o (P , M length+N length −1) ;
19 // s e t t h e number o f t h r e a d s
20 omp set num threads ( 4 ) ;
21 f l o a t sum=0;
22 int k ;
23 int p ;
24 #pragma omp p a r a l l e l f o r p r i v a t e ( k , p ) r e d u c t i o n (+:sum )
25 for ( p=0; p<=M length+N length −2; p++)
26 {
27 sum=0;
28 for ( k=0; k<N l e n g t h ; k++)
29 {
30 sum+=M2[ p+k ] ∗N[ N length−k − 1 ] ;
31 }
32 P [ p]=sum ;
33 }
34 return P ;
35 }
Listing A.2: OpenMP-parallelised discrete convolution algorithm
1 // ///////////////////////////////////////////////////////////
2 // h o s t programm t o manage t h e k e r n e l c a l l s , which compute
3 // 1D d i s c r e t e c o n v o l u t i o n on GPU with CUDA
4 //
5 // M s i g n a l data i n p u t array , type : f l o a t
6 // M length l e n g t h o f a r r a y M
7 // N f i l t e r data i n p u t array , type : f l o a t
8 // N l e n g t h l e n g t h o f a r r a y N
9 // P output r e s u l t array , type : f l o a t
10 // P i s o f s i z e M length+N length −1
11 // t i m e r p u r e time i n ms f o r t h e k e r n e l c a l l
12 // ///////////////////////////////////////////////////////////
55
26 // a l l o c a t e d e v i c e memory
27 f l o a t ∗ d M apron ;
28 c u t i l S a f e C a l l ( cudaMalloc ( ( void ∗ ∗ ) &d M apron , ( M apron length
+l a s t l o o p o f f s e t ) ∗ s i z e o f ( f l o a t ) ) ) ;
29
30 // copy h o s t memory t o d e v i c e
31 c u t i l S a f e C a l l ( cudaMemcpy ( d M apron , M apron , ( M apron length
+l a s t l o o p o f f s e t ) ∗ s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
32
33 // a l l o c a t e d e v i c e memory f o r r e s u l t
34 float ∗ d P ;
35 c u t i l S a f e C a l l ( cudaMalloc ( ( void ∗ ∗ ) &d P , ( M length+N length −1+
l a s t l o o p o f f s e t )∗ sizeof ( float ) ) ) ;
36
37 // copy h o s t memory t o d e v i c e
38 c u t i l S a f e C a l l ( cudaMemcpy ( d P , P , ( M length+N length −1+
l a s t l o o p o f f s e t ) ∗ s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
39
40 // compute e x e c u t i o n p a r a m e t e r s
41 unsigned int num blocks = ( ( M length+N length −1)/ num threads )
+1;
42 // g r i d c o n f i g u r a t i o n
43 dim3 g r i d ( num blocks , 1 , 1 ) ;
44 // b l o c k c o n f i g u r a t i o n
45 dim3 t h r e a d s ( num threads , 1 , 1 ) ;
46
47 // s t a r t t h e t i m e r f o r t h e pure k e r n e l e x e c u t i o n time
48 cutilCheckError ( cutStartTimer ( ∗ timer pure ) ) ;
49
50 // e x e c u t e t h e k e r n e l s t e p w i s e , a s i t i s d i v i d e d i n t o p a r t s
51 for ( int i =0; i <N l e n g t h ; i+=num threads )
52 {
53 // copy c u r r e n t needed p a r t o f t h e f i l t e r t o f a s t cached
c o n s t a n t memory a s s h a r e d memory i s l i m i t e d and needed f o r
o t h e r data
56 A. Appendix - Source Code
59 // s t o p t h e t i m e r f o r t h e pure k e r n e l e x e c u t i o n time
60 c u t i l C h e c k E r r o r ( cutStopTimer ( ∗ t i m e r p u r e ) ) ;
61
62 // copy r e s u l t from d e v i c e t o h o s t
63 c u t i l S a f e C a l l ( cudaMemcpy ( P , d P , s i z e o f ( f l o a t ) ∗ ( M length+
N length −1) , cudaMemcpyDeviceToHost ) ) ;
64
1 // ///////////////////////////////////////////////////////////
2 // k e r n e l , which computes 1D d i s c r e t e c o n v o l u t i o n on GPU
3 // with CUDA. Each k e r n e l can h a n d l e max . 384 f i l t e r e l e m e n t s .
4 //
5 // d M apron g l o b a l s i g n a l data i n p u t array , type : f l o a t
6 // f o i s t h e c u r r e n t o f f s e t o f t h e f i l t e r b e e i n g used
7 // d P output data a r r a y i n g l o b a l memory
8 // c d N p a r t o f t h e f i l t e r a v a i l a b l e i n c o n s t a n t memory
9 // ///////////////////////////////////////////////////////////
10 global void c o n v o l u t i o n K e r n e l ( f l o a t ∗ d M apron , int fo ,
float ∗ d P )
11 {
12 // I n i t i a l i z e memory
13 // f o r s i g n a l data i n s h a r e d memory
14 shared f l o a t s d M apron [ 3 8 4 ∗ 2 ] ;
15 // f o r r e s u l t i n s h a r e d memory
16 shared float s d P [ 3 8 4 ] ;
17 // c u r r e n t t h r e a d i d e n t i f i e r
18 unsigned int t i d=t h r e a d I d x . x ;
19 // i n i t i a l i s e with z e r o
20 s d M apron [ t i d ] = 0 ;
21 s d M apron [ t i d+blockDim . x ] = 0 ;
22 s d P [ tid ]=0;
23
32 // l o o p i n p a r a l l e l o v e r e v e r y computed output v a l u e
33 for ( int i =0; i <blockDim . x ; i ++)
34 {
35 /∗ c u t i l B a n k C h e c k e r ( s d P , t i d ) = s d P [ t i d ]+( s d M apron [ t i d+
i ] ∗ c d N [ i ] ) ; ∗/
36 s d P [ t i d ]= s d P [ t i d ]+( s d M apron [ t i d+i ] ∗ c d N [ i ] ) ;
37 syncthreads () ;
38 }
39
1 // ///////////////////////////////////////////////////////////
2 // computes r o l l i n g b a l l a l g o r i t h m on t h e CPU
3 // r o l l i n g b a l l c o n s i s t s o f two s t e p s :
4 // 1 . E r o s i o n
5 // 2 . D i l a t i o n
6 //
7 // M2 s i g n a l data i n p u t array , type : f l o a t
8 // N f i l t e r data i n p u t array , type : f l o a t
9 // M length l e n g t h o f a r r a y M2
10 // N l e n g t h l e n g t h o f a r r a y N
11 //
12 // R output r e s u l t array , type : f l o a t
13 // ///////////////////////////////////////////////////////////
14 f l o a t ∗ s i m p l e r o l l i n g b a l l ( f l o a t ∗ M2, f l o a t ∗ N, int M length , int
N length )
15 {
16 // i n t e r m e d i a t e r e s u l t a r r a y
17 f l o a t ∗ P = ( f l o a t ∗ ) m a l l o c ( (M L+N L−1)∗ s i z e o f ( f l o a t ) ) ;
18 // output a r r a y
19 f l o a t ∗ R = ( f l o a t ∗ ) m a l l o c ( (M L+N L−1)∗ s i z e o f ( f l o a t ) ) ;
20 // i n i t i a l i s e i n t e r m e d i a t e r e s u l t a r r a y with i n f i n i t y
21 i n i t a r r a y w i t h i n f (P , M length+N length −1) ;
22 // temporary v a r i a b l e s
23 f l o a t sum=0, temp=0;
24 int p=0, k=0;
25 // minus i n f i n i t y
26 f l o a t i n f i =l o g ( ( f l o a t ) 0 ) ;
27
28 // E r o s i o n
58 A. Appendix - Source Code
41 // i n t e r m e d i a t e s t e p t o copy t h e e r o s i o n s r e s u l t i n a s e c o n d
array
42 for ( p=0; p<M length+N length −1; p++)
43 R[ p]=P [ p ] ;
44
45 // D i l a t i o n
46 for ( p=0; p<=M length+N length −2; p++)
47 {
48
1 // ///////////////////////////////////////////////////////////
2 // computes r o l l i n g b a l l a l g o r i t h m on t h e CPU,
3 // u s i n g t h e OpenMP l i b r a r y f o r p a r a l l e l e x e c u t i o n .
4 // r o l l i n g b a l l c o n s i s t s o f two s t e p s :
5 // 1 . E r o s i o n
6 // 2 . D i l a t i o n
7 //
8 // M2 s i g n a l data i n p u t array , type : f l o a t
9 // N f i l t e r data i n p u t array , type : f l o a t
10 // M length l e n g t h o f a r r a y M2
11 // N l e n g t h l e n g t h o f a r r a y N
12 //
13 // R output r e s u l t array , type : f l o a t
14 // ///////////////////////////////////////////////////////////
59
29 // E r o s i o n
30 #pragma omp p a r a l l e l f o r p r i v a t e ( k , sum )
31 for ( p=0; p<=M length+N length −2; p++)
32 {
33 sum= i n f i ;
34 f l o a t temp ;
35 for ( k=0; k<N l e n g t h ; k++)
36 {
37 // o p t i m i s a t i o n o f : sum=max( sum ,N[ N length−k−1]−M2[ p+k ] ) ;
38 temp=N[ N length−k−1]−M2[ p+k ] ;
39 i f ( temp>sum ) sum=temp ;
40 }
41 P [ p]=−sum ;
42 }
43
44 // i n t e r m e d i a t e s t e p t o copy t h e e r o s i o n s r e s u l t i n a s e c o n d
array
45 // more e x p e n s i v e with an OpenMP P a r a l l e l For
46 for ( p=0; p<M length+N length −1; p++)
47 R[ p]=P [ p ] ;
48
49 // D i l a t i o n
50 #pragma omp p a r a l l e l f o r p r i v a t e ( k , p )
51 for ( p=0; p<=M length+N length −2; p++)
52 {
53 f l o a t temp ;
54 for ( k=0; k<N l e n g t h ; k++)
55 {
56 // o p t i m i s a t i o n o f : R[ p+k]=max(R[ p+k ] ,N[ N length−k−1]+P [ p ] ) ;
57 temp=N[ N length−k−1]+P [ p ] ;
58 i f ( temp>R[ p+k ] ) R[ p+k]=temp ;
59 }
60 }
61
1 // ///////////////////////////////////////////////////////////
2 // h o s t programm t o manage t h e k e r n e l c a l l s , which compute
3 // r o l l i n g b a l l a l g o r i t h m on t h e CPU.
4 // r o l l i n g b a l l c o n s i s t s o f two s t e p s :
5 // 1 . E r o s i o n
6 // 2 . D i l a t i o n
7 //
8 // M s i g n a l data i n p u t array , type : f l o a t
9 // N f i l t e r data i n p u t array , type : f l o a t
10 // M length l e n g t h o f a r r a y M
11 // N l e n g t h l e n g t h o f a r r a y N
12 // P output r e s u l t array , type : f l o a t
13 // l e n g t h o f a r r a y P i s o f c o u r s e M length+o f f s e t
14 // t i m e r p u r e time i n ms f o r t h e k e r n e l c a l l
15 // ///////////////////////////////////////////////////////////
16 host void runConvolutionGPU ( f l o a t ∗ M, int M length , f l o a t ∗ N
, int N length , f l o a t ∗ P , unsigned int ∗ t i m e r p u r e )
17 {
18 // t o c o n s i d e r boundary c o n d i t i o n s and a v o i d i f −b r a n c h e s u s e a
new a r r a y M apron , s e e below
19 int M apron length=M length+( N length −1) ;
20 f l o a t ∗ M apron = ( f l o a t ∗ ) m a l l o c ( ( M apron length+
l a s t l o o p o f f s e t )∗ sizeof ( float ) ) ;
21
29 // a l l o c a t e d e v i c e memory
30 f l o a t ∗ d M apron ;
31 c u t i l S a f e C a l l ( cudaMalloc ( ( void ∗ ∗ ) &d M apron , ( M apron length
+l a s t l o o p o f f s e t ) ∗ s i z e o f ( f l o a t ) ) ) ;
32
33 // copy h o s t memory t o d e v i c e
34 c u t i l S a f e C a l l ( cudaMemcpy ( d M apron , M apron , ( M apron length
+l a s t l o o p o f f s e t ) ∗ s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
35
39
40 // a l l o c a t e d e v i c e memory f o r r e s u l t
41 float ∗ d P ;
42 float ∗ d R ;
43 c u t i l S a f e C a l l ( cudaMalloc ( ( void ∗ ∗ ) &d P , ( M length+
l a s t l o o p o f f s e t )∗ sizeof ( float ) ) ) ;
44 c u t i l S a f e C a l l ( cudaMalloc ( ( void ∗ ∗ ) &d R , ( M length+N length −1+
l a s t l o o p o f f s e t )∗ sizeof ( float ) ) ) ;
45
46 // copy h o s t memory t o d e v i c e
47 c u t i l S a f e C a l l ( cudaMemcpy ( d P , P , ( M length+l a s t l o o p o f f s e t )
∗ s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
48 c u t i l S a f e C a l l ( cudaMemcpy ( d R , R , ( M length+N length −1+
l a s t l o o p o f f s e t ) ∗ s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
49
50 // compute e x e c u t i o n p a r a m e t e r s
51 unsigned int num blocks = ( M length / num threads ) +1;
52 // g r i d c o n f i g u r a t i o n
53 dim3 g r i d ( num blocks , 1 , 1 ) ;
54 // b l o c k c o n f i g u r a t i o n
55 dim3 t h r e a d s ( num threads , 1 , 1 ) ;
56
57 // s t a r t t h e t i m e r f o r t h e pure k e r n e l e x e c u t i o n time
58 cutilCheckError ( cutStartTimer ( ∗ timer pure ) ) ;
59
60 // e x e c u t e t h e k e r n e l s t e p w i s e , a s i t i s d i v i d e d i n t o p a r t s
61 for ( int i =0; i <N l e n g t h ; i+=num threads )
62 {
63 // copy c u r r e n t needed p a r t o f t h e f i l t e r t o f a s t cached
c o n s t a n t memory a s s h a r e d memory i s l i m i t e d and needed f o r
o t h e r data
64 c u t i l S a f e C a l l ( cudaMemcpyToSymbol ( ”c d N ” , &N[ i ] , num threads ∗
s i z e o f ( f l o a t ) , 0 , cudaMemcpyHostToDevice ) ) ;
65 rbKernel <<< g r i d , t h r e a d s >>>(d M apron , i , d P , d R ) ;
66 c u t i l S a f e C a l l ( cudaThreadSynchronize ( ) ) ;
67 }
68 c u t i l S a f e C a l l ( cudaThreadSynchronize ( ) ) ;
69
70 // e x e c u t e a h e l p e r k e r n e l , t o copy data
71 rbKernel2 <<< g r i d , t h r e a d s >>>(d P ) ;
72 c u t i l S a f e C a l l ( cudaThreadSynchronize ( ) ) ;
73 // copy d P i n t h e middle o f d R . d R i s a h e l p e r a r r a y
74 c u t i l S a f e C a l l ( cudaMemcpy ( &d R [ ( N length −1) / 2 ] , d P , M length
∗ s i z e o f ( f l o a t ) , cudaMemcpyDeviceToDevice ) ) ;
75 c u t i l S a f e C a l l ( cudaThreadSynchronize ( ) ) ;
76
77 // e x e c u t e t h e k e r n e l s t e p w i s e , a s i t i s d i v i d e d i n t o p a r t s
78 for ( int i =0; i <N l e n g t h ; i+=num threads )
79 {
80 // copy c u r r e n t needed p a r t o f t h e f i l t e r t o f a s t cached
c o n s t a n t memory a s s h a r e d memory i s l i m i t e d and needed f o r
o t h e r data
62 A. Appendix - Source Code
81 c u t i l S a f e C a l l ( cudaMemcpyToSymbol ( ”c d N ” , &N[ i ] , 2∗
num threads ∗ s i z e o f ( f l o a t ) , 0 , cudaMemcpyHostToDevice ) ) ;
82 rbKernel3 <<< g r i d , t h r e a d s >>>(d R , i , d P ) ;
83 c u t i l S a f e C a l l ( cudaThreadSynchronize ( ) ) ;
84 }
85
86 // s t o p t h e t i m e r f o r t h e pure k e r n e l e x e c u t i o n time
87 c u t i l C h e c k E r r o r ( cutStopTimer ( ∗ t i m e r p u r e ) ) ;
88
89 // copy r e s u l t from d e v i c e t o h o s t
90 c u t i l S a f e C a l l ( cudaMemcpy ( P , d P , s i z e o f ( f l o a t ) ∗ ( M length ) ,
cudaMemcpyDeviceToHost ) ) ;
91
1 // ///////////////////////////////////////////////////////////
2 // E r o s i o n k e r n e l , which computes e r o s i o n o f t h e r o l l i n g
3 // b a l l a l g o r i t h m on GPU with CUDA.
4 // Each k e r n e l can h a n d l e max . 384 f i l t e r e l e m e n t s .
5 //
6 // d M apron g l o b a l s i g n a l data i n p u t array , type : f l o a t
7 // f o i s t h e c u r r e n t o f f s e t o f t h e f i l t e r b e e i n g used
8 // d P output data a r r a y i n g l o b a l memory
9 // c d N p a r t o f t h e f i l t e r a v a i l a b l e i n c o n s t a n t memory
10 // ///////////////////////////////////////////////////////////
11 global void r b K e r n e l ( f l o a t ∗ d M apron , int fo , f l o a t ∗ d P ,
float ∗ d R )
12 {
13 // I n i t i a l i z e memory
14 // f o r s i g n a l data i n s h a r e d memory
15 shared f l o a t s d M apron [ 3 8 4 ∗ 2 ] ;
16 // f o r r e s u l t i n s h a r e d memory
17 shared float s d P [ 3 8 4 ] ;
18 // c u r r e n t t h r e a d i d e n t i f i e r
19 unsigned int t i d=t h r e a d I d x . x ;
20
27 // l o o p i n p a r a l l e l o v e r e v e r y computed output v a l u e
63
39 // ///////////////////////////////////////////////////////////
40 // H e l p e r k e r n e l . I n v e r t s an a r r a y o f type f l o a t
41 // d P i s p o i n t e r t o t h e data o f type f l o a t i n d e v i c e memory
42 // ///////////////////////////////////////////////////////////
43 global void r b K e r n e l 2 ( f l o a t ∗ d P )
44 {
45 unsigned int i d ;
46 // c u r r e n t g l o b a l t h r e a d i d e n t i f i e r
47 i d=b l o c k I d x . x∗ blockDim . x+t h r e a d I d x . x ;
48 d P [ i d ]=−d P [ i d ] ;
49 syncthreads () ;
50 }
51
52 // ///////////////////////////////////////////////////////////
53 // D i l a t i o n k e r n e l , which computes d i l a t i o n o f t h e r o l l i n g
54 // b a l l a l g o r i t h m on GPU with CUDA.
55 // Each k e r n e l can h a n d l e max . 384 f i l t e r e l e m e n t s .
56 //
57 // d R g l o b a l s i g n a l data array , type : f l o a t
58 // f o i s t h e c u r r e n t o f f s e t o f t h e f i l t e r b e e i n g used
59 // d P output data a r r a y o f type f l o a t i n g l o b a l memory
60 // c d N p a r t o f t h e f i l t e r a v a i l a b l e i n c o n s t a n t memory
61 // ///////////////////////////////////////////////////////////
62 global void r b K e r n e l 3 ( f l o a t ∗ d R , int fo , f l o a t ∗ d P )
63 {
64 // c u r r e n t t h r e a d i d e n t i f i e r
65 unsigned int t i d=t h r e a d I d x . x ;
66 // f o r s i g n a l data i n s h a r e d memory
67 shared float s d P [ 3 8 4 ] ;
68 // f o r temp s i g n a l data i n s h a r e d memory
69 shared float s d R [ 3 8 4 ∗ 2 ] ;
70
77 // l o o p i n p a r a l l e l o v e r e v e r y computed output v a l u e
78 for ( int i =0; i <384; i ++)
64 A. Appendix - Source Code
79 {
80 s d P [ t i d ]=max( s d P [ t i d ] , ( c d N [ i ]+ s d R [ i+t i d ] ) ) ;
81 syncthreads () ;
82 }
83
Performance Evaluation
In this appendix, the evaluation of Algorithm 8, 9 and 10 & 11 will be presented.
Different filter widths from 9 to 99.999 and signal data sets with 10 to 1.000.000
elements have been used.
In Figure B.1 the runtime of the sequential Algorithm 8 on the CPU is shown. O(n3 )
growth without irregularities.
Figure B.2 shows the runtime of the parallelised Algorithm 9 on the CPU with the
OpenMP library.
Figure B.3 shows the relation between the sequential and the parallel algorithm. For
small instances, it is counterproductive to use Algorithm 9 because of its overhead.
However, for instances with datasize · f iltersize > 100.000 the speedup is about
1.75.
Figure B.4 visualises the runtime of Algorithm 10 and 11 with all the memory
transfers from and to the device.
In contrast, Figure B.5 visualises the same without the memory overhead.
Figure B.6 explicitly shows the overhead which is never going to deceed below approx.
20ms.
In Figure B.7 the speedup of the GPU without the overhead in relation to sequential
single threaded Algorithm 8 is visualised. Theoretically, a speedup up to 35 in
66 B. Appendix - Additional Runtime Measurements
the main part is possible but not realistic as an increasing overhead relativises the
speedup.
In Figure B.8, the speedup of the GPU with the overhead in relation to Algorithm
9 on a Dualcore CPU is visualised. This figure describes the expected speedup in a
real application. The instance has to be large enough, i.e. datasize · f iltersize >
100.000.000, to gain a speedup up to 20. A instance in practice is about datasize ∼
100.000 and f iltersize ∼ 10.000. Bearing in mind Amdahl’s law, a speedup of 20
with 32 cores is quite a success.
In Figure B.9 the speedup of Nvidia FX 3700 vs. FX 1700 according to Algorithm
A.7 and A.8 is visualised. As the clock rate of the FX 3700 is 1.24GHz and it has 14
SMs and the FX 1700 has a clock rate of 0.92GHz and 4 SMs the expected speedup
is 4.7. The reached speedup is 4.6. Thus the CUDA program scales on both GPUs
linearly.
According to Section 6.2 the bandwidth test results are visualised in Figure B.10.
From approx. 1M B on, the upper bound bandwidth is reached. The upper bound of
host to device bandwidth of about 2.5GB/s and device to host bandwidth of about
2.9GB/s can be a bottleneck in an application. The upper bound for device intern
bandwidth is about 9.5GB/s. In Figure B.11 the time of copying the data according
to Figure B.10 is visualised.
Figure B.6: Overhead time of GPU, like memory transfer and allocation
Figure B.9: Speedup of Nvidia FX 3700 vs. FX 1700 according to Algorithm A.7
and A.8
71
Figure B.10: Bandwidth of memory transfers from host to device, device to host
and device to device. The vertical lines show an increasing increment of transfered
data.
Figure B.11: Time of memory transfers from host to device, device to host and
device to device.
72 B. Appendix - Additional Runtime Measurements
C. Appendix - Runtime
Measurement Data
Figure C.6: Overhead time of GPU, like memory transfer and allocation
80 C. Appendix - Runtime Measurement Data
Figure C.13: Overhead time of GPU, like memory transfer and allocation
87
[BCI+ 08] S. Barrachina, M. Castillo, F.D. Igual, R. Mayo, and E.S. Quintana-
Orti. Evaluation and tuning of the level 3 cublas for graphics proces-
sors. Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE
International Symposium on, pages 1–8, April 2008.
[DN00] Helene Desmartis and Bernd Nawracala. A method for processing mea-
suring values, 01 2000.
[gas] http://www.gass-ltd.co.il/en/products/default.aspx.
[Har07] Mark Harris. Optimizing parallel reduction in cuda. page 38, 2007.
[iee] http://754r.ucbtest.org/standards/754.pdf.
[inta] http://www.intel.com/pressroom/archive/releases/20050418comp.htm.
[intb] http://www.intel.com/pressroom/archive/releases/20070204comp.htm.
[nvia] http://developer.download.nvidia.com/compute/cuda/1 1/
cublas library 1.1.pdf.
[nvic] http://forums.nvidia.com/index.php?showtopic=84440&view=findpost
&p=478583.
90 Bibliography
[ope] http://www.khronos.org/news/press/releases/
the khronos group releases opencl 1.0 specification.
[rap] http://www.rapidmind.net.
[WP08] Samuel Williams and David Patterson. The roofline model: A pedagog-
ical tool for program analysis and optimization, 2008.