0% found this document useful (0 votes)

72 views52 pages

Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams

This document provides an overview of autoparallelization using the GPSME toolkit. It discusses converting C/C++ code to OpenCL/CUDA to run on GPUs in order to improve performance. It describes how the autoparallelization process works by inserting compiler pragmas to transfer data and copy loop bodies to the GPU. A simple image blurring example is provided to demonstrate using the toolkit. Practical concerns around code structure, dependencies, and function calls are also covered.

Uploaded by

crazy_1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views52 pages

Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams

Uploaded by

crazy_1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Automatically converting C/

C++ to OpenCL/CUDA
Introduction by David Williams
Overview
} This presentation provides an introduction to
autoparallelisation, focusing on our GPSME
toolkit.
} We will cover:
◦ What autoparallelisation is and why we want it.
◦ How the autoparallelisation process is performed.
◦ An introduction to using our toolkit.
◦ Benchmarking the toolkit and performance
considerations.
◦ A demonstration of using the toolkit and frontend.
} Toolkit is available.
Who are we?
} The GPSME project is a collaboration between
industry and academia.
◦ Multiple partners across Europe.
◦ All with different problems to solve.
} Our research project aims to make GPU
computing more accessible.
◦ Reduce need for expert knowledge.
◦ Eliminate need for specialised languages.
◦ Avoid rewriting existing code.
Using the GPU
Flexibility / Assembly
Performance

OpenCL/CUDA

Libraries

Autoparallelisation
Ease of use
/ Speed of
development Drag-and-drop
Why autoparallelisation?
} Automatically converting C/C++ to OpenCL/CUDA
has a number of advantages:
◦ Single codebase – Simplifies the process of targeting
machines both with and without GPUS.
◦ Reuse existing code.
◦ Target a wide range of hardware.
◦ Achieve independence from specific backend
technologies.
◦ Avoid lengthy boilerplate code.
How autoparallelisation works
} At its heart, the GPSME toolkit converts C/C++
code into OpenCL/CUDA by following compiler
#pragmas.
◦ Transfer required data to the GPU
◦ Copy the body of a loop into an OpenCL/CUDA program.
◦ Execute the program on each core simultaneously.
} This is built on a framework called ROSE, by
extending a tool called Mint.
◦ See www.rosecompiler.org for more information.
How autoparallelisation works
A simple example
} Keep in mind that the GPU has two key architectural
differences compared to the CPU:
◦ Multiple cores operating in parallel.
◦ Separate memory space.
A simple example
} The code below performs a simple low-pass filter
(blur) from a source to a destination.
for (y = 1; y < imageHeight-1; y++)
{
for (x = 1; x < imageWidth-1; x++)
{
float sum = 0.0f;
for(offsetY = -1; offsetY <= 1; offsetY++)
{
for(offsetX = -1; offsetX <= 1; offsetX++)
{
int finalX = x + offsetX;
int finalY = y + offsetY;
sum += srcImage[finalY * imageWidth + finalX];
}
}
dstImage[y * imageWidth + x] = sum / 9.0f;
}
}
A simple example
} We can augment this with GPSME directives:
#pragma GPSME copy( srcImage, toDevice, imageWidth, imageHeight)
#pragma GPSME copy( dstImage, toDevice, imageWidth, imageHeight)
#pragma GPSME parallel
{
#pragma GPSME for nest(2) tile ( 16, 16 )
for (y = 1; y < imageHeight-1; y++)
{
for (x = 1; x < imageWidth-1; x++)
{
float sum = 0.0f;
for(offsetY = -1; offsetY <= 1; offsetY++)
{
for(offsetX = -1; offsetX <= 1; offsetX++)
{
//Removed code for brevity
}
}
dstImage[y * imageWidth + x] = sum / 9.0f;
}
}
}
#pragma GPSME copy( srcImage, fromDevice, imageWidth, imageHeight)
#pragma GPSME copy( dstImage, fromDevice, imageWidth, imageHeight)
A simple example
} The translator is a command line tool which runs
under Linux:
gpsme inputFile.cpp [options]

} Generates output C++ and CUDA in a single file.

} Additional command line options can be provided
◦ --shared
◦ --register
} For people who don’t run a Linux system the translator
can be run via a web interface.
A simple example
} The resulting code can be quite large but here are
some core snippets:
cudaMemcpy3DParms param_1_dev_1_srcImage = {0};
param_1_dev_1_srcImage.srcPtr = make_cudaPitchedPtr(((void *)
srcImage),(imageWidth) * sizeof(float ),(imageWidth),(imageHeight));
param_1_dev_1_srcImage.dstPtr = dev_1_srcImage;
param_1_dev_1_srcImage.extent = ext_dev_1_srcImage;
param_1_dev_1_srcImage.kind = cudaMemcpyHostToDevice;
stat_dev_1_srcImage = cudaMemcpy3D(&param_1_dev_1_srcImage);

if (_gidy >= 0 && _gidy <= imageHeight - 1) {{{

if (_gidx >= 0 && _gidx <= imageWidth - 1) {{
if ((((_gidx > 0) && (_gidx < (imageWidth - 1))) &&
(_gidy > 0)) && (_gidy < (imageHeight - 1))) {
float sum = 0.0f;
for (_p_offsetY = -1; _p_offsetY <= 1; _p_offsetY++) {
_index1D = _gidx;
for (_p_offsetX = -1; _p_offsetX <= 1; _p_offsetX++) {
int finalX = (_gidx + _p_offsetX);
int finalY = (_gidy + _p_offsetY);
sum += srcImage[(finalY * imageWidth) + finalX];
A simple example
} Within your project you can now replace the
original C/C++ file with the generated one.
} Also set up your project for OpenCL/CUDA
◦ Install software development kit
◦ Set up include/linker paths in your project
◦ Install runtime/drivers
This must also be done on target machines.
} Watch out for naming conflicts if you keep the old
code as well.
A simple example
} Several of the GPSME directives are available:
◦ #pragma GPSME parallel
Marks the region to be parallelised.
◦ #pragma GPSME for
A ‘for’ loop to be transferred to the GPU. Options are
available to control the way this is split across threads.
◦ #pragma GPSME barrier
Inserts a synchronisation point.
◦ #pragma GPSME single
Marks a region to be executed serially.
◦ #pragma GPSME copy
Performs a memory transfer.
A real world example
.
.
int iter = 0;
int iX, iY, iZ;
CPU_FLOAT_TYPE* pTemp;

#pragma GPSME copy(pInputData, toDevice, width, height, depth)

#pragma GPSME copy(pOutputData, toDevice, width, height, depth)
#pragma GPSME copy(pFullMaskData, toDevice, width, height, depth)

#pragma GPSME parallel

{
for(iter=0; iter < 50; iter++)
{
#pragma GPSME for nest(all) tile(8,8,8)
for(iZ=0; iZ < depth; iZ++)
{
.
.
E = 1.0f + first[0] * first[0] / (first[2] * first[2]);
F = first[0] * first[1] / (first[2] * first[2]);
G = 1.0f + first[1] * first[1] / (first[2] * first[2]);
L = (2.0f*first[0]*first[2]*second[0 * 3 + 2] - first[0]...
M = (first[0]*first[2]*second[1 * 3 + 2] +first[1]*first[2]...
N = (2.0f*first[1]*first[2]*second[1 * 3 + 2] - first[1]...
.
.
}
}
}

#pragma GPSME copy(pInputData, fromDevice, width, height, depth)

#pragma GPSME copy(pOutputData, fromDevice, width, height, depth)
#pragma GPSME copy(pFullMaskData, fromDevice, width, height, depth)
.
.
Practical concerns
} The GPSME toolkit can create huge speedups
◦ Depends on underlying code structure.
} The code should:
◦ Include (nested) for loops which can be moved to the GPU.
◦ Avoid interloop dependencies.
◦ Avoid function calls and recursion.
◦ Avoid conditional logic.
◦ Avoid system operations (allocations, disk access, etc)
◦ Avoid dependencies on external libraries.
} The performance increase from parallelism must
outweigh the cost of start up and memory transfers.
Interloop dependencies
} What if we want to apply multiple passes of our
previous filter?
for (count = 0; count < 1000; count++)
{
for (y = 1; y < imageHeight-1; y++)
{
for (x = 1; x < imageWidth-1; x++)
{
float sum = 0.0f;
for(offsetY = -1; offsetY <= 1; offsetY++)
{
for(offsetX = -1; offsetX <= 1; offsetX++)
{
int finalX = x + offsetX;
int finalY = y + offsetY;
sum += srcImage[finalY * imageWidth + finalX];
}
}
dstImage[y * imageWidth + x] = sum / 9.0f;
}
}
swap(srcImage, dstImage);
}
Interloop dependencies
} In general such interloop dependencies are
problematic for all GPUification approaches as
they break parallelism.
◦ Techniques exist to reduce them but they are limited.
} You should consider whether you can revise your
code to remove the dependencies.
} In some cases it would help to add
synchronisation primitives to the toolkit. We’re
investigating this.
Function calls
} Proper function calls are not supported on all GPU
hardware.
◦ Functions are usually inlined in the compiled code.
◦ GPSME toolkit only supports functions which can be inlined.
◦ Recursion is not possible
} Possible workarounds:
◦ Make sure the function can be inlined and contains code
appropriate for the GPU.
◦ Bring the function call outside the loop if it doesn’t really need
to be executed every iteration.
◦ Split the loop in to two loops – one following the other. Only
parallelise one of them.
Conditional logic
} GPUs have a Single Instruction Multiple Data
(SIMD) architecture.
} All threads follow the same execution path.
◦ Relevant when testing boundary conditions (e.g. at edge
of image)
} Conditional logic is possible but might not deliver
the expected benefits.
◦ This was relevant for the MedicSight code.
Conditional logic
#pragma GPSME for nest(2) tile(16,16)
for(int x = 0; x < 128; x++)
{
for(int y = 0; y < 128; y++)
{
float val = someArray[x][y];
if(val < 0.001f)
{
continue; // Optimisation
}
else
{
// Some expensive code here
}
}
}
Memory transfers
} GPUs typically have memory which is physically
separate from the main system memory.
◦ The #pragma GPSME copy directive performs transfers.
} Transfers must be performed immediately before
execution of the parallel region.
◦ The GPSME toolkit will enforce this.
Memory Transfers
} You should consider:
◦ Bandwidth: There is a limit to the rate at which data can
be transferred to the GPU. This rate varies between
cards (typically 10-200 Gb/sec).
◦ Latency: There is a small delay between requesting a
memory transfer and it actually happening. Therefore
one large transfer is faster than several small one.
◦ Memory Size: GPUs typically have between 128Mb to
2Gb of memory, and some is reserved for rendering
processes.
Use of External Libraries
} It is common (and generally good practice) to
build applications on third-party libraries.
} Unfortunately this causes some problems for
parallelisation toolkits.
◦ Must be able to see source code to the libraries being
used.
◦ Libraries must be available on Linux.
◦ Libraries cannot be used within parallel regions.
◦ Webserver add some extra complications.
} How can we work around these issues?
Use of External Libraries
} This is a problem case:
#include <windows.h>
.
.
.
someWindowsFunction();
.
.
.
#pragma GPSME for nest(2) tile(16,16)
for(int x = 0; x < 128; x++)
{
for(int y = 0; y < 128; y++)
{
//Some code here
}
}
Use of External Libraries
} Solve it by splitting the file in two:
// In ‘parallelisable.cpp’ (for example)
#pragma GPSME for nest(2) tile(16,16)
for(int x = 0; x < 128; x++)
{
for(int y = 0; y < 128; y++)
{
//some code here
}
}

//In main.cpp
#include <windows.h>
#include “parallelisable.h”
.
.
someWindowsFunction();
.
.
//Now call parallelised function in parallelisable.cpp
Use of External Libraries
} A more difficult scenario:
#pragma GPSME for nest(2) tile(16,16)
for(int x = 0; x < 128; x++)
{
for(int y = 0; y < 128; y++)
{
.
.
// External function call
cvSomeFunction();
.
.
}
}
Use of External Libraries
} When working through the webserver:
◦ Make sure the required dependencies are installed.
◦ Upload all project-specific headers which are needed.
#include "OpenCV.h"
#include "VTK.h"
.
.
#include "MyHeader1.h" // Upload this one
#include "MyHeader2.h" // Upload this one
.
.
int main(int argc, char** argv)
{
//Some code here
}
Now let’s see how this works on
some harder problems…
Polybench benchmark suite
} Collection of micro-benchmarks
} Originally developed for the CPU
} CUDA/OpenCL versions were developed recently

} Implemented OpenMP, OpenACC and GPSME

version
} Recently submitted a paper that presents the
results
Polybench benchmark suite
} Convolution:
2DCONV - 2D convolutional filter
3DCONV - 3D convolutional filter

} Linear Algebra:
2MM - 2 Matrix Multiplications (D=A*B; E=C*D)
3MM - 3 Matrix Multiplications (E=A*B; F=C*D; G=E*F)
ATAX - Matrix Transpose and Vector Multiplication
BICG - BiCG Sub Kernel of BiCGStab Linear Solver
GEMM - Matrix-multiply C=alpha.A.B+beta.C
GESUMMV - Scalar, Vector and Matrix Multiplication
GRAMSCHMIDT-Gram-Schmidt decomposition
MVT - Matrix Vector Product and Transpose
SYR2K - Symmetric rank-2k operations
SYRK - Symmetric rank-k operations

} Datamining:
CORRELATION - Correlation Computation
COVARIANCE - Covariance Computation

} Stencils:
FDTD-2D - 2-D Finite Difference Time Domain Kernel
Open standards
} OpenMP } OpenACC
◦ Open standard ◦ Open standard for
for directive- directive-based GPU
computing
based multi-core
◦ Announced at SC11
programming [November 2011]
◦ Most compilers ◦ Caps, Cray, and PGI
support it by now are currently providing
◦ Easy to harness OpenACC compilers
shared memory ◦ Version 2.0 is to be
multi-core released soon…
parallelism
Polybench initial results
} Most tests benefit from speed-ups compared to
the OpenMP version.
Example – GEMM OpenACC
#pragma acc data copyin(A[NI*NJ],B[NI*NJ]) copyout(C[NI*NJ]){
#pragma acc kernels loop independent vector(32)
for (i = 0; i < NI; i++) {
#pragma acc loop independent vector(32)
for (j = 0; j < NJ; j++) {
C[i*NJ + j] = 0.0;
for (k = 0; k < NK; ++k) {
C[i*NJ + j] += A[i*NK + k] * B[k*NJ + j];
}
}
}
}
Example – GEMM GPSME
#pragma GPSME copy(A,toDevice, NI, NJ)
#pragma GPSME copy(B,toDevice, NI, NJ)
#pragma GPSME parallel {
#pragma GPSME for nest(2) tile(32,32)
for (i = 0; i < NI; i++) {
for (j = 0; j < NJ; j++) {
C[i*NJ + j] = 0.0;
for (k = 0; k < NK; ++k) {
C[i*NJ + j] += A[i*NK + k] * B[k*NJ + j];
}
}
}
}
#pragma GPSME copy(C, fromDevice, NI,NJ)
Example – GRAMSCHMIDT
#pragma GPSME copy(A,toDevice, N, M)
#pragma GPSME copy(R,toDevice, N, M)
#pragma GPSME copy(Q,toDevice, N, M)
#pragma GPSME parallel{
#pragma GPSME for nest(1) tile(128)
for (k = 0; k < N; k++) {
nrm = 0; Reduction limits 2nd level
for (i = 0; i < M; i++) { parallelization
nrm += A[i*N + k] * A[i*N + k];
}
R[k*N + k] = sqrt(nrm);
for (i = 0; i < M; i++) {
Q[i*N + k] = A[i*N + k] / R[k*N + k];
}
for (j = k + 1; j < N; j++) {
R[k*N + j] = 0;
for (i = 0; i < M; i++) {
R[k*N + j] += Q[i*N + k] * A[i*N + j];
}
for (i = 0; i < M; i++) {
A[i*N + j] = A[i*N + j] - Q[i*N + k] * R[k*N + j];
}
}
}
}
#pragma GPSME copy(A,fromDevice, N, M)
Example – GRAMSCHMIDT
for (k = 0; k < N; k++) {
nrm = 0;
for (i = 0; i < M; i++) {
nrm += A[i*N + k] * A[i*N + k];
}
R[k*N + k] = sqrt(nrm);
for (i = 0; i < M; i++) {
Q[i*N + k] = A[i*N + k] / R[k*N + k];
}
}
#pragma GPSME copy(A,toDevice, N, M)
#pragma GPSME copy(R,toDevice, N, M)
#pragma GPSME copy(Q,toDevice, N, M)
#pragma GPSME parallel{
#pragma GPSME for nest(2) tile(16,16)
for (k = 0; k < N; k++) { Triangular loop limits
for (j = k + 1; j < N; j++) { nd level parallelization
R[k*N + j] = 0;
2
for (i = 0; i < M; i++) {
R[k*N + j] += Q[i*N + k] * A[i*N + j];
}
for (i = 0; i < M; i++) {
A[i*N + j] = A[i*N + j] - Q[i*N + k] * R[k*N + j];
}
}
}
}
Triangular loop support
} Thread blocks can be:
◦ Full: All threads are part of the iteration space. Resources are not wasted.
◦ Empty: No thread is part of the iteration space. Resources are not wasted.
◦ Half-full: This create divergent branch behavior. Some threads are to be
executed, and some are not.
Polybench benchmark suite

} Triangular support increases performance by more than 30 times

} Outperforms OpenACC by a good margin on these tests
Future work – Multi dimensional arrays

} Tests have been modified to access memory in a 2D manner

a[i][j], as opposed to a[i*M+j]
} GPSME finds extra optimization opportunities by exploiting the
2D access pattern
} 25% performance increase when using explicit 2D arrays
Arithmetic intensity
} Arithmetic intensity is defined as the ratio between
computation and memory load/store
Float vs. double

} GPSME is equal or better than OpenACC in all cases

Conclusions on Polybench

} GPSME outperforms OpenACC on the majority of

cases:
◦ Better register usage
◦ Cleaner output code
Memory Space Bandwidth

Register memory ≈ 8,000 GB/s

Shared memory ≈ 1,600 GB/s

Global memory ≈ 177 GB/s

Mapped memory ≈ 8 GB/s

Source: Rob Farber
“CUDA Application Design and Development”
Rotasoft Evaluation
} The ASIFT algorithm for feature extraction
◦ Keypoint matching
} Rotasoft have successfully evaluated the ASIFT
implementations
◦ On their own dataset
◦ On a dataset provided by the RTD performers
} Matching accuracy is almost the same as with the
CPU version
◦ Highly invariant to camera viewpoint change
} Main modification: Replaced Array of Structures with
Structure of Arrays
Array of Structures vs Structures of
Arrays
} GPU global memory is accessed in chunks and aligned.

struct key_aos struct key_soa

{ {
int angle; int * angle;
int scale; int * scale;
int descriptor[128]; int * descriptor[128];
}; };

key_aos *d_keys; key_soa d_keys;

cudaMalloc((void**)&d_keys, ...); cudaMalloc((void**)
&d_keys.angle, ...);
cudaMalloc((void**)
&d_keys.scale, ...);
cudaMalloc((void**)
&d_keys.descriptor, ...);
Rotasoft Evaluation – Keypoint
matching
} Tested on 800x600 image:
◦ Computes matches between two sets of around 11,000 keypoints

Rotasoft: Core i3@2.1GHz+GT520M Groningen: Core i7@3.4GHz+GTX680

Rotasoft workstation Groningen workstation
(time in seconds) (time in seconds)
Original 69.5 25.9
OpenMP 25.7 6.7
Manual GPU 12.5 1.9
Auto GPU 14.6 3.2

• Speed-up of 6x for a lower grade system

• Speed-up of up to 13.6x for a high-performance system
Rotasoft Evaluation – Keypoint
matching
Rotasoft Evaluation
} We continue with evaluating parts of ASIFT keypoint
detection, starting with convolution
◦ Convolution is about 45-50% of the detection stage
Convolution - GPSME
#pragma GPSME copy (A, toDevice,N,M)
#pragma GPSME copy (B, toDevice,N,M)
#pragma GPSME copy (c, toDevice,3,3)
#pragma GPSME parallel {
#pragma GPSME for nest(2) tile(32,16)
for (int i = 1; i < M - 1; ++i) {
for (int j = 1; j < N - 1; ++j) {
B[i][j] = c[0][0] * A[i - 1][j - 1] + c[0][1] * A[i + 0][j - 1] +
c[0][2] * A[i + 1][j - 1] + c[1][0] * A[i - 1][j + 0] +
c[1][1] * A[i + 0][j + 0] + c[1][2] * A[i + 1][j + 0] +
c[2][0] * A[i - 1][j + 1] + c[2][1] * A[i + 0][j + 1] +
c[2][2] * A[i + 1][j + 1];
}
}
}
#pragma GPSME copy (B, fromDevice,N,M)
Convolution performance
• Intel i7@3.4GHz ; NVidia GTX680
Small data Small data Big data Big data
model* model* model** model**
3x3 kernel 5x5 kernel 3x3 kernel 5x5 kernel
[Hz] [Hz] [Hz] [Hz]

CPU – GCC 486 64.5 2.94 0.44

PGI OpenACC 4629 2127 26.17 12.33

GPSME 4901 2785 34.6 16.28

• Speed-up between 10x and 43x vs. CPU code

• Between 5%-30% faster than PGI’s OpenACC

* 1024x1024 image
**12288X12288 image
OpenACC vs. GPSME
} OpenACC advantages:
◦ It’s an open standard implemented by compiler vendors.
◦ Flexibility
Synchronisation, memory and device management, caching.
◦ Ease of use (integrated into Visual Studio)
} GPSME advantages:
◦ Simplicity
◦ Generates cleaner output code
CUDA, as well as OpenCL code
◦ Doesn’t incur performance penalties for the above advantages
◦ Full access to source code makes it easily extendable
Conclusions
} GPSME toolkit can deliver large performance
gains for some classes of problems.
} Better or equal than PGI OpenACC compiler on
Polybench
} For real-world code, usually some revising of is
needed:
◦ Isolate code you wish to parallelise
◦ Try to eliminate library and loop dependencies.
◦ Consider memory transfers, especially inside loops
◦ Use SoA instead of AoS

01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
MCQs
No ratings yet
MCQs
29 pages
Summative Test 1 (Tle 7 & 8)
100% (4)
Summative Test 1 (Tle 7 & 8)
4 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Unit - 4
No ratings yet
Unit - 4
33 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Cuda
No ratings yet
Cuda
7 pages
F2424 UserManual
No ratings yet
F2424 UserManual
177 pages
Introduction To Software Rendering On Embedded Systems
No ratings yet
Introduction To Software Rendering On Embedded Systems
22 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Web GPU
0% (1)
Web GPU
40 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
How To Configure The MiCOM P139 Device With Configuration Files Prepared in Advance
No ratings yet
How To Configure The MiCOM P139 Device With Configuration Files Prepared in Advance
8 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
ParallelR-Accelerating R Applications With CUDA
No ratings yet
ParallelR-Accelerating R Applications With CUDA
59 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Pipeline Studio - Standalone Licensing Help
No ratings yet
Pipeline Studio - Standalone Licensing Help
4 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Intro To Matlab GPU Programming
No ratings yet
Intro To Matlab GPU Programming
35 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Architecture Instruction Set Extensions Programming Reference
No ratings yet
Architecture Instruction Set Extensions Programming Reference
180 pages
OpenACC 2017spring
No ratings yet
OpenACC 2017spring
46 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
No ratings yet
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
44 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Week 10 - J2Me, Midlet: Co 4307 - Tutorial Slides
No ratings yet
Week 10 - J2Me, Midlet: Co 4307 - Tutorial Slides
58 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
Tilining
No ratings yet
Tilining
23 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
CGMT Practical - File
No ratings yet
CGMT Practical - File
27 pages
Advance Computer Graphics
No ratings yet
Advance Computer Graphics
32 pages
ABC
No ratings yet
ABC
28 pages
CGR
No ratings yet
CGR
32 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15
42 pages
OpenACC 3
No ratings yet
OpenACC 3
23 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
WRF-GPU DR Young-Tae+Kim
No ratings yet
WRF-GPU DR Young-Tae+Kim
22 pages
Lec 6
No ratings yet
Lec 6
16 pages
Computer JKSSB Exams
No ratings yet
Computer JKSSB Exams
40 pages
Directives Tips For Fortran
No ratings yet
Directives Tips For Fortran
15 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Laudon Ess10e PP 4
No ratings yet
Laudon Ess10e PP 4
48 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
CGR MicroProject Proposal and Report Group 15
No ratings yet
CGR MicroProject Proposal and Report Group 15
10 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Introducing The Adafruit Bluefruit LE Sniffer: Created by Kevin Townsend
No ratings yet
Introducing The Adafruit Bluefruit LE Sniffer: Created by Kevin Townsend
44 pages
Wa0002 PDF
No ratings yet
Wa0002 PDF
39 pages
Report Template PDF
No ratings yet
Report Template PDF
9 pages
Unit 3
No ratings yet
Unit 3
31 pages
C - Lecture I - IV
No ratings yet
C - Lecture I - IV
27 pages
Question
No ratings yet
Question
21 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
CUDA
No ratings yet
CUDA
33 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
Code Review Template
No ratings yet
Code Review Template
16 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
5 System Startup and Shutdown
No ratings yet
5 System Startup and Shutdown
17 pages
HPC-Practical-4Addition of Two Large Vectors
No ratings yet
HPC-Practical-4Addition of Two Large Vectors
4 pages
How To Install Telnet in CentOS - RHEL - Scientific Linux 6
No ratings yet
How To Install Telnet in CentOS - RHEL - Scientific Linux 6
10 pages
Cassandra Intro
No ratings yet
Cassandra Intro
10 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
The Design of A SRAM-Based Field-Programmable Gate Array-Part II: Circuit Design and Layout
No ratings yet
The Design of A SRAM-Based Field-Programmable Gate Array-Part II: Circuit Design and Layout
10 pages
Cuda Program + Wait For User Input
No ratings yet
Cuda Program + Wait For User Input
2 pages
Catálogo Laptops
No ratings yet
Catálogo Laptops
7 pages
COA Unit-1 and Unit-2 List of Questions
No ratings yet
COA Unit-1 and Unit-2 List of Questions
2 pages
OfficeSuite UserManual
No ratings yet
OfficeSuite UserManual
6 pages
AutoCAD Civil 3D Performance Optimization 2
No ratings yet
AutoCAD Civil 3D Performance Optimization 2
5 pages
Displaying Diagnostics Status and Comparison Sta-Tus Using Icons
No ratings yet
Displaying Diagnostics Status and Comparison Sta-Tus Using Icons
6 pages
020-101357-04 - DWG INTERCONN Laser System PDF
No ratings yet
020-101357-04 - DWG INTERCONN Laser System PDF
1 page
Computer Virus and Antivirus: History of Computer Viruses
No ratings yet
Computer Virus and Antivirus: History of Computer Viruses
5 pages
Optional Logic, If Any Subject Is Below 35 Marks Should Be Failed and Grade Should Be F and 35 and Above Should Be Passed
No ratings yet
Optional Logic, If Any Subject Is Below 35 Marks Should Be Failed and Grade Should Be F and 35 and Above Should Be Passed
2 pages
Fortnite Part 2
100% (1)
Fortnite Part 2
2 pages
Doc41 - ScoringFACIT-Fatigue Subscale v4-R
No ratings yet
Doc41 - ScoringFACIT-Fatigue Subscale v4-R
1 page
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams

Uploaded by

Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams

Uploaded by

Automatically converting C/

} Generates output C++ and CUDA in a single file.

if (_gidy >= 0 && _gidy <= imageHeight - 1) {{{

#pragma GPSME copy(pInputData, toDevice, width, height, depth)

#pragma GPSME parallel

#pragma GPSME copy(pInputData, fromDevice, width, height, depth)

} Implemented OpenMP, OpenACC and GPSME

} Triangular support increases performance by more than 30 times

} Tests have been modified to access memory in a 2D manner

} GPSME is equal or better than OpenACC in all cases

} GPSME outperforms OpenACC on the majority of

Register memory ≈ 8,000 GB/s

Shared memory ≈ 1,600 GB/s

Global memory ≈ 177 GB/s

Mapped memory ≈ 8 GB/s

struct key_aos struct key_soa

key_aos *d_keys; key_soa d_keys;

Rotasoft: Core i3@2.1GHz+GT520M Groningen: Core i7@3.4GHz+GTX680

• Speed-up of 6x for a lower grade system

CPU – GCC 486 64.5 2.94 0.44

PGI OpenACC 4629 2127 26.17 12.33

GPSME 4901 2785 34.6 16.28

• Speed-up between 10x and 43x vs. CPU code

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.