0% found this document useful (0 votes)

155 views15 pages

NVIDIA OpenCL JumpStart Guide

CUDA is NVIDIA's technology for GPU Computing. OpenCL is an open standard for parallel programming of heterogeneous systems. CUDA Driver API allows programmers to develop applications for the CUDA architecture. This guide will help developers get started writing applications that will use OpenCL even before OpenCL v1. Conformant implementations are available.

Uploaded by

evertonpetro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

155 views15 pages

NVIDIA OpenCL JumpStart Guide

Uploaded by

evertonpetro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

NVIDIA® OpenCL™

JumpStart Guide

Technical Brief

Version 0.9
April 2009
NVIDIA OpenCL JumpStart Guide

Introduction
The purpose of this document is to help developers get started writing applications that will use
OpenCL even before OpenCL v1.0 conformant implementations are available.
This guide will help you to start developing GPU accelerated applications today, using C for CUDA
compute kernels and the CUDA Driver API in ways that that will make it easy to transition to
OpenCL when you are ready.

Overview
OpenCL (Open Compute Language) is an open standard for parallel programming of
heterogeneous systems, managed by the Khronos Group. OpenCL supports a wide range of
applications, from embedded and consumer software to HPC solutions, through a low-level, high-
performance, portable abstraction. By creating an efficient, close-to-the-metal programming
interface, OpenCL will form the foundation layer of a parallel computing ecosystem of platform-
independent tools, middleware and applications.
CUDA is NVIDIA's technology for GPU Computing. With the CUDA architecture and tools,
developers are achieving dramatic speedups in fields such as medical imaging and natural resource
exploration, and creating breakthrough applications in areas such as image recognition and real-time
HD video playback and encoding.
Leveraging the massively parallel processing power of NVIDIA GPUs, OpenCL running on the
CUDA architecture extends NVIDIA’s world-renowned graphics processor technology into the
realm of parallel computing. Applications that run on the CUDA architecture can take advantage of
an installed base of over one hundred million CUDA-enabled GPUs in desktop and notebook
computers, professional workstations, and supercomputer clusters. NVIDIA GPUs enable this
unprecedented performance via standard APIs such as OpenCL and DirectX Compute, and high
level programming languages such as C/C++, Fortran, Java, Python, and .NET.
The NVIDIA CUDA Driver API allows programmers to develop applications for the CUDA
architecture and is the predecessor of OpenCL. As such, the CUDA Driver API is very similar to
OpenCL with a high correspondence between functions. Using the CUDA Driver API and the
guidelines explained in this document will allow a smooth transition to OpenCL in the future, and
allows you to get started today learning GPU computing and parallel programming concepts.

Getting Started
To get started, follow the steps in the CUDA Quickstart Guide for your operating system, and read
through the rest of this document. CUDA Quickstart Guides are available at:
http://www.nvidia.com/object/cuda_develop.html

Note: You must have a CUDA-enabled GPU in your system. All recent NVIDIA GPUS
have the necessary support, and a full list is available here:
http://www.nvidia.com/object/cuda_learn_products.html

1
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

Differences between OpenCL

and the CUDA Driver API
This section describes several key differences between the CUDA Driver API and OpenCL. Please
also refer to the CUDA Programming Guide and the OpenCL Specification v1.0 for additional
details.

Pointer Traversal
Multiple pointer traversals must be avoided on OpenCL, the behavior of such operations is
undefined in the specification. Pointer traversals are allowed with C for CUDA.

struct Node { Node* next; }

n = n->next; // undefined operation in OpenCL,
// since ‘n’ here is a kernel input

To do this on OpenCL, pointers must be converted to be relative to the buffer base pointer and
only refer to data within the buffer itself (no pointers between OpenCL buffers are allowed).

struct Node { unsigned int next; }

…
n = bufBase + n; // pointer arithmetic is fine, bufBase is
// a kernel input param to the buffer’s beginning

Kernel Programs
Using C for CUDA, kernel programs are precompiled into a binary format and there are function
calls for dealing with module and function loading. In OpenCL, the compiler is built into the
runtime and can be invoked on the raw text or a binary can be built and saved for later load. There
are slight differences in keywords and syntax of the languages used for kernels.

Kernel Invocation Memory Offsets

The current version of OpenCL does not support stream offsets at the API/kernel invocation level.
Offsets must be passed in as a parameter to the kernel and the address of the memory computed
inside it. CUDA kernels may be started at offsets within buffers at the API/kernel invocation level.

Vector Addition Example

Here we show the differences between C for CUDA and OpenCL implementations of vector
addition.

2
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

The program adds two arrays of floats. The basic components of this program are identical in C for
CUDA and OpenCL:

• A compute kernel, which will be executed on the compute device (GPU)

• A host application drives the kernel execution, with each thread adding one element read
from arrays b and c

C for CUDA Kernel Code:

__global__ void
vectorAdd(const float * a, const float * b, float * c)
{
// Vector element index
int nIndex = blockIdx.x * blockDim.x + threadIdx.x;

c[nIndex] = a[nIndex] + b[nIndex];

}

OpenCL Kernel Code

__kernel void
vectorAdd(__global const float * a,
__global const float * b,
__global float * c)
{
// Vector element index
int nIndex = get_global_id(0);

c[nIndex] = a[nIndex] + b[nIndex];

}

Conceptually both languages are very similar. For this program, the differences are mostly in the
syntax. Let’s look at these differences in detail.
Kernel declaration specifier
CUDA kernel functions are declared using the “__global__” function modifier, while OpenCL
kernel functions are declared using “__kernel”.
Pointer declaration specifiers
With OpenCL, it is mandatory to specify the address space for any pointers passed as arguments to
kernel functions. This kernel has three parameters a, b, and c that are pointers to global device
memory. These arrays must be declared using the __global specifier in OpenCL.
Global thread index computation
In C for CUDA, all index and threadblock size information is available to kernels in three structures:
threadIdx.{x|y|z}, blockIdx.{x|y|z}, blockDim.{x|y|z} and
gridDim.{x|y|z}. The kernel developer is responsible for implementing the index
computations necessary for the kernel to operate on its data.

3
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

In contrast, OpenCL provides basic index information to kernels via functions. OpenCL also
provides several functions to access derived information such as get_global_id(). This
function computes a global work item index from work group index, work group size and thread
index. OpenCL also provides the function get_local_id() to query the id inside the work group,
get_work_dim() to query the number of dimension of the work group launched for the kernel
and the get_global_size() function to query the size of the work group.

CUDA Driver API Host Code:

The vector add example is a very basic CUDA program that adds two arrays together. The CUDA
driver API is a lower level API that offers a better level of control for CUDA applications. It is
language independent since it can deal directly with PTX or CUBIN objects. PTX or CUBIN files
generated by NVCC.EXE can be loaded using the CUDA Driver API.
This example assumes that the CUDA kernel previously shown has been successfully compiled via
NVCC.exe into a CUBIN file named “vectorAdd.cubin”.

const unsigned int cnBlockSize = 512;

const unsigned int cnBlocks = 3;
const unsigned int cnDimension = cnBlocks * cnBlockSize;

CUdevice hDevice;
CUcontext hContext;
CUmodule hModule;
CUfunction hFunction;

// create CUDA device & context

cuInit(0);
cuDeviceGet(&hContext, 0); // pick first device
cuCtxCreate(&hContext, 0, hDevice));

cuModuleLoad(&hModule, “vectorAdd.cubin”);
cuModuleGetFunction(&hFunction, hModule, "vectorAdd");

// allocate host vectors

float * pA = new float[cnDimension];
float * pB = new float[cnDimension];
float * pC = new float[cnDimension];

// initialize host memory

randomInit(pA, cnDimension);
randomInit(pB, cnDimension);

// allocate memory on the device

CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC;
cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float));

// copy host vectors to device

cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float));
cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float));

4
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

// setup parameter values

cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1);
cuParamSeti(cuFunction, 0, pDeviceMemA);
cuParamSeti(cuFunction, 4, pDeviceMemB);
cuParamSeti(cuFunction, 8, pDeviceMemC);
cuParamSetSize(cuFunction, 12);

// execute kernel
cuLaunchGrid(cuFunction, cnBlocks, 1);

// copy the result from device back to host

cuMemcpyDtoH((void *) pC, pDeviceMemC, cnDimension * sizeof(float));

delete[] pA;
delete[] pB;
delete[] pC;

cuMemFree(pDeviceMemA);
cuMemFree(pDeviceMemB);
cuMemFree(pDeviceMemC);

OpenCL Host Code:

Let’s compare the Host Code from the CUDA Driver API to the OpenCL one below. The code
below assumes that the OpenCL kernel code from below is stored in a string named
“sProgramSource”.

const unsigned int cnBlockSize = 512;

const unsigned int cnBlocks = 3;
const unsigned int cnDimension = cnBlocks * cnBlockSize;

// create OpenCL device & context

cl_context hContext;
hContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU,
0, 0, 0);

// query all devices available to the context

size_t nContextDescriptorSize;
clGetContextInfo(hContext, CL_CONTEXT_DEVICES,
0, 0, &nContextDescriptorSize);
cl_device_id * aDevices = malloc(nContextDescriptorSize);
clGetContextInfo(hContext, CL_CONTEXT_DEVICES,
nContextDescriptorSize, aDevices, 0);

// create a command queue for first device the context reported

cl_command_queue hCmdQueue;
hCmdQueue = clCreateCommandQueue(hContext, aDevices[0], 0, 0);

// create & compile program

cl_program hProgram;
hProgram = clCreateProgramWithSource(hContext, 1,
sProgramSource, 0, 0);
clBuildProgram(hProgram, 0, 0, 0, 0, 0);

5
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

// create kernel
cl_kernel hKernel;
hKernel = clCreateKernel(hProgram, “vectorAdd”, 0);

// allocate host vectors

float * pA = new float[cnDimension];
float * pB = new float[cnDimension];
float * pC = new float[cnDimension];

// initialize host memory

randomInit(pA, cnDimension);
randomInit(pB, cnDimension);

// allocate device memory

cl_mem hDeviceMemA, hDeviceMemB, hDeviceMemC;
hDeviceMemA = clCreateBuffer(hContext,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
cnDimension * sizeof(cl_float),
pA,
0);
hDeviceMemB = clCreateBuffer(hContext,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
cnDimension * sizeof(cl_float),
pA,
0);
hDeviceMemC = clCreateBuffer(hContext,
CL_MEM_WRITE_ONLY,
cnDimension * sizeof(cl_float),
0, 0);

// setup parameter values

clSetKernelArg(hKernel, 0, sizeof(cl_mem), (void *)&hDeviceMemA);
clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hDeviceMemB);
clSetKernelArg(hKernel, 2, sizeof(cl_mem), (void *)&hDeviceMemC);

// execute kernel
clEnqueueNDRangeKernel(hCmdQueue, hKernel, 1, 0,
&cnDimension, 0, 0, 0, 0);

// copy results from device back to host

clEnqueueReadBuffer(hContext, hDeviceMemC, CL_TRUE, 0,
cnDimension * sizeof(cl_float),
pC, 0, 0, 0);

delete[] pA;
delete[] pB;
delete[] pC;

clReleaseMemObj(hDeviceMemA);
clReleaseMemObj(hDeviceMemB);
clReleaseMemObj(hDeviceMemC);

6
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

API Differences
Both C for CUDA and OpenCL implementations perform the same steps conceptually. The main
differences are the naming schemes and how data gets passed to the API. Both OpenCL and the
CUDA Driver API require the developer to manage the contexts and parameter passing.
One noteworthy difference is that C for CUDA programs are compiled with an external tool (the
NVCC compiler) before executing on the final application. This compilation step is typically
performed when the actual application is built. Typically, the OpenCL compiler is invoked at
runtime and the programmer needs to create or obtain the strings with the kernel programs. It is
also possible to offline compile OpenCL source in a similar fashion to C for CUDA.
The following sections cover the API differences per program section.

Initialization, Context and Device Creation

CUDA Driver API and OpenCL both have to concept of a “Context”. Any resources involved in
executing compute code using either of the APIs will belong to a Context. One of the first steps for
any compute program is to create such a context.

Using the CUDA Driver API:

Before any function calls to the CUDA driver API can be made, CUDA needs to be initialized with
a call to cuInit(0);
In future versions of CUDA, cuInit( ) will also include initialization flags as parameters. The
current versions of CUDA require 0 (Zero) to be passed.
In CUDA a context is created for a specific device. The typical flow is to first query the CUDA
devices available on a given system, get a handle to the device one wants to execute the CUDA code
on and create a context on that device. The vectorAdd sample uses a simplified version of this
workflow and simply picks the first CUDA device (device 0):

cuInit(0);
cuDeviceGet(&hContext, 0);
cuCtxCreate(&hContext, 0, hDevice));

Using OpenCL:
OpenCL does not require global initialization of the library. One can directly proceed to context
creation. OpenCL allows creation of a context directly for a certain type of compute device. In this
example we choose GPUs.

cl_context hContext;
hContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU,
0, 0, 0);

7
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

After the context is created, all devices for this context can be queried:

After executing the above code the aDevices array contains entries with information about the
devices available for this context.
OpenCL introduces an additional concept: Command Queues. Commands launching kernels and
reading or writing memory are always issued for a specific command queue. A command queue is
created on a specific device in a context. The following code creates a command queue for the
device and context created so far:

cl_command_queue hCmdQueue;
hCmdQueue = clCreateCommandQueue(hContext, aDevices[0], 0, 0);

With this the program has progressed to the point where data can be uploaded to the device’s
memory and processed by launching compute kernels on the device.

Kernel Creation
The following sections discuss how kernels are created using the CUDA Driver API and OpenCL.

Using the CUDA Driver API:

CUDA kernel code is typically stored in a separate file and compiled to binary format (using the
NVCC compiler). This is similar to compiling a C file to object code. The result of this compilation
step is a CUBIN file, which is loaded by an application at runtime using the cuModuleLoad()
function.
A handle to a specific kernel in a CUBIN module is obtained via a string lookup of the kernel
function’s name. The code for module loading and accessing the kernel function assumes that the
vectorAdd.cu kernel code has been compiled to vectorAdd.cubin:

CUmodule hModule;
cuModuleLoad(&hModule, “vectorAdd.cubin”);
cuModuleGetFunction(&hFunction, hModule, "vectorAdd");

Using OpenCL:
OpenCL is different from C for CUDA in that OpenCL does not provide a standalone compiler for
creating device ready binary code. The OpenCL interface provides methods for compiling kernels
given a string containing the kernel code (clCreateProgramWithSource()) at runtime.
Once a kernel is compiled it can be launched on the device.

8
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

Note: The OpenCL API also provides methods to access a program’s binaries after
successful compilation, as well as methods to create program objects from such
binaries. The OpenCL describes a scenario where applications can avoid lengthy
compiles every time they are launched by caching the kernel binaries on disk and
only recompiling if the binaries for a specific device are not already in cache.

Given the richness of the OpenCL API, it is possible in principle for a developer to
recreate the tools for a workflow like the CUDA one, where a separate compile
(implemented based on the OpenCL library) is used to compile binaries which the
application loads during runtime.

In summary, the most straight forward process is to compile the kernels at runtime and this is what
the following code does:

cl_program hProgram;
hProgram = clCreateProgramWithSource(hContext, 1,
sProgramSource, 0, 0);
clBuildProgram(hProgram, 0, 0, 0, 0, 0);

// create kernel
cl_kernel hKernel;
hKernel = clCreateKernel(hProgram, “vectorAdd”, 0);

The clCreateProgramWithSource() function creates a program object. sProgramSource

is a C string containing the kernel source code. clBuildProgram() compiles the kernel source
into binary code suited for the context’s devices (it is possible to restrict compilation to a subset of a
context’s devices by passing a non-zero pointer to a list of device descriptors).
clCreateKernel() returns a handle given a string with the kernel function’s name.

Device Memory Allocation

This section covers how memory is allocated on the device. The vectorAdd example allocates arrays
of float (in global device memory) for the three vectors (A, B, C) involved in the addition C = A+B.
CUDA’s device memory for the Driver API’s management functions are modeled after the C
runtime’s malloc(), free(), and memcpy() functions. The following code allocates three buffers of
appropriate size to hold the three arrays and fills the two input vectors (A, B) with data prepared on
the host via a host-to-device copy.

9
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

Using the CUDA Driver API:

We use cuMemcpyHtoD() to copy data from host to device.

CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC;

cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float));

// copy host vectors to device

cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float));
cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float));

Using OpenCL:
OpenCL’s device memory is managed via “buffer objects”. Buffer objects are created via the
clCreateBuffer() function, which offers a richer set of parameters than CUDA memory management
functions: Buffer objects can be flagged as read and write-only, and it’s even possible to specify a
host memory region to be used by the device directly.
OpenCL buffer creation also allows for passing a host pointer to the data to be copied into the new
buffer, all in one call; the following code shows the buffer creation for the three device memory
region for vector A, B, C. A and B are being filled with data from the host, pointed to by pA, and
pB. Since vector C is there to receive the results, it is not getting prefilled with data.

cl_mem hDeviceMemA, hDeviceMemB, hDeviceMemC;

hDeviceMemA = clCreateBuffer(hContext,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
cnDimension * sizeof(cl_float),
pA,
0);
hDeviceMemB = clCreateBuffer(hContext,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
cnDimension * sizeof(cl_float),
pA,
0);
hDeviceMemC = clCreateBuffer(hContext,
CL_MEM_WRITE_ONLY,
cnDimension * sizeof(cl_float),
0, 0);

Kernel Parameter Specification

The next step in preparing the kernels for launch is to establish a mapping between the kernels’
parameters, essentially pointers to the three vectors A, B and C, to the three device memory regions,
which were allocated in the previous section.

10
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

Parameter setting in both APIs is a pretty low-level affair. It requires knowledge of the total number
, order, and types of a given kernel’s parameters. The order and types of the parameters are used to
determine a specific parameters offset inside the data block made up of all parameters. The offset in
bytes for the n-th parameter is essentially the sum of the sizes of all (n-1) preceding parameters.

Using the CUDA Driver API:

In CUDA device pointers are represented as unsigned int and the CUDA Driver API has a
dedicated method for setting that type. Here’s the code for setting the three parameters. Note how
the offset is incrementally computed as the sum of the previous parameters’ sizes.

cuParamSeti(cuFunction, 0, pDeviceMemA);
cuParamSeti(cuFunction, 4, pDeviceMemB);
cuParamSeti(cuFunction, 8, pDeviceMemC);
cuParamSetSize(cuFunction, 12);

Using OpenCL:
In OpenCL parameter setting is done via a single function that takes a pointer to the location of the
parameter to be set.
clSetKernelArg(hKernel, 0, sizeof(cl_mem), (void *)&hDeviceMemA);
clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hDeviceMemB);
clSetKernelArg(hKernel, 2, sizeof(cl_mem), (void *)&hDeviceMemC);

Kernel Launch
Launching a kernel requires the specification of the dimension and size of the “thread-grid”. The
CUDA Programming Guide and the OpenCL specification contain details about the structure of
those grids. For NVIDIA GPUs the permissible structures are the same for CUDA and OpenCL.
For the vectorAdd sample we need to start one thread per vector-element (of the output vector).
The number of elements in the vector is given in the cnDimension variable. It is defined to be
cnDimension = cnBlockSize * cnBlocks. This means that cnDimension threads
need to be executed. The threads are structured into cnBlocks one-dimensional thread blocks of
size cnBlockSize.

Using the CUDA Driver API:

A kernel’s block size is specified in a call separate from the actual kernel launch using
cuFunctSetBlockShape. The kernel launching function cuLaunchGrid then only
specifies the number of blocks to be launched.

cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1);

cuLaunchGrid (cuFunction, cnBlocks, 1);

11
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

Using OpenCL:
The OpenCL equivalent of kernel launching is to “enqueue” a kernel for execution into a command
queue. The enqueue function takes parameters for both the work group size (work group is the
OpenCL equivalent of a CUDA thread-block), and the global work size, which is the size of the
global array of threads.

Note: Where in CUDA the global work size is specified in terms of number of thread
blocks, it is given in number of threads in OpenCL.

Both work group size and global work size are potentially one, two, or three dimensional arrays. The
function expects pointers of unsigned ints to be passed in the fourth and fifth parameters.
For the vectorAdd example, work groups and total work size is a one-dimensional grid of threads.

clEnqueueNDRangeKernel(hCmdQueue, hKernel, 1, 0,
&cnDimension, &cnBlockSize, 0, 0, 0);

The parameters of cnDimension and cnBlockSize must be pointers to unsigned int.

Work group sizes that are dimensions greater than 1, the parameters will be a pointer to arrays of
sizes.

Result Data Retrieval

Both kernel launch functions (CUDA and OpenCL) are asynchronous, i.e. they return immediately
after scheduling the kernel to be executed on the GPU. In order for a copy operation that retrieves
the result vector C (copy from device to host) to produce correct results in synchronization with the
kernel completion needs to happen.
CUDA memcpy functions automatically synchronize and complete any outstanding kernel launches
proceeding. Both API’s also provide a set of asynchronous memory transfer functions which
allows a user to overlap memory transfers with computation to increase throughput.

Using the CUDA Driver API:

Use cuMemcpyDtoH() to copy results back to the host.
cuMemcpyDtoH((void *)pC, pDeviceMemC, cnDimension * sizeof(float));

Using OpenCL:
OpenCL’s clEnqueueReadBuffer() function allows the user to specify whether a read is to
be synchronous or asynchronous (third argument). For the simple vectorAdd sample a

12
www.nvidia.com
NVIDIA OpenCL JumpStart Guide

synchronizing read is used, which results in the same behavior as the simple synchronous CUDA
memory copy above:

clEnqueueReadBuffer(hContext, hDeviceC, CL_TRUE, 0,

cnDimension * sizeof(cl_float),
pC, 0, 0, 0);

When used for asynchronous reads, OpenCL has an event mechanism that allows the host
application to query the status or wait for the completion of a given call.

Additional Resources

Resource URL
Khronos OpenCL Homepage http://www.khronos.org/opencl
OpenCL 1.0 Specification http://www.khronos.org/registry/cl
OpenCL at NVIDIA http://www.nvidia.com/object/cuda_opencl.html
CUDA Driver http://www.nvidia.com/object/cuda_get.html
CUDA Toolkit http://www.nvidia.com/object/cuda_get.html
CUDA SDK http://www.nvidia.com/object/cuda_get.html
CUDA Reference Guide http://www.nvidia.com/object/cuda_develop.html
CUDA Programming Guide http://www.nvidia.com/object/cuda_develop.html
CUDA Zone http://www.nvidia.com/cuda
Developer Forums http://forums.nvidia.com/index.php?showforum=62
CUDA Visual Profiler http://www.nvidia.com/object/cuda_get.html
CUDA GDB http://www.nvidia.com/object/cuda_get.html

For more information about GPU Computing with OpenCL and other technologies, please visit www.nvidia.com/cuda.

13
www.nvidia.com
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND
OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA
MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT,
MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no
responsibility for the consequences of use of such information or for any infringement of patents or other
rights of third parties that may result from its use. No license is granted by implication or otherwise under any
patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to
change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA
Corporation products are not authorized for use as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA, the NVIDIA logo, and GeForce are trademarks or registered trademarks of NVIDIA Corporation.
OpenCL is an Apple Trademark licensed by Khronos. Other company and product names may be trademarks
of the respective companies with which they are associated.

NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050
www.nvidia.com

Windows Internals 4th Edition Mark Russinovich David Solomon download
No ratings yet
Windows Internals 4th Edition Mark Russinovich David Solomon download
69 pages
Install Firefox in Redhat 8
No ratings yet
Install Firefox in Redhat 8
3 pages
# - Software Architecture Patterns Second Edition 9781098134273
100% (1)
# - Software Architecture Patterns Second Edition 9781098134273
89 pages
A Monroe Case
0% (1)
A Monroe Case
14 pages
WOMEN MOVEMENT IN INDIA - PDF 1-Converted 2
No ratings yet
WOMEN MOVEMENT IN INDIA - PDF 1-Converted 2
9 pages
Multi-Threaded Programming With POSIX Threads - Linux Systems Programming
No ratings yet
Multi-Threaded Programming With POSIX Threads - Linux Systems Programming
2,608 pages
MSDOS Programming Info
100% (1)
MSDOS Programming Info
631 pages
Mango Literature Review
100% (1)
Mango Literature Review
5 pages
08 - Chapter 3 PDF
No ratings yet
08 - Chapter 3 PDF
49 pages
Using Visual C++, 6th (Special) Edition
100% (2)
Using Visual C++, 6th (Special) Edition
882 pages
The Linux Concept Journey - V4.0 - July 2024
No ratings yet
The Linux Concept Journey - V4.0 - July 2024
50 pages
Digital Image Processing - Human Visual System
No ratings yet
Digital Image Processing - Human Visual System
32 pages
Rajvrdhan VHDL ppt
No ratings yet
Rajvrdhan VHDL ppt
13 pages
Rob Pike at Stanford On Golang
100% (3)
Rob Pike at Stanford On Golang
56 pages
Northern Ireland Geography: Cultura Inglesa M. Denys Nº3200379/ Daniel Santos n3210479
No ratings yet
Northern Ireland Geography: Cultura Inglesa M. Denys Nº3200379/ Daniel Santos n3210479
17 pages
OpenACC Princeton Bootcamp PDF
No ratings yet
OpenACC Princeton Bootcamp PDF
51 pages
STC GS 1 b2 SG Train Lighting Coaches
No ratings yet
STC GS 1 b2 SG Train Lighting Coaches
33 pages
Unit5-8051 Architecture
No ratings yet
Unit5-8051 Architecture
12 pages
Victor NM Eco Series
0% (1)
Victor NM Eco Series
2 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
x64 Asm
100% (2)
x64 Asm
4 pages
A_Brief_History_of_Computers
No ratings yet
A_Brief_History_of_Computers
34 pages
Physical Examination
100% (4)
Physical Examination
18 pages
Homework 5 5
No ratings yet
Homework 5 5
2 pages
Polymer Clay Cabochon Necklace
100% (1)
Polymer Clay Cabochon Necklace
3 pages
Airport Information: Details For Alfonso Bonilla Aragon Intl
No ratings yet
Airport Information: Details For Alfonso Bonilla Aragon Intl
31 pages
ĐỀ 5 - no answers - Quiz 2
No ratings yet
ĐỀ 5 - no answers - Quiz 2
4 pages
Crash N' Burn: Writing Linux Application Fault Handlers
100% (4)
Crash N' Burn: Writing Linux Application Fault Handlers
25 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
SM2258XT Bga144 8tsop 4L Q0723 SCH
No ratings yet
SM2258XT Bga144 8tsop 4L Q0723 SCH
6 pages
Western Music
No ratings yet
Western Music
3 pages
Suicide Detection With Natural Language Processing
No ratings yet
Suicide Detection With Natural Language Processing
14 pages
C# Is Modern and Easy
No ratings yet
C# Is Modern and Easy
83 pages
The Story of Orpheus and Eurydice,: Orpheus With His Singing Lyre Led The Trees, Led The Wild Beasts of The Wilderness
No ratings yet
The Story of Orpheus and Eurydice,: Orpheus With His Singing Lyre Led The Trees, Led The Wild Beasts of The Wilderness
1 page
Linux Graphics Demystified
100% (1)
Linux Graphics Demystified
49 pages
GTK+ GUI Programming: Ori Idan Helicon Technologies
100% (1)
GTK+ GUI Programming: Ori Idan Helicon Technologies
19 pages
Real Time Software
No ratings yet
Real Time Software
272 pages
Synchronization Algorithms and Concurrent Programming
No ratings yet
Synchronization Algorithms and Concurrent Programming
74 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
Insurance Management System Project
No ratings yet
Insurance Management System Project
5 pages
Cloud Application Development in Python: Bahga & Madisetti, © 2014 Book Website: WWW - Cloudcomputingbook.info
No ratings yet
Cloud Application Development in Python: Bahga & Madisetti, © 2014 Book Website: WWW - Cloudcomputingbook.info
20 pages
Basics of Embedded Linux
100% (1)
Basics of Embedded Linux
55 pages
Basic of C - Geeky Shows 1
100% (5)
Basic of C - Geeky Shows 1
4 pages
Rajah 5: (1 Markah)
No ratings yet
Rajah 5: (1 Markah)
3 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
Introduction To Atom Processor
100% (1)
Introduction To Atom Processor
2 pages
HT TT11
No ratings yet
HT TT11
2 pages
P4 Tutorial
No ratings yet
P4 Tutorial
107 pages
Docker Linux Exercises
No ratings yet
Docker Linux Exercises
38 pages
Drivers Linux
100% (1)
Drivers Linux
57 pages
Writing Device Drivers
No ratings yet
Writing Device Drivers
24 pages
Pthread PDF
No ratings yet
Pthread PDF
33 pages
Glade Tutorial
No ratings yet
Glade Tutorial
36 pages
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
No ratings yet
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
80 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
X86 Assembly Language C Fundamentals
100% (1)
X86 Assembly Language C Fundamentals
3 pages
WEEK 1 - Entrep
No ratings yet
WEEK 1 - Entrep
5 pages
Windows Graphics Overview: David Blythe Architect Windows Graphics & Gaming Technologies Microsoft Corporation
No ratings yet
Windows Graphics Overview: David Blythe Architect Windows Graphics & Gaming Technologies Microsoft Corporation
34 pages
Tutorial On Linux Device Driver: 1 Basics
100% (1)
Tutorial On Linux Device Driver: 1 Basics
6 pages
AMD OpenCL Programming User Guide
No ratings yet
AMD OpenCL Programming User Guide
180 pages
Untitled
100% (1)
Untitled
12 pages
CST Lab Micro Strip Line
No ratings yet
CST Lab Micro Strip Line
11 pages
Alkyl Halides
No ratings yet
Alkyl Halides
4 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
DLL Remarks
100% (1)
DLL Remarks
2 pages
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
No ratings yet
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
21 pages
Sources of Islamic Law (Marking Scheme)
No ratings yet
Sources of Islamic Law (Marking Scheme)
8 pages
Windows NT Operating System
No ratings yet
Windows NT Operating System
19 pages
Lynxos
No ratings yet
Lynxos
4 pages
Principles of Compiler Design - Tutorial 9
100% (1)
Principles of Compiler Design - Tutorial 9
7 pages
All About Linux Signals - Linux Programming Blog
No ratings yet
All About Linux Signals - Linux Programming Blog
20 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
Fortran 90 Tutorial
No ratings yet
Fortran 90 Tutorial
28 pages
Glade Tutorial
No ratings yet
Glade Tutorial
5 pages
Consep's Sizetec Screens r2
No ratings yet
Consep's Sizetec Screens r2
4 pages
Physics Ss3 Mock
No ratings yet
Physics Ss3 Mock
2 pages
Graphics Programming in C
No ratings yet
Graphics Programming in C
2 pages
Ultimate Robotics Programming with ROS 2 and Python: Design, Develop, and Implement Intelligent Robotics Applications with Advanced Navigation, Simulation, and Computer Vision for Mobile and Industrial Robots (English Edition)
From Everand
Ultimate Robotics Programming with ROS 2 and Python: Design, Develop, and Implement Intelligent Robotics Applications with Advanced Navigation, Simulation, and Computer Vision for Mobile and Industrial Robots (English Edition)
Jonathan Cacace
No ratings yet
.NET MAUI Cross-Platform Application Development: Build high-performance apps for Android, iOS, macOS, and Windows using XAML and Blazor with .NET 8
From Everand
.NET MAUI Cross-Platform Application Development: Build high-performance apps for Android, iOS, macOS, and Windows using XAML and Blazor with .NET 8
Roger Ye
No ratings yet
Debugging Like a Pro: A Practical Guide with Examples
From Everand
Debugging Like a Pro: A Practical Guide with Examples
William E. Clark
No ratings yet
Dynamic Programming in Java: From Basics to Expert Proficiency
From Everand
Dynamic Programming in Java: From Basics to Expert Proficiency
William Smith
No ratings yet
C# for Intermediates: A Complete Course for Intermediate Programmers
From Everand
C# for Intermediates: A Complete Course for Intermediate Programmers
Lena Neill
No ratings yet
OpenFlow Cookbook
From Everand
OpenFlow Cookbook
Kingston Smiler. S
5/5 (1)
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Modern C++ Programming: Including the recent standards C++11, C++17, C++20, C++23
From Everand
Modern C++ Programming: Including the recent standards C++11, C++17, C++20, C++23
Orhan Gazi
No ratings yet
Mastering Ninject for Dependency Injection
From Everand
Mastering Ninject for Dependency Injection
Daniel Baharestani
No ratings yet
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
From Everand
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
Kiet Huynh
No ratings yet
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

NVIDIA OpenCL JumpStart Guide

Uploaded by

NVIDIA OpenCL JumpStart Guide

Uploaded by

NVIDIA® OpenCL™

Differences between OpenCL

struct Node { Node* next; }

struct Node { unsigned int next; }

Kernel Invocation Memory Offsets

Vector Addition Example

• A compute kernel, which will be executed on the compute device (GPU)

C for CUDA Kernel Code:

c[nIndex] = a[nIndex] + b[nIndex];

OpenCL Kernel Code

c[nIndex] = a[nIndex] + b[nIndex];

CUDA Driver API Host Code:

const unsigned int cnBlockSize = 512;

// create CUDA device & context

// allocate host vectors

// initialize host memory

// allocate memory on the device

// copy host vectors to device

// setup parameter values

// copy the result from device back to host

OpenCL Host Code:

const unsigned int cnBlockSize = 512;

// create OpenCL device & context

// query all devices available to the context

// create a command queue for first device the context reported

// create & compile program

// allocate host vectors

// initialize host memory

// allocate device memory

// setup parameter values

// copy results from device back to host

Initialization, Context and Device Creation

Using the CUDA Driver API:

Using the CUDA Driver API:

The clCreateProgramWithSource() function creates a program object. sProgramSource

Device Memory Allocation

Using the CUDA Driver API:

CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC;

// copy host vectors to device

cl_mem hDeviceMemA, hDeviceMemB, hDeviceMemC;

Kernel Parameter Specification

Using the CUDA Driver API:

Using the CUDA Driver API:

cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1);

The parameters of cnDimension and cnBlockSize must be pointers to unsigned int.

Result Data Retrieval

Using the CUDA Driver API:

clEnqueueReadBuffer(hContext, hDeviceC, CL_TRUE, 0,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.