0% found this document useful (0 votes)

15 views32 pages

CUDA_Binary_Utilities

The document provides an overview of CUDA binary utilities including cuobjdump, nvdisasm, and nvprune, which are tools for examining and disassembling CUDA binary files. It details the usage, command-line options, and differences between cuobjdump and nvdisasm, highlighting their functionalities for handling cubin files and host binaries. Additionally, it includes instruction set references for various GPU architectures and examples of command outputs.

Uploaded by

edwin bayani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views32 pages

CUDA_Binary_Utilities

Uploaded by

edwin bayani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

CUDA BINARY UTILITIES

DA-06762-001_v7.0 | August 2014

Application Note
TABLE OF CONTENTS

Chapter 1. Overview............................................................................................ 1
1.1. What is a CUDA Binary?................................................................................. 1
1.2. Differences between cuobjdump and nvdisasm..................................................... 1
Chapter 2. cuobjdump.......................................................................................... 3
2.1. Usage........................................................................................................3
2.2. Command-line Options................................................................................... 5
Chapter 3. nvdisasm............................................................................................. 7
3.1. Usage........................................................................................................7
3.2. Command-line Options................................................................................. 10
Chapter 4. Instruction Set Reference...................................................................... 12
4.1. GT200 Instruction Set.................................................................................. 12
4.2. Fermi Instruction Set................................................................................... 14
4.3. Kepler Instruction Set.................................................................................. 18
4.4. Maxwell Instruction Set................................................................................ 21
Chapter 5. nvprune............................................................................................ 26
5.1. Usage...................................................................................................... 26
5.2. Command-line Options................................................................................. 26

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | ii
LIST OF FIGURES

Figure 1 Control Flow Graph ................................................................................... 9

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | iii
LIST OF TABLES

Table 1 Comparison of cuobjdump and nvdisasm .......................................................... 1

Table 2 cuobjdump Command-line Options ..................................................................5

Table 3 nvdisasm Command-line Options ................................................................... 11

Table 4 GT200 Instruction Set ................................................................................ 12

Table 5 Fermi Instruction Set .................................................................................15

Table 6 Kepler Instruction Set ................................................................................ 18

Table 7 Maxwell Instruction Set .............................................................................. 21

Table 8 nvprune Command-line Options .................................................................... 27

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | iv
Chapter 1.
OVERVIEW

This document introduces cuobjdump, nvdisasm, and nvprune, three CUDA binary
tools for Linux(x86 and ARM), Windows, Mac OS and Android.

1.1. What is a CUDA Binary?

A CUDA binary (also referred to as cubin) file is an ELF-formatted file which consists of
CUDA executable code sections as well as other sections containing symbols, relocators,
debug info, etc. By default, the CUDA compiler driver nvcc embeds cubin files into the
host executable file. But they can also be generated separately by using the "-cubin"
option of nvcc. cubin files are loaded at run time by the CUDA driver API.

For more details on cubin files or the CUDA compilation trajectory, refer to NVIDIA
CUDA Compiler Driver NVCC.

1.2. Differences between cuobjdump and

nvdisasm
CUDA provides two binary utilities for examining and disassembling cubin files and
host executables: cuobjdump and nvdisasm. Basically, cuobjdump accepts both cubin
files and host binaries while nvdisasm only accepts cubin files; but nvdisasm provides
richer output options.
Here's a quick comparison of the two tools:

Table 1 Comparison of cuobjdump and nvdisasm

cuobjdump nvdisasm

Disassemble cubin Yes Yes

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 1
Overview

cuobjdump nvdisasm

Extract ptx and extract and disassemble cubin from the Yes No
following input files:
‣ Host binaries
‣ Executables
‣ Object files
‣ Static libraries
‣ External fatbinary files

Control flow analysis and output No Yes

Advanced display options No Yes

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 2
Chapter 2.
CUOBJDUMP

cuobjdump extracts information from CUDA binary files (both standalone and those
embedded in host binaries) and presents them in human readable format. The output of
cuobjdump includes CUDA assembly code for each kernel, CUDA ELF section headers,
string tables, relocators and other CUDA specific sections. It also extracts embedded ptx
text from host binaries.
For a list of CUDA assembly instruction set of each GPU architecture, see Instruction Set
Reference.

2.1. Usage
cuobjdump accepts a single input file each time it's run. The basic usage is as following:

cuobjdump [options] <file>

To disassemble a standalone cubin or cubins embedded in a host executable and show

CUDA assembly of the kernels, use the following command:
cuobjdump -sass <input file>

To dump cuda elf sections in human readable format from a cubin file, use the following
command:
cuobjdump -elf <cubin file>

To extract ptx text from a host binary, use the following command:
cuobjdump -ptx <host binary>

Here's a sample output of cuobjdump:

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 3
cuobjdump

$ cuobjdump a.out -ptx -sass

Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
identifier = add.cu

code for sm_20

Function : _Z3addPiS_S_
.headerflags @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV R6, c[0x0][0x20]; /* 0x2800400080019de4 */
/*0010*/ MOV R7, c[0x0][0x24]; /* 0x280040009001dde4 */
/*0018*/ MOV R2, c[0x0][0x28]; /* 0x28004000a0009de4 */
/*0020*/ MOV R3, c[0x0][0x2c]; /* 0x28004000b000dde4 */
/*0028*/ LDU.E R0, [R6]; /* 0x8c00000000601c85 */
/*0030*/ MOV R4, c[0x0][0x30]; /* 0x28004000c0011de4 */
/*0038*/ LDU.E R2, [R2]; /* 0x8c00000000209c85 */
/*0040*/ MOV R5, c[0x0][0x34]; /* 0x28004000d0015de4 */
/*0048*/ IADD R0, R2, R0; /* 0x4800000000201c03 */
/*0050*/ ST.E [R4], R0; /* 0x9400000000401c85 */
/*0058*/ EXIT; /* 0x8000000000001de7 */
.............................

Fatbin ptx code:

================
arch = sm_20
code version = [4,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
identifier = add.cu

.version 4.0
.target sm_20
.address_size 64

.visible .entry _Z3addPiS_S_(

.param .u64 _Z3addPiS_S__param_0,
.param .u64 _Z3addPiS_S__param_1,
.param .u64 _Z3addPiS_S__param_2
)
{
.reg .s32 %r<4>;
.reg .s64 %rd<7>;

ld.param.u64 %rd1, [_Z3addPiS_S__param_0];

ld.param.u64 %rd2, [_Z3addPiS_S__param_1];
ld.param.u64 %rd3, [_Z3addPiS_S__param_2];
cvta.to.global.u64 %rd4, %rd3;
cvta.to.global.u64 %rd5, %rd2;
cvta.to.global.u64 %rd6, %rd1;
ldu.global.u32 %r1, [%rd6];
ldu.global.u32 %r2, [%rd5];
add.s32 %r3, %r2, %r1;
st.global.u32 [%rd4], %r3;
ret;
}

As shown in the output, the a.out host binary contains cubin and ptx code for sm_20.

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 4
cuobjdump

To list cubin files in the host binary use -lelf option:

$ cuobjdump a.out -lelf

ELF file 1: add_new.sm_20.cubin
ELF file 2: add_new.sm_30.cubin
ELF file 3: add_old.sm_20.cubin
ELF file 4: add_old.sm_30.cubin

To extract all the cubins as files from the host binary use -xelf all option:

$ cuobjdump a.out -xelf all

Extracting ELF file 1: add_new.sm_20.cubin
Extracting ELF file 2: add_new.sm_30.cubin
Extracting ELF file 3: add_old.sm_20.cubin
Extracting ELF file 4: add_old.sm_30.cubin

To extract the cubin named add_new.sm_30.cubin:

$ cuobjdump a.out -xelf add_new.sm_30.cubin

Extracting ELF file 1: add_old.sm_20.cubin

To extract only the cubins containing _old in their names:

$ cuobjdump a.out -xelf _old

Extracting ELF file 1: add_old.sm_20.cubin
Extracting ELF file 2: add_old.sm_30.cubin

You can pass any substring to -xelf and -xptx options. Only the files having the
substring in the name will be extracted from the input binary.

2.2. Command-line Options

Table 2 contains supported command-line options of cuobjdump, along with a
description of what each option does. Each option has a long name and a short name,
which can be used interchangeably.

Table 2 cuobjdump Command-line Options

Option (long) Option Description

(short)
--all-fatbin -all Dump all fatbin sections. By default will only dump
contents of executable fatbin (if exists), else
relocatable fatbin if no executable fatbin.
--dump-elf -elf Dump ELF Object sections.
--dump-elf-symbols -symbols Dump ELF symbol names.
--dump-ptx -ptx Dump PTX for all listed device functions.

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 5
cuobjdump

Option (long) Option Description

(short)
--dump-sass -sass Dump CUDA assembly for a single cubin file or all
cubin files embedded in the binary.
--extract-elf <partial -xelf Extract ELF file(s) name containing <partial file name>
file name>,... and save as file(s). Use 'all' to extract all files. To get
the list of ELF files use -lelf option. Works with host
executable/object/library and external fatbin. All
'dump' and 'list' options are ignored with this option.
--extract-ptx <partial -xptx Extract PTX file(s) name containing <partial file
file name>,... name> and save as file(s). Use 'all' to extract all files.
To get the list of PTX files use -lptx option. Works
with host executable/object/library and external
fatbin. All 'dump' and 'list' options are ignored with
this option.
--function <function -fun Specify names of device functions whose fat binary
name>,... structures must be dumped.
--gpu-architecture <gpu -arch Specify GPU Architecture for which information
architecture name> should be dumped. Allowed values for this option:
'sm_20','sm_21', 'sm_30','sm_32','sm_35','sm_50','sm_52'
.
--help -h Print this help information on this tool.
--list-elf -lelf List all the ELF files available in the fatbin. Works with
host executable/object/library and external fatbin.
All other options are ignored with this flag. This can
be used to select particular ELF with -xelf option
later.
--list-ptx -lptx List all the PTX files available in the fatbin. Works
with host executable/object/library and external
fatbin. All other options are ignored with this flag.
This can be used to select particular PTX with -xptx
option later.
--options-file <file>,... -optf Include command line options from specified file.
--sort-functions -sort Sort functions when dumping sass.
--version -V Print version information on this tool.

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 6
Chapter 3.
NVDISASM

nvdisasm extracts information from standalone cubin files and presents them in human
readable format. The output of nvdisasm includes CUDA assembly code for each
kernel, listing of ELF data sections and other CUDA specific sections. Output style and
options are controlled through nvdisasm command-line options. nvdisasm also does
control flow analysis to annotate jump/branch targets and makes the output easier to
read.

nvdisasm requires complete relocation information to do control flow analysis. If

this information is missing from the CUDA binary, either use the nvdisasm option
"-ndf" to turn off control flow analysis, or use the ptxas and nvlink option "-
preserve-relocs" to re-generate the cubin file.

For a list of CUDA assembly instruction set of each GPU architecture, see Instruction Set
Reference.

3.1. Usage
nvdisasm accepts a single input file each time it's run. The basic usage is as following:

nvdisasm [options] <input cubin file>

To get the control flow graph of a kernel, use the following:

nvdisasm -cfg <input cubin file>

Here's a sample output of nvdisasm:

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 7
nvdisasm

.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS

EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30) "

//--------------------- .nv.info --------------------------

.section .nv.info,"",@"SHT_CUDA_INFO "
.align 4

......

//--------------------- .text._Z4addXPii --------------------------

.section .text._Z4addXPii,"ax",@progbits
.sectioninfo @"SHI_REGISTERS=11 "
.align 4
.global _Z4addXPii
.type _Z4addXPii,@function
.size _Z4addXPii,(.L_19 - _Z4addXPii)
.other _Z4addXPii,@"STO_CUDA_ENTRY STV_DEFAULT "
_Z4addXPii:
.text._Z4addXPii:
/*0008*/ MOV R1, c[0x0][0x44];
/*0010*/ ISUB R1, R1, 0x8;
/*0018*/ MOV R0, c[0x0][0x148];
/*0020*/ IADD R6.CC, R1, c[0x0][0x24];
/*0028*/ ISETP.LT.AND P0, PT, R0, 0x1, PT;
/*0030*/ MOV R8, c[0x0][0x140];
/*0038*/ MOV R9, c[0x0][0x144];
/*0048*/ IADD.X R7, RZ, RZ;
/*0050*/ ISUB R10, R6, c[0x0][0x24];
/*0058*/ @P0 BRA `(.L_2);
/*0060*/ LD.E R0, [R8];
/*0068*/ MOV R2, RZ;
/*0070*/ NOP;
/*0078*/ NOP;
.L_3:
/*0088*/ IADD R2, R2, 0x1;
/*0090*/ MOV R3, R0;
/*0098*/ IADD R0, R0, 0x1;
/*00a0*/ ISETP.LT.AND P0, PT, R2, c[0x0][0x148], PT;
/*00a8*/ @P0 BRA `(.L_3);
/*00b0*/ IADD R0, R3, 0x1;
/*00b8*/ ST.E [R8], R0;
.L_2:
/*00c8*/ S2R R0, SR_TID.X;
/*00d0*/ ISETP.NE.AND P0, PT, R0, RZ, PT;
/*00d8*/ @P0 EXIT ;
/*00e0*/ LD.E R0, [R8];
/*00e8*/ MOV R4, c[0x0][0xf0];
/*00f0*/ MOV R5, c[0x0][0xf4];
/*00f8*/ STL [R10], R0;
/*0108*/ JCAL `(vprintf);
/*0110*/ EXIT ;
.L_4:
/*0118*/ BRA `(.L_4);
.L_19:

//--------------------- SYMBOLS --------------------------

.type vprintf,@function

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 8
nvdisasm

nvdisasm is capable of generating control flow of CUDA assembly in the format of

DOT graph description language. The output of the control flow from nvdisasm can be
directly imported to a DOT graph visualization tool such as Graphviz.

This feature is only supported on cubins generated for Compute Capability 3.0 and
later.

Here's how you can generate a PNG image (cfg.png) of the control flow of the above
cubin (a.cubin) with nvdisasm and Graphviz:
nvdisasm -cfg a.cubin | dot -ocfg.png -Tpng

Here's the generated graph:

Figure 1 Control Flow Graph

nvdisasm is capable of showing the register (CC, general and predicate) liveness
range information. For each line of CUDA assembly, nvdisasm displays whether a
given device register was assigned, accessed, live or re-assigned. It also shows the total

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 9
nvdisasm

number of registers used. This is useful if the user is interested in the life range of any
particular register, or register usage in general.

This feature is only supported on cubins generated for Compute Capability 3.0 and
later.

Here's a sample output (left columns are omitted):

// +------+---------------+-----+
// | CC | GPR |PRED |
// | | 0000000000 | |
// | # 01 | # 0123456789 | # 0 |
// +------+---------------+-----+
_main10acosParams // | | | |
_main10acosParams,@function // | | | |
_main10acosParams,(.L_17 - _Z9acos_main10acosParams) // | | | |
_main10acosParams,@"STO_CUDA_ENTRY STV_DEFAULT" // | | | |
// | | | |
// | | | |
MOV R1, c[0x0][0x44]; // | | 1 ^ | |
S2R R0, SR_CTAID.X; // | | 2 ^: | |
S2R R3, SR_TID.X; // | | 3 :: ^ | |
IMAD R3, R0, c[0x0][0x28], R3; // | | 3 v: x | |
MOV R0, c[0x0][0x28]; // | | 3 ^: : | |
ISETP.GE.AND P0, PT, R3, c[0x0][0x150], PT; // | | 3 :: v | 1 ^ |
IMUL R0, R0, c[0x0][0x34]; // | | 3 x: : | 1 : |
@P0 EXIT; // | | 3 :: : | 1 v |
MOV32I R8, 0x4; // | | 4 :: : ^ | |
MOV32I R9, 0x3c94d2e9; // | | 5 :: : :^ | |
NOP; // | | 5 :: : :: | |
NOP; // | | 5 :: : :: | |
NOP; // | | 5 :: : :: | |
NOP; // | | 5 :: : :: | |
// | | 5 :: : :: | |
IMAD R6.CC, R3, R8, c[0x0][0x140]; // | 1 ^ | 6 :: v ^ v: | |
IMAD.HI.X R7, R3, R8, c[0x0][0x144]; // | 1 v | 7 :: v :^v: | |
LD.E R2, [R6]; // | | 8 ::^: vv:: | |
FADD.FTZ R4, -|R2|, 1; // | | 7 ::v:^ :: | |
FSETP.GT.FTZ.AND P0, PT, |R2|, c[0x2][0x0], PT; // | | 7 ::v:: :: | 1 ^ |
FMUL.FTZ R4, R4, 0.5; // | | 7 ::::x :: | 1 : |
F2F.FTZ.F32.F32 R5, |R2|; // | | 8 ::v::^ :: | 1 : |
MUFU.RSQ R4, R4; // | | 8 ::::x: :: | 1 : |
@P0 MUFU.RCP R5, R4; // | | 8 ::::v^ :: | 1 v |
FMUL.FTZ R4, R5, R5; // | | 8 ::::^v :: | 1 : |
IMAD R6.CC, R3, R8, c[0x0][0x148]; // | 1 ^ | 9 :::v::^ v: | 1 : |
FFMA.FTZ R7, R4, c[0x2][0x4], R9; // | 1 : | 10 ::::v::^:v | 1 : |
FFMA.FTZ R7, R7, R4, c[0x2][0x8]; // | 1 : | 10 ::::v::x:: | 1 : |
FFMA.FTZ R7, R7, R4, c[0x2][0xc]; // | 1 : | 10 ::::v::x:: | 1 : |
FFMA.FTZ R7, R7, R4, c[0x2][0x10]; // | 1 : | 10 ::::v::x:: | 1 : |
FMUL.FTZ R4, R7, R4; // | 1 : | 10 ::::x::v:: | 1 : |
IMAD.HI.X R7, R3, R8, c[0x0][0x14c]; // | 1 v | 10 :::v:::^v: | 1 : |
FFMA.FTZ R4, R4, R5, R5; // | | 10 ::::xv:::: | 1 : |
IADD R3, R3, R0; // | | 9 v::x: :::: | 1 : |
FADD32I.FTZ R5, -R4, 1.5707963705062866211; // | | 10 ::::v^:::: | 1 : |
@P0 FADD.FTZ R5, R4, R4; // | | 10 ::::v^:::: | 1 v |
ISETP.LT.AND P0, PT, R3, c[0x0][0x150], PT; // | | 9 :::v ::::: | 1 ^ |
FADD32I.FTZ R4, -R5, 3.1415927410125732422; // | | 10 ::::^v:::: | 1 : |
FCMP.LT.FTZ R2, R4, R5, R2; // | | 10 ::x:vv:::: | 1 : |
ST.E [R6], R2; // | | 8 ::v: vv:: | 1 : |
@P0 BRA `(.L_1); // | | 5 :: : :: | 1 v |
MOV RZ, RZ; // | | 1 : | |
EXIT; // | | 1 : | |
// +......+...............+.....+
BRA `(.L_2); // | | | |
// +------+---------------+-----+
// Legend:
// ^ : Register assignment
// v : Register usage
// x : Register usage and reassignment
// : : Register in use
// <space> : Register not in use
// # : Number of occupied registers

3.2. Command-line Options

Table 3 contains the supported command-line options of nvdisasm, along with a
description of what each option does. Each option has a long name and a short name,
which can be used interchangeably.

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 10
nvdisasm

Table 3 nvdisasm Command-line Options

Option (long) Option Description

(short)
--base-address <value> -base Specify the logical base address of the image
to disassemble. This option is only valid when
disassembling a raw instruction binary (see option '--
binary'), and is ignored when disassembling an Elf file.
Default value: 0.
--binary <SMxy> -b When this option is specified, the input file is assumed
to contain a raw instruction binary, that is, a sequence
of binary instruction encodings as they occur in
instruction memory. The value of this option must
be the asserted architecture of the raw binary.
Allowed values for this option: 'SM20','SM21','SM30',
'SM32','SM35','SM50','SM52'.
--help -h Print this help information on this tool.
--life-range-mode -lrm This option implies option '--print-life-ranges', and
determines how register live range info should be
printed. 'count': Not at all, leaving only the column
(number of live registers); 'wide': Columns spaced out
for readability (default); 'narrow': A one-character
column for each register, economizing on table width
Allowed values for this option: 'count','narrow','wide'.
--no-dataflow -ndf Disable dataflow analyzer after disassembly. Dataflow
analysis is normally enabled to perform branch stack
analysis and annotate all instructions that jump via
the GPU branch stack with inferred branch target
labels. However, it may occasionally fail when certain
restrictions on the input nvelf/cubin are not met.
--options-file <file>,... -optf Include command line options from specified file.
--output-control-flow- -cfg When specified, output the control flow graph in a
graph format consumable by graphviz tools (such as dot).
--print-code -c Only print code sections.
--print-instruction- -hex When specified, print the encoding bytes after each
encoding disassembled operation.
--print-life-ranges -plr Print register life range information in a trailing
column in the produced disassembly.
--print-line-info -g Annotate disassembly with source line information
obtained from .debug_line section, if present.
--print-raw -raw Print the disassembly without any attempt to beautify
it.
--separate-functions -sf Separate the code corresponding with function
symbols by some new lines to let them stand out in
the printed disassembly.
--version -V Print version information on this tool.

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 11
Chapter 4.
INSTRUCTION SET REFERENCE

This is an instruction set reference of NVIDIA® GPU architectures GT200, Fermi, Kepler
and Maxwell.

4.1. GT200 Instruction Set

The GT200 architecture (Compute Capability 1.x) has the following instruction set
format:
(instruction) (destination) (source1), (source2) ...

Valid destination and source locations include:

‣ RX for registers
‣ AX for address registers
‣ SRX for special system-controlled registers
‣ CX for condition registers
‣ global14 r[X] for global memory referenced by an address in a register
‣ g[X] for shared memory
‣ c[X][Y] for constant memory
‣ local[X] for local memory
Table 4 lists valid instructions for the GT200 GPUs.

Table 4 GT200 Instruction Set

Opcode Description
A2R Move address register to data register
ADA Add immediate to address register
BAR CTA-wide barrier synchronization
BRA Conditional branch
BRK Conditional break from a loop
BRX Fetch an address from constant memory and branch to it

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 12
Instruction Set Reference

Opcode Description
C2R Conditional code to data register
CAL Unconditional subroutine call
COS Cosine
DADD Double-precision floating point addition
DFMA Double-precision floating point fused multiply-add
DMAX Double-precision floating point maximum
DMIN Double-precision floating point minimum
DMUL Double-precision floating point multiply
DSET Double-precision floating point conditional set
EX2 Exponential base two function
F2F Copy floating-point value with conversion to a different floating-
point type
F2I Copy floating-point value with conversion to integer
FADD/FADD32/FADD32I Single-precision floating point addition
FCMP Single-precision floating point compare
FMAD/FMAD32/FMAD32I Single-precision floating point multiply-add
FMAX Single-precision floating point maximum
FMIN Single-precision floating point minimum
FMUL/FMUL32/FMUL32I Single-precision floating point multiply
FSET Single-precision floating point conditional set
G2R Move from shared memory to register. A .LCK suffix indicates that
the bank is locked until a R2G.UNL has been performed; this is used
to implement shared memory atomics.
GATOM.IADD/EXCH/CAS/IMIN/ Global memory atomic operations; performs both an atomic
IMAX/INC/DEC/IAND/IOR/IXOR operation and returns the original value
GLD Load from global memory
GRED.IADD/IMIN/IMAX/INC/DEC/ Global memory reduction operations; performs only an atomic
IAND/IOR/IXOR operation with no return value
GST Store to global memory
I2F Copy integer value to floating-point with conversion
I2I Copy integer value to integer with conversion
IADD/IADD32/IADD32I Integer addition
IMAD/IMAD32/IMAD32I Integer multiply-add
IMAX Integer maximum
IMIN Integer minimum
IMUL/IMUL32/IMUL32I Integer multiply
ISAD/ISAD32 Sum of absolute difference

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 13
Instruction Set Reference

Opcode Description
ISET Integer conditional set
LG2 Floating point logarithm base 2
LLD Load from local memory
LST Store to local memory
LOP Logical operation (AND/OR/XOR)
MOV/MOV32 Move source to destination
MVC Move from constant memory to destination
MVI Move immediate to destination
NOP No operation
R2A Move register to address register
R2C Move data register to conditional code
R2G Store to shared memory. When used with the .UNL suffix, releases a
previously held lock on that shared memory bank
RCP Single-precision floating point reciprocal
RET Conditional return from subroutine
RRO Range reduction operator
RSQ Reciprocal square root
S2R Move special register to register
SHL Shift left
SHR Shift right
SIN Sine
SSY Set synchronization point; used before potentially divergent
instructions
TEX/TEX32 Texture fetch
VOTE Warp-vote primitive

4.2. Fermi Instruction Set

The Fermi architecture (Compute Capability 2.x) has the following instruction set
format:
(instruction) (destination) (source1), (source2) ...

Valid destination and source locations include:

‣ RX for registers
‣ SRX for special system-controlled register
‣ PX for condition register
‣ c[X][Y] for constant memory

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 14
Instruction Set Reference

Table 5 lists valid instructions for the Fermi GPUs.

Table 5 Fermi Instruction Set

Opcode Description
Floating Point Instructions
FFMA FP32 Fused Multiply Add
FADD FP32 Add
FCMP FP32 Compare
FMUL FP32 Multiply
FMNMX FP32 Minimum/Maximum
FSWZ FP32 Swizzle
FSET FP32 Set
FSETP FP32 Set Predicate
RRO FP Range Reduction Operator
MUFU FP Multi-Function Operator
DFMA FP64 Fused Multiply Add
DADD FP64 Add
DMUL FP64 Multiply
DMNMX FP64 Minimum/Maximum
DSET FP64 Set
DSETP FP64 Set Predicate
Integer Instructions
IMAD Integer Multiply Add
IMUL Integer Multiply
IADD Integer Add
ISCADD Integer Scaled Add
ISAD Integer Sum Of Abs Diff
IMNMX Integer Minimum/Maximum
BFE Integer Bit Field Extract
BFI Integer Bit Field Insert
SHR Integer Shift Right
SHL Integer Shift Left
LOP Integer Logic Op
FLO Integer Find Leading One
ISET Integer Set
ISETP Integer Set Predicate

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 15
Instruction Set Reference

Opcode Description
ICMP Integer Compare and Select
POPC Population Count
Conversion Instructions
F2F Float to Float
F2I Float to Integer
I2F Integer to Float
I2I Integer to Integer
Movement Instructions
MOV Move
SEL Conditional Select/Move
PRMT Permute
Predicate/CC Instructions
P2R Predicate to Register
R2P Register to Predicate
CSET CC Set
CSETP CC Set Predicate
PSET Predicate Set
PSETP Predicate Set Predicate
Texture Instructions
TEX Texture Fetch
TLD Texture Load
TLD4 Texture Load 4 Texels
TXQ Texture Query
Compute Load/Store Instructions
LDC Load from Constant
LD Load from Memory
LDU Load Uniform
LDL Load from Local Memory
LDS Load from Shared Memory
LDLK Load and Lock
LDSLK Load from Shared Memory and Lock
LD_LDU LD_LDU is a combination of a generic load LD with a load uniform
LDU
LDS_LDU LDS_LDU is combination of a Shared window load LDS with a load
uniform LDU.
ST Store to Memory

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 16
Instruction Set Reference

Opcode Description
STL Store to Local Memory
STUL Store and Unlock
STS Store to Shared Memory
STSUL Store to Shared Memory and Unlock
ATOM Atomic Memory Operation
RED Atomic Memory Reduction Operation
CCTL Cache Control
CCTLL Cache Control (Local)
MEMBAR Memory Barrier
Surface Memory Instructions
SULD Surface Load
SULEA Surface Load Effective Address
SUST Surface Store
SURED Surface Reduction
SUQ Surface Query
Control Instructions
BRA Branch to Relative Address
BRX Branch to Relative Indexed Address
JMP Jump to Absolute Address
JMX Jump to Absolute Indexed Address
CAL Call to Relative Address
JCAL Call to Absolute Address
RET Return from Call
BRK Break from Loop
CONT Continue in Loop
LONGJMP Long Jump
SSY Set Sync Relative Address
PBK Pre-Break Relative Address
PCNT Pre-Continue Relative Address
PRET Pre-Return Relative Address
PLONGJMP Pre-Long-Jump Relative Address
BPT Breakpoint/Trap
EXIT Exit Program
Miscellaneous Instructions
NOP No Operation

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 17
Instruction Set Reference

Opcode Description
S2R Special Register to Register
B2R Barrier to Register
LEPC Load Effective PC
BAR Barrier Synchronization
VOTE Query condition across threads

4.3. Kepler Instruction Set

The Kepler architecture (Compute Capability 3.x) has the following instruction set
format:
(instruction) (destination) (source1), (source2) ...

Valid destination and source locations include:

‣ X for registers
‣ SRX for special system-controlled registers
‣ PX for condition registers
‣ c[X][Y] for constant memory
Table 6 lists valid instructions for the Kepler GPUs.

Table 6 Kepler Instruction Set

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 18
Instruction Set Reference

Opcode Description
DMNMX FP64 Minimum/Maximum
DSET FP64 Set
DSETP FP64 Set Predicate
Integer Instructions
IMAD Integer Multiply Add
IMADSP Integer Extract Multiply Add
IMUL Integer Multiply
IADD Integer Add
ISCADD Integer Scaled Add
ISAD Integer Sum Of Abs Diff
IMNMX Integer Minimum/Maximum
BFE Integer Bit Field Extract
BFI Integer Bit Field Insert
SHR Integer Shift Right
SHL Integer Shift Left
SHF Integer Funnel Shift
LOP Integer Logic Op
FLO Integer Find Leading One
ISET Integer Set
ISETP Integer Set Predicate
ICMP Integer Compare and Select
POPC Population Count
Conversion Instructions
F2F Float to Float
F2I Float to Integer
I2F Integer to Float
I2I Integer to Integer
Movement Instructions
MOV Move
SEL Conditional Select/Move
PRMT Permute
SHFL Warp Shuffle
Predicate/CC Instructions
P2R Predicate to Register
R2P Register to Predicate

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 19
Instruction Set Reference

Opcode Description
CSET CC Set
CSETP CC Set Predicate
PSET Predicate Set
PSETP Predicate Set Predicate
Texture Instructions
TEX Texture Fetch
TLD Texture Load
TLD4 Texture Load 4 Texels
TXQ Texture Query
Compute Load/Store Instructions
LDC Load from Constant
LD Load from Memory
LDG Non-coherent Global Memory Load
LDL Load from Local Memory
LDS Load from Shared Memory
LDSLK Load from Shared Memory and Lock
ST Store to Memory
STL Store to Local Memory
STS Store to Shared Memory
STSCUL Store to Shared Memory Conditionally and Unlock
ATOM Atomic Memory Operation
RED Atomic Memory Reduction Operation
CCTL Cache Control
CCTLL Cache Control (Local)
MEMBAR Memory Barrier
Surface Memory Instructions
SUCLAMP Surface Clamp
SUBFM Surface Bit Field Merge
SUEAU Surface Effective Address
SULDGA Surface Load Generic Address
SUSTGA Surface Store Generic Address
Control Instructions
BRA Branch to Relative Address
BRX Branch to Relative Indexed Address
JMP Jump to Absolute Address

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 20
Instruction Set Reference

Opcode Description
JMX Jump to Absolute Indexed Address
CAL Call to Relative Address
JCAL Call to Absolute Address
RET Return from Call
BRK Break from Loop
CONT Continue in Loop
SSY Set Sync Relative Address
PBK Pre-Break Relative Address
PCNT Pre-Continue Relative Address
PRET Pre-Return Relative Address
BPT Breakpoint/Trap
EXIT Exit Program
Miscellaneous Instructions
NOP No Operation
S2R Special Register to Register
B2R Barrier to Register
BAR Barrier Synchronization
VOTE Query condition across threads

4.4. Maxwell Instruction Set

The Maxwell architecture (Compute Capability 5.x) has the following instruction set
format:
(instruction) (destination) (source1), (source2) ...

Valid destination and source locations include:

‣ X for registers
‣ SRX for special system-controlled registers
‣ PX for condition registers
‣ c[X][Y] for constant memory
Table 7 lists valid instructions for the Maxwell GPUs.

Table 7 Maxwell Instruction Set

Opcode Description
Floating Point Instructions
FADD FP32 Add

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 21
Instruction Set Reference

Opcode Description
FCHK Single Precision FP Divide Range Check
FCMP FP32 Compare to Zero and Select Source
FFMA FP32 Fused Multiply and Add
FMNMX FP32 Minimum/Maximum
FMUL FP32 Multiply
FSET FP32 Compare And Set
FSETP FP32 Compare And Set Predicate
FSWZADD FP32 Add used for FSWZ emulation
MUFU Multi Function Operation
RRO Range Reduction Operator FP
DADD FP64 Add
DFMA FP64 Fused Mutiply Add
DMNMX FP64 Minimum/Maximum
DMUL FP64 Multiply
DSET FP64 Compare And Set
DSETP FP64 Compare And Set Predicate
Integer Instructions
BFE Bit Field Extract
BFI Bit Field Insert
FLO Find Leading One
IADD Integer Addition
IADD3 3-input Integer Addition
ICMP Integer Compare to Zero and Select Source
IMAD Integer Multiply And Add
IMADSP Extracted Integer Multiply And Add.
IMNMX Integer Minimum/Maximum
IMUL Integer Multiply
ISCADD Scaled Integer Addition
ISET Integer Compare And Set
ISETP Integer Compare And Set Predicate
LEA Compute Effective Address
LOP Logic Operation
LOP3 3-input Logic Operation
POPC Population count
SHF Funnel Shift

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 22
Instruction Set Reference

Opcode Description
SHL Shift Left
SHR Shift Right
XMAD Integer Short Multiply Add
Conversion Instructions
F2F Floating Point To Floating Point Conversion
F2I Floating Point To Integer Conversion
I2F Integer To Floating Point Conversion
I2I Integer To Integer Conversion
Movement Instructions
MOV Move
PRMT Permute Register Pair
SEL Select Source with Predicate
SHFL Warp Wide Register Shuffle
Predicate/CC Instructions
CSET Test Condition Code And Set
CSETP Test Condition Code and Set Predicate
PSET Combine Predicates and Set
PSETP Combine Predicates and Set Predicate
P2R Move Predicate Register To Register
R2P Move Register To Predicate/CC Register
Texture Instructions
TEX Texture Fetch
TLD Texture Load
TLD4 Texture Load 4
TXQ Texture Query
TEXS Texture Fetch with scalar/non-vec4 source/destinations
TLD4S Texture Load 4 with scalar/non-vec4 source/destinations
TLDS Texture Load with scalar/non-vec4 source/destinations
Compute Load/Store Instructions
LD Load from generic Memory
LDC Load Constant
LDG Load from Global Memory
LDL Load within Local Memory Window
LDS Local within Shared Memory Window
ST Store to generic Memory

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 23
Instruction Set Reference

Opcode Description
STG Store to global Memory
STL Store within Local or Shared Window
STS Store within Local or Shared Window
ATOM Atomic Operation on generic Memory
ATOMS Atomic Operation on Shared Memory
RED Reduction Operation on generic Memory
CCTL Cache Control
CCTLL Cache Control
MEMBAR Memory Barrier
CCTLT Texture Cache Control
Surface Memory Instructions
SUATOM Surface Reduction
SULD Surface Load
SURED Atomic Reduction on surface memory
SUST Surface Store
Control Instructions
BRA Relative Branch
BRX Relative Branch Indirect
JMP Absolute Jump
JMX Absolute Jump Indirect
SSY Set Synchronization Point
SYNC Converge threads after conditional branch
CAL Relative Call
JCAL Absolute Call
PRET Pre-Return From Subroutine
RET Return From Subroutine
BRK Break
PBK Pre-Break
CONT Continue
PCNT Pre-continue
EXIT Exit Program
PEXIT Pre-Exit
BPT BreakPoint/Trap
Miscellaneous Instructions
NOP No Operation

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 24
Instruction Set Reference

Opcode Description
CS2R Move Special Register to Register
S2R Move Special Register to Register
B2R Move Barrier To Register
BAR Barrier Synchronization
R2B Move Register to Barrier
VOTE Vote Across SIMD Thread Group

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 25
Chapter 5.
NVPRUNE

nvprune prunes host object files and libraries to only contain device code for the
specified targets.

5.1. Usage
nvprune accepts a single input file each time it's run, emitting a new output file. The
basic usage is as following:
nvprune [options] -o <outfile> <infile>

The input file must be either a relocatable host object or static library (not a host
executable), and the output file will be the same format.
Either the --arch or --generate-code option must be used to specify the target(s) to keep.
All other device code is discarded from the file. The targets can be either a sm_NN arch
(cubin) or compute_NN arch (ptx).
For example, the following will prune libcublas_static.a to only contain sm_35 cubin
rather than all the targets which normally exist:
nvprune -arch sm_35 libcublas_static.a -o libcublas_static35.a

Note that this means that libcublas_static35.a will not run on any other architecture, so
should only be used when you are building for a single architecture.

5.2. Command-line Options

Table 8 contains supported command-line options of nvprune, along with a description
of what each option does. Each option has a long name and a short name, which can be
used interchangeably.

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 26
nvprune

Table 8 nvprune Command-line Options

Option (long) Option Description

(short)
--arch <gpu architecture -arch Specify the name of the NVIDIA GPU architecture
name>,... which will remain in the object or library.
--generate-code -gencode This option is same format as nvcc --generate-
code option, and provides a way to specify multiple
architectures which should remain in the object or
library. Only the 'code' values are used as targets to
match. Allowed keywords for this option: 'arch','code'.
--output-file -o Specify name and location of the output file.
--help -h Print this help information on this tool.
--options-file <file>,... -optf Include command line options from specified file.
--version -V Print version information on this tool.

www.nvidia.com
CUDA Binary Utilities DA-06762-001_v7.0 | 27
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
"MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA
Corporation assumes no responsibility for the consequences of use of such
information or for any infringement of patents or other rights of third parties
that may result from its use. No license is granted by implication of otherwise
under any patent rights of NVIDIA Corporation. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and
replaces all other information previously supplied. NVIDIA Corporation products
are not authorized as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA
Corporation in the U.S. and other countries. Other company and product names
may be trademarks of the respective companies with which they are associated.

www.nvidia.com

Marketing - Marketing Strategies of Toyota Final
100% (1)
Marketing - Marketing Strategies of Toyota Final
66 pages
CUDA Binary Utilities
No ratings yet
CUDA Binary Utilities
32 pages
CUDA Binary Utilities
No ratings yet
CUDA Binary Utilities
36 pages
Cuda Binary Utilities: Application Note
No ratings yet
Cuda Binary Utilities: Application Note
41 pages
Part2 22
No ratings yet
Part2 22
97 pages
NVCC 1.1
No ratings yet
NVCC 1.1
30 pages
Cheat Sheet CUDA
No ratings yet
Cheat Sheet CUDA
2 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
4 Objdump: Ranlib Objcopy Top Index
No ratings yet
4 Objdump: Ranlib Objcopy Top Index
14 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Linux Objdump Command Examples (Disassemble A Binary File) : Home Free Ebook Start Here Contact About
No ratings yet
Linux Objdump Command Examples (Disassemble A Binary File) : Home Free Ebook Start Here Contact About
17 pages
GPU_Programming_slides_2
No ratings yet
GPU_Programming_slides_2
37 pages
Cublas Library
No ratings yet
Cublas Library
146 pages
The Cuda Compiler Driver NVCC: Last Modified On
No ratings yet
The Cuda Compiler Driver NVCC: Last Modified On
39 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Lecture3 Fundamentals of CUDA(Part1)_2025
No ratings yet
Lecture3 Fundamentals of CUDA(Part1)_2025
52 pages
multiplication.ipynb - Colab
No ratings yet
multiplication.ipynb - Colab
2 pages
Cublas Library: User Guide
No ratings yet
Cublas Library: User Guide
248 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Intro x86 Part 3: Linux Tools & Analysis: Xeno Kovah - 2009/2010 Xkovah at Gmail
No ratings yet
Intro x86 Part 3: Linux Tools & Analysis: Xeno Kovah - 2009/2010 Xkovah at Gmail
24 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
cuuda nvidai guide_Part2
No ratings yet
cuuda nvidai guide_Part2
15 pages
CuPrintf Readme
No ratings yet
CuPrintf Readme
6 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Lecture 6
No ratings yet
Lecture 6
91 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
addition.ipynb - Colab
No ratings yet
addition.ipynb - Colab
2 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Cuda Firstprograms PDF
No ratings yet
Cuda Firstprograms PDF
6 pages
GNU Assembler
No ratings yet
GNU Assembler
122 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Stm32f0 Linux DVLPT
No ratings yet
Stm32f0 Linux DVLPT
11 pages
gpu-object-linking
No ratings yet
gpu-object-linking
16 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
CUDA_part-1
No ratings yet
CUDA_part-1
52 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
CUDA_part-1-LMS
No ratings yet
CUDA_part-1-LMS
51 pages
Cuda Reference Manual
No ratings yet
Cuda Reference Manual
256 pages
NirajTamang Week8
No ratings yet
NirajTamang Week8
10 pages
CUDA - Quick Reference PDF
No ratings yet
CUDA - Quick Reference PDF
2 pages
Using CUDA
No ratings yet
Using CUDA
57 pages
Cud A Reference Manual
No ratings yet
Cud A Reference Manual
299 pages
Nvidia Cuda Tegra Toolkit 10.2.89: Release Notes For Development Auto 5.1.9
No ratings yet
Nvidia Cuda Tegra Toolkit 10.2.89: Release Notes For Development Auto 5.1.9
8 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Autodesk Fusion PCB Black Book (V 2.0.21528)
From Everand
Autodesk Fusion PCB Black Book (V 2.0.21528)
Gaurav Verma
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
wcms_190903
No ratings yet
wcms_190903
36 pages
# IndexFor Cadd Manual
No ratings yet
# IndexFor Cadd Manual
6 pages
Maxwell_Compatibility_Guide
No ratings yet
Maxwell_Compatibility_Guide
8 pages
Kepler Tuning Guide
No ratings yet
Kepler Tuning Guide
14 pages
BDEYE-BLACK
No ratings yet
BDEYE-BLACK
20 pages
Vision_Architecture_Status_Feb83
No ratings yet
Vision_Architecture_Status_Feb83
14 pages
Introduction to computer networking tutorial
No ratings yet
Introduction to computer networking tutorial
1 page
meridian setup
No ratings yet
meridian setup
29 pages
Sabelli - Seismic Demands On Steel Braced Frame Buildings With Buckling-Restrained Braces
No ratings yet
Sabelli - Seismic Demands On Steel Braced Frame Buildings With Buckling-Restrained Braces
13 pages
Chapter 1. Globalization and International Linkages
No ratings yet
Chapter 1. Globalization and International Linkages
43 pages
Section 1: The Physical Self: Stages of Life Span
No ratings yet
Section 1: The Physical Self: Stages of Life Span
4 pages
Gregor Wolbring: The Politics of Ableism
100% (2)
Gregor Wolbring: The Politics of Ableism
8 pages
Economic Analysis of Law (Richard Posner) (Z-Lib - Org) (250-350)
No ratings yet
Economic Analysis of Law (Richard Posner) (Z-Lib - Org) (250-350)
101 pages
Nike's Operations Management - Edited
No ratings yet
Nike's Operations Management - Edited
16 pages
TM2045_Blockchain Technology for Operations and SCM
No ratings yet
TM2045_Blockchain Technology for Operations and SCM
2 pages
Tales of Vesperia Tov45
No ratings yet
Tales of Vesperia Tov45
314 pages
Download Full (Ebook) Indian Restaurant Curry at Home Misty Ricardo's Curry Kitchen by Richard Sayce ISBN 9781999660819, 9781999660826, 1999660811, 199966082X PDF All Chapters
100% (1)
Download Full (Ebook) Indian Restaurant Curry at Home Misty Ricardo's Curry Kitchen by Richard Sayce ISBN 9781999660819, 9781999660826, 1999660811, 199966082X PDF All Chapters
76 pages
Research Paper PDF
No ratings yet
Research Paper PDF
11 pages
Relative-Clauses-Defi and Nondif
No ratings yet
Relative-Clauses-Defi and Nondif
15 pages
Topic-06-TailingsDams-Peru-May-2015a
No ratings yet
Topic-06-TailingsDams-Peru-May-2015a
34 pages
Europe Tyre Market Forecast Opportunities 2017
No ratings yet
Europe Tyre Market Forecast Opportunities 2017
8 pages
Lesson Plan On Understanding Human Development
No ratings yet
Lesson Plan On Understanding Human Development
24 pages
Acc Cement Report Ratios 18-4-2020
No ratings yet
Acc Cement Report Ratios 18-4-2020
6 pages
Solar Flares
No ratings yet
Solar Flares
2 pages
Stopping by Woods On A Snowy Evening
No ratings yet
Stopping by Woods On A Snowy Evening
3 pages
Movers Sample Papers Volume 2 Páginas 3 9,14 27
No ratings yet
Movers Sample Papers Volume 2 Páginas 3 9,14 27
21 pages
ECE 212 Digital Logic Syllabus Fall 2020 Credit Hours: 3: Instructor Information
No ratings yet
ECE 212 Digital Logic Syllabus Fall 2020 Credit Hours: 3: Instructor Information
10 pages
A MQP Paper20A Set1 Dec24
No ratings yet
A MQP Paper20A Set1 Dec24
21 pages
FORM AmendmentsTEMPLATE
100% (1)
FORM AmendmentsTEMPLATE
2 pages
Nutrition Unit 2
No ratings yet
Nutrition Unit 2
36 pages
Story
No ratings yet
Story
80 pages
The Rattrap
No ratings yet
The Rattrap
10 pages
Singer 5400
100% (1)
Singer 5400
64 pages
Health and Safety Policy Manual Issue 14
100% (2)
Health and Safety Policy Manual Issue 14
34 pages
tablet evaluation
No ratings yet
tablet evaluation
26 pages
Homophones PowerPoint
No ratings yet
Homophones PowerPoint
18 pages
3. Classification of Strabismus
No ratings yet
3. Classification of Strabismus
31 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CUDA_Binary_Utilities

Uploaded by

CUDA_Binary_Utilities

Uploaded by

CUDA BINARY UTILITIES

DA-06762-001_v7.0 | August 2014

Figure 1 Control Flow Graph ................................................................................... 9

Table 1 Comparison of cuobjdump and nvdisasm .......................................................... 1

Table 2 cuobjdump Command-line Options ..................................................................5

Table 3 nvdisasm Command-line Options ................................................................... 11

Table 4 GT200 Instruction Set ................................................................................ 12

Table 5 Fermi Instruction Set .................................................................................15

Table 6 Kepler Instruction Set ................................................................................ 18

Table 7 Maxwell Instruction Set .............................................................................. 21

Table 8 nvprune Command-line Options .................................................................... 27

1.1. What is a CUDA Binary?

1.2. Differences between cuobjdump and

Table 1 Comparison of cuobjdump and nvdisasm

Disassemble cubin Yes Yes

Control flow analysis and output No Yes

cuobjdump [options] <file>

To disassemble a standalone cubin or cubins embedded in a host executable and show

Here's a sample output of cuobjdump:

$ cuobjdump a.out -ptx -sass

code for sm_20

Fatbin ptx code:

.visible .entry _Z3addPiS_S_(

ld.param.u64 %rd1, [_Z3addPiS_S__param_0];

To list cubin files in the host binary use -lelf option:

$ cuobjdump a.out -lelf

$ cuobjdump a.out -xelf all

To extract the cubin named add_new.sm_30.cubin:

$ cuobjdump a.out -xelf add_new.sm_30.cubin

To extract only the cubins containing _old in their names:

$ cuobjdump a.out -xelf _old

2.2. Command-line Options

Table 2 cuobjdump Command-line Options

Option (long) Option Description

Option (long) Option Description

nvdisasm requires complete relocation information to do control flow analysis. If

nvdisasm [options] <input cubin file>

To get the control flow graph of a kernel, use the following:

Here's a sample output of nvdisasm:

.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS

//--------------------- .nv.info --------------------------

//--------------------- .text._Z4addXPii --------------------------

//--------------------- SYMBOLS --------------------------

nvdisasm is capable of generating control flow of CUDA assembly in the format of

Here's the generated graph:

Figure 1 Control Flow Graph

Here's a sample output (left columns are omitted):

3.2. Command-line Options

Table 3 nvdisasm Command-line Options

Option (long) Option Description

4.1. GT200 Instruction Set

Valid destination and source locations include:

Table 4 GT200 Instruction Set

4.2. Fermi Instruction Set

Valid destination and source locations include:

Table 5 lists valid instructions for the Fermi GPUs.

Table 5 Fermi Instruction Set

4.3. Kepler Instruction Set

Valid destination and source locations include:

Table 6 Kepler Instruction Set

4.4. Maxwell Instruction Set

Valid destination and source locations include:

Table 7 Maxwell Instruction Set

5.2. Command-line Options

Table 8 nvprune Command-line Options

Option (long) Option Description

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.