CUDA UR Reference Document
Contents
CUDA UR Reference Document#
This document gives general guidelines of how to use UR to load and build programs, and execute kernels on a CUDA device.
Device code#
A CUDA device image may be made of PTX and/or SASS, two different kinds of device code for NVIDIA GPUs.
CUDA device images can be generated by a CUDA-capable compiler toolchain. Most CUDA compiler toolchains are capable of generating PTX, SASS and/or bundles of PTX and SASS.
When generating device code to be launched using Unified Runtime, it is recommended to use a programming model with explicit kernel parameters, such as OpenCL or CUDA. This is because kernels generated by a programming model with implicit kernel parameters, such as SYCL, cannot guarantee any specific number or ordering of kernel parameters. It has been observed that kernel signatures for the same SYCL kernel may vary significantly when compiled for different architectures.
PTX#
PTX is a high level NVIDIA ISA which can be JIT compiled at runtime by the CUDA driver. In UR, this JIT compilation happens at urProgramBuild, where PTX is assembled into device specific SASS which then can run on device.
PTX is forward compatible, so PTX generated for .target sm_52
will be JIT
compiled without issue for devices with a greater compute capability than
sm_52
. Whereas PTX generated for sm_80
cannot be JIT compiled for an
sm_60
device.
An advantage of using PTX over SASS is that one code can run on multiple devices. However, PTX generated for an older arch may not give access to newer hardware instructions, such as new atomic operations, or tensor core instructions.
JIT compilation has some overhead at urProgramBuild, especially if the program
that is being loaded contains multiple kernels. The ptxjitcompiler
keeps a
JIT cache, however, so this overhead is only paid the first time that a program
is built. JIT caching may be turned off by setting the environment variable
CUDA_CACHE_DISABLE=1
.
SASS#
SASS is a device specific binary which may be produced by ptxas
or some
other tool. SASS is specific to an individual arch and is not portable across
arches.
A SASS file may be stored as a .cubin
file by NVIDIA tools.
UR Programs#
A ur_program_handle_t has a one to one mapping with the CUDA driver object CUModule.
In UR for CUDA, a ur_program_handle_t can be created using urProgramCreateWithBinary with:
A single PTX module, stored as a null terminated
uint8_t
buffer.A single SASS module, stored as an opaque
uint8_t
buffer.A mixed PTX/SASS module, where the SASS module is the assembled PTX module.
A ur_program_handle_t is valid only for a single architecture. If a CUDA compatible binary contains device code for multiple NVIDIA architectures, it is the user’s responsibility to split these separate device images so that urProgramCreateWithBinary is only called with a device binary for a single device arch.
If a program is large and contains many kernels, loading and/or JIT compiling the program may have a high overhead. This can be mitigated by splitting a program into multiple smaller programs (corresponding to PTX/SASS files). In this way, an application will only pay the overhead of loading/compiling kernels that it will likely use.
Using PTX Modules in UR#
A PTX module will be loaded and JIT compiled for the necessary architecture at
urProgramBuild. If the PTX module has been generated for a compute capability
greater than the compute capability of the device, then urProgramBuild will
fail with the error CUDA_ERROR_NO_BINARY_FOR_GPU
.
A PTX module passed to urProgramBuild must contain only one PTX file. Separate PTX files are to be handled separately.
Arguments may be passed to the ptxjitcompiler
via urProgramBuild.
Currently maxrregcount
is the only supported argument.
urProgramBuild(ctx, program, "maxrregcount=128");
Using SASS Modules in UR#
A SASS module will be loaded and checked for compatibility at urProgramBuild.
If the SASS module is incompatible with the device arch then urProgramBuild
will fail with the error CUDA_ERROR_NO_BINARY_FOR_GPU
.
Using Mixed PTX/SASS Bundles in UR#
Mixed PTX/SASS modules can be used to make a program with
urProgramCreateWithBinary. At urProgramBuild the CUDA driver will check
whether the bundled SASS is compatible with the active device. If the SASS is
compatible then the ur_program_handle_t will be built from the SASS, and if
not then the PTX will be used as a fallback and JIT compiled by the CUDA
driver. If both PTX and SASS are incompatible with the active device then
urProgramBuild will fail with the error CUDA_ERROR_NO_BINARY_FOR_GPU
.
UR Kernels#
Once urProgramCreateWithBinary and urProgramBuild have succeeded, kernels can be fetched from programs with urKernelCreate. urKernelCreate must be called with the exact name of the kernel in the PTX/SASS module. This name will depend on the mangling used when compiling the kernel, so it is recommended to examine the symbols in the PTX/SASS module before trying to extract kernels in UR.
$ cuobjdump --dump-elf-symbols hello.cubin | grep mykernel
_Z13mykernelv
At present it is not possible to query the names of the kernels in a UR program for CUDA, so it is necessary to know the (mangled or otherwise) names of kernels in advance or by some other means.
UR kernels can be dispatched with urEnqueueKernelLaunch. The argument
pGlobalWorkOffset
can only be used if the kernels have been instrumented to
take the extra global offset argument. Use of the global offset is not
recommended for non SYCL compiler toolchains. This parameter can be ignored if
the user does not wish to use the global offset.
Local Memory Arguments#
In UR local memory is a region of memory shared by all the work-items in
a work-group. A kernel function signature can include local memory address
space pointer arguments, which are set by the user with
urKernelSetArgLocal
with the number of bytes of local memory to allocate
and make available from the pointer argument.
The CUDA adapter implements local memory in a kernel as a single __shared__
memory allocation, and each individual local memory argument is a u32
byte
offset kernel parameter which is combined inside the kernel with the
__shared__
memory allocation. Therefore for N
local arguments that need
set on a kernel with urKernelSetArgLocal
, the total aligned size across the
N
calls to urKernelSetArgLocal
is calculated for the __shared__
memory allocation by the CUDA adapter and passed as the sharedMemBytes
argument to cuLaunchKernel
(or variants like cuLaunchCooperativeKernel
or cuGraphAddKernelNode
).
For each kernel u32
local memory offset parameter, aligned offsets into the
single memory location are calculated and passed at runtime by the adapter via
kernelParams
when launching the kernel (or adding the kernel as a graph
node). When a user calls urKernelSetArgLocal
with an argument index that
has already been set on the kernel, the adapter recalculates the size of the
__shared__
memory allocation and offset for the index, as well as the
offsets of any local memory arguments at following indices.
Warning
The CUDA UR adapter implementation of local memory assumes the kernel created has been created by DPC++, instrumenting the device code so that local memory arguments are offsets rather than pointers.
Other Notes#
The environment variable
SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE
can be set in order to exceed the default max dynamic local memory size. More information can be found here.The size of primitive datatypes may differ in host and device code. For instance, NVCC treats
long double
as 8 bytes for device and 16 bytes for host.In kernel
printf
for NVPTX targets does not support the%z
modifier.
Contributors#
Hugh Delaney hugh.delaney@codeplay.com
Ewan Crawford ewan@codeplay.com