==========================
CUDA UR Reference Document
==========================

This document gives general guidelines of how to use UR to load and build
programs, and execute kernels on a CUDA device.

Device code
===========

A CUDA device image may be made of PTX and/or SASS, two different kinds of
device code for NVIDIA GPUs.

CUDA device images can be generated by a CUDA-capable compiler toolchain. Most
CUDA compiler toolchains are capable of generating PTX, SASS and/or bundles of
PTX and SASS.

When generating device code to be launched using Unified Runtime, it is
recommended to use a programming model with explicit kernel parameters, such as
OpenCL or CUDA. This is because kernels generated by a programming model with
implicit kernel parameters, such as SYCL, cannot guarantee any specific number
or ordering of kernel parameters. It has been observed that kernel signatures
for the same SYCL kernel may vary significantly when compiled for different
architectures.

PTX
---

PTX is a high level NVIDIA ISA which can be JIT compiled at runtime by the CUDA
driver. In UR, this JIT compilation happens at :ref:`urProgramBuild`\, where PTX is
assembled into device specific SASS which then can run on device.

PTX is forward compatible, so PTX generated for ``.target sm_52`` will be JIT
compiled without issue for devices with a greater compute capability than
``sm_52``. Whereas PTX generated for ``sm_80`` cannot be JIT compiled for an
``sm_60`` device.

An advantage of using PTX over SASS is that one code can run on multiple
devices. However, PTX generated for an older arch may not give access to newer
hardware instructions, such as new atomic operations, or tensor core
instructions.

JIT compilation has some overhead at :ref:`urProgramBuild`\, especially if the program
that is being loaded contains multiple kernels. The ``ptxjitcompiler`` keeps a
JIT cache, however, so this overhead is only paid the first time that a program
is built. JIT caching may be turned off by setting the environment variable
``CUDA_CACHE_DISABLE=1``.

SASS
----

SASS is a device specific binary which may be produced by ``ptxas`` or some
other tool. SASS is specific to an individual arch and is not portable across
arches.

A SASS file may be stored as a ``.cubin`` file by NVIDIA tools.

UR Programs
===========

A ur_program_handle_t has a one to one mapping with the CUDA driver object
`CUModule <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MODULE.html#group__CUDA__MODULE>`_.

In UR for CUDA, a ur_program_handle_t can be created using
:ref:`urProgramCreateWithBinary` with:

* A single PTX module, stored as a null terminated ``uint8_t`` buffer.
* A single SASS module, stored as an opaque ``uint8_t`` buffer.
* A mixed PTX/SASS module, where the SASS module is the assembled PTX module.

A ur_program_handle_t is valid only for a single architecture. If a CUDA
compatible binary contains device code for multiple NVIDIA architectures, it is
the user's responsibility to split these separate device images so that
:ref:`urProgramCreateWithBinary` is only called with a device binary for a single
device arch.

If a program is large and contains many kernels, loading and/or JIT compiling
the program may have a high overhead. This can be mitigated by splitting a
program into multiple smaller programs (corresponding to PTX/SASS files). In
this way, an application will only pay the overhead of loading/compiling
kernels that it will likely use.

Using PTX Modules in UR
-----------------------

A PTX module will be loaded and JIT compiled for the necessary architecture at
:ref:`urProgramBuild`\. If the PTX module has been generated for a compute capability
greater than the compute capability of the device, then :ref:`urProgramBuild` will
fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``.

A PTX module passed to :ref:`urProgramBuild` must contain only one PTX file.
Separate PTX files are to be handled separately.

Arguments may be passed to the ``ptxjitcompiler`` via :ref:`urProgramBuild`\.
Currently ``maxrregcount`` is the only supported argument.

.. parsed-literal::

   :ref:`urProgramBuild`\(ctx, program, "maxrregcount=128");


Using SASS Modules in UR
------------------------

A SASS module will be loaded and checked for compatibility at :ref:`urProgramBuild`\.
If the SASS module is incompatible with the device arch then :ref:`urProgramBuild`
will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``.

Using Mixed PTX/SASS Bundles in UR
----------------------------------

Mixed PTX/SASS modules can be used to make a program with
:ref:`urProgramCreateWithBinary`\. At :ref:`urProgramBuild` the CUDA driver will check
whether the bundled SASS is compatible with the active device. If the SASS is
compatible then the ur_program_handle_t will be built from the SASS, and if
not then the PTX will be used as a fallback and JIT compiled by the CUDA
driver. If both PTX and SASS are incompatible with the active device then
:ref:`urProgramBuild` will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``.

UR Kernels
==========

Once :ref:`urProgramCreateWithBinary` and :ref:`urProgramBuild` have succeeded, kernels
can be fetched from programs with :ref:`urKernelCreate`\. :ref:`urKernelCreate` must be
called with the exact name of the kernel in the PTX/SASS module. This name will
depend on the mangling used when compiling the kernel, so it is recommended to
examine the symbols in the PTX/SASS module before trying to extract kernels in
UR.

.. code-block:: console

    $ cuobjdump --dump-elf-symbols hello.cubin | grep mykernel
    _Z13mykernelv

At present it is not possible to query the names of the kernels in a UR program
for CUDA, so it is necessary to know the (mangled or otherwise) names of kernels
in advance or by some other means.

UR kernels can be dispatched with :ref:`urEnqueueKernelLaunch`\. The argument
``pGlobalWorkOffset`` can only be used if the kernels have been instrumented to
take the extra global offset argument. Use of the global offset is not
recommended for non SYCL compiler toolchains. This parameter can be ignored if
the user does not wish to use the global offset.

Other Notes
===========

- The environment variable ``SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE`` can be set in
  order to exceed the default max dynamic local memory size. More information
  can be found
  `here <https://intel.github.io/llvm-docs/EnvironmentVariables.html#controlling-dpc-cuda-plugin>`_.
- The size of primitive datatypes may differ in host and device code. For
  instance, NVCC treats ``long double`` as 8 bytes for device and 16 bytes for
  host.
- In kernel ``printf`` for NVPTX targets does not support the ``%z`` modifier.

Contributors
------------

* Hugh Delaney `hugh.delaney@codeplay.com <hugh.delaney@codeplay.com>`_