AMD HIP UR Reference Document
Contents
AMD HIP UR Reference Document#
This document gives general guidelines of how to use UR to execute kernels on a AMD HIP device.
Device code#
Unlike the NVPTX platform, AMDGPU does not use a device IR that can be JIT compiled at runtime. Therefore, all device binaries must be precompiled for a particular arch.
The naming of AMDGPU device code files may vary across different generations
of devices. .hsa
or .hsaco
are common extensions as of 2023.
HIPCC can generate device code for a particular arch using the --genco
flag
$ hipcc --genco hello.cu --amdgpu-target=gfx906 -o hello.hsaco
When generating device code to be launched using Unified Runtime, it is recommended to use a programming model with explicit kernel parameters, such as OpenCL or HIP. This is because kernels generated by a programming model with implicit kernel parameters, such as SYCL, cannot guarantee any specific number or ordering of kernel parameters. It has been observed that kernel signatures for the same SYCL kernel may vary significantly when compiled for different architectures.
UR Programs#
A ur_program_handle_t has a one to one mapping with the HIP runtime object hipModule_t
In UR for HIP, a ur_program_handle_t can be created using urProgramCreateWithBinary with:
A single device code module
A ur_program_handle_t is valid only for a single architecture. If a HIP compatible binary contains device code for multiple AMDGPU architectures, it is the user’s responsibility to split these separate device images so that urProgramCreateWithBinary is only called with a device binary for a single device arch.
If the AMDGPU module is incompatible with the device arch then urProgramBuild
will fail with the error hipErrorNoBinaryForGpu
.
If a program is large and contains many kernels, loading the program may have a high overhead. This can be mitigated by splitting a program into multiple smaller programs. In this way, an application will only pay the overhead of loading kernels that it will likely use.
Kernels#
Once urProgramCreateWithBinary and urProgramBuild have succeeded, kernels can be fetched from programs with urKernelCreate. urKernelCreate must be called with the exact name of the kernel in the AMDGPU device code module. This name will depend on the mangling used when compiling the kernel, so it is recommended to examine the symbols in the AMDGPU device code module before trying to extract kernels in UR code.
llvm-objdump
or readelf
may not correctly view the symbols in an AMDGPU
device module. It may be necessary to call clang-offload-bundler
first in
order to extract the ELF
file that can be passed to readelf
.
$ clang-offload-bundler --unbundle --input=hello.hsaco --output=hello.o --targets=hipv4-amdgcn-amd-amdhsa--gfx906 --type=o
$ readelf hello.o -s | grep mykernel
_Z13mykernelv
At present it is not possible to query the names of the kernels in a UR program for HIP, so it is necessary to know the (mangled or otherwise) names of kernels in advance or by some other means.
UR kernels can be dispatched with urEnqueueKernelLaunch. The argument
pGlobalWorkOffset
can only be used if the kernels have been instrumented to
take the extra global offset argument. Use of the global offset is not
recommended for non SYCL compiler toolchains. This parameter can be ignored if
the user does not wish to use the global offset.
Local Memory Arguments#
In UR local memory is a region of memory shared by all the work-items in
a work-group. A kernel function signature can include local memory address
space pointer arguments, which are set by the user with
urKernelSetArgLocal
with the number of bytes of local memory to allocate
and make available from the pointer argument.
The HIP adapter implements local memory in a kernel as a single __shared__
memory allocation, and each individual local memory argument is a u32
byte
offset kernel parameter which is combined inside the kernel with the
__shared__
memory allocation. Therefore for N
local arguments that need
set on a kernel with urKernelSetArgLocal
, the total aligned size across the
N
calls to urKernelSetArgLocal
is calculated for the __shared__
memory allocation by the HIP adapter and passed as the sharedMemBytes
argument to hipModuleLaunchKernel
or hipGraphAddKernelNode
.
For each kernel u32
local memory offset parameter, aligned offsets into the
single memory location are calculated and passed at runtime by the adapter via
kernelParams
when launching the kernel (or adding the kernel as a graph
node). When a user calls urKernelSetArgLocal
with an argument index that
has already been set on the kernel, the adapter recalculates the size of the
__shared__
memory allocation and offset for the index, as well as the
offsets of any local memory arguments at following indices.
Warning
The HIP UR adapter implementation of local memory assumes the kernel created has been created by DPC++, instrumenting the device code so that local memory arguments are offsets rather than pointers.
HIP kernels that are generated for DPC++ kernels with SYCL local accessors
contain extra value arguments on top of the local memory argument for the
local accessor. For each urKernelSetArgLocal
argument, a user needs
to make 3 calls to urKernelSetArgValue
with each of the next 3 consecutive
argument indexes. This represents a 3 dimensional offset into the local
accessor.
Other Notes#
In kernel
printf
may not work for certain ROCm versions.
Contributors#
Hugh Delaney hugh.delaney@codeplay.com
Ewan Crawford ewan@codeplay.com