Programming Guide#

Platforms and Devices#

The oneAPI Unified Runtime API architecture exposes both physical and logical abstraction of the underlying devices capabilities. The device, sub-device and memory are exposed at physical level while command queues, events and synchronization methods are defined as logical entities. All logical entities will be bound to device level physical capabilities.

Device discovery APIs enumerate the accelerators functional features. These APIs provide interface to query information like compute unit count within the device or sub device, available memory and affinity to the compute, user managed cache size and work submission command queues.

Platforms#

A platform object represents a collection of physical devices in the system accessed by the same driver.

  • The application may query the number of platforms installed on the system, and their respective handles, using urPlatformGet.

  • More than one platform may be available in the system. For example, one platform may support two GPUs from one vendor, another platform supports a GPU from a different vendor, and finally a different platform may support an FPGA.

  • Platform objects are read-only, global constructs. i.e. multiple calls to urPlatformGet will return identical platform handles.

  • A platform handle is primarily used during device discovery and during creation and management of contexts.

Device#

A device object represents a physical device in the system that supports the platform.

  • The application may query the number devices supported by a platform, and their respective handles, using urDeviceGet.

  • Device objects are read-only, global constructs. i.e. multiple calls to urDeviceGet will return identical device handles.

  • A device handle is primarily used during creation and management of resources that are specific to a device.

  • Device may expose sub-devices that allow finer-grained control of physical or logical partitions of a device.

The following diagram illustrates the relationship between the platform, device, context and other objects described in this document.

../_images/runtime_object_hier.png

Initialization and Discovery#

// Discover all available adapters
uint32_t adapterCount = 0;
urAdapterGet(0, nullptr, &adapterCount);
std::vector<ur_adapter_handle_t> adapters(adapterCount);
urAdapterGet(adapterCount, adapters.data(), nullptr);

// Discover all the platform instances
uint32_t platformCount = 0;
urPlatformGet(adapters.data(), adapterCount, 0, nullptr, &platformCount);

std::vector<ur_platform_handle_t> platforms(platformCount);
urPlatformGet(adapters.data(), adapterCount, platform.size(), platforms.data(), &platformCount);

// Get number of total GPU devices in the platform
uint32_t deviceCount = 0;
urDeviceGet(platforms[0], UR_DEVICE_TYPE_GPU, &deviceCount, nullptr,nullptr);

// Get handles of all GPU devices in the platform
std::vector<ur_device_handle_t> devices(deviceCount);
urDeviceGet(platforms[0], UR_DEVICE_TYPE_GPU, &deviceCount,devices.data(), devices.size());

Device handle lifetime#

Device objects are reference-counted, using urDeviceRetain and urDeviceRelease. The ref-count of a device is automatically incremented when a device is obtained by urDeviceGet. After a device is no longer needed by the application it must call urDeviceRelease. When the ref-count of the underlying device handle becomes zero then that device object is deleted. Note that a Unified Runtime adapter may internally increment and decrement a device’s ref-count. So after the call to urDeviceRelease below, the device may stay active until other objects using it, such as a command-queue, are deleted. However, an application may not use the device after it releases its last reference.

// Get the handle of the first GPU device in the platform
ur_device_handle_t hDevice;
uint32_t deviceCount = 1;
urDeviceGet(hPlatforms[0], UR_DEVICE_TYPE_GPU, &deviceCount, &hDevice, 1);
urDeviceRelease(hDevice);

Retrieve info about device#

The urDeviceGetInfo can return various information about the device. In case where the info size is only known at runtime then two calls are needed, where first will retrieve the size.

// Size is known beforehand
ur_device_type_t deviceType;
urDeviceGetInfo(hDevice, UR_DEVICE_INFO_TYPE,sizeof(ur_device_type_t), &deviceType, nullptr);

// Size is only known at runtime
size_t infoSize;
urDeviceGetInfo(hDevice, UR_DEVICE_INFO_NAME, 0, &infoSize, nullptr);

std::string deviceName;
DeviceName.resize(infoSize);
urDeviceGetInfo(hDevice, UR_DEVICE_INFO_NAME, infoSize,deviceName.data(), nullptr);

Device partitioning into sub-devices#

urDevicePartition partitions a device into a sub-device. The exact representation and characteristics of the sub-devices are device specific, but normally they each represent a fixed part of the parent device, which can explicitly be programmed individually.

ur_device_handle_t hDevice;
ur_device_partition_property_t prop;
prop.value.affinity_domain = UR_DEVICE_AFFINITY_DOMAIN_FLAG_NEXT_PARTITIONABLE;

ur_device_partition_properties_t properties{
    UR_STRUCTURE_TYPE_DEVICE_PARTITION_PROPERTIES,
    nullptr,
    &prop,
    1,
};

uint32_t count = 0;
std::vector<ur_device_handle_t> subDevices;
urDevicePartition(hDevice, &properties, 0, nullptr, &count);

if (count > 0) {
    subDevices.resize(count);
    urDevicePartition(Device, &properties, count, &subDevices.data(),nullptr);
}

The returned sub-devices may be requested for further partitioning into sub-sub-devices, and so on. An implementation will return “0” in the count if no further partitioning is supported.

uint32_t count;
urDevicePartition(subDevices[0], &properties, 0, nullptr, &count);
if(count == 0){
    // no further partitioning allowed
}

Contexts#

Contexts serve the purpose of resource sharing (between devices in the same context), and resource isolation (ensuring that resources do not cross context boundaries). Resources such as memory allocations, events, and programs are explicitly created against a context.

uint32_t deviceCount = 1;
ur_device_handle_t hDevice;
urDeviceGet(hPlatform, UR_DEVICE_TYPE_GPU, &deviceCount, &hDevice,nullptr);

// Create a context
ur_context_handle_t hContext;
urContextCreate(1, &hDevice, nullptr, &hContext);

// Operations on this context
// ...

// Release the context handle
urContextRelease(hContext);

Programs and Kernels#

There are two constructs we need to prepare code for execution on the device:

  • Programs serve as containers for device code. They typically encapsulate a collection of functions and global variables represented in an intermediate language, and one or more device-native binaries compiled from that collection.

  • Kernels represent a handle to a function within a program that can be launched on a device.

Programs#

Programs can be constructed with an intermediate language binary or a device-native binary. Programs constructed with IL must be further compiled through either urProgramCompile and urProgramLink or urProgramBuild before they can be used to create a kernel object.

// Create a program with IL
ur_program_handle_t hProgram;
urProgramCreateWithIL(hContext, ILBin, ILBinSize, nullptr, &hProgram);

// Build the program.
urProgramBuild(hContext, hProgram, nullptr);

The diagram below shows the possible paths to obtaining a program that can be used to create a kernel:

../_images/programs.png

Kernels#

A Kernel is a reference to a kernel within a program and it supports both explicit and implicit kernel arguments along with data needed for launch.

// Create kernel object from program
ur_kernel_handle_t hKernel;
urKernelCreate(hProgram, "addVectors", &hKernel);
urKernelSetArgMemObj(hKernel, 0, nullptr, A);
urKernelSetArgMemObj(hKernel, 1, nullptr, B);
urKernelSetArgMemObj(hKernel, 2, nullptr, C);

Queue and Enqueue#

Queue objects are used to submit work to a given device. Kernels and commands are submitted to queue for execution using Enqueue commands: such as urEnqueueKernelLaunch, urEnqueueMemBufferWrite. Enqueued kernels and commands can be executed in order or out of order depending on the queue’s property UR_QUEUE_FLAG_OUT_OF_ORDER_EXEC_MODE_ENABLE when the queue is created. If a queue is out of order, the queue may internally do some scheduling of work to achieve concurrency on the device, while honouring the event dependencies that are passed to each Enqueue command.

// Create an out of order queue for hDevice in hContext
ur_queue_handle_t hQueue;
urQueueCreate(hContext, hDevice,UR_QUEUE_FLAG_OUT_OF_ORDER_EXEC_MODE_ENABLE, &hQueue);

// Launch a kernel with 3D workspace partitioning
const uint32_t nDim = 3;
const size_t gWorkOffset = {0, 0, 0};
const size_t gWorkSize = {128, 128, 128};
const size_t lWorkSize = {1, 8, 8};
urEnqueueKernelLaunch(hQueue, hKernel, nDim, gWorkOffset, gWorkSize,lWorkSize, 0, nullptr, nullptr);

Queue object lifetime#

Queue objects are reference-counted. If an application or thread needs to retain access to a queue created by another application or thread, it can call urQueueRetain. An application must call urQueueRelease when a queue object is no longer needed. When a queue object’s reference count becomes zero, it is deleted by the runtime.

Native Driver Access#

The runtime API provides accessors for native handles. For example, given a ur_program_handle_t, we can call urProgramGetNativeHandle to retrieve a ur_native_handle_t. We can then leverage a platform extension to convert the native handle to a driver handle. For example, OpenCL platform may expose an extension urProgramCreateWithNativeHandle to retrieve a cl_program.

Memory#

UR Mem Handles#

A ur_mem_handle_t can represent an untyped memory buffer object, created by urMemBufferCreate, or a memory image object, created by urMemImageCreate. A ur_mem_handle_t manages the internal allocation and deallocation of native memory objects across all devices in a ur_context_handle_t. A ur_mem_handle_t may only be used by queues that share the same ur_context_handle_t.

If multiple queues in the same ur_context_handle_t use the same ur_mem_handle_t across dependent commands, a dependency must be defined by the user using the enqueue entry point’s phEventWaitList parameter. Provided that dependencies are explicitly passed to UR entry points, a UR adapter will manage memory migration of native memory objects across all devices in a context, if memory migration is indeed necessary in the backend API.

// Q1 and Q2 are both in hContext
ur_mem_handle_t hBuffer;
urMemBufferCreate(hContext,,,,&hBuffer);
urEnqueueMemBufferWrite(Q1, hBuffer,,,,,,, &outEv);
urEnqueueMemBufferRead(Q2, hBuffer,,,,, 1, &outEv /phEventWaitList/, );

As such, the buffer written to in urEnqueueMemBufferWrite can be successfully read using urEnqueueMemBufferRead from another queue in the same context, since the event associated with the write operation has been passed as a dependency to the read operation.

Memory Pooling#

The urUSMPoolCreate function explicitly creates memory pools and returns ur_usm_pool_handle_t. ur_usm_pool_handle_t can be passed to urUSMDeviceAlloc, urUSMHostAlloc and urUSMSharedAlloc through ur_usm_desc_t structure. Allocations that specify different pool handles must be isolated and not reside on the same page. Memory pool is subject to limits specified during pool creation.

Even if no ur_usm_pool_handle_t is provided to an allocation function, each adapter may still perform memory pooling.