CUDA* and SYCL* Programming Model Comparison#
This section compares the CUDA* and SYCL* programming models and shows how to map concepts and APIs from CUDA to SYCL.
Execution Model#
Kernel Function#
In CUDA*, a kernel function is defined using the __global__
declaration
specifier and is executed concurrently across multiple threads on the GPU.
Functions that are called by a CUDA kernel must be qualified with the __device__
specifier. The __host__
declaration specifier is used to qualify functions
that can be called from host code running on the CPU.
In contrast, a SYCL* kernel is a function that is executed on SYCL-capable devices, such as CPUs, GPUs, or FPGAs. These kernels are launched from host code and are executed concurrently on SYCL devices. Unlike CUDA, SYCL kernel functions do not require special declaration specifiers and are defined using standard C++ syntax.
The following table shows CUDA and SYCL equivalencies for defining kernel functions:
CUDA |
SYCL |
---|---|
|
|
|
|
|
|
|
|
If the macro __CUDA_ARCH__
is used to differentiate the code path in a CUDA
__host__ __device__
function, you can use macro __SYCL_DEVICE_ONLY__
in
SYCL to achieve similar functionality.
The following table shows CUDA and SYCL equivalencies specifying device functions:
CUDA |
SYCL |
---|---|
__host__ __device__ int foo(int i) {
#ifdef __CUDA_ARCH__
return i + 1;
#else
return 0;
#endif
}
__global__ void kernel() {
foo();
}
int main() {
return foo();
}
|
int foo(int i) {
#ifdef __SYCL_DEVICE_ONLY__
return i + 1;
#else
return 0;
#endif
}
void kernel() {
foo();
}
int main() {
return foo();
}
|
Execution Hierarchy#
In both CUDA and SYCL programming models, the kernel execution instances are organized hierarchically to exploit parallelism effectively. In CUDA, these instances are called threads; in SYCL, they are referred to as work-items. CUDA threads can be organized into blocks, which in turn can be organized into grids. SYCL work-items can be organized into work-groups, which in turn can be organized into ND-ranges.
From a hardware perspective, when a CUDA kernel is executed on a GPU, the Streaming Multiprocessors (SMs) create, manage, schedule, and execute threads in groups of 32 threads, known as warps. In comparison, in SYCL a sub-group represents a collection of related work-items within a work-group that execute concurrently.
To migrate CUDA code to SYCL code, the CUDA execution hierarchy can be mapped to the SYCL hierarchy as shown in the following table:
CUDA |
SYCL |
---|---|
thread |
work-item |
warp |
sub-group |
block |
work-group |
grid |
ND-range |
Thread Indexing#
As previously discussed, CUDA threads and SYCL work-items are organized in a hierarchical manner.
CUDA contains built-in variables to support threads:
Thread ID:
threadIdx.x/y/z
Block ID:
blockIdx.x/y/z
Block dimensions:
blockDim.x/y/z
Grid dimensions:
gridDim.x/y/z
SYCL contains equivalent built-in variables:
Thread ID:
sycl::nd_item.get_local_id(0/1/2)
Work-group ID:
sycl::nd_item.get_group(0/1/2)
Work-group dimensions:
sycl::nd_item.get_local_range().get(0/1/2)
ND-range dimensions:
sycl::nd_item.get_group_range(0/1/2)
According to the execution space index linearization sections of the CUDA C++ programing Guide and the SYCL 2020 Specification, given a block or work-group with shape (dx, dy, dz) and an element with id (x, y, z), then the index is x + y * Dx + z * Dx * Dy for CUDA, and z + y * Dz + x * Dz * Dy for SYCL. This discrepancy is due in part to SYCL’s better alignment with C++’s multidimensional indexing. Due to the difference in index computation methods, the CUDA execution space right-most dimension z is mapped to the SYCL left-most dimension x, as shown in the following table:
CUDA |
SYCL |
---|---|
|
|
|
|
|
|
|
|
|
|
Kernel Launch#
CUDA uses the <<<...>>>
execution configuration syntax and the dim3
type
to specify the dimensions and sizes of grids and blocks. The function call
operator, in conjunction with the execution configuration syntax, is used to
submit a kernel function to a stream for execution. When a kernel is submitted,
an index space is defined, with each thread and block receiving a unique thread
ID and block ID. These IDs are used to compute the index within the index space,
and they are accessible within the kernel through built-in variables.
SYCL uses the parallel_for
member function of sycl::queue
and
sycl::range
to specify the dimensions and sizes of ND-ranges and work-groups.
Users can also apply the SYCL kernel attribute [[sycl::reqd_sub_group_size(dim)]]
to indicate that a kernel must be compiled and executed with the specified sub-group
size. Each device supports only certain sub-group sizes, as defined by info::device::sub_group_sizes
.
The following table shows the original CUDA code for a kernel launch example, migrated to SYCL:
Original CUDA Code |
Migrated SYCL Code |
---|---|
__global__ void foo() {
int a = threadIdx.x;
}
int main() {
dim3 size_1(100, 200, 300);
dim3 size_2(5, 10, 20);
foo<<<size_1, size_2>>>();
}
|
void foo(sycl::nd_item<3> item) {
int a = item.get_local_id(2);
}
int main() {
sycl::queue q;
sycl::range<3> size_1(300, 200, 100);
sycl::range<3> size_2(20, 10, 5);
q.parallel_for(
sycl::nd_range<3>(size_1 * size_2, size_2),
[=](sycl::nd_item<3> item) [[sycl::reqd_sub_group_size(32)]] {
foo(item);
});
}
|
Make note of the following details in the migrated SYCL code:
In the constructor of
sycl::nd_range
, the first parameter global size is a work-item instead of a work-group. Thus global size should be the product ofsize_1
andsize_2
to align with CUDA.The Thread Indexing section describes that the CUDA execution space right-most dimension z should be mapped to the SYCL left-most dimension x. Thus in this example, the size of dimension needs to be reversed.
Memory Model#
The CUDA and SYCL memory model is a hierarchical structure. In CUDA, it comprises multiple memory spaces, such as shared memory, global memory, constant memory, and unified memory. Shared memory allows for efficient communication within a thread block. Global memory is accessible to all threads across the device, providing a larger storage space but slower access compared to shared memory. Constant memory is a read-only space, suitable for storing unchanging data such as constants or lookup tables. Unified memory can be accessed both on host and device.
Similarly, in SYCL, local memory is shared within a work-group, global memory is accessible by all work-items, and shared memory is accessible on the host and device. According to the SYCL 2020 Specification, constant memory no longer appears in the SYCL device memory model in SYCL 2020.
Global, Constant, and Unified Memory#
CUDA global memory and constant memory can be mapped to SYCL global memory. CUDA
unified memory can be mapped to SYCL shared memory. To perform this migration,
allocate memory through sycl::malloc_device
or sycl::malloc_shared
. For
example:
Original CUDA Code |
Migrated SYCL Code |
---|---|
void foo() {
int *mem1, *mem2;
cudaMalloc(&mem1, 10);
cudaMallocManaged(&mem2, 10);
}
|
void foo() {
sycl::queue q;
int *mem1, *mem2;
mem1 = sycl::malloc_device<int>(10, q);
mem2 = sycl::malloc_shared<int>(10, q);
}
|
Note that in CUDA, type specifiers __device__
, __constant__
, and __managed__
can be used to declare a variable resident in global memory and unified memory.
There is no direct equivalence in SYCL, but you can implement similar functionality
with helper classes. A reference implementation for dpct::global_memory
,
dpct::constant_memory
, and dpct::shared_memory
is provided in the
SYCLomatic project.
CUDA Device API Mapping#
Synchronization API#
In CUDA, synchronization functions are used to coordinate the execution of
different threads in a CUDA kernel. __syncthreads()
blocks the execution of
all threads in a thread block until all threads in that block have reached the
function. Additionally, all global and shared memory accesses made by these threads
prior to __syncthreads()
are visible to all threads in the block. The CUDA
__syncthreads
function can be mapped to sycl::group_barrier
(with a
sycl::group
object passed in). For the CUDA __syncthreads_and
,
__syncthreads_or
, and __syncthreads_count
function migration, an additional
group algorithm is needed after sycl::group_barrier
. The CUDA __syncwarp
function can be mapped to sycl::group_barrier
(with a sycl::sub_group
object spassed in).
The following table shows CUDA to SYCL mapping for synchronization functions:
CUDA |
SYCL |
---|---|
|
|
|
|
|
|
|
|
|
|
Memory Fence API#
Memory fence functions can be used to enforce some ordering on memory accesses.
The memory fence functions differ in the scope in which the orderings are enforced.
CUDA memory fence functions can be mapped to sycl::atomic_fence
with different
memory scope.
The following table shows CUDA to SYCL mapping for fence functions:
CUDA |
SYCL |
---|---|
|
|
|
|
|
|
Warp Intrinsic API#
CUDA warp intrinsic functions, including warp vote and shuffle functions, can be mapped to the SYCL group algorithm API.
Warp Vote API#
The CUDA warp vote API can map to the SYCL group algorithm API, as shown in the following table:
CUDA |
SYCL |
---|---|
|
|
|
|
|
|
For the sync version of CUDA warp intrinsic functions, a mask is passed that specifies the threads participating in the call. The equivalent SYCL API does not support mask directly. Refer to the reference implementation of mask version APIs in the SYCLomatic project.
The following table shows the original CUDA code for a warp vote example, migrated to SYCL:
Original CUDA Code |
Migrated SYCL Code |
---|---|
__device__ void foo(){
__all_sync(mask, predicate);
}
|
void foo(sycl::nd_item<3> item) {
auto g = item.get_sub_group();
sycl::all_of_group(g, (~mask & (0x1 << g.get_local_linear_id())) || predicate);
}
|
Warp Shuffle API#
CUDA warp shuffle functions can be mapped to the following SYCL group algorithms:
CUDA |
SYCL |
---|---|
|
|
|
|
|
|
|
|
CUDA shuffle functions support operate on a subset threads of warp. The equivalent SYCL API does not support operations on a subset of sub_group directly. Refer to the helper implementation in the SYCLomatic project.
The following table shows the original CUDA code for a warp shuffle example, migrated to SYCL:
Original CUDA Code |
Migrated SYCL Code |
---|---|
__device__ void foo(){
__shfl_sync(mask, val, r_id, 16);
}
|
void foo(sycl::nd_item<3> item) {
auto g = item.get_sub_group();
unsigned int start_index = (g.get_local_linear_id() / 16) * 16;
sycl::select_from_group(g, val, start_index + r_id % 16);
}
|
CUDA Host API Mapping#
Device Management#
The CUDA device management API can map to the SYCL device class and its member functions as follows:
CUDA |
SYCL |
---|---|
|
|
|
|
|
|
|
|
The following table shows the original CUDA code for a device management example, migrated to SYCL:
Original CUDA Code |
Migrated SYCL Code |
---|---|
int device_count;
cudaGetDeviceCount(&device_count);
for(int i = 0; i < device_count; i++) {
cudaSetDevice(i);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
int warp_size = prop.warpSize;
…
kernel<<<size_1, size_2>>>();
}
|
auto devices = sycl::device::get_devices(sycl::info::device_type::gpu);
for(auto &device : devices) {
sycl::queue q(device);
auto sub_group_sizes = device.get_info<sycl::info::device::sub_group_sizes>();
...
q.parallel_for(sycl::nd_range<3>(size_1 * size_2, size_2), [=](sycl::nd_item<3> item){
kernel(item);
});
}
|
Stream Management#
The CUDA stream management API can map to the SYCL queue class and its member functions as follows:
CUDA |
SYCL |
---|---|
|
Constructor of |
|
Destructor of |
|
|
|
|
|
|
The following table shows the original CUDA code for a stream management example, migrated to SYCL:
Original CUDA Code |
Migrated SYCL Code |
---|---|
void callback(cudaStream_t st, cudaError_t status, void *vp) {…}
void test() {
cudaStream_t stream;
cudaEvent_t event;
cudaStreamCreate(&stream);
cudaStreamAddCallback(stream, callback, 0, 0);
cudaStreamSynchronize(stream);
cudaStreamWaitEvent(stream, event, 0);
cudaStreamDestroy(stream);
}
|
void callback(sycl::queue st, int status, void *vp) {…}
void test() {
sycl::queue q;
sycl::event event;
std::async([&]() {
q.wait();
callback(q, 0, 0);
});
q.wait();
q.ext_oneapi_submit_barrier({event});
}
|
Note that the constructor of sycl::queue
creates a queue with an out-of-order
property by default. Use sycl::property::queue::in_order::in_order()
in the
construction of the queue to create an in-order queue.
Memory Management#
The CUDA memory management API can map to the SYCL USM pointer-based memory management API as follows:
CUDA |
SYCL |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The following table shows the original CUDA code for a memory management example, migrated to SYCL:
Original CUDA Code |
Migrated SYCL Code |
---|---|
void test() {
int *dev_ptr, *host_ptr, *shared_ptr;
int size;
...
cudaMalloc(&dev_ptr, size);
cudaMallocHost(&host_ptr, size);
cudaMallocManaged(&shared_ptr, size);
cudaMemset(dev_ptr, size, 0);
cudaMemcpy(host_ptr, dev_ptr, size, cudaMemcpyHostToDevice);
cudaMemcpy(shared_ptr, host_ptr, size, cudaMemcpyHostToDevice);
...
int a = shared_ptr[0];
...
cudaFree(dev_ptr);
cudaFree(host_ptr);
cudaFree(shared_ptr);
}
|
void test() {
sycl::queue q;
int *dev_ptr, *host_ptr, *shared_ptr;
int size;
...
dev_ptr = (int *)sycl::malloc_device(size, q);
host_ptr = (int *)sycl::malloc_host(size, q);
shared_ptr = (int *)sycl::malloc_shared(size, q);
q.memset(dev_ptr, size, 0).wait();
q.memcpy(host_ptr, dev_ptr, size).wait();
q.memcpy(shared_ptr, host_ptr, size).wait();
...
int a = shared_ptr[0];
...
sycl::free(dev_ptr, q);
sycl::free(host_ptr, q);
sycl::free(shared_ptr, q);
}
|
Error Handling#
In the CUDA runtime library error handling relies mainly on the error code returned by the API call. In SYCL, synchronous errors are reported immediately by the runtime throwing an exception. Use try-catch statements to catch and process these exceptions. For example:
Original CUDA Code |
Migrated SYCL Code |
---|---|
void test() {
int *ptr;
if (cudaMalloc(&ptr, sizeof(int))) {
std::cout << "error" << std::endl;
}
}
|
void test() try {
int *ptr;
sycl::queue q;
ptr = sycl::malloc_device<int>(1, q);
}
catch (sycl::exception const &exc) {
std::cerr << exc.what() << "Exception caught at file:" << __FILE__
<< ", line:" << __LINE__ << std::endl;
std::exit(1);
}
|
In SYCL, asynchronous errors are not reported immediately as they occur. The queue
can optionally take an asynchronous handler at construction, with an exception_list
as a parameter. Invocation of an async_handler
may be triggered by the queue
member functions queue::wait_and_throw()
, queue::throw_asynchronous()
, or
automatically on destruction of a queue that contains unconsumed asynchronous errors.
When invoked, an async_handler
is called and receives an exception_list
argument containing a list of exception objects representing any unconsumed
asynchronous errors associated with the queue or context. The following example
shows one implementation of an asynchronous exception handler, migrated to SYCL:
Original CUDA Code |
Migrated SYCL Code |
---|---|
void test() {
int *ptr;
kernel<<<1, 1>>>();
if (cudaDeviceSynchronize()) {
std::cout << "error" << std::endl;
}
}
|
auto exception_handler = [](cl::sycl::exception_list exceptions) {
for (std::exception_ptr const &e : exceptions) {
try {
std::rethrow_exception(e);
} catch (cl::sycl::exception const &e) {
std::cerr << "Caught asynchronous SYCL exception:" << std::endl
<< e.what() << std::endl
<< "Exception caught at file:" << __FILE__
<< ", line:" << __LINE__ << std::endl;
}
}
};
void test() {
sycl::queue q{exception_handler};
q.parallel_for(
sycl::nd_range<3>(size_1 * size_2, size_2),
[=](sycl::nd_item<3> item){
kernel(item);
}).wait_and_throw();
}
|