Radix Sort#
radix_sort Function Templates#
The radix_sort
function sorts data using the radix sort algorithm.
The sorting is stable, preserving the relative order of elements with equal keys.
Both in-place and out-of-place overloads are provided. Out-of-place overloads do not alter the input sequence.
The functions implement a Onesweep* 1 algorithm variant.
A synopsis of the radix_sort
function is provided below:
// defined in <oneapi/dpl/experimental/kernel_templates>
namespace oneapi::dpl::experimental::kt::gpu::esimd {
// Sort in-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename Iterator>
sycl::event
radix_sort (sycl::queue q, Iterator first, Iterator last,
KernelParam param); // (1)
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename Range>
sycl::event
radix_sort (sycl::queue q, Range&& r, KernelParam param); // (2)
// Sort out-of-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename Iterator1,
typename Iterator2>
sycl::event
radix_sort (sycl::queue q, Iterator1 first, Iterator1 last,
Iterator2 first_out, KernelParam param); // (3)
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename Range1, typename Range2>
sycl::event
radix_sort (sycl::queue q, Range1&& r, Range2&& r_out,
KernelParam param); // (4)
}
Note
The radix_sort
is currently available only for Intel® Data Center GPU Max Series,
and requires Intel® oneAPI DPC++/C++ Compiler 2023.2 or newer.
Template Parameters#
Name |
Description |
---|---|
|
The sort order. Ascending: |
|
The number of bits to sort for each radix sort algorithm pass. |
Parameters#
Name |
Description |
---|---|
|
The SYCL* queue where kernels are submitted. |
|
The sequences to apply the algorithm to. Supported sequence types:
|
|
A kernel_param object.
Its |
Type Requirements:
The element type of sequence(s) to sort must be a C++ integral or floating-point type other than
bool
with a width of up to 64 bits.
Note
Current limitations:
Number of elements to sort must not exceed 2^30.
RadixBits
can only be 8.param.workgroup_size
can only be 64.
Return Value#
A sycl::event
object representing the status of the algorithm execution.
Usage Examples#
In-Place Example#
// possible build and run commands:
// icpx -fsycl radix_sort.cpp -o radix_sort -I /path/to/oneDPL/include && ./radix_sort
#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>
#include <oneapi/dpl/experimental/kernel_templates>
namespace kt = oneapi::dpl::experimental::kt;
int main()
{
std::size_t n = 6;
sycl::queue q{sycl::gpu_selector_v};
std::uint32_t* keys = sycl::malloc_shared<std::uint32_t>(n, q);
// initialize
keys[0] = 3, keys[1] = 2, keys[2] = 1, keys[3] = 5, keys[4] = 3, keys[5] = 3;
// sort
auto e = kt::gpu::esimd::radix_sort<false, 8>(q, keys, keys + n, kt::kernel_param<416, 64>{}); // (1)
e.wait();
// print
for(std::size_t i = 0; i < n; ++i)
std::cout << keys[i] << ' ';
std::cout << '\n';
sycl::free(keys, q);
return 0;
}
Output:
5 3 3 3 2 1
Out-of-Place Example#
// possible build and run commands:
// icpx -fsycl radix_sort.cpp -o radix_sort -I /path/to/oneDPL/include && ./radix_sort
#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>
#include <oneapi/dpl/experimental/kernel_templates>
namespace kt = oneapi::dpl::experimental::kt;
int main()
{
std::size_t n = 6;
sycl::queue q{sycl::gpu_selector_v};
std::uint32_t* keys = sycl::malloc_shared<std::uint32_t>(n, q);
std::uint32_t* keys_out = sycl::malloc_shared<std::uint32_t>(n, q);
// initialize
keys[0] = 3, keys[1] = 2, keys[2] = 1, keys[3] = 5, keys[4] = 3, keys[5] = 3;
// sort
auto e = kt::gpu::esimd::radix_sort<false, 8>(q, keys, keys + n, keys_out, kt::kernel_param<416, 64>{}); // (3)
e.wait();
// print
for(std::size_t i = 0; i < n; ++i)
std::cout << keys[i] << ' ';
std::cout << '\n';
for(std::size_t i = 0; i < n; ++i)
std::cout << keys_out[i] << ' ';
std::cout << '\n';
sycl::free(keys, q);
sycl::free(keys_out, q);
return 0;
}
Output:
3 2 1 5 3 3
5 3 3 3 2 1
Memory Requirements#
The algorithm uses global and local device memory (see SYCL 2020 Specification)
for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device.
If there is not enough global device memory, a std::bad_alloc
exception is thrown.
The behavior is undefined if there is not enough local memory.
The amount of memory that is required depends on input data and configuration parameters, as described below.
Global Memory Requirements#
Global memory is used for copying the input sequence(s) and storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation:
Nkeys + C * Nkeys
where the sequence with keys takes Nkeys space, and the additional space is C * Nkeys.
The value of C depends on param.data_per_workitem
, param.workgroup_size
, and RadixBits
.
For param.data_per_workitem
set to 32, param.workgroup_size
to 64, and RadixBits
to 8,
C approximately equals to 1.
Incrementing RadixBits
increases C up to twice, while doubling either
param.data_per_workitem
or param.workgroup_size
leads to a halving of C.
Note
If the number of elements to sort does not exceed param.data_per_workitem * param.workgroup_size
,
radix_sort
is executed by a single work-group and does not use any global memory.
Local Memory Requirements#
Local memory is used for reordering keys within a work-group, and for storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation:
Nkeys_per_workgroup + C
where Nkeys_per_workgroup is the amount of memory to store keys. C is some additional space for storing internal data.
Nkeys_per_workgroup equals to sizeof(key_type) * param.data_per_workitem * param.workgroup_size
,
C does not exceed 4KB.
Recommended Settings for Best Performance#
The general advice is to choose kernel parameters based on performance measurements and profiling information. The initial configuration may be selected according to these high-level guidelines:
When the number of elements to sort (N) is small (~16K or less) and the algorithm is
radix_sort
, generally sorting is done more efficiently by a single work-group. Increase theparam
values to makeN <= param.data_per_workitem * param.workgroup_size
.When the number of elements to sort
N
is between 16K and 1M, utilizing all available compute cores is key for better performance. Allow creating enough work chunks to feed all Xe-cores 2 on a GPU:param.data_per_workitem * param.workgroup_size ≈ N / xe_core_count
.When the number of elements to sort is large (more than ~1M), maximizing the number of elements processed by a work-group, which equals to
param.data_per_workitem * param.workgroup_size
, reduces synchronization overheads between work-groups and usually benefits the overall performance.
Warning
Avoid setting too large param.data_per_workitem
and param.workgroup_size
values.
Make sure that Memory requirements are satisfied.
Note
param.data_per_workitem
is the only available parameter to tune the performance,
since param.workgroup_size
currently supports only one value (64).
- 1
Andy Adinets and Duane Merrill (2022). Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. https://arxiv.org/abs/2206.01784.
- 2
The Xe-core term is described in the oneAPI GPU Optimization Guide. Check the number of cores in the device specification, such as Intel® Data Center GPU Max specification.