Radix Sort#
radix_sort and radix_sort_by_key Function Templates#
The radix_sort
and radix_sort_by_key
functions sort data using the radix sort algorithm.
The sorting is stable, ensuring the preservation of the relative order of elements with equal keys.
The functions implement a Onesweep* 1 algorithm variant. Both in-place and out-of-place
overloads are provided. For out-of-place overloads, the input data order is preserved.
A synopsis of the radix_sort
and radix_sort_by_key
functions is provided below:
// defined in <oneapi/dpl/experimental/kernel_templates>
namespace oneapi::dpl::experimental::kt::gpu::esimd {
// Sort a single sequence
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename Iterator>
sycl::event
radix_sort (sycl::queue q, Iterator first, Iterator last,
KernelParam param); // (1)
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename Range>
sycl::event
radix_sort (sycl::queue q, Range&& r, KernelParam param); // (2)
// Sort a single sequence out-of-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename Iterator1,
typename Iterator2>
sycl::event
radix_sort (sycl::queue q, Iterator1 first, Iterator1 last,
Iterator2 first_out, KernelParam param) // (3)
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename Range1, typename Range2>
sycl::event
radix_sort (sycl::queue q, Range1&& r, Range2&& r_out,
KernelParam param) // (4)
// Sort a sequence of keys and apply the same order to a sequence of values
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename Iterator1, typename Iterator2>
sycl::event
radix_sort_by_key (sycl::queue q, Iterator1 keys_first, Iterator1 keys_last,
Iterator2 values_first, KernelParam param); // (5)
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename KeysRng, typename ValuesRng>
sycl::event
radix_sort_by_key (sycl::queue q, KeysRng&& keys,
ValuesRng&& values, KernelParam param); // (6)
// Sort a sequence of keys and values out-of-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename KeysIterator1,
typename ValsIterator1, typename KeysIterator2,
typename ValsIterator2>
sycl::event
radix_sort_by_key (sycl::queue q, KeysIterator1 keys_first,
KeysIterator1 keys_last, ValsIterator1 vals_first,
KeysIterator2 keys_out_first, ValsIterator2 vals_out_first,
KernelParam param) // (7)
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
typename KernelParam, typename KeysRng1, typename ValsRng1,
typename KeysRng2, typename ValsRng2>
sycl::event
radix_sort_by_key (sycl::queue q, KeysRng1&& keys, ValsRng1&& values,
KeysRng2&& keys_out, ValsRng2&& vals_out,
KernelParam param) // (8)
}
Template Parameters#
Name |
Description |
---|---|
|
The sort order. Ascending: |
|
The number of bits to sort for each radix sort algorithm pass. |
Parameters#
Name |
Description |
---|---|
|
The SYCL* queue where kernels are submitted. |
|
The sequences to apply the algorithm to. Supported sequence types:
|
|
A kernel_param object.
Its |
Type Requirements:
The element type of sequence(s) to sort must be a C++ integral or floating-point type other than
bool
with a width of up to 64 bits.
Note
Current limitations:
Number of elements to sort must not exceed 2^30.
RadixBits
can only be 8.param.workgroup_size
can only be 64.
Return Value#
A sycl::event
object representing the status of the algorithm execution.
Usage Examples#
radix_sort In-Place Example#
// possible build and run commands:
// icpx -fsycl radix_sort.cpp -o radix_sort -I /path/to/oneDPL/include && ./radix_sort
#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>
#include <oneapi/dpl/experimental/kernel_templates>
namespace kt = oneapi::dpl::experimental::kt;
int main()
{
std::size_t n = 6;
sycl::queue q{sycl::gpu_selector_v};
std::uint32_t* keys = sycl::malloc_shared<std::uint32_t>(n, q);
// initialize
keys[0] = 3, keys[1] = 2, keys[2] = 1, keys[3] = 5, keys[4] = 3, keys[5] = 3;
// sort
auto e = kt::gpu::esimd::radix_sort<false, 8>(q, keys, keys + n, kt::kernel_param<416, 64>{}); // (1)
e.wait();
// print
for(std::size_t i = 0; i < n; ++i)
std::cout << keys[i] << ' ';
std::cout << '\n';
sycl::free(keys, q);
return 0;
}
Output:
5 3 3 3 2 1
radix_sort_by_key In-Place Example#
// possible build and run commands:
// icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key
#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>
#include <oneapi/dpl/experimental/kernel_templates>
namespace kt = oneapi::dpl::experimental::kt;
int main()
{
std::size_t n = 6;
sycl::queue q{sycl::gpu_selector_v};
sycl::buffer<std::uint32_t> keys{sycl::range<1>(n)};
sycl::buffer<char> values{sycl::range<1>(n)};
// initialize
{
sycl::host_accessor k_acc{keys, sycl::write_only};
k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3;
sycl::host_accessor v_acc{values, sycl::write_only};
v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e';
}
// sort
auto e = kt::gpu::esimd::radix_sort_by_key<true, 8>(q, keys, values, kt::kernel_param<96, 64>{}); // (6)
e.wait();
// print
{
sycl::host_accessor k_acc{keys, sycl::read_only};
for(std::size_t i = 0; i < n; ++i)
std::cout << k_acc[i] << ' ';
std::cout << '\n';
sycl::host_accessor v_acc{values, sycl::read_only};
for(std::size_t i = 0; i < n; ++i)
std::cout << v_acc[i] << ' ';
std::cout << '\n';
}
return 0;
}
Output:
1 2 3 3 3 5
s o r t e d
radix_sort Out-of-Place Example#
// possible build and run commands:
// icpx -fsycl radix_sort.cpp -o radix_sort -I /path/to/oneDPL/include && ./radix_sort
#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>
#include <oneapi/dpl/experimental/kernel_templates>
namespace kt = oneapi::dpl::experimental::kt;
int main()
{
std::size_t n = 6;
sycl::queue q{sycl::gpu_selector_v};
std::uint32_t* keys = sycl::malloc_shared<std::uint32_t>(n, q);
std::uint32_t* keys_out = sycl::malloc_shared<std::uint32_t>(n, q);
// initialize
keys[0] = 3, keys[1] = 2, keys[2] = 1, keys[3] = 5, keys[4] = 3, keys[5] = 3;
// sort
auto e = kt::gpu::esimd::radix_sort<false, 8>(q, keys, keys + n, keys_out, kt::kernel_param<416, 64>{}); // (3)
e.wait();
// print
for(std::size_t i = 0; i < n; ++i)
std::cout << keys[i] << ' ';
std::cout << '\n';
for(std::size_t i = 0; i < n; ++i)
std::cout << keys_out[i] << ' ';
std::cout << '\n';
sycl::free(keys, q);
sycl::free(keys_out, q);
return 0;
}
Output:
3 2 1 5 3 3
5 3 3 3 2 1
radix_sort_by_key Out-of-Place Example#
// possible build and run commands:
// icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key
#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>
#include <oneapi/dpl/experimental/kernel_templates>
namespace kt = oneapi::dpl::experimental::kt;
int main()
{
std::size_t n = 6;
sycl::queue q{sycl::gpu_selector_v};
sycl::buffer<std::uint32_t> keys{sycl::range<1>(n)};
sycl::buffer<std::uint32_t> keys_out{sycl::range<1>(n)};
sycl::buffer<char> values{sycl::range<1>(n)};
sycl::buffer<char> values_out{sycl::range<1>(n)};
// initialize
{
sycl::host_accessor k_acc{keys, sycl::write_only};
k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3;
sycl::host_accessor v_acc{values, sycl::write_only};
v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e';
}
// sort
auto e = kt::gpu::esimd::radix_sort_by_key<true, 8>(q, keys, values, keys_out, values_out,
kt::kernel_param<96, 64>{}); // (8)
e.wait();
// print
{
sycl::host_accessor k_acc{keys, sycl::read_only};
for(std::size_t i = 0; i < n; ++i)
std::cout << k_acc[i] << ' ';
std::cout << '\n';
sycl::host_accessor v_acc{values, sycl::read_only};
for(std::size_t i = 0; i < n; ++i)
std::cout << v_acc[i] << ' ';
std::cout << "\n\n";
sycl::host_accessor k_out_acc{keys_out, sycl::read_only};
for(std::size_t i = 0; i < n; ++i)
std::cout << k_out_acc[i] << ' ';
std::cout << '\n';
sycl::host_accessor v_out_acc{values_out, sycl::read_only};
for(std::size_t i = 0; i < n; ++i)
std::cout << v_out_acc[i] << ' ';
std::cout << '\n';
}
return 0;
}
Output:
3 2 1 5 3 3
r o s d t e
1 2 3 3 3 5
s o r t e d
Memory Requirements#
The algorithms use global and local device memory (see SYCL 2020 Specification) for intermediate data storage. For the algorithms to operate correctly, there must be enough memory on the device; otherwise, the behavior is undefined. The amount of memory that is required depends on input data and configuration parameters, as described below.
Global Memory Requirements#
Global memory is used for copying the input sequence(s) and storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation:
radix_sort
Nkeys + C * Nkeys
radix_sort_by_key
Nkeys + Nvalues + C * Nkeys
where the sequence with keys takes Nkeys space, the sequence with values takes Nvalues space, and the additional space is C * Nkeys.
The value of C depends on param.data_per_workitem
, param.workgroup_size
, and RadixBits
.
For param.data_per_workitem
set to 32, param.workgroup_size
to 64, and RadixBits
to 8,
C approximately equals to 1.
Incrementing RadixBits
increases C up to twice, while doubling either
param.data_per_workitem
or param.workgroup_size
leads to a halving of C.
Note
If the number of elements to sort does not exceed param.data_per_workitem * param.workgroup_size
,
radix_sort
is executed by a single work-group and does not use any global memory.
For radix_sort_by_key
there is no single work-group implementation yet.
Local Memory Requirements#
Local memory is used for reordering keys or key-value pairs within a work-group, and for storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation:
radix_sort
Nkeys_per_workgroup + C
radix_sort_by_key
Nkeys_per_workgroup + Nvalues_per_workgroup + C
where Nkeys_per_workgroup and Nvalues_per_workgroup are the amounts of memory to store keys and values, respectively. C is some additional space for storing internal data.
Nkeys_per_workgroup equals to sizeof(key_type) * param.data_per_workitem * param.workgroup_size
,
Nvalues_per_workgroup equals to sizeof(value_type) * param.data_per_workitem * param.workgroup_size
,
C does not exceed 4KB.
Recommended Settings for Best Performance#
The general advice is to choose kernel parameters based on performance measurements and profiling information. The initial configuration may be selected according to these high-level guidelines:
When the number of elements to sort (N) is small (~16K or less) and the algorithm is
radix_sort
, generally sorting is done more efficiently by a single work-group. Increase theparam
values to makeN <= param.data_per_workitem * param.workgroup_size
.When the number of elements to sort
N
is between 16K and 1M, utilizing all available compute cores is key for better performance. Allow creating enough work chunks to feed all Xe-cores 2 on a GPU:param.data_per_workitem * param.workgroup_size ≈ N / xe_core_count
.When the number of elements to sort is large (more than ~1M), maximizing the number of elements processed by a work-group, which equals to
param.data_per_workitem * param.workgroup_size
, reduces synchronization overheads between work-groups and usually benefits the overall performance.
Warning
Avoid setting too large param.data_per_workitem
and param.workgroup_size
values.
Make sure that Memory requirements are satisfied.
Note
param.data_per_workitem
is the only available parameter to tune the performance,
since param.workgroup_size
currently supports only one value (64).
- 1
Andy Adinets and Duane Merrill (2022). Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. Retrieved from https://arxiv.org/abs/2206.01784.
- 2
The Xe-core term is described in the oneAPI GPU Optimization Guide. Check the number of cores in the device specification, such as Intel® Data Center GPU Max specification.