Radix Sort By Key#

radix_sort_by_key Function Templates#

The radix_sort_by_key function sorts keys using the radix sort algorithm, applying the same order to the corresponding values. The sorting is stable, preserving the relative order of elements with equal keys. Both in-place and out-of-place overloads are provided. Out-of-place overloads do not alter the input sequences.

The functions implement a Onesweep* 1 algorithm variant.

A synopsis of the radix_sort_by_key function is provided below:

// defined in <oneapi/dpl/experimental/kernel_templates>

namespace oneapi::dpl::experimental::kt::gpu::esimd {

// Sort in-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Iterator1, typename Iterator2>
radix_sort_by_key (sycl::queue q, Iterator1 keys_first, Iterator1 keys_last,
                   Iterator2 values_first, KernelParam param); // (1)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename KeysRng, typename ValuesRng>
radix_sort_by_key (sycl::queue q, KeysRng&& keys,
                   ValuesRng&& values, KernelParam param); // (2)

// Sort out-of-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename KeysIterator1,
          typename ValuesIterator1, typename KeysIterator2,
          typename ValuesIterator2>
radix_sort_by_key (sycl::queue q, KeysIterator1 keys_first,
                   KeysIterator1 keys_last, ValuesIterator1 values_first,
                   KeysIterator2 keys_out_first, ValuesIterator2 values_out_first,
                   KernelParam param); // (3)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename KeysRng1, typename ValuesRng1,
          typename KeysRng2, typename ValuesRng2>
radix_sort_by_key (sycl::queue q, KeysRng1&& keys, ValuesRng1&& values,
                   KeysRng2&& keys_out, ValuesRng2&& values_out,
                   KernelParam param); // (4)


The radix_sort_by_key is currently available only for Intel® Data Center GPU Max Series, and requires Intel® oneAPI DPC++/C++ Compiler 2023.2 or newer.

Template Parameters#



bool IsAscending

The sort order. Ascending: true; Descending: false.

std::uint8_t RadixBits

The number of bits to sort for each radix sort algorithm pass.





The SYCL* queue where kernels are submitted.

  • keys_first, keys_last, values_first (1),

  • keys, values (2),

  • keys_first, keys_last, values_first, keys_out_first, values_out_first (3)

  • keys, values, keys_out, values_out (4).

The sequences to apply the algorithm to. Supported sequence types:


A kernel_param object. Its data_per_workitem must be a positive multiple of 32.

Type Requirements:

  • The element type of sequence(s) to sort must be a C++ integral or floating-point type other than bool with a width of up to 64 bits.


Current limitations:

  • Number of elements to sort must not exceed 2^30.

  • RadixBits can only be 8.

  • param.workgroup_size can only be 64.

Return Value#

A sycl::event object representing the status of the algorithm execution.

Usage Examples#

In-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   sycl::buffer<std::uint32_t> keys{sycl::range<1>(n)};
   sycl::buffer<char> values{sycl::range<1>(n)};

   // initialize
      sycl::host_accessor k_acc{keys, sycl::write_only};
      k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3;

      sycl::host_accessor v_acc{values, sycl::write_only};
      v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e';

   // sort
   auto e = kt::gpu::esimd::radix_sort_by_key<true, 8>(q, keys, values, kt::kernel_param<96, 64>{}); // (2)

   // print
      sycl::host_accessor k_acc{keys, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << k_acc[i] << ' ';
      std::cout << '\n';

      sycl::host_accessor v_acc{values, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << v_acc[i] << ' ';
      std::cout << '\n';

   return 0;


1 2 3 3 3 5
s o r t e d

Out-of-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   sycl::buffer<std::uint32_t> keys{sycl::range<1>(n)};
   sycl::buffer<std::uint32_t> keys_out{sycl::range<1>(n)};
   sycl::buffer<char> values{sycl::range<1>(n)};
   sycl::buffer<char> values_out{sycl::range<1>(n)};

   // initialize
      sycl::host_accessor k_acc{keys, sycl::write_only};
      k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3;

      sycl::host_accessor v_acc{values, sycl::write_only};
      v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e';

   // sort
   auto e = kt::gpu::esimd::radix_sort_by_key<true, 8>(q, keys, values, keys_out, values_out,
                                                       kt::kernel_param<96, 64>{}); // (4)

   // print
      sycl::host_accessor k_acc{keys, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << k_acc[i] << ' ';
      std::cout << '\n';

      sycl::host_accessor v_acc{values, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << v_acc[i] << ' ';
      std::cout << "\n\n";

      sycl::host_accessor k_out_acc{keys_out, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << k_out_acc[i] << ' ';
      std::cout << '\n';

      sycl::host_accessor v_out_acc{values_out, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << v_out_acc[i] << ' ';
      std::cout << '\n';

   return 0;


3 2 1 5 3 3
r o s d t e

1 2 3 3 3 5
s o r t e d

Memory Requirements#

The algorithm uses global and local device memory (see SYCL 2020 Specification) for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device. If there is not enough global device memory, a std::bad_alloc exception is thrown. The behavior is undefined if there is not enough local memory. The amount of memory that is required depends on input data and configuration parameters, as described below.

Global Memory Requirements#

Global memory is used for copying the input sequence(s) and storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation:

Nkeys + Nvalues + C * Nkeys

where the sequence with keys takes Nkeys space, the sequence with values takes Nvalues space, and the additional space is C * Nkeys.

The value of C depends on param.data_per_workitem, param.workgroup_size, and RadixBits. For param.data_per_workitem set to 32, param.workgroup_size to 64, and RadixBits to 8, C approximately equals to 1. Incrementing RadixBits increases C up to twice, while doubling either param.data_per_workitem or param.workgroup_size leads to a halving of C.

Local Memory Requirements#

Local memory is used for reordering key-value pairs within a work-group, and for storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation:

Nkeys_per_workgroup + Nvalues_per_workgroup + C

where Nkeys_per_workgroup and Nvalues_per_workgroup are the amounts of memory to store keys and values, respectively. C is some additional space for storing internal data.

Nkeys_per_workgroup equals to sizeof(key_type) * param.data_per_workitem * param.workgroup_size, Nvalues_per_workgroup equals to sizeof(value_type) * param.data_per_workitem * param.workgroup_size, C does not exceed 4KB.