Radix Sort#

radix_sort and radix_sort_by_key Function Templates#

The radix_sort and radix_sort_by_key functions sort data using the radix sort algorithm. The sorting is stable, ensuring the preservation of the relative order of elements with equal keys. The functions implement a Onesweep* 1 algorithm variant. Both in-place and out-of-place overloads are provided. For out-of-place overloads, the input data order is preserved.

A synopsis of the radix_sort and radix_sort_by_key functions is provided below:

// defined in <oneapi/dpl/experimental/kernel_templates>

namespace oneapi::dpl::experimental::kt::gpu::esimd {

// Sort a single sequence
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Iterator>
sycl::event
radix_sort (sycl::queue q, Iterator first, Iterator last,
            KernelParam param); // (1)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Range>
sycl::event
radix_sort (sycl::queue q, Range&& r, KernelParam param); // (2)


// Sort a single sequence out-of-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Iterator1,
          typename Iterator2>
sycl::event
radix_sort (sycl::queue q, Iterator1 first, Iterator1 last,
            Iterator2 first_out, KernelParam param) // (3)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Range1, typename Range2>
sycl::event
radix_sort (sycl::queue q, Range1&& r, Range2&& r_out,
            KernelParam param) // (4)


// Sort a sequence of keys and apply the same order to a sequence of values
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Iterator1, typename Iterator2>
sycl::event
radix_sort_by_key (sycl::queue q, Iterator1 keys_first, Iterator1 keys_last,
                   Iterator2 values_first, KernelParam param); // (5)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename KeysRng, typename ValuesRng>
sycl::event
radix_sort_by_key (sycl::queue q, KeysRng&& keys,
                   ValuesRng&& values, KernelParam param); // (6)


// Sort a sequence of keys and values out-of-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename KeysIterator1,
          typename ValsIterator1, typename KeysIterator2,
          typename ValsIterator2>
sycl::event
radix_sort_by_key (sycl::queue q, KeysIterator1 keys_first,
                   KeysIterator1 keys_last, ValsIterator1 vals_first,
                   KeysIterator2 keys_out_first, ValsIterator2 vals_out_first,
                   KernelParam param) // (7)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename KeysRng1, typename ValsRng1,
          typename KeysRng2, typename ValsRng2>
sycl::event
radix_sort_by_key (sycl::queue q, KeysRng1&& keys, ValsRng1&& values,
                   KeysRng2&& keys_out, ValsRng2&& vals_out,
                   KernelParam param) // (8)
}

Template Parameters#

Name

Description

bool IsAscending

The sort order. Ascending: true; Descending: false.

std::uint8_t RadixBits

The number of bits to sort for each radix sort algorithm pass.

Parameters#

Name

Description

q

The SYCL* queue where kernels are submitted.

  • first, last (1),

  • r (2),

  • first, last, first_out (3),

  • r, r_out (4),

  • keys_first, keys_last, values_first (5),

  • keys, values (6),

  • keys_first, keys_last, vals_first, keys_out_first, values_out_first (7)

  • keys, values, keys_out, values_out (8),

The sequences to apply the algorithm to. Supported sequence types:

param

A kernel_param object. Its data_per_workitem must be a positive multiple of 32.

Type Requirements:

  • The element type of sequence(s) to sort must be a C++ integral or floating-point type other than bool with a width of up to 64 bits.

Note

Current limitations:

  • Number of elements to sort must not exceed 2^30.

  • RadixBits can only be 8.

  • param.workgroup_size can only be 64.

Return Value#

A sycl::event object representing the status of the algorithm execution.

Usage Examples#

radix_sort In-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort.cpp -o radix_sort -I /path/to/oneDPL/include && ./radix_sort

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
{
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   std::uint32_t* keys = sycl::malloc_shared<std::uint32_t>(n, q);

   // initialize
   keys[0] = 3, keys[1] = 2, keys[2] = 1, keys[3] = 5, keys[4] = 3, keys[5] = 3;

   // sort
   auto e = kt::gpu::esimd::radix_sort<false, 8>(q, keys, keys + n, kt::kernel_param<416, 64>{}); // (1)
   e.wait();

   // print
   for(std::size_t i = 0; i < n; ++i)
      std::cout << keys[i] << ' ';
   std::cout << '\n';

   sycl::free(keys, q);
   return 0;
}

Output:

5 3 3 3 2 1

radix_sort_by_key In-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
{
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   sycl::buffer<std::uint32_t> keys{sycl::range<1>(n)};
   sycl::buffer<char> values{sycl::range<1>(n)};

   // initialize
   {
      sycl::host_accessor k_acc{keys, sycl::write_only};
      k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3;

      sycl::host_accessor v_acc{values, sycl::write_only};
      v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e';
   }

   // sort
   auto e = kt::gpu::esimd::radix_sort_by_key<true, 8>(q, keys, values, kt::kernel_param<96, 64>{}); // (6)
   e.wait();

   // print
   {
      sycl::host_accessor k_acc{keys, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << k_acc[i] << ' ';
      std::cout << '\n';

      sycl::host_accessor v_acc{values, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << v_acc[i] << ' ';
      std::cout << '\n';
   }

   return 0;
}

Output:

1 2 3 3 3 5
s o r t e d

radix_sort Out-of-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort.cpp -o radix_sort -I /path/to/oneDPL/include && ./radix_sort

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
{
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   std::uint32_t* keys = sycl::malloc_shared<std::uint32_t>(n, q);
   std::uint32_t* keys_out = sycl::malloc_shared<std::uint32_t>(n, q);

   // initialize
   keys[0] = 3, keys[1] = 2, keys[2] = 1, keys[3] = 5, keys[4] = 3, keys[5] = 3;

   // sort
   auto e = kt::gpu::esimd::radix_sort<false, 8>(q, keys, keys + n, keys_out, kt::kernel_param<416, 64>{}); // (3)
   e.wait();

   // print
   for(std::size_t i = 0; i < n; ++i)
      std::cout << keys[i] << ' ';
   std::cout << '\n';
   for(std::size_t i = 0; i < n; ++i)
      std::cout << keys_out[i] << ' ';
   std::cout << '\n';

   sycl::free(keys, q);
   sycl::free(keys_out, q);
   return 0;
}

Output:

3 2 1 5 3 3
5 3 3 3 2 1

radix_sort_by_key Out-of-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
{
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   sycl::buffer<std::uint32_t> keys{sycl::range<1>(n)};
   sycl::buffer<std::uint32_t> keys_out{sycl::range<1>(n)};
   sycl::buffer<char> values{sycl::range<1>(n)};
   sycl::buffer<char> values_out{sycl::range<1>(n)};


   // initialize
   {
      sycl::host_accessor k_acc{keys, sycl::write_only};
      k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3;

      sycl::host_accessor v_acc{values, sycl::write_only};
      v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e';
   }

   // sort
   auto e = kt::gpu::esimd::radix_sort_by_key<true, 8>(q, keys, values, keys_out, values_out,
                                                       kt::kernel_param<96, 64>{}); // (8)
   e.wait();

   // print
   {
      sycl::host_accessor k_acc{keys, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << k_acc[i] << ' ';
      std::cout << '\n';

      sycl::host_accessor v_acc{values, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << v_acc[i] << ' ';
      std::cout << "\n\n";

      sycl::host_accessor k_out_acc{keys_out, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << k_out_acc[i] << ' ';
      std::cout << '\n';

      sycl::host_accessor v_out_acc{values_out, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << v_out_acc[i] << ' ';
      std::cout << '\n';
   }

   return 0;
}

Output:

3 2 1 5 3 3
r o s d t e

1 2 3 3 3 5
s o r t e d

Memory Requirements#

The algorithms use global and local device memory (see SYCL 2020 Specification) for intermediate data storage. For the algorithms to operate correctly, there must be enough memory on the device; otherwise, the behavior is undefined. The amount of memory that is required depends on input data and configuration parameters, as described below.

Global Memory Requirements#

Global memory is used for copying the input sequence(s) and storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation:

radix_sort

Nkeys + C * Nkeys

radix_sort_by_key

Nkeys + Nvalues + C * Nkeys

where the sequence with keys takes Nkeys space, the sequence with values takes Nvalues space, and the additional space is C * Nkeys.

The value of C depends on param.data_per_workitem, param.workgroup_size, and RadixBits. For param.data_per_workitem set to 32, param.workgroup_size to 64, and RadixBits to 8, C approximately equals to 1. Incrementing RadixBits increases C up to twice, while doubling either param.data_per_workitem or param.workgroup_size leads to a halving of C.

Note

If the number of elements to sort does not exceed param.data_per_workitem * param.workgroup_size, radix_sort is executed by a single work-group and does not use any global memory. For radix_sort_by_key there is no single work-group implementation yet.

Local Memory Requirements#

Local memory is used for reordering keys or key-value pairs within a work-group, and for storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation:

radix_sort

Nkeys_per_workgroup + C

radix_sort_by_key

Nkeys_per_workgroup + Nvalues_per_workgroup + C

where Nkeys_per_workgroup and Nvalues_per_workgroup are the amounts of memory to store keys and values, respectively. C is some additional space for storing internal data.

Nkeys_per_workgroup equals to sizeof(key_type) * param.data_per_workitem * param.workgroup_size, Nvalues_per_workgroup equals to sizeof(value_type) * param.data_per_workitem * param.workgroup_size, C does not exceed 4KB.