Pass Data to Algorithms#
For an algorithm to access data, it is important that the used execution policy matches the data storage type. The following table shows which execution policies can be used with various data storage types.
Data Storage |
Device Policies |
Host Policies |
---|---|---|
Yes |
No |
|
Device-allocated unified shared memory (USM) |
Yes |
No |
Shared and host-allocated USM |
Yes |
Yes |
|
Yes |
Yes |
|
See Use std::vector |
Yes |
Other data in host memory |
No |
Yes |
When using the standard-aligned (or host) execution policies, oneDPL supports data being passed to its algorithms as specified in the C++ standard (C++17 for algorithms working with iterators, C++20 for parallel range algorithms), with known restrictions and limitations.
According to the standard, the calling code must prevent data races when using algorithms with parallel execution policies.
Note
Implementations of std::vector<bool>
are not required to avoid data races for concurrent modifications
of vector elements. Some implementations may optimize multiple bool
elements into a bitfield, making it unsafe
for multithreading. For this reason, it is recommended to avoid std::vector<bool>
for anything but a read-only
input with the standard-aligned execution policies.
The following subsections describe proper ways to pass data to an algorithm invoked with a device execution policy.
Use oneapi::dpl::begin and oneapi::dpl::end Functions#
oneapi::dpl::begin
and oneapi::dpl::end
are special helper functions that
allow you to pass SYCL buffers to parallel algorithms. These functions accept
a SYCL buffer and return an object of an unspecified type that provides the following API:
It satisfies
CopyConstructible
andCopyAssignable
C++ named requirements and comparable withoperator==
andoperator!=
.It gives the following valid expressions:
a + n
,a - n
, anda - b
, wherea
andb
are objects of the type, andn
is an integer value. The effect of those operations is the same as for the type that satisfies theLegacyRandomAccessIterator
, a C++ named requirement.It provides the
get_buffer
method, which returns the buffer passed to thebegin
andend
functions.
The begin
and end
functions can take SYCL 2020 deduction tags and sycl::no_init
as arguments
to explicitly control which access mode should be applied to a particular buffer when submitting
a SYCL kernel to a device:
sycl::buffer<int> buf{/*...*/};
auto first_ro = oneapi::dpl::begin(buf, sycl::read_only);
auto first_wo = oneapi::dpl::begin(buf, sycl::write_only, sycl::no_init);
auto first_ni = oneapi::dpl::begin(buf, sycl::no_init);
To use the functions, add #include <oneapi/dpl/iterator>
to your code. For example:
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/iterator>
#include <random>
#include <sycl/sycl.hpp>
int main(){
std::vector<int> vec(1000);
std::generate(vec.begin(), vec.end(), std::minstd_rand{});
sycl::buffer<int> buf{ vec.data(), vec.size() };
auto buf_begin = oneapi::dpl::begin(buf);
auto buf_end = oneapi::dpl::end(buf);
oneapi::dpl::sort(oneapi::dpl::execution::dpcpp_default, buf_begin, buf_end);
return 0;
}
Use std::vector#
You can use iterators to an ordinary std::vector
with data in host memory, as shown in the following example:
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <random>
#include <vector>
int main(){
std::vector<int> vec( 1000 );
std::generate(vec.begin(), vec.end(), std::minstd_rand{});
oneapi::dpl::sort(oneapi::dpl::execution::dpcpp_default, vec.begin(), vec.end());
return 0;
}
In this case a temporary SYCL buffer is created, the data is copied to this buffer, and it is processed according to the algorithm semantics. After processing on a device is complete, the modified data is copied from the temporary buffer back to the host container.
Note
For parallel range algorithms with device execution policies the use of ordinary std::vector
s is not supported.
While convenient, direct use of an ordinary std::vector
can lead to unintended copying between the host
and the device. We recommend working with SYCL buffers or with USM to reduce data copying.
Note
For specialized memory algorithms that begin or end the lifetime of data objects, that is,
uninitialized_*
and destroy*
families of functions, the data to initialize or destroy
should be accessible on the device without extra copying. Therefore these algorithms may not use
data storage on the host with device execution policies.
You can also use std::vector
with a sycl::usm_allocator
, as shown in the following example.
Make sure that the allocator and the execution policy use the same SYCL queue:
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <random>
#include <vector>
#include <sycl/sycl.hpp>
int main(){
const int n = 1000;
auto policy = oneapi::dpl::execution::dpcpp_default;
sycl::usm_allocator<int, sycl::usm::alloc::shared> alloc(policy.queue());
std::vector<int, decltype(alloc)> vec(n, alloc);
std::generate(vec.begin(), vec.end(), std::minstd_rand{});
// Recommended to use USM pointers:
oneapi::dpl::sort(policy, vec.data(), vec.data() + vec.size());
/*
// Iterators for USM allocators might require extra copying - not a recommended method
oneapi::dpl::sort(policy, vec.begin(), vec.end());
*/
return 0;
}
For std::vector
with a USM allocator we recommend to use std::vector::data()
in
combination with std::vector::size()
as shown in the example above, rather than iterators to
std::vector
. That is because for some implementations of the C++ Standard Library it might not
be possible for oneDPL to detect that iterators are pointing to USM-allocated data. In that
case the data will be treated as if it were in host memory, with an extra copy made to a SYCL buffer.
Retrieving USM pointers from std::vector
as shown guarantees no unintended copying.
Use Range Views#
For parallel range algorithms with device execution policies,
place the data in USM or a USM-allocated std::vector
, and pass it to an algorithm
via a device-copyable range or view object such as std::ranges::subrange
or std::span
.
Note
Use of std::ranges::views::all
is not supported for algorithms with device execution policies.
These data ranges as well as supported range adaptors and factories may be combined into data transformation pipelines that also can be used with parallel range algorithms. For example:
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <random>
#include <vector>
#include <span>
#include <ranges>
#include <functional>
#include <sycl/sycl.hpp>
int main(){
const int n = 1000;
auto policy = oneapi::dpl::execution::dpcpp_default;
sycl::queue q = policy.queue();
int* d_head = sycl::malloc_host<int>(n, q);
std::generate(d_head, d_head + n, std::minstd_rand{});
sycl::usm_allocator<int, sycl::usm::alloc::shared> alloc(q);
std::vector<int, decltype(alloc)> vec(n, alloc);
oneapi::dpl::ranges::copy(policy,
std::ranges::subrange(d_head, d_head + n) | std::views::transform(std::negate{}),
std::span(vec));
oneapi::dpl::ranges::sort(policy, std::span(vec));
sycl::free(d_head, q);
return 0;
}