DPCT1114#
Message#
cudaMemcpy
is migrated to asynchronization memcpy
, assuming in the original
code the source host memory is pageable memory. If the memory is not pageable,
call wait()
on event return by memcpy
API to ensure synchronization behavior.
Detailed Help#
The cudaMemcpy
function typically operates synchronously. However, when
copying from host to device using pageable host memory, its behavior becomes
asynchronous. If the --optimize-migration
option is used during migration,
the migration tool assumes host memory is pageable and migrates cudaMemcpy
into an asynchronous memcpy
from host to device, which can improve performance
by permitting concurrent memory transfer with other tasks. Nonetheless, if the
source memory is pinned host memory, the user needs to call wait()
on the
event returned by the memcpy
API to ensure synchronization behavior.
Suggestions to Fix#
For example, this original CUDA* code:
1Int N = 100;
2float *src, *dst;
3cudaMalloc(&dst, sizeof(float) * N);
4cudaMallocHost(&src, sizeof(float) * N);
5for(int i = 0; i < N; i++){
6 Src[i] = i;
7}
8cudaMemcpy(dst, src, sizeof(float) * N, cudaMemcpyHostToDevice);
results in the following migrated SYCL* code:
1sycl::device dev_ct1;
2sycl::queue q_ct1(dev_ct1, sycl::property_list{sycl::property::queue::in_order()});
3float *src, *dst;
4dst = sycl::malloc_device<float>(N, q_ct1);
5src = sycl::malloc_host<float>(N, q_ct1);
6for(int i = 0; i < N; i++){
7 src[i] = i;
8}
9/*
10DPCT1114:1: cudaMemcpy is migrated to asynchronization memcpy, assuming in the original code the source host memory is pageable memory. If the memory is not pageable, call wait() on event return by memcpy API to ensure synchronization behavior.
11*/
12q_ct1.memcpy(dst, src, sizeof(float) * N);
which is rewritten to:
1sycl::device dev_ct1;
2sycl::queue q_ct1(dev_ct1, sycl::property_list{sycl::property::queue::in_order()});
3float *src, *dst;
4dst = sycl::malloc_device<float>(N, q_ct1);
5src = sycl::malloc_host<float>(N, q_ct1);
6for(int i = 0; i < N; i++){
7 src[i] = i;
8}
9q_ct1.memcpy(dst, src, sizeof(float) * N).wait(); // src is allocated by cudaMallocHost with page-locked memory on host, so call wait().