DPCT1114#

Message#

cudaMemcpy is migrated to asynchronization memcpy, assuming in the original code the source host memory is pageable memory. If the memory is not pageable, call wait() on event return by memcpy API to ensure synchronization behavior.

Detailed Help#

The cudaMemcpy function typically operates synchronously. However, when copying from host to device using pageable host memory, its behavior becomes asynchronous. If the --optimize-migration option is used during migration, the migration tool assumes host memory is pageable and migrates cudaMemcpy into an asynchronous memcpy from host to device, which can improve performance by permitting concurrent memory transfer with other tasks. Nonetheless, if the source memory is pinned host memory, the user needs to call wait() on the event returned by the memcpy API to ensure synchronization behavior.

Suggestions to Fix#

For example, this original CUDA* code:

1Int N = 100;
2float *src, *dst;
3cudaMalloc(&dst, sizeof(float) * N);
4cudaMallocHost(&src, sizeof(float) * N);
5for(int i = 0; i < N; i++){
6  Src[i] = i;
7}
8cudaMemcpy(dst, src, sizeof(float) * N, cudaMemcpyHostToDevice);

results in the following migrated SYCL* code:

 1sycl::device dev_ct1;
 2sycl::queue q_ct1(dev_ct1, sycl::property_list{sycl::property::queue::in_order()});
 3float *src, *dst;
 4dst = sycl::malloc_device<float>(N, q_ct1);
 5src = sycl::malloc_host<float>(N, q_ct1);
 6for(int i = 0; i < N; i++){
 7  src[i] = i;
 8}
 9/*
10DPCT1114:1: cudaMemcpy is migrated to asynchronization memcpy, assuming in the original code the source host memory is pageable memory. If  the memory is not pageable, call wait() on event return by memcpy API to ensure synchronization behavior.
11*/
12q_ct1.memcpy(dst, src, sizeof(float) * N);

which is rewritten to:

1sycl::device dev_ct1;
2sycl::queue q_ct1(dev_ct1, sycl::property_list{sycl::property::queue::in_order()});
3float *src, *dst;
4dst = sycl::malloc_device<float>(N, q_ct1);
5src = sycl::malloc_host<float>(N, q_ct1);
6for(int i = 0; i < N; i++){
7  src[i] = i;
8}
9q_ct1.memcpy(dst, src, sizeof(float) * N).wait(); // src is allocated by cudaMallocHost with page-locked memory on host, so call wait().