DPCT1110#

Message#

The total declared local variable size in device function <function name> exceeds 128 bytes and may cause high register pressure. Consult with your hardware vendor to find the total register size available and adjust the code, or use smaller sub-group size to avoid high register pressure.

Detailed Help#

In specific hardware configurations, the number of registers available for each work-item is limited. For instance, in the Intel Xe-LP GPU architecture, each hardware thread has 4KB of registers. Consequently, if the sub-group size is 32, then each work-item can utilize 128 bytes of registers (4KB/32). If the declared local variable size in a device function exceeds 128 bytes, some variables may be stored in local or global memory, potentially leading to reduced performance when frequently accessed. To address this issue, you can either decrease the sub-group size to make more registers available for each work-item, or follow the recommendations in the Registerization and Avoid Register Spills section of the oneAPI GPU Optimization Guide. For other hardware, please consult with your hardware vendor to get configuration information.

Suggestions to Fix#

For example, this original CUDA* code:

1__global__ void Kernel(){
2  int result[50];
3  ...
4}
5int main{
6  ...
7  Kernel<<<1, 100>>>();
8}

results in the following migrated SYCL* code:

 1/*
 2DPCT1110:0: The total declared local variable size in device function "Kernel" exceeds 128 bytes and may cause high register pressure. Consult with your hardware vendor to find the total register size available and adjust the code or use smaller    sub-group size to avoid high register pressure.
 3*/
 4void Kernel(){
 5  int result[50];
 6  ...
 7}
 8int main{
 9  ...
10  q.parallel_for(sycl::range(100), [=](sycl::nd_item<3> item) [[intel::reqd_sub_group_size(32)]] { Kernel(); });
11}

which is rewritten to:

 1void Kernel(){
 2  int result[50];
 3  ...
 4}
 5
 6int main{
 7  ...
 8  /*
 9   Reduce sub_group size to make more registers available for each work-item, which may help to avoid high register pressure.
10  */
11  q.parallel_for(sycl::range(100), [=](sycl::nd_item<3> item) [[intel::reqd_sub_group_size(16)]] { Kernel(); });
12}