Environment Variables#
Collective Algorithms Selection#
oneCCL supports collective operations for the host (CPU) memory buffers and device (GPU) memory buffers. In addition, oneCCL has two different paths to support collectives with GPU buffers; one directly uses Level Zero, and the other uses SYCL. The SYCL path is a new code being developed and not all collectives are supported.
For the Level Zero implementation, in the case of GPU buffers, oneCCL collectives are optimized to execute a hierarchical algorithm composed of an optimized scale-up phase (communication between ranks/processes in the same node) and a scaleout phase (communication between ranks/processes on different nodes). In the case of CPU buffers, the current collective algorithms do not have support for scale-up and scaleout phases; only a non-hierarchical algorithm can be chosen.
With CCL_<coll_name> = <algo_name>
, you can select the algorithm for the
collective in <coll_name>
. For GPU buffers, the default algorithm is
topo
, which refers to the scale-up algorithm. If you select an algorithm
different from topo
, oneCCL will implement a non-hierarchical algorithm,
where it will copy the GPU buffers to the Host (CPU) and will run the specified
algorithm.
For CPU buffers, topo
is not available; you can only select one of the
other algorithms in the table for a given collective.
If the collective uses GPU buffers, you can select whether the implementation
of the scale-up algorithm should use copy engines or kernels. There is also the
option to select the scaleout algorithm using
CCL_<coll_name>_SCALEOUT=<algo_name>
.
Next, environment variables for collective algorithm selection are explained based on the code path (Level Zero or SYCL), the collective being called, and the type of buffer (GPU or CPU).
Level Zero Path#
ALLGATHER/ALLGATHERV#
CCL_ALLGATHER/CCL_ALLGATHERV#
Syntax
For the whole message size:
CCL_ALLGATHER=<algo_name>
CCL_ALLGATHERV=<algo_name>
For a specific message size range:
CCL_ALLGATHER="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
CCL_ALLGATHERV="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of the available collective algorithms.<size_range>
is described by the left and the right size borders in the<left>-<right>
format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.
Example
CCL_ALLGATHER="direct:0-8192;ring:8193-max"
CCL_ALLGATHERV="direct:0-8192;ring:8193-max"
Arguments
<algo_name> |
Description |
---|---|
|
Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers. |
|
Based on |
|
Send to all, receive from all. |
|
alltoall-based algorithm. |
|
Series of broadcast operations with different root ranks. |
|
ring-based algorithm. |
Description
Use this environment variable to specify the algorithm for ALLGATHER
/ ALLGATERV
.
If using GPU buffers, select CCL_ALLGATHER=topo
or CCL_ALLGATHERV=topo
(the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node.
For GPU buffers, when selecting an algorithm different from topo
, oneCCL copies the data to the host and follows the specified CPU algorithm.
CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL#
Syntax
CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=<value>
Arguments
<value> |
Description |
---|---|
|
Uses compute kernels to transfer data across GPUs for the allgather phase of |
|
Uses copy engines to transfer data across GPUs for the allgather phase of the |
Description
Set this environment variable to use GPU buffers to specify the scale-up phase of the algorithm for ALLGATHER
/ ALLGATHERV
.
This environment variable allows the user to choose between using compute kernels or copy engines.
This option is only available if CCL_ALLGATHER = topo
or CCL_ALLGATHERV=topo
(the default for GPU buffers).
Note
This environment variable applies to both ALLGATHER
and ALLGATHERV
.
CCL_ALLGATHER_SCALEOUT/CCL_ALLGATHERV_SCALEOUT#
Syntax
For the whole message size:
CCL_ALLGATHER_SCALEOUT=<algo_name>
CCL_ALLGATHERV_SCALEOUT=<algo_name>
For a specific message size range:
CCL_ALLGATHER_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
CCL_ALLGATHERV_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of the available scaleout collective algorithms.<size_range>
is described by the left and the right size borders in the<left>-<right>
format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.
Example
CCL_ALLGATHER_SCALEOUT="direct:0-8192;ring:8193-max"
CCL_ALLGATHERV_SCALEOUT="direct:0-8192;ring:8193-max"
Arguments
<algo_name> |
Description |
---|---|
|
Based on |
|
Send to all, receive from all. |
|
alltoall-based algorithm. |
|
Series of broadcast operations with different root ranks. |
|
ring-based algorithm. The default value. |
Description
Set this environment variable to use GPU buffers to specify the scaleout phase of the algorithm for ALLGATHER
/ ALLGATHERV
.
This option is only available if CCL_ALLGATHERV = topo
or CCL_ALLGATHER = topo
(the default for GPU buffers).
oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.
To see the actual table values, set CCL_LOG_LEVEL=info
.
ALLREDUCE#
CCL_ALLREDUCE#
Syntax
For the whole message size:
CCL_ALLREDUCE=<algo_name>
For a specific message size range:
CCL_ALLREDUCE="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of available collective algorithms.<size_range>
is described by the left and the right size borders in the<left>-<right>
format. The size is specified in bytes. To specify the maximum message size, use the reserved wordmax
.
Example
CCL_ALLREDUCE="recursive_doubling:0-8192;rabenseifner:8193-1048576;ring:1048577-max"
Arguments
<algo_name> |
Description |
|
Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers. |
|
Based on |
|
Rabenseifner algorithm. |
|
May be beneficial for imbalanced workloads. |
|
reduce_scatter + allgather ring. Use |
|
double-tree algorithm. |
|
Recursive doubling algorithm. |
|
Two-dimensional algorithm (reduce_scatter + allreduce + allgather). |
Description
Use this environment variable to specify the algorithm for ALLREDUCE
.
If using GPU buffers, select CCL_ALLREDUCE=topo
(the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node.
For GPU buffers, when selecting an algorithm different from topo
, oneCCL copies the data to the host and follows the specified CPU algorithm.
oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.
To see the actual table values, set CCL_LOG_LEVEL=info
.
CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)#
Syntax
CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=<value>
Arguments
<value> |
Description |
---|---|
|
Uses compute kernels to transfer data across GPUs for the reduce-scatter phase of the |
|
Uses copy engines to transfer data across GPUs for the reduce-scatter phase of the |
Description
Set this environment variable to use GPU buffers to specify how to perform the reduce_scatter portion of the scale-up ALLREDUCE
collective.
This variable allows you to choose between using compute kernels or copy engines.
This option is only available if CCL_ALLREDUCE=topo
(the default for GPU buffers).
CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)#
Syntax
CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=<value>
Arguments
<value> |
Description |
---|---|
|
Uses compute kernels to transfer data across GPUs for the allgather phase of |
|
Uses copy engines to transfer data across GPUs for the allgather phase of the |
Description
ALLREDUCE
is implemented as a reduce-scatter phase followed by an allgather phase.
Set this environment variable to use GPU buffers to specify how to perform the
allgather portion of the scale-up ALLREDUCE
collective. This environment
variable allows the user to choose between using compute kernels or using copy
engines. This option is only available if CCL_ALLGATHERV=topo
(the default
for GPU buffers).
CCL_ALLREDUCE_SCALEOUT (GPU buffers only)#
Syntax
For the whole message size:
CCL_ALLREDUCE_SCALEOUT=<algo_name>
For a specific message size range:
CCL_ALLREDUCE_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of available collective algorithms.<size_range>
is described by the left and the right size borders the<left>-<right>
format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.
Example
CCL_ALLREDUCE_SCALEOUT="recursive_doubling:0-8192;rabenseifner:8193-1048576;ring:1048577-max
Arguments
|
Based on |
|
Rabenseifner algorithm. |
|
May be beneficial for imbalanced workloads. |
|
reduce_scatter + allgather ring. Use |
|
double-tree algorithm. |
|
Recursive doubling algorithm. |
Description
Set this environment variable to use GPU buffers to specify the scaleout algorithm for ALLREDUCE.
This option is only available if CCL_ALLREDUCE = topo
(the default for GPU buffers).
oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.
To see the actual table values, set CCL_LOG_LEVEL=info
.
ALLTOALL, ALLTOALLV#
CCL_ALLTOALL, CCL_ALLTOALLV#
Syntax
For the whole message size:
CCL_ALLTOALL=<algo_name> or CCL_ALLTOALLV=<algo_name>
For a specific message size range:
CCL_ALLTOALL="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
or
CCL_ALLTOALLV="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of available collective algorithms.<size_range>
is described by the left and the right size borders in the<left>-<right>
format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.
Example
CCL_ALLTOALL="naive:0-8192;scatter:8193-max"
or
CCL_ALLTOALLV="naive:0-8192;scatter:8193-max"
Arguments
|
Topology-aware algorithm. The default for GPU buffers. Not available for CPU buffers. |
|
Based on |
|
Send to all, receive from all. |
|
scatter-based algorithm. |
CCL_ALLTOALLV_MONOLITHIC_KERNEL#
Syntax
CCL_ALLTOALLV_MONOLITHIC_KERNEL=<value>
Arguments
<value> |
Description |
---|---|
|
Uses compute kernels to transfer data across GPUs for the allgather phase of the |
|
Uses copy engines to transfer data across GPUs for the allgather phase of the |
Description
Set this environment variable to use GPU buffers to specify the scale-up algorithm for ALLTOALL
or ALLTOALLV
This environment variable allows the user to choose between using compute kernels or using copy engines.
This option is only available if CCL_ALLTOALL=topo
or CCL_ALLTOALLV=topo
. The default for GPU buffers.
CCL_ALLTOALL_SCALEOUT, CCL_scaleout_ALLTOALLV_scaleout#
Syntax
For the whole message size:
CCL_ALLTOALL_SCALEOUT=<algo_name> or CCL_ALLTOALLV_SCALEOUT=<algo_name>
For a specific message size range:
CCL_ALLTOALL_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
or
CCL_ALLTOALLV_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of available collective algorithms.<size_range>
is described by the left and the right size borders in a format<left>-<right>
. The size is specified in bytes. To specify the maximum message size, use the reserved wordmax
.
Example
CCL_ALLTOALL_SCALEOUT="naive:0-8192;scatter:8193-max"
or
CCL_ALLTOALLV_SCALEOUT="naive:0-8192;scatter:8193-max"
Arguments
<algo_name> |
Description |
---|---|
|
Send to all, receive from all. |
|
scatter-based algorithm. The default value. |
Description
Set this environment variable to use GPU buffers to specify the scaleout algorithm for ALLTOALL
or ALLTOALLV
.
This option is only available if CCL_ALLTOALL=topo
or CCL_ALLTOALLV=topo
(the default for GPU buffers).
oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.
To see the actual table values, set CCL_LOG_LEVEL=info
.
BARRIER#
CCL_BARRIER#
Syntax
CCL_BARRIER=<algo_name>
Arguments
<algo_name> |
Description |
---|---|
|
Based on |
|
Ring-based algorithm. |
Description
Use this environment variable to select the barrier algorithm.
BROADCAST#
CCL_BCAST#
Syntax
CCL_BCAST=<algo_name>
Arguments
<algo_name> |
Description |
|
Topology-aware algorithm. The default for GPU buffers. Not available for CPU buffers. |
|
Based on MPI_Ibcast. |
|
ring-based algorithm. |
|
double-tree algorithm. |
|
Send to all from root rank. |
Description
Use this environment variable to select the algorithm used for broadcast.
Note
The BCAST
algorithm does not yet support the CCL_BCAST_scaleout
environment variable. To change the algorithm for BCAST
, use the CCL_BCAST
environment variable.
REDUCE#
CCL_REDUCE#
Syntax
For the whole message size:
CCL_REDUCE=<algo_name>
For a specific message size range:
CCL_REDUCE="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of available collective algorithms.<size_range>
is described by the left and the right size borders in the<left>-<right>
format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.
Example
CCL_REDUCE="direct:0-8192;double_tree:1048577-max"
Arguments
<algo_name> |
Description |
|
Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers. |
|
Based on |
|
Rabenseifner algorithm. |
|
tree algorithm |
|
double-tree algorithm. |
Description
Set this environment variable to specify the algorithm for REDUCE
.
If using GPU buffers, select CCL_REDUCE=topo
(the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node.
For GPU buffers, when selecting an algorithm different from topo
, oneCCL copies the data to the host and follows the specified CPU algorithm.
oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.
To see the actual table values, set CCL_LOG_LEVEL=info
.
CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)#
Syntax
CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=<value>
Arguments
<value> |
Description |
|
Uses compute kernels to transfer data across GPUs for the reduce-scatter phase of the |
|
Uses copy engines to transfer data across GPUs for the reduce-scatter phase of the |
Description
Set this environment variable to use GPU buffers to specify the scale-up algorithm for ALLREDUCE
.
This environment variable allows the user to choose between using compute kernels or using copy engines.
This option is only available if CCL_REDUCE=topo
(the default for GPU buffers).
CCL_REDUCE_SCALEOUT (GPU buffers only)#
Syntax
For the whole message size:
CCL_REDUCE_SCALEOUT=<algo_name>
For a specific message size range:
CCL_REDUCE_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of available collective algorithms.<size_range>
is described by the left and the right size borders in a format<left>-<right>
. The size is specified in bytes. To specify the maximum message size, use the reserved wordmax
.
Example
CCL_REDUCE_SCALEOUT="direct:0-8192;double_tree:1048577-max"
Arguments
<algo_name> |
Description |
|
Based on |
|
Rabenseifner algorithm. |
|
tree algorithm. |
|
double-tree algorithm. The default value. |
Description
Set this environment variable to use GPU buffers to specify the scaleout algorithm for REDUCE
.
This option is only available if CCL_REDUCE=topo
(the default for GPU buffers).
oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.
To see the actual table values, set CCL_LOG_LEVEL=info
.
REDUCE_SCATTER#
CCL_REDUCE_SCATTER#
Syntax
For the whole message size:
CCL_REDUCE_SCATTER=<algo_name>
For a specific message size range:
CCL_REDUCE_SCATTER="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of available collective algorithms.<size_range>
is described by the left and the right size borders in a format<left>-<right>
. The size is specified in bytes. To specify the maximum message size, use the reserved wordmax
.
Example
CCL_REDUCE_SCATTER="direct:0-8192;ring:1048577-max"
Arguments
<algo_name> |
Description |
|
Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers. |
|
Based on |
|
Send to all, receive, and reduce from all. |
|
ring-based algorithm. Use |
Description
Use this environment variable to specify the algorithm for reduce. If using GPU
buffers, select CCL_REDUCE_SCATTER=topo
(the default) to use a hierarchical
algorithm for scale-up data transfer across GPUs in the same node. For GPU
buffers,when selecting an algorithm different from topo
, oneCCL copies the
data to the host and follow the specified CPU algorithm.
oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.
To see the actual table values, set CCL_LOG_LEVEL=info
.
CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)#
Syntax
CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=<value>
Arguments
<value> |
Description |
|
Uses compute kernels to transfer data across GPUs for the reduce-scatter phase of the |
|
Uses copy engines to transfer data across GPUs for the reduce-scatter phase of the |
Description
Set this environment variable to use GPU buffers to specify how to perform the reduce-scatter portion of the scale-up REDUCE_SCATTER
collective.
This environment variable allows the user to choose between using compute kernels or using copy engines.
This option is only available if CCL_REDUCE_SCATTER=topo
(the default for GPU buffers).
CCL_REDUCE_SCATTER_SCALEOUT (GPU buffers only)#
Syntax
For the whole message size:
CCL_REDUCE_SCATTER_SCALEOUT=<algo_name>
For a specific message size range:
CCL_REDUCE_SCATTER_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<algo_name>
is selected from the list of available collective algorithms.<size_range>
is described by the left and the right size borders in a format<left>-<right>
. The size is specified in bytes. To specify the maximum message size, use the reserved wordmax
.
Example
CCL_REDUCE_SCATTER_SCALEOUT="direct:0-8192;double_tree:1048577-max"
Arguments
<algo_name> |
Description |
|
Based on |
|
Send to all, receive, and reduce from all. The default value. |
|
Ring-based algorithm. Use |
Description
Set this environment variable to use GPU buffers to specify the scaleout algorithm for ALLREDUCE
.
This option is only available if CCL_REDUCE_SCATTER = topo
(the default for GPU buffers).
oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.
To see the actual table values, set CCL_LOG_LEVEL=info
.
SYCL PATH (Default with 2021.14)#
All collectives#
CCL_ENABLE_SYCL_KERNELS#
Syntax
CCL_ENABLE_SYCL_KERNELS=<value>
Arguments
<value> |
Description |
|
Enable SYCL kernels. The default value. |
|
Disable SYCL kernels. |
Description
Setting this environment variable to 1
enables SYCL kernel-based implementations for ALLGATHERV
, ALLREDUCE
, and REDUCE_SCATTER
.
This new optimization optimizes all message sizes and supports the following data types:
int32
fp32
fp16
bf16
sum operations
oneCCL falls back to other implementations when the support is unavailable with SYCL kernels, so that you can set up this environment variable safely.
Note
The name of this variable in 2021.12 was CCL_SKIP_SCHEDULER
. Starting with 2021.13, the variable has been renamed to CCL_ENABLE_SYCL_KERNELS
.
ALLGATHER/ALLGATHERV#
CCL_SYCL_ALLGATHERV_TMP_BUF#
Syntax
CCL_SYCL_ALLGATHERV_TMP_BUF=<value>
Arguments
<value> |
Description |
|
Uses a persistent temporary buffer to perform the |
|
Performs an IPC handle exchange, avoiding copies to temporary buffers. Default value. |
Description
Specifies if the ALLGATHER
/ ALLGATHERV
implementation should use a persistent temporary buffer or not. The implementation with temporary buffers makes the collective fully asynchronous, but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. The current default uses Level Zero IPC to avoid the copies to the temporary buffer.
CCL_SYCL_ALLGATHERV_SMALL_THRESHOLD#
Syntax
CCL_SYCL_ALLGATHERV_SMALL_THRESHOLD=<value>
Arguments
<value> |
Description |
|
Threshold in bytes to specify the small size algorithm. Default value |
Description
ALLGATHER
/ ALLGATHERV
collectives with message sizes smaller than the specified threshold will use an algorithm specialized for small-sized messages.
CCL_SYCL_ALLGATHERV_SCALEOUT_THRESHOLD#
Syntax
CCL_SYCL_ALLGATHERV_SCALEOUT_THRESHOLD=<value>
Arguments
<value> |
Description |
|
Threshold in bytes to specify when scale-out |
Description
For ALLGATHER
/ ALLGATHERV
collectives, with the total message sizes below this threshold in bytes, the SYCL path is chosen to execute the collective operation. For message sizes exceeding this threshold, the implementation will switch to the Level Zero Path. The total message size is the number of bytes received from all participating processes.
ALLREDUCE#
CCL_SYCL_ALLREDUCE_TMP_BUF#
Syntax
CCL_SYCL_ALLREDUCE_TMP_BUF=<value>
Arguments
<value> |
Description |
|
Uses a persistent temporary buffer to perform the |
|
Performs an IPC handle exchange, avoiding copies to temporary buffers. Default value. |
Description
Specifies whether the ALLREDUCE
implementation should use a persistent temporary buffer. The implementation with temporary buffers makes the collective fully asynchronous, but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. The current default uses Level Zero IPC support to avoid the copies to the temporary buffer.
CCL_SYCL_ALLREDUCE_SMALL_THRESHOLD#
Syntax
CCL_ALLREDUCE_SMALL_THRESHOLD=<value>
Arguments
<value> |
Description |
|
Threshold in bytes to specify the small size algorithm. Default value is |
Description
ALLREDUCE
collective with message sizes smaller than the specified threshold will use an algorithm specialized for small-sized messages.
CCL_SYCL_ALLREDUCE_SCALEOUT_THRESHOLD#
Syntax
CCL_SYCL_ALLREDUCE_SCALEOUT_THRESHOLD=<value>
Arguments
<value> |
Description |
|
Threshold in bytes to specify when scale-out allreduce uses SYCL kernel-based implementation. Default value is |
Description
For ALLREDUCE
collectives, with message sizes below this threshold in bytes, the SYCL path is chosen to execute the collective operation. For message sizes exceeding this threshold, the implementation will switch to the Level Zero Path.
CCL_SYCL_ALLREDUCE_SCALEOUT_DIRECT_THRESHOLD#
Syntax
CCL_SYCL_ALLREDUCE_SCALEOUT_DIRECT_THRESHOLD=<value>
Arguments
<value> |
Description |
|
Threshold in bytes to specify when allreduce collective selects direct |
Description
For allreduce collectives with message sizes below this threshold in bytes, MPI_Iallreduce
direct algorithm is selected as scale-out phase of the colllective. For message sizes above this threshold and under the CCL_SYCL_ALLREDUCE_SCALEOUT_THRESHOLD
, the default algorithm (ring) is selected.
REDUCE_SCATTER#
CCL_SYCL_REDUCE_SCATTER_TMP_BUF#
Syntax
CCL_SYCL_REDUCE_SCATTER_TMP_BUF=<value>
Arguments
<value> |
Description |
|
Uses a persistent temporary buffer to perform the |
|
Performs an IPC handle exchange, avoiding copies to temporary buffers. Default value. |
Description
Specifies if the REDUCE_SCATTER
implementation should use a persistent temporary buffer or not. The implementation with temporary buffers makes the collective fully asynchronous, but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. The current default uses Level Zero IPC support to avoid the copies to the temporary buffer.
CCL_SYCL_REDUCE_SCATTER_SCALEOUT_DIRECT_THRESHOLD#
Syntax
CCL_SYCL_REDUCE_SCATTER_SCALEOUT_DIRECT_THRESHOLD=<value>
Arguments
<value> |
Description |
|
Threshold in bytes to specify when reduce-scatter collective selects direct |
Description
For reduce-scatter collectives with message sizes below this threshold in bytes, MPI_Ireduce_scatter
direct algorithm is selected for the scale-out phase of the collective. For message sizes above this threshold and under the CCL_SYCL_REDUCE_SCATTER_SCALEOUT_THRESHOLD
, the default algorithm (ring) is selected.
CCL_SYCL_REDUCE_SCATTER_SCALEOUT_THRESHOLD#
Syntax
CCL_SYCL_REDUCE_SCATTER_SCALEOUT_THRESHOLD=<value>
Arguments
<value> |
Description |
|
Threshold in bytes to specify when scale-out |
Description
For REDUCE_SCATTER
collectives with message sizes below this threshold in bytes, the SYCL path is chosen to execute the collective operation. For message sizes exceeding this threshold, the implementation will switch to the Level Zero Path.
CCL_SYCL_REDUCE_SCATTER_SMALL_THRESHOLD#
Syntax
CCL_SYCL_REDUCE_SCATTER_SMALL_THRESHOLD=<value>
Arguments
<value> |
Description |
|
Threshold in bytes to specify the small-size algorithm. Default value |
Description
REDUCE_SCATTER
collectives with message sizes smaller than the specified threshold will use an algorithm specialized for small-sized messages.
Workers#
The group of environment variables to control worker threads.
CCL_WORKER_COUNT#
Syntax
CCL_WORKER_COUNT=<value>
Arguments
<value> |
Description |
---|---|
|
The number of worker threads for oneCCL rank ( |
Description
Set this environment variable to specify the number of oneCCL worker threads. For GPU buffers, currently we do not recommend setting this variable to values larger than 1
.
CCL_WORKER_AFFINITY#
Syntax
CCL_WORKER_AFFINITY=<cpulist>
Arguments
<cpulist> |
Description |
---|---|
|
Workers are automatically pinned to last cores of pin domain.
Pin domain depends from process launcher.
If |
|
A comma-separated list of core numbers and/or ranges of core numbers for all local workers, one number per worker.
The i-th local worker is pinned to the i-th core in the list.
For example |
Description
Set this environment variable to specify cpu affinity for oneCCL worker threads.
CCL_WORKER_MEM_AFFINITY#
Syntax
CCL_WORKER_MEM_AFFINITY=<nodelist>
Arguments
<nodelist> |
Description |
---|---|
|
Workers are automatically pinned to NUMA nodes that correspond to CPU affinity of workers. |
|
A comma-separated list of NUMA node numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th NUMA node in the list. The number should not exceed the number of NUMA nodes available on the system. |
Description
Set this environment variable to specify memory affinity for oneCCL worker threads.
KVS#
CCL_KVS_MODE#
Syntax
CCL_KVS_MODE=<value>
Arguments
<cpulist> |
Description |
---|---|
|
PMI transport (default) |
|
MPI transport |
Description
Set the environment variable to specify the transport used to establish a connection between ranks during the oneCCL communicator creation. Currently, the mpi
value is only supported when the MPI transport is used (see CCL_ATL_TRANSPORT
). For large scale runs, we recommend setting KVS_MODE
to mpi
.
ATL#
The group of environment variables to control ATL (abstract transport layer).
CCL_ATL_TRANSPORT#
Syntax
CCL_ATL_TRANSPORT=<value>
Arguments
<value> |
Description |
---|---|
|
MPI transport (default). |
|
OFI (libfabric*) transport. |
Description
Set this environment variable to select the transport for inter-process communications.
CCL_ATL_HMEM#
Syntax
CCL_ATL_HMEM=<value>
Arguments
<value> |
Description |
---|---|
|
Enable heterogeneous memory support on the transport layer. |
|
Disable heterogeneous memory support on the transport layer (default). |
Description
Set this environment variable to enable handling of HMEM/GPU buffers by the transport layer. The actual HMEM support depends on the limitations on the transport level and system configuration.
CCL_ATL_SHM#
Syntax
CCL_ATL_SHM=<value>
Arguments
<value> |
Description |
---|---|
|
Disables the OFI shared memory provider. The default value. |
|
Enables the OFI shared memory provider. |
Description
Set this environment variable to enable the OFI shared memory provider to
communicate between ranks in the same node of the host (CPU) buffers. This
capability requires OFI as the transport (CCL_ATL_TRANSPORT=ofi
).
The OFI/SHM provider has support to utilize the Intel(R) Data Streaming Accelerator* (DSA). To run it with DSA*, you need: * Linux* OS kernel support for the DSA* shared work queues * Libfabric* 1.17 or later
To enable DSA, set the following environment variables:
FI_SHM_DISABLE_CMA=1
FI_SHM_USE_DSA_SAR=1
Refer to Libfabric* Programmer’s Manual for the additional details about DSA* support in the SHM provider: https://ofiwg.github.io/libfabric/main/man/fi_shm.7.html.
CCL_PROCESS_LAUNCHER#
Syntax
CCL_PROCESS_LAUNCHER=<value>
Arguments
<value> |
Description |
---|---|
|
Uses the MPI hydra job launcher. The default value. |
|
Uses torchrun <https://pytorch.org/docs/stable/elastic/run.html> as a job launcher. |
|
Is used with the PALS job launcher that uses the pmix API. The CCL_PROCESS_LAUNCHER=pmix CCL_ATL_TRANSPORT=mpi mpiexec -np 2 -ppn 2 --pmi=pmix ...
|
|
No job launcher is used. You should specify the values for |
Description
Set this environment variable to specify the job launcher.
CCL_LOCAL_SIZE#
Syntax
CCL_LOCAL_SIZE=<value>
Arguments
<value> |
Description |
---|---|
|
A total number of ranks on the local host. |
Description
Set this environment variable to specify a total number of ranks on a local host.
CCL_LOCAL_RANK#
Syntax
CCL_LOCAL_RANK=<value>
Arguments
<value> |
Description |
---|---|
|
Rank number of the current process on the local host. |
Description
Set this environment variable to specify the rank number of the current process in the local host.
Multi-NIC#
CCL_MNIC
, CCL_MNIC_NAME
and CCL_MNIC_COUNT
define filters to select multiple NICs.
oneCCL workers will be pinned on selected NICs in a round-robin way.
CCL_MNIC#
Syntax
CCL_MNIC=<value>
Arguments
<value> |
Description |
---|---|
|
Select all NICs available on the node. |
|
Select all NICs local for the NUMA node that corresponds to process pinning. |
|
Disable special NIC selection, use a single default NIC (default). |
Description
Set this environment variable to control multi-NIC selection by NIC locality.
CCL_MNIC_NAME#
Syntax
CCL_MNIC_NAME=<namelist>
Arguments
<namelist> |
Description |
---|---|
|
A comma-separated list of NIC full names or prefixes to filter NICs.
Use the |
Description
Set this environment variable to control multi-NIC selection by NIC names.
CCL_MNIC_COUNT#
Syntax
CCL_MNIC_COUNT=<value>
Arguments
<value> |
Description |
---|---|
|
The maximum number of NICs that should be selected for oneCCL workers. If not specified then equal to the number of oneCCL workers. |
Description
Set this environment variable to specify the maximum number of NICs to be selected. The actual number of NICs selected may be smaller due to limitations on transport level or system configuration.
Inter Process Communication (IPC)#
CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD#
Syntax
CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=<value>
<value> |
Description |
---|---|
|
The number IPC handles in the receiver cache. The default value is 1000. |
Description
Use this environment variable to change the number of IPC
handles opened with zeMemOpenIpcHandle()
that oneCCL maintains in its receiving
cache. IPC handles refer to Level Zero Memory IPCs.
The IPC handles opened with zeMemOpenIpcHandle()
are stored by oneCCL in
the receiving cache. However, when the number of opened IPC handles exceeds the
specified threshold, the cache will evict a handle using a LRU (Last Recently
Used) policy. Starting with version 2021.10, the default value is 1000.
CCL_ZE_IPC_EXCHANGE#
Syntax
CCL_ZE_IPC_EXCHANGE=<value>
Arguments
<value> |
Description |
---|---|
|
Uses the DRM mechanism for Level Zero IPC exchange. This is an experimental mechanism that is used with OS kernels previous to SP4. Default value for 2021.13 and before. To use the DRM mechanism, the |
|
Uses |
|
Uses socket mechanism for Level Zero IPC exchange. |
|
This mode is used by oneCCL when built on a system without |
Description
Set this environment variable to specify the mechanism to use for Level Zero IPC exchange.
CCL_ZE_CACHE_GET_IPC_HANDLES_THRESHOLD#
Syntax
CCL_ZE_CACHE_GET_IPC_HANDLES_THRESHOLD=<value>
<value> |
Description |
---|---|
|
The number IPC handles in the receiver cache. The default value is 1000. |
Description
Use this environment variable to change the number of IPC handles obtained with
zeMemGetIpcHandle()
that oneCCL maintains in its sender cache. IPC handles
refer to Level Zero Memory IPCs.
The IPC handles obtained with zeMemGetIpcHandle()
are stored by oneCCL in the
sender cache. However, when the number of get IPC handles exceeds the specified
threshold, the cache will evict a handle using a LRU (Last Recently Used)
policy. The default value is 1000.
Low-precision Data Types#
The group of environment variables to control processing of low-precision data types.
CCL_BF16#
Syntax
CCL_BF16=<value>
Arguments
<value> |
Description |
---|---|
|
Select implementation based on |
|
Select implementation based on |
Description
Set this environment variable to select implementation for BF16 <-> FP32
conversion on reduction phase of collective operation. The default value
depends on instruction set support on specific CPU. AVX512_BF16
-based
implementation has precedence over AVX512F
-based one.
CCL_FP16#
Syntax
CCL_FP16=<value>
Arguments
<value> |
Description |
---|---|
|
Select implementation based on |
|
Select implementation based on |
|
Select implementation based on |
Description
Set this environment variable to select implementation for on reduction phase of collective operation.
AVX512FP16
uses native FP16 numeric operations for reduction.
AVX512F
and F16C
use FP16 <-> FP32 conversion operations to perform the reduction.
The default value depends on instruction set support on specific CPU.
AVX512FP16
-based implementation has precedence over AVX512F
and F16C
-based one.
CCL_ATL_MPI_FP16#
Syntax
CCL_ATL_MPI_FP16=<value>
Arguments
<value> |
Description |
---|---|
|
Disables the Intel MPI native FP16 support. |
|
Enables the Intel MPI native FP16 support (default for version 2021.14). |
Description
Set this environment variable to enable or disable Intel MPI native FP16 support. Requires Intel MPI newer than 2021.13. This variable can be enabled with MPI implementation that is not Intel MPI, such as MPICH
, but it will have no impact.
CCL_ATL_MPI_BF16#
Syntax
CCL_ATL_MPI_BF16=<value>
Arguments
<value> |
Description |
---|---|
|
Disables the Intel MPI native BF16 support. |
|
Enables the Intel MPI native BF16 support (default for version 2021.14). |
Description
Set this environment variable to enable or disable Intel MPI native BF16 support. Requires Intel MPI newer than 2021.13. This variable can be enabled with MPI implementation that is not Intel MPI, such as MPICH
, but it will have no impact.
CCL_LOG_LEVEL#
Syntax
CCL_LOG_LEVEL=<value>
Arguments
<value> |
---|
|
|
|
|
|
Description
Set this environment variable to control logging level.
CCL_ITT_LEVEL#
Syntax
CCL_ITT_LEVEL=<value>
Arguments
<value> |
Description |
---|---|
|
Enable support for ITT profiling. |
|
Disable support for ITT profiling (default). |
Description
Set this environment variable to specify Intel® Instrumentation and Tracing Technology (ITT) profiling level. Once the environment variable is enabled (value > 0), it is possible to collect and display profiling data for oneCCL using tools such as Intel® VTune™ Profiler.
Fusion#
The group of environment variables to control fusion of collective operations.
CCL_FUSION#
Syntax
CCL_FUSION=<value>
Arguments
<value> |
Description |
---|---|
|
Enable fusion of collective operations |
|
Disable fusion of collective operations (default) |
Description
Set this environment variable to control fusion of collective operations. The real fusion depends on additional settings described below.
CCL_FUSION_BYTES_THRESHOLD#
Syntax
CCL_FUSION_BYTES_THRESHOLD=<value>
Arguments
<value> |
Description |
---|---|
|
Bytes threshold for a collective operation. If the size of a communication buffer in bytes is less than or equal
to |
Description
Set this environment variable to specify the threshold of the number of bytes for a collective operation to be fused.
CCL_FUSION_COUNT_THRESHOLD#
Syntax
CCL_FUSION_COUNT_THRESHOLD=<value>
Arguments
<value> |
Description |
---|---|
|
The threshold for the number of collective operations.
oneCCL can fuse together no more than |
Description
Set this environment variable to specify count threshold for a collective operation to be fused.
CCL_FUSION_CYCLE_MS#
Syntax
CCL_FUSION_CYCLE_MS=<value>
Arguments
<value> |
Description |
---|---|
|
The frequency of checking for collectives operations to be fused, in milliseconds:
|
Description
Set this environment variable to specify the frequency of checking for collectives operations to be fused.
CCL_PRIORITY#
Syntax
CCL_PRIORITY=<value>
Arguments
<value> |
Description |
---|---|
|
You have to explicitly specify the priority using |
|
Priority is implicitly increased on each collective call. You do not have to specify priority. |
|
Disable prioritization (default). |
Description
Set this environment variable to control priority mode of collective operations.
CCL_MAX_SHORT_SIZE#
Syntax
CCL_MAX_SHORT_SIZE=<value>
Arguments
<value> |
Description |
---|---|
|
Bytes threshold for a collective operation ( |
Description
Set this environment variable to specify the threshold of the number of bytes for a collective operation to be split.
CCL_SYCL_OUTPUT_EVENT#
Syntax
CCL_SYCL_OUTPUT_EVENT=<value>
Arguments
<value> |
Description |
---|---|
|
Enable support for SYCL output event (default). |
|
Disable support for SYCL output event. |
Description
Set this environment variable to control support for SYCL output event.
Once the support is enabled, you can retrieve SYCL output event from oneCCL event using get_native()
method.
oneCCL event must be associated with oneCCL communication operation.
CCL_ZE_LIBRARY_PATH#
Syntax
CCL_ZE_LIBRARY_PATH=<value>
Arguments
<value> |
Description |
---|---|
|
Specify the name and full path to the |
Description
Set this environment variable to specify the name and full path to
Level-Zero
library. The path should be absolute and validated. Set this
variable if Level-Zero
is not located in the default path. By default
oneCCL uses the libze_loader.so
name for dynamic loading.
Point-To-Point Operations#
CCL_RECV#
Syntax
CCL_RECV=<value>
Arguments
<value> |
Description |
---|---|
|
Based on the MPI*/OFI* transport layer. |
|
Uses Intel(R) Xe Link technology across GPUs in a multi-GPU node. The default for GPU buffers. |
|
Based on the MPI*/OFI* transport layer and GPU RDMA when supported by the hardware. |
CCL_SEND#
Syntax
CCL_SEND=<value>
Arguments
<value> |
Description |
---|---|
|
Based on the MPI*/OFI* transport layer. |
|
Uses Intel(R) Xe Link technology across GPUs in a multi-GPU node. The default for GPU buffers. |
|
Based on the MPI*/OFI* transport layer and GPU RDMA when supported by the hardware. |
CCL_ZE_TMP_BUF_SIZE#
Syntax
CCL_ZE_TMP_BUF_SIZE=<value>
Arguments
<value> |
Description |
---|---|
|
Size of the temporary buffer (in bytes) oneCCL uses to perform
collective operations with topo algorithm and Level Zero path. Default is |
Description
Set this environment variable to change the size of the temporary buffer used
by the topo algorithm in the Level Zero path. The value is specified in bytes.
The default value is 536870912
.
You can tune the value of this variable depending on the system memory available, the memory the application requires, and the message size of the collectives used. With larger values, oneCCL consumes more memory but can provide higher performance. Similarly, small values will reduce memory utilization, but can degrade performance.