Environment Variables#

Collective Algorithms Selection#

oneCCL supports collective operations for the host (CPU) memory buffers and device (GPU) memory buffers. In addition, oneCCL has two different paths to support collectives with GPU buffers; one directly uses Level Zero, and the other uses SYCL. The SYCL path is a new code being developed and not all collectives are supported.

For the Level Zero implementation, in the case of GPU buffers, oneCCL collectives are optimized to execute a hierarchical algorithm composed of an optimized scale-up phase (communication between ranks/processes in the same node) and a scaleout phase (communication between ranks/processes on different nodes). In the case of CPU buffers, the current collective algorithms do not have support for scale-up and scaleout phases; only a non-hierarchical algorithm can be chosen.

With CCL_<coll_name> = <algo_name>, you can select the algorithm for the collective in <coll_name>. For GPU buffers, the default algorithm is topo, which refers to the scale-up algorithm. If you select an algorithm different from topo, oneCCL will implement a non-hierarchical algorithm, where it will copy the GPU buffers to the Host (CPU) and will run the specified algorithm.

For CPU buffers, topo is not available; you can only select one of the other algorithms in the table for a given collective.

If the collective uses GPU buffers, you can select whether the implementation of the scale-up algorithm should use copy engines or kernels. There is also the option to select the scaleout algorithm using CCL_<coll_name>_SCALEOUT=<algo_name>.

Next, environment variables for collective algorithm selection are explained based on the code path (Level Zero or SYCL), the collective being called, and the type of buffer (GPU or CPU).

Level Zero Path#

ALLGATHER/ALLGATHERV#

CCL_ALLGATHER/CCL_ALLGATHERV#

Syntax

For the whole message size:

CCL_ALLGATHER=<algo_name>
CCL_ALLGATHERV=<algo_name>

For a specific message size range:

CCL_ALLGATHER="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
CCL_ALLGATHERV="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of the available collective algorithms.

  • <size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLGATHER="direct:0-8192;ring:8193-max"
CCL_ALLGATHERV="direct:0-8192;ring:8193-max"

Arguments

<algo_name>

Description

topo

Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers.

direct

Based on MPI_Iallgather/ MPI_Iallgatherv.

naive

Send to all, receive from all.

flat

alltoall-based algorithm.

multi_bcast

Series of broadcast operations with different root ranks.

ring

ring-based algorithm.

Description

Use this environment variable to specify the algorithm for ALLGATHER/ ALLGATERV.

If using GPU buffers, select CCL_ALLGATHER=topo or CCL_ALLGATHERV=topo (the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node. For GPU buffers, when selecting an algorithm different from topo, oneCCL copies the data to the host and follows the specified CPU algorithm.

CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL#

Syntax

CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>

Description

1

Uses compute kernels to transfer data across GPUs for the allgather phase of ALLGATHER / ALLGATHERV. The default value.

0

Uses copy engines to transfer data across GPUs for the allgather phase of the ALLGATHER / ALLGATHERV collectives.

Description

Set this environment variable to use GPU buffers to specify the scale-up phase of the algorithm for ALLGATHER/ ALLGATHERV. This environment variable allows the user to choose between using compute kernels or copy engines.

This option is only available if CCL_ALLGATHER = topo or CCL_ALLGATHERV=topo (the default for GPU buffers).

Note

This environment variable applies to both ALLGATHER and ALLGATHERV.

CCL_ALLGATHER_SCALEOUT/CCL_ALLGATHERV_SCALEOUT#

Syntax

For the whole message size:

CCL_ALLGATHER_SCALEOUT=<algo_name>
CCL_ALLGATHERV_SCALEOUT=<algo_name>

For a specific message size range:

CCL_ALLGATHER_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
CCL_ALLGATHERV_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of the available scaleout collective algorithms.

  • <size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLGATHER_SCALEOUT="direct:0-8192;ring:8193-max"
CCL_ALLGATHERV_SCALEOUT="direct:0-8192;ring:8193-max"

Arguments

<algo_name>

Description

direct

Based on MPI_Iallgather/ MPI_Iallgatherv.

naive

Send to all, receive from all.

flat

alltoall-based algorithm.

multi_bcast

Series of broadcast operations with different root ranks.

ring

ring-based algorithm. The default value.

Description

Set this environment variable to use GPU buffers to specify the scaleout phase of the algorithm for ALLGATHER / ALLGATHERV. This option is only available if CCL_ALLGATHERV = topo or CCL_ALLGATHER = topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

ALLREDUCE#

CCL_ALLREDUCE#

Syntax

For the whole message size:

CCL_ALLREDUCE=<algo_name>

For a specific message size range:

CCL_ALLREDUCE="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of available collective algorithms.

  • <size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLREDUCE="recursive_doubling:0-8192;rabenseifner:8193-1048576;ring:1048577-max"

Arguments

<algo_name>

Description

topo

Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers.

direct

Based on MPI_Iallreduce.

rabenseifner

Rabenseifner algorithm.

nreduce

May be beneficial for imbalanced workloads.

ring

reduce_scatter + allgather ring. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining on reduce_scatter phase.

double_tree

double-tree algorithm.

recursive_doubling

Recursive doubling algorithm.

2d

Two-dimensional algorithm (reduce_scatter + allreduce + allgather).

Description

Use this environment variable to specify the algorithm for ALLREDUCE.

If using GPU buffers, select CCL_ALLREDUCE=topo (the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node. For GPU buffers, when selecting an algorithm different from topo, oneCCL copies the data to the host and follows the specified CPU algorithm.

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)#

Syntax

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>

Description

1

Uses compute kernels to transfer data across GPUs for the reduce-scatter phase of the ALLREDUCE collectives. The default value.

0

Uses copy engines to transfer data across GPUs for the reduce-scatter phase of the ALLREDUCE.

Description

Set this environment variable to use GPU buffers to specify how to perform the reduce_scatter portion of the scale-up ALLREDUCE collective. This variable allows you to choose between using compute kernels or copy engines.

This option is only available if CCL_ALLREDUCE=topo (the default for GPU buffers).

CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)#

Syntax

CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>

Description

1

Uses compute kernels to transfer data across GPUs for the allgather phase of ALLREDUCE. The default value.

0

Uses copy engines to transfer data across GPUs for the allgather phase of the ALLREDUCE collective.

Description

ALLREDUCE is implemented as a reduce-scatter phase followed by an allgather phase.

Set this environment variable to use GPU buffers to specify how to perform the allgather portion of the scale-up ALLREDUCE collective. This environment variable allows the user to choose between using compute kernels or using copy engines. This option is only available if CCL_ALLGATHERV=topo (the default for GPU buffers).

CCL_ALLREDUCE_SCALEOUT (GPU buffers only)#

Syntax

For the whole message size:

CCL_ALLREDUCE_SCALEOUT=<algo_name>

For a specific message size range:

CCL_ALLREDUCE_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of available collective algorithms.

  • <size_range> is described by the left and the right size borders the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLREDUCE_SCALEOUT="recursive_doubling:0-8192;rabenseifner:8193-1048576;ring:1048577-max

Arguments

direct

Based on MPI_allreduce

rabenseifner

Rabenseifner algorithm.

nreduce

May be beneficial for imbalanced workloads.

ring

reduce_scatter + allgather ring. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining on reduce_scatter phase. The default value.

double_tree

double-tree algorithm.

ring

Recursive doubling algorithm.

Description

Set this environment variable to use GPU buffers to specify the scaleout algorithm for ALLREDUCE. This option is only available if CCL_ALLREDUCE = topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

ALLTOALL, ALLTOALLV#

CCL_ALLTOALL, CCL_ALLTOALLV#

Syntax

For the whole message size:

CCL_ALLTOALL=<algo_name>  or CCL_ALLTOALLV=<algo_name>

For a specific message size range:

CCL_ALLTOALL="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

or

CCL_ALLTOALLV="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of available collective algorithms.

  • <size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLTOALL="naive:0-8192;scatter:8193-max"

or

CCL_ALLTOALLV="naive:0-8192;scatter:8193-max"

Arguments

topo

Topology-aware algorithm. The default for GPU buffers. Not available for CPU buffers.

direct

Based on MPI_Ialltoall

naive

Send to all, receive from all.

scatter

scatter-based algorithm.

CCL_ALLTOALLV_MONOLITHIC_KERNEL#

Syntax

CCL_ALLTOALLV_MONOLITHIC_KERNEL=<value>

Arguments

<value>

Description

1

Uses compute kernels to transfer data across GPUs for the allgather phase of the ALLTOALL and ALLTOALLV collectives. The default value.

0

Uses copy engines to transfer data across GPUs for the allgather phase of the ALLTOALL and ALLTOALLV collectives.

Description

Set this environment variable to use GPU buffers to specify the scale-up algorithm for ALLTOALL or ALLTOALLV This environment variable allows the user to choose between using compute kernels or using copy engines.

This option is only available if CCL_ALLTOALL=topo or CCL_ALLTOALLV=topo. The default for GPU buffers.

CCL_ALLTOALL_SCALEOUT, CCL_scaleout_ALLTOALLV_scaleout#

Syntax

For the whole message size:

CCL_ALLTOALL_SCALEOUT=<algo_name>  or CCL_ALLTOALLV_SCALEOUT=<algo_name>

For a specific message size range:

CCL_ALLTOALL_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

or

CCL_ALLTOALLV_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of available collective algorithms.

  • <size_range> is described by the left and the right size borders in a format <left>-<right>. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLTOALL_SCALEOUT="naive:0-8192;scatter:8193-max"

or

CCL_ALLTOALLV_SCALEOUT="naive:0-8192;scatter:8193-max"

Arguments

<algo_name>

Description

naive

Send to all, receive from all.

scatter

scatter-based algorithm. The default value.

Description

Set this environment variable to use GPU buffers to specify the scaleout algorithm for ALLTOALL or ALLTOALLV. This option is only available if CCL_ALLTOALL=topo or CCL_ALLTOALLV=topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

BARRIER#

CCL_BARRIER#

Syntax

CCL_BARRIER=<algo_name>

Arguments

<algo_name>

Description

direct

Based on MPI_Ibarrier.

ring

Ring-based algorithm.

Description

Use this environment variable to select the barrier algorithm.

BROADCAST#

CCL_BCAST#

Syntax

CCL_BCAST=<algo_name>

Arguments

<algo_name>

Description

topo

Topology-aware algorithm. The default for GPU buffers. Not available for CPU buffers.

direct

Based on MPI_Ibcast.

ring

ring-based algorithm.

double_tree

double-tree algorithm.

naive

Send to all from root rank.

Description

Use this environment variable to select the algorithm used for broadcast.

Note

The BCAST algorithm does not yet support the CCL_BCAST_scaleout environment variable. To change the algorithm for BCAST, use the CCL_BCAST environment variable.

REDUCE#

CCL_REDUCE#

Syntax

For the whole message size:

CCL_REDUCE=<algo_name>

For a specific message size range:

CCL_REDUCE="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of available collective algorithms.

  • <size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_REDUCE="direct:0-8192;double_tree:1048577-max"

Arguments

<algo_name>

Description

topo

Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers.

direct

Based on MPI_Ireduce.

rabenseifner

Rabenseifner algorithm.

tree

tree algorithm

double_tree

double-tree algorithm.

Description

Set this environment variable to specify the algorithm for REDUCE.

If using GPU buffers, select CCL_REDUCE=topo (the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node. For GPU buffers, when selecting an algorithm different from topo, oneCCL copies the data to the host and follows the specified CPU algorithm.

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)#

Syntax

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>

Description

1

Uses compute kernels to transfer data across GPUs for the reduce-scatter phase of the REDUCE collective. The default value.

0

Uses copy engines to transfer data across GPUs for the reduce-scatter phase of the REDUCE collective.

Description

Set this environment variable to use GPU buffers to specify the scale-up algorithm for ALLREDUCE. This environment variable allows the user to choose between using compute kernels or using copy engines.

This option is only available if CCL_REDUCE=topo (the default for GPU buffers).

CCL_REDUCE_SCALEOUT (GPU buffers only)#

Syntax

For the whole message size:

CCL_REDUCE_SCALEOUT=<algo_name>

For a specific message size range:

CCL_REDUCE_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of available collective algorithms.

  • <size_range> is described by the left and the right size borders in a format <left>-<right>. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_REDUCE_SCALEOUT="direct:0-8192;double_tree:1048577-max"

Arguments

<algo_name>

Description

direct

Based on MPI_Ireduce.

rabenseifner

Rabenseifner algorithm.

tree

tree algorithm.

double_tree

double-tree algorithm. The default value.

Description

Set this environment variable to use GPU buffers to specify the scaleout algorithm for REDUCE. This option is only available if CCL_REDUCE=topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

REDUCE_SCATTER#

CCL_REDUCE_SCATTER#

Syntax

For the whole message size:

CCL_REDUCE_SCATTER=<algo_name>

For a specific message size range:

CCL_REDUCE_SCATTER="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of available collective algorithms.

  • <size_range> is described by the left and the right size borders in a format <left>-<right>. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_REDUCE_SCATTER="direct:0-8192;ring:1048577-max"

Arguments

<algo_name>

Description

topo

Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers.

direct

Based on MPI_Ireduce_scatter_block.

naive

Send to all, receive, and reduce from all.

ring

ring-based algorithm. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining.

Description

Use this environment variable to specify the algorithm for reduce. If using GPU buffers, select CCL_REDUCE_SCATTER=topo (the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node. For GPU buffers,when selecting an algorithm different from topo, oneCCL copies the data to the host and follow the specified CPU algorithm.

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)#

Syntax

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>

Description

1

Uses compute kernels to transfer data across GPUs for the reduce-scatter phase of the REDUCE_SCATTER collective. The default value.

0

Uses copy engines to transfer data across GPUs for the reduce-scatter phase of the REDUCE_SCATTER collective.

Description

Set this environment variable to use GPU buffers to specify how to perform the reduce-scatter portion of the scale-up REDUCE_SCATTER collective. This environment variable allows the user to choose between using compute kernels or using copy engines.

This option is only available if CCL_REDUCE_SCATTER=topo (the default for GPU buffers).

CCL_REDUCE_SCATTER_SCALEOUT (GPU buffers only)#

Syntax

For the whole message size:

CCL_REDUCE_SCATTER_SCALEOUT=<algo_name>

For a specific message size range:

CCL_REDUCE_SCATTER_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

  • <algo_name> is selected from the list of available collective algorithms.

  • <size_range> is described by the left and the right size borders in a format <left>-<right>. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_REDUCE_SCATTER_SCALEOUT="direct:0-8192;double_tree:1048577-max"

Arguments

<algo_name>

Description

direct

Based on MPI_Ireduce_scatter_block.

naive

Send to all, receive, and reduce from all. The default value.

ring

Ring-based algorithm. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining.

Description

Set this environment variable to use GPU buffers to specify the scaleout algorithm for ALLREDUCE. This option is only available if CCL_REDUCE_SCATTER = topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

SYCL PATH (Default with 2021.14)#

All collectives#

CCL_ENABLE_SYCL_KERNELS#

Syntax

CCL_ENABLE_SYCL_KERNELS=<value>

Arguments

<value>

Description

1

Enable SYCL kernels. The default value.

0

Disable SYCL kernels.

Description

Setting this environment variable to 1 enables SYCL kernel-based implementations for ALLGATHERV, ALLREDUCE, and REDUCE_SCATTER.

This new optimization optimizes all message sizes and supports the following data types:

  • int32

  • fp32

  • fp16

  • bf16

  • sum operations

oneCCL falls back to other implementations when the support is unavailable with SYCL kernels, so that you can set up this environment variable safely.

Note

The name of this variable in 2021.12 was CCL_SKIP_SCHEDULER. Starting with 2021.13, the variable has been renamed to CCL_ENABLE_SYCL_KERNELS.

ALLGATHER/ALLGATHERV#

CCL_SYCL_ALLGATHERV_TMP_BUF#

Syntax

CCL_SYCL_ALLGATHERV_TMP_BUF=<value>

Arguments

<value>

Description

1

Uses a persistent temporary buffer to perform the ALLGATHER/ ALLGATHERV.

0

Performs an IPC handle exchange, avoiding copies to temporary buffers. Default value.

Description

Specifies if the ALLGATHER/ ALLGATHERV implementation should use a persistent temporary buffer or not. The implementation with temporary buffers makes the collective fully asynchronous, but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. The current default uses Level Zero IPC to avoid the copies to the temporary buffer.

CCL_SYCL_ALLGATHERV_SMALL_THRESHOLD#

Syntax

CCL_SYCL_ALLGATHERV_SMALL_THRESHOLD=<value>

Arguments

<value>

Description

>=0

Threshold in bytes to specify the small size algorithm. Default value 131072.

Description

ALLGATHER/ ALLGATHERV collectives with message sizes smaller than the specified threshold will use an algorithm specialized for small-sized messages.

CCL_SYCL_ALLGATHERV_SCALEOUT_THRESHOLD#

Syntax

CCL_SYCL_ALLGATHERV_SCALEOUT_THRESHOLD=<value>

Arguments

<value>

Description

>=0

Threshold in bytes to specify when scale-out ALLGATHER/ ALLGATHERV uses SYCL kernel-based implementation. Default value is 1048576.

Description

For ALLGATHER/ ALLGATHERV collectives, with the total message sizes below this threshold in bytes, the SYCL path is chosen to execute the collective operation. For message sizes exceeding this threshold, the implementation will switch to the Level Zero Path. The total message size is the number of bytes received from all participating processes.

ALLREDUCE#

CCL_SYCL_ALLREDUCE_TMP_BUF#

Syntax

CCL_SYCL_ALLREDUCE_TMP_BUF=<value>

Arguments

<value>

Description

1

Uses a persistent temporary buffer to perform the ALLREDUCE operation.

0

Performs an IPC handle exchange, avoiding copies to temporary buffers. Default value.

Description

Specifies whether the ALLREDUCE implementation should use a persistent temporary buffer. The implementation with temporary buffers makes the collective fully asynchronous, but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. The current default uses Level Zero IPC support to avoid the copies to the temporary buffer.

CCL_SYCL_ALLREDUCE_SMALL_THRESHOLD#

Syntax

CCL_ALLREDUCE_SMALL_THRESHOLD=<value>

Arguments

<value>

Description

>=0

Threshold in bytes to specify the small size algorithm. Default value is 524288.

Description

ALLREDUCE collective with message sizes smaller than the specified threshold will use an algorithm specialized for small-sized messages.

CCL_SYCL_ALLREDUCE_SCALEOUT_THRESHOLD#

Syntax

CCL_SYCL_ALLREDUCE_SCALEOUT_THRESHOLD=<value>

Arguments

<value>

Description

>=0

Threshold in bytes to specify when scale-out allreduce uses SYCL kernel-based implementation. Default value is 1048576.

Description

For ALLREDUCE collectives, with message sizes below this threshold in bytes, the SYCL path is chosen to execute the collective operation. For message sizes exceeding this threshold, the implementation will switch to the Level Zero Path.

CCL_SYCL_ALLREDUCE_SCALEOUT_DIRECT_THRESHOLD#

Syntax

CCL_SYCL_ALLREDUCE_SCALEOUT_DIRECT_THRESHOLD=<value>

Arguments

<value>

Description

>=0

Threshold in bytes to specify when allreduce collective selects direct MPI_Iallreduce for the scale-out phase of the collective. Default value is 1048576.

Description

For allreduce collectives with message sizes below this threshold in bytes, MPI_Iallreduce direct algorithm is selected as scale-out phase of the colllective. For message sizes above this threshold and under the CCL_SYCL_ALLREDUCE_SCALEOUT_THRESHOLD, the default algorithm (ring) is selected.

REDUCE_SCATTER#

CCL_SYCL_REDUCE_SCATTER_TMP_BUF#

Syntax

CCL_SYCL_REDUCE_SCATTER_TMP_BUF=<value>

Arguments

<value>

Description

1

Uses a persistent temporary buffer to perform the REDUCE_SCATTER operation.

0

Performs an IPC handle exchange, avoiding copies to temporary buffers. Default value.

Description

Specifies if the REDUCE_SCATTER implementation should use a persistent temporary buffer or not. The implementation with temporary buffers makes the collective fully asynchronous, but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. The current default uses Level Zero IPC support to avoid the copies to the temporary buffer.

CCL_SYCL_REDUCE_SCATTER_SCALEOUT_DIRECT_THRESHOLD#

Syntax

CCL_SYCL_REDUCE_SCATTER_SCALEOUT_DIRECT_THRESHOLD=<value>

Arguments

<value>

Description

>=0

Threshold in bytes to specify when reduce-scatter collective selects direct MPI_Ireduce_scatter as scale-out phase algorithm. Default value is 1048576.

Description

For reduce-scatter collectives with message sizes below this threshold in bytes, MPI_Ireduce_scatter direct algorithm is selected for the scale-out phase of the collective. For message sizes above this threshold and under the CCL_SYCL_REDUCE_SCATTER_SCALEOUT_THRESHOLD, the default algorithm (ring) is selected.

CCL_SYCL_REDUCE_SCATTER_SCALEOUT_THRESHOLD#

Syntax

CCL_SYCL_REDUCE_SCATTER_SCALEOUT_THRESHOLD=<value>

Arguments

<value>

Description

>=0

Threshold in bytes to specify when scale-out REDUCE_SCATTER uses SYCL kernel-based implementation. Default value is 4294967296.

Description

For REDUCE_SCATTER collectives with message sizes below this threshold in bytes, the SYCL path is chosen to execute the collective operation. For message sizes exceeding this threshold, the implementation will switch to the Level Zero Path.

CCL_SYCL_REDUCE_SCATTER_SMALL_THRESHOLD#

Syntax

CCL_SYCL_REDUCE_SCATTER_SMALL_THRESHOLD=<value>

Arguments

<value>

Description

>=0

Threshold in bytes to specify the small-size algorithm. Default value 2097152.

Description

REDUCE_SCATTER collectives with message sizes smaller than the specified threshold will use an algorithm specialized for small-sized messages.

Workers#

The group of environment variables to control worker threads.

CCL_WORKER_COUNT#

Syntax

CCL_WORKER_COUNT=<value>

Arguments

<value>

Description

N

The number of worker threads for oneCCL rank (1 if not specified).

Description

Set this environment variable to specify the number of oneCCL worker threads. For GPU buffers, currently we do not recommend setting this variable to values larger than 1.

CCL_WORKER_AFFINITY#

Syntax

CCL_WORKER_AFFINITY=<cpulist>

Arguments

<cpulist>

Description

auto

Workers are automatically pinned to last cores of pin domain. Pin domain depends from process launcher. If mpirun from oneCCL package is used then pin domain is MPI process pin domain. Otherwise, pin domain is all cores on the node.

<cpulist>

A comma-separated list of core numbers and/or ranges of core numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th core in the list. For example <a>,<b>-<c> defines list of cores containing core with number <a> and range of cores with numbers from <b> to <c>. The core number should not exceed the number of cores available on the system. The length of the list should be equal to the number of workers.

Description

Set this environment variable to specify cpu affinity for oneCCL worker threads.

CCL_WORKER_MEM_AFFINITY#

Syntax

CCL_WORKER_MEM_AFFINITY=<nodelist>

Arguments

<nodelist>

Description

auto

Workers are automatically pinned to NUMA nodes that correspond to CPU affinity of workers.

<nodelist>

A comma-separated list of NUMA node numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th NUMA node in the list. The number should not exceed the number of NUMA nodes available on the system.

Description

Set this environment variable to specify memory affinity for oneCCL worker threads.

KVS#

CCL_KVS_MODE#

Syntax

CCL_KVS_MODE=<value>

Arguments

<cpulist>

Description

pmi

PMI transport (default)

mpi

MPI transport

Description

Set the environment variable to specify the transport used to establish a connection between ranks during the oneCCL communicator creation. Currently, the mpi value is only supported when the MPI transport is used (see CCL_ATL_TRANSPORT). For large scale runs, we recommend setting KVS_MODE to mpi.

ATL#

The group of environment variables to control ATL (abstract transport layer).

CCL_ATL_TRANSPORT#

Syntax

CCL_ATL_TRANSPORT=<value>

Arguments

<value>

Description

mpi

MPI transport (default).

ofi

OFI (libfabric*) transport.

Description

Set this environment variable to select the transport for inter-process communications.

CCL_ATL_HMEM#

Syntax

CCL_ATL_HMEM=<value>

Arguments

<value>

Description

1

Enable heterogeneous memory support on the transport layer.

0

Disable heterogeneous memory support on the transport layer (default).

Description

Set this environment variable to enable handling of HMEM/GPU buffers by the transport layer. The actual HMEM support depends on the limitations on the transport level and system configuration.

CCL_ATL_SHM#

Syntax

CCL_ATL_SHM=<value>

Arguments

<value>

Description

0

Disables the OFI shared memory provider. The default value.

1

Enables the OFI shared memory provider.

Description

Set this environment variable to enable the OFI shared memory provider to communicate between ranks in the same node of the host (CPU) buffers. This capability requires OFI as the transport (CCL_ATL_TRANSPORT=ofi).

The OFI/SHM provider has support to utilize the Intel(R) Data Streaming Accelerator* (DSA). To run it with DSA*, you need: * Linux* OS kernel support for the DSA* shared work queues * Libfabric* 1.17 or later

To enable DSA, set the following environment variables:

FI_SHM_DISABLE_CMA=1
FI_SHM_USE_DSA_SAR=1

Refer to Libfabric* Programmer’s Manual for the additional details about DSA* support in the SHM provider: https://ofiwg.github.io/libfabric/main/man/fi_shm.7.html.

CCL_PROCESS_LAUNCHER#

Syntax

CCL_PROCESS_LAUNCHER=<value>

Arguments

<value>

Description

hydra

Uses the MPI hydra job launcher. The default value.

torchrun

Uses torchrun <https://pytorch.org/docs/stable/elastic/run.html> as a job launcher.

pmix

Is used with the PALS job launcher that uses the pmix API. The mpiexec command should be similar to:

CCL_PROCESS_LAUNCHER=pmix CCL_ATL_TRANSPORT=mpi mpiexec -np 2 -ppn 2 --pmi=pmix ...

none

No job launcher is used. You should specify the values for CCL_LOCAL_SIZE and CCL_LOCAL_RANK.

Description

Set this environment variable to specify the job launcher.

CCL_LOCAL_SIZE#

Syntax

CCL_LOCAL_SIZE=<value>

Arguments

<value>

Description

SIZE

A total number of ranks on the local host.

Description

Set this environment variable to specify a total number of ranks on a local host.

CCL_LOCAL_RANK#

Syntax

CCL_LOCAL_RANK=<value>

Arguments

<value>

Description

RANK

Rank number of the current process on the local host.

Description

Set this environment variable to specify the rank number of the current process in the local host.

Multi-NIC#

CCL_MNIC, CCL_MNIC_NAME and CCL_MNIC_COUNT define filters to select multiple NICs. oneCCL workers will be pinned on selected NICs in a round-robin way.

CCL_MNIC#

Syntax

CCL_MNIC=<value>

Arguments

<value>

Description

global

Select all NICs available on the node.

local

Select all NICs local for the NUMA node that corresponds to process pinning.

none

Disable special NIC selection, use a single default NIC (default).

Description

Set this environment variable to control multi-NIC selection by NIC locality.

CCL_MNIC_NAME#

Syntax

CCL_MNIC_NAME=<namelist>

Arguments

<namelist>

Description

<namelist>

A comma-separated list of NIC full names or prefixes to filter NICs. Use the ^ symbol to exclude NICs starting with the specified prefixes. For example, if you provide a list mlx5_0,mlx5_1,^mlx5_2, NICs with the names mlx5_0 and mlx5_1 will be selected, while mlx5_2 will be excluded from the selection.

Description

Set this environment variable to control multi-NIC selection by NIC names.

CCL_MNIC_COUNT#

Syntax

CCL_MNIC_COUNT=<value>

Arguments

<value>

Description

N

The maximum number of NICs that should be selected for oneCCL workers. If not specified then equal to the number of oneCCL workers.

Description

Set this environment variable to specify the maximum number of NICs to be selected. The actual number of NICs selected may be smaller due to limitations on transport level or system configuration.

Inter Process Communication (IPC)#

CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD#

Syntax

CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=<value>

<value>

Description

N

The number IPC handles in the receiver cache. The default value is 1000.

Description

Use this environment variable to change the number of IPC handles opened with zeMemOpenIpcHandle() that oneCCL maintains in its receiving cache. IPC handles refer to Level Zero Memory IPCs.

The IPC handles opened with zeMemOpenIpcHandle() are stored by oneCCL in the receiving cache. However, when the number of opened IPC handles exceeds the specified threshold, the cache will evict a handle using a LRU (Last Recently Used) policy. Starting with version 2021.10, the default value is 1000.

CCL_ZE_IPC_EXCHANGE#

Syntax

CCL_ZE_IPC_EXCHANGE=<value>

Arguments

<value>

Description

drmfd

Uses the DRM mechanism for Level Zero IPC exchange. This is an experimental mechanism that is used with OS kernels previous to SP4. Default value for 2021.13 and before. To use the DRM mechanism, the libdrm and drm headers must be available on the system.

pidfd

Uses pidfd mechanism for Level Zero IPC exchange. It requires OS kernel SP4 or above as it requires Linux 5.6 kernel or above. Default with 2021.14.

sockets

Uses socket mechanism for Level Zero IPC exchange.

none

This mode is used by oneCCL when built on a system without drmfd support.

Description

Set this environment variable to specify the mechanism to use for Level Zero IPC exchange.

CCL_ZE_CACHE_GET_IPC_HANDLES_THRESHOLD#

Syntax

CCL_ZE_CACHE_GET_IPC_HANDLES_THRESHOLD=<value>

<value>

Description

N

The number IPC handles in the receiver cache. The default value is 1000.

Description

Use this environment variable to change the number of IPC handles obtained with zeMemGetIpcHandle() that oneCCL maintains in its sender cache. IPC handles refer to Level Zero Memory IPCs.

The IPC handles obtained with zeMemGetIpcHandle() are stored by oneCCL in the sender cache. However, when the number of get IPC handles exceeds the specified threshold, the cache will evict a handle using a LRU (Last Recently Used) policy. The default value is 1000.

Low-precision Data Types#

The group of environment variables to control processing of low-precision data types.

CCL_BF16#

Syntax

CCL_BF16=<value>

Arguments

<value>

Description

avx512f

Select implementation based on AVX512F instructions.

avx512bf

Select implementation based on AVX512_BF16 instructions.

Description

Set this environment variable to select implementation for BF16 <-> FP32 conversion on reduction phase of collective operation. The default value depends on instruction set support on specific CPU. AVX512_BF16-based implementation has precedence over AVX512F-based one.

CCL_FP16#

Syntax

CCL_FP16=<value>

Arguments

<value>

Description

f16c

Select implementation based on F16C instructions.

avx512f

Select implementation based on AVX512F instructions.

avx512fp16

Select implementation based on AVX512FP16 instructions.

Description

Set this environment variable to select implementation for on reduction phase of collective operation. AVX512FP16 uses native FP16 numeric operations for reduction. AVX512F and F16C use FP16 <-> FP32 conversion operations to perform the reduction. The default value depends on instruction set support on specific CPU. AVX512FP16-based implementation has precedence over AVX512F and F16C-based one.

CCL_ATL_MPI_FP16#

Syntax

CCL_ATL_MPI_FP16=<value>

Arguments

<value>

Description

0

Disables the Intel MPI native FP16 support.

1

Enables the Intel MPI native FP16 support (default for version 2021.14).

Description

Set this environment variable to enable or disable Intel MPI native FP16 support. Requires Intel MPI newer than 2021.13. This variable can be enabled with MPI implementation that is not Intel MPI, such as MPICH, but it will have no impact.

CCL_ATL_MPI_BF16#

Syntax

CCL_ATL_MPI_BF16=<value>

Arguments

<value>

Description

0

Disables the Intel MPI native BF16 support.

1

Enables the Intel MPI native BF16 support (default for version 2021.14).

Description

Set this environment variable to enable or disable Intel MPI native BF16 support. Requires Intel MPI newer than 2021.13. This variable can be enabled with MPI implementation that is not Intel MPI, such as MPICH, but it will have no impact.

CCL_LOG_LEVEL#

Syntax

CCL_LOG_LEVEL=<value>

Arguments

<value>

error

warn (default)

info

debug

trace

Description

Set this environment variable to control logging level.

CCL_ITT_LEVEL#

Syntax

CCL_ITT_LEVEL=<value>

Arguments

<value>

Description

1

Enable support for ITT profiling.

0

Disable support for ITT profiling (default).

Description

Set this environment variable to specify Intel® Instrumentation and Tracing Technology (ITT) profiling level. Once the environment variable is enabled (value > 0), it is possible to collect and display profiling data for oneCCL using tools such as Intel® VTune™ Profiler.

Fusion#

The group of environment variables to control fusion of collective operations.

CCL_FUSION#

Syntax

CCL_FUSION=<value>

Arguments

<value>

Description

1

Enable fusion of collective operations

0

Disable fusion of collective operations (default)

Description

Set this environment variable to control fusion of collective operations. The real fusion depends on additional settings described below.

CCL_FUSION_BYTES_THRESHOLD#

Syntax

CCL_FUSION_BYTES_THRESHOLD=<value>

Arguments

<value>

Description

SIZE

Bytes threshold for a collective operation. If the size of a communication buffer in bytes is less than or equal to SIZE, then oneCCL fuses this operation with the other ones.

Description

Set this environment variable to specify the threshold of the number of bytes for a collective operation to be fused.

CCL_FUSION_COUNT_THRESHOLD#

Syntax

CCL_FUSION_COUNT_THRESHOLD=<value>

Arguments

<value>

Description

COUNT

The threshold for the number of collective operations. oneCCL can fuse together no more than COUNT operations at a time.

Description

Set this environment variable to specify count threshold for a collective operation to be fused.

CCL_FUSION_CYCLE_MS#

Syntax

CCL_FUSION_CYCLE_MS=<value>

Arguments

<value>

Description

MS

The frequency of checking for collectives operations to be fused, in milliseconds:

  • Small MS value can improve latency.

  • Large MS value can help to fuse larger number of operations at a time.

Description

Set this environment variable to specify the frequency of checking for collectives operations to be fused.

CCL_PRIORITY#

Syntax

CCL_PRIORITY=<value>

Arguments

<value>

Description

direct

You have to explicitly specify the priority using priority.

lifo

Priority is implicitly increased on each collective call. You do not have to specify priority.

none

Disable prioritization (default).

Description

Set this environment variable to control priority mode of collective operations.

CCL_MAX_SHORT_SIZE#

Syntax

CCL_MAX_SHORT_SIZE=<value>

Arguments

<value>

Description

SIZE

Bytes threshold for a collective operation (0 if not specified). If the size of a communication buffer in bytes is less than or equal to SIZE, then oneCCL does not split operation between workers. Applicable for ALLREDUCE, REDUCE and BROADCAST.

Description

Set this environment variable to specify the threshold of the number of bytes for a collective operation to be split.

CCL_SYCL_OUTPUT_EVENT#

Syntax

CCL_SYCL_OUTPUT_EVENT=<value>

Arguments

<value>

Description

1

Enable support for SYCL output event (default).

0

Disable support for SYCL output event.

Description

Set this environment variable to control support for SYCL output event. Once the support is enabled, you can retrieve SYCL output event from oneCCL event using get_native() method. oneCCL event must be associated with oneCCL communication operation.

CCL_ZE_LIBRARY_PATH#

Syntax

CCL_ZE_LIBRARY_PATH=<value>

Arguments

<value>

Description

PATH/NAME

Specify the name and full path to the Level-Zero library for dynamic loading by oneCCL.

Description

Set this environment variable to specify the name and full path to Level-Zero library. The path should be absolute and validated. Set this variable if Level-Zero is not located in the default path. By default oneCCL uses the libze_loader.so name for dynamic loading.

Point-To-Point Operations#

CCL_RECV#

Syntax

CCL_RECV=<value>

Arguments

<value>

Description

direct

Based on the MPI*/OFI* transport layer.

topo

Uses Intel(R) Xe Link technology across GPUs in a multi-GPU node. The default for GPU buffers.

offload

Based on the MPI*/OFI* transport layer and GPU RDMA when supported by the hardware.

CCL_SEND#

Syntax

CCL_SEND=<value>

Arguments

<value>

Description

direct

Based on the MPI*/OFI* transport layer.

topo

Uses Intel(R) Xe Link technology across GPUs in a multi-GPU node. The default for GPU buffers.

offload

Based on the MPI*/OFI* transport layer and GPU RDMA when supported by the hardware.

CCL_ZE_TMP_BUF_SIZE#

Syntax

CCL_ZE_TMP_BUF_SIZE=<value>

Arguments

<value>

Description

N

Size of the temporary buffer (in bytes) oneCCL uses to perform collective operations with topo algorithm and Level Zero path. Default is 536870912, that is, 512 MBs.

Description

Set this environment variable to change the size of the temporary buffer used by the topo algorithm in the Level Zero path. The value is specified in bytes. The default value is 536870912.

You can tune the value of this variable depending on the system memory available, the memory the application requires, and the message size of the collectives used. With larger values, oneCCL consumes more memory but can provide higher performance. Similarly, small values will reduce memory utilization, but can degrade performance.