oneCCL Benchmark Point-to-Point User Guide#

oneCCL provides two benchmarks for performance measurement of the point-to-point operations in oneCCL:

ccl_latency measures latency
ccl_bw measures bandwidth

The benchmark is distributed with the oneCCL package. You can find it in the examples directory within the oneCCL installation path.

Build oneCCL Benchmark#

CPU-Only#

To build the benchmark, complete the following steps:

Configure your environment. Source the installed oneCCL library for the CPU-only support:
```
source <oneCCL install dir>/ccl/latest/env/vars.sh --ccl-configuration=cpu
```
Navigate to <oneCCL install dir>/share/doc/ccl/examples

Build the benchmark using the following command:

cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$(pwd)/build/_install && cmake --build build -j $(nproc) -t install

CPU-GPU#

Configure your environment.
- Source the Intel(R) oneAPI DPC++/C++ Compiler. See the documentation for the instructions.
- Source the installed oneCCL library for the CPU-GPU support:
```
source <oneCCL install dir>/ccl/latest/env/vars.sh --ccl-configuration=cpu_gpu_dpcpp
```
Navigate to <oneCCL install dir>/share/doc/ccl/examples.

Build the SYCL benchmark with the following command:

cmake -S . -B build -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_INSTALL_PREFIX=$(pwd)/build/_install && cmake --build build -j $(nproc) -t install

Run oneCCL Point-to-Point Benchmark#

To run ccl_latency or ccl_bw, use the following commands:

mpirun -n 2 -ppn <P> ccl_latency [arguments]

mpirun -n 2 -ppn <P> ccl_bw [arguments]

Where:

2 is the number of processes (this benchmark runs only with two processes).
N is the number of processes within a node. For this benchmark, <P> can only be 1 or 2. When N==1, it indicates there is a single process on each node and the benchmark runs across nodes. When N==2, it indicates both processes are on the same node.

The benchmark reports:

#bytes - the message size in the number of bytes
elem_count - the message size in the number of elements
#repetitions - the number of iterations
Latency (for ccl_latency benchmark) - time to send the data from a sender process to a receiving process, where both the send and receive operations are blocking. The time is reported in μsec.
Bandwidth (for ccl_bw) - represented in Mbytes/second for the bandwidth benchmark

Both benchmarks always transfer elements of type int32.

Point-to-Point Benchmark Arguments#

ccl_latency and ccl_bw accept the same arguments. To see the benchmark arguments, use the --help argument.

ccl_latency and ccl_bw accept accept the following arguments:

Option	Description	Default Value
`-b`, `--backend`	Specify the backend. The possible values are cpu and gpu. For a CPU-only build, the backend is automatically set to cpu, and only the cpu option is available. For a CPU-GPU build, cpu and gpu options are available, and gpu is the default value. The cpu value allocates buffers in the host (CPU) memory, while the gpu value allocates buffers in the device (GPU) memory.	The default is gpu for CPU-GPU build, cpu for CPU-only build
`-i`, `--iters`	Specify the number of iterations executed by the benchmark.	`16`
`-w`, `--warmup_iters`	Specify the number of the warmup iterations. It means the number of iterations the benchmark runs before starting the timing of the iterations specified with the `-i` argument.	`16`
`-p`, `--cache`	Specify whether to use persistent collectives (`p=1`) or not (`p=0`).	`0` .. note:: The benchmark currently does not support persistent collectives.
`-e`, `--sycl_queue_type`	Specify the type of SYCL queue. Possible values are 0 (out_order) or 1 (in_order).	`0 (out_order)`
`-s`, `--wait`	Specifies the synchronization model, that is, whether the point to point operation is 1 (blocking) or 0 (non_blocking). Notice that currently the benchmark only supports blocking point to point operations.	Default mode is 1 (blocking)
`-f`, `--min_elem_count`	Specifies the minimum number of elements used for the operation.	`1`
`-t`, `--max_elem_count`	Specify the maximum number of elements used for the operation.	`33554432` .. note:: The `-t` and `-f` options specify the count in the number of elements. Therefore, the total number of bytes is obtained by multiplying the number of elements by the number of bytes of the data type. For instance, when using `-f 128` and data type `fp32`, the total amount of bytes is 512 bytes (`128 element count * 4 bytes FP32`). `ccl_latency`/`ccl_bw run` and report performance for message sizes that correspond to the `-t` and `-f` arguments and all message sizes that are power of two in between these two numbers.
`-y`, `--elem_counts`	Specify a list with the number of elements used for the collective, such as `[-y 4,8,32,131072]`.	The default value is between `1` and `33554432` and all powers of two in between.
`-c`, `--check`	Check for correctness. The possible values are `off` (disable checking), `last` (check the last iteration), and `all` (check all the iterations).	`last`
`-h`, `--help`	Show all of the supported options.

Examples#

GPU#

The following example shows how to run ccl_latency with the GPU buffers:

mpirun -n 2 -ppn <P> ccl_latency -b gpu -i 20 -f 1024 -t 67108864 -e 1
mpirun -n 2 -ppn <P> ccl_bw -b gpu -i 20 -f 1024 -t 67108864 -e 1

The above commands:

Run the ccl_latency or the ccl_bw benchmark
Contain a total of two processes (this benchmark only supports two processes)
Use P processes per node, where P can be 1 if running on two different nodes or 2 when running on a single node
Use GPU buffers
Use 20 iterations
Use element count from 1024 to 67108864 (ccl_latency or ccl_bw will run with the powers of two in that range)
Have in-order queue

CPU#

 mpirun –n 2 -ppn <P> ccl_latency -b cpu -i 20 -f 1024 -t 67108864
| mpirun –n 2 -ppn <P> ccl_bw -b cpu -i 20 -f 1024 -t 67108864

The preceding command:

Runs the ccl_latency/ccl_bw benchmark
Contains a total of two processes (this benchmark only supports two processes)
Contains P processes per node, where P can be 1 if running on two different nodes or 2 when running on a single node
Uses CPU buffers
Uses 20 iterations
Uses element count from 1024 to 67108864 (ccl_latency will run with the power of two in that range)

oneCCL Benchmark Point-to-Point User Guide

Contents