oneCCL Benchmark Point-to-Point User Guide#
oneCCL provides two benchmarks for performance measurement of the point-to-point operations in oneCCL:
ccl_latency measures latency
ccl_bw measures bandwidth
The benchmark is distributed with the oneCCL package. You can find it in the examples directory within the oneCCL installation path.
Build oneCCL Benchmark#
CPU-Only#
To build the benchmark, complete the following steps:
Configure your environment. Source the installed oneCCL library for the CPU-only support:
source <oneCCL install dir>/ccl/latest/env/vars.sh --ccl-configuration=cpu
Navigate to
<oneCCL install dir>/share/doc/ccl/examples
Build the benchmark using the following command:
cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$(pwd)/build/_install && cmake --build build -j $(nproc) -t install
CPU-GPU#
Configure your environment.
Source the Intel(R) oneAPI DPC++/C++ Compiler. See the documentation for the instructions.
Source the installed oneCCL library for the CPU-GPU support:
source <oneCCL install dir>/ccl/latest/env/vars.sh --ccl-configuration=cpu_gpu_dpcpp
Navigate to
<oneCCL install dir>/share/doc/ccl/examples
.Build the SYCL benchmark with the following command:
cmake -S . -B build -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_INSTALL_PREFIX=$(pwd)/build/_install && cmake --build build -j $(nproc) -t install
Run oneCCL Point-to-Point Benchmark#
To run ccl_latency
or ccl_bw
, use the following commands:
mpirun -n 2 -ppn <P> ccl_latency [arguments]
mpirun -n 2 -ppn <P> ccl_bw [arguments]
Where:
2
is the number of processes (this benchmark runs only with two processes).N
is the number of processes within a node. For this benchmark,<P>
can only be1
or2
. WhenN==1
, it indicates there is a single process on each node and the benchmark runs across nodes. WhenN==2
, it indicates both processes are on the same node.
The benchmark reports:
#bytes
- the message size in the number of byteselem_count
- the message size in the number of elements#repetitions
- the number of iterationsLatency (for ccl_latency benchmark) - time to send the data from a sender process to a receiving process, where both the send and receive operations are blocking. The time is reported in μsec.
Bandwidth (for
ccl_bw
) - represented in Mbytes/second for the bandwidth benchmark
Both benchmarks always transfer elements of type int32.
Point-to-Point Benchmark Arguments#
ccl_latency
and ccl_bw
accept the same arguments. To see the benchmark arguments, use the --help
argument.
ccl_latency
and ccl_bw accept
accept the following arguments:
Option |
Description |
Default Value |
---|---|---|
|
Specify the backend. The possible values are cpu and gpu. For a CPU-only build, the backend is automatically set to cpu, and only the cpu option is available. For a CPU-GPU build, cpu and gpu options are available, and gpu is the default value. The cpu value allocates buffers in the host (CPU) memory, while the gpu value allocates buffers in the device (GPU) memory. |
The default is gpu for CPU-GPU build, cpu for CPU-only build |
|
Specify the number of iterations executed by the benchmark. |
|
|
Specify the number of the warmup iterations. It means the number of iterations the benchmark runs before starting the timing of the iterations specified with the |
|
|
Specify whether to use persistent collectives ( |
|
|
Specify the type of SYCL queue. Possible values are 0 (out_order) or 1 (in_order). |
|
|
Specifies the synchronization model, that is, whether the point to point operation is 1 (blocking) or 0 (non_blocking). Notice that currently the benchmark only supports blocking point to point operations. |
Default mode is 1 (blocking) |
|
Specifies the minimum number of elements used for the operation. |
|
|
Specify the maximum number of elements used for the operation. |
|
|
Specify a list with the number of elements used for the collective, such as |
The default value is between |
|
Check for correctness. The possible values are |
|
|
Show all of the supported options. |
Examples#
GPU#
The following example shows how to run ccl_latency
with the GPU buffers:
mpirun -n 2 -ppn <P> ccl_latency -b gpu -i 20 -f 1024 -t 67108864 -e 1
mpirun -n 2 -ppn <P> ccl_bw -b gpu -i 20 -f 1024 -t 67108864 -e 1
The above commands:
Run the
ccl_latency
or theccl_bw
benchmarkContain a total of two processes (this benchmark only supports two processes)
Use P processes per node, where P can be
1
if running on two different nodes or2
when running on a single nodeUse GPU buffers
Use 20 iterations
Use element count from
1024
to67108864
(ccl_latency
orccl_bw
will run with the powers of two in that range)Have in-order queue
CPU#
mpirun –n 2 -ppn <P> ccl_latency -b cpu -i 20 -f 1024 -t 67108864
| mpirun –n 2 -ppn <P> ccl_bw -b cpu -i 20 -f 1024 -t 67108864
The preceding command:
Runs the
ccl_latency/ccl_bw
benchmarkContains a total of two processes (this benchmark only supports two processes)
Contains P processes per node, where
P
can be1
if running on two different nodes or2
when running on a single nodeUses CPU buffers
Uses 20 iterations
Uses element count from
1024
to67108864
(ccl_latency
will run with the power of two in that range)