Low-precision Datatypes

oneCCL provides support for collective operations on low-precision (LP) datatypes (bfloat16 and float16).

Reduction of LP buffers (for example as phase in ccl::allreduce) includes conversion from LP to FP32 format, reduction of FP32 values and conversion from FP32 to LP format.

oneCCL utilizes CPU vector instructions for FP32 <-> LP conversion.

For BF16 <-> FP32 conversion oneCCL provides AVX512F and AVX512_BF16-based implementations. AVX512F-based implementation requires GCC 4.9 or higher. AVX512_BF16-based implementation requires GCC 10.0 or higher and GNU binutils 2.33 or higher. AVX512_BF16-based implementation may provide less accuracy loss after multiple up-down conversions.

For FP16 <-> FP32 conversion oneCCL provides F16C and AVX512F-based implementations. Both implementations require GCC 4.9 or higher.

Refer to Low-precision datatypes for details about relevant environment variables.