Reduction of LP buffers (for example as phase in
ccl::allreduce) includes conversion from LP to FP32 format, reduction of FP32 values and conversion from FP32 to LP format.
oneCCL utilizes CPU vector instructions for FP32 <-> LP conversion.
For BF16 <-> FP32 conversion oneCCL provides
AVX512F-based implementation requires GCC 4.9 or higher.
AVX512_BF16-based implementation requires GCC 10.0 or higher and GNU binutils 2.33 or higher.
AVX512_BF16-based implementation may provide less accuracy loss after multiple up-down conversions.
For FP16 <-> FP32 conversion oneCCL provides
Both implementations require GCC 4.9 or higher.
Refer to Low-precision datatypes for details about relevant environment variables.