Deep Neural Network Library (DNNL)  1.3.0
Performance library for Deep Learning
Data Types

DNNL functionality supports a number of numerical data types. IEEE single precision floating point (fp32) is considered to be the golden standard in deep learning applications and is supported in all the library functions. The purpose of low precision data types support is to improve performance of compute intensive operations, such as convolutions, inner product, and recurrent neural network cells in comparison to fp32.

Data type Desc
f32 IEEE single precision floating point
bf16 non-IEEE 16-bit floating point
f16 IEEE half precision floating point
s8/u8 signed/unsigned 8-bit integer

Inference and Training

DNNL supports training and inference with the following data types:

Usage mode CPU GPU
Inference f32, bf16, s8/u8 f32, f16
Training f32, bf16 f32
Note
Using lower precision arithmetic may require changes in the deep learning model implementation.

See topics for the corresponding data types details:

Individual primitives may have additional limitations with respect to data type support based on the precision requirements. The list of data types supported by each primitive is included in the corresponding sections of the developer guide.

Hardware Limitations

While all the platforms DNNL supports have hardware acceleration for fp32 arithmetics, that is not the case for other data types. Considering that performance is the main purpose of the low precision data types support, DNNL implements this functionality only for the platforms that have hardware acceleration for these data types. The table below summarizes the current support matrix:

Data type CPU GPU
f32 any any
bf16 Intel(R) DL Boost with bfloat16 not supported
f16 not supported any
s8, u8 Intel AVX512, Intel DL Boost not supported
Note
DNNL has functional bfloat16 support on processors with Intel AVX512 Byte and Word Instructions (AVX512BW) support for validation purposes. The performance of bfloat16 primitives on platforms without hardware acceleration for bfloat16 is 3-4x lower in comparison to the same operations on the fp32 data type.