oneDNN Graph provides low precision support with int8 (signed/unsigned 8-bit integer), bf16 and f16 data types. oneDNN Graph API expects the computation graph is converted to low precision representation, the data’s precision and quantization parameters are specified explicitly. oneDNN Graph API implementation will strictly respect the numeric precision of the computation.
oneDNN Graph API provides below two operations to support quantized model with static quantization:
Dequantize operation takes integer tensor with its associated scale and zero point and returns f32 tensor. Quantize operation takes f32 tensor, scale, zero point, and returns integer tensor. The scale and zero point are single dimension tensors, which could contain one value for the per-tensor quantization case or multiple values for the per-channel quantization case. The integer tensor could be represented in unsigned int8 or signed int8 data type. Zero point could be zero for symmetric quantization scheme, and a non-zero value for asymmetric quantization scheme.
Dequantize and Quantize operation should be inserted manually in the graph as part of quantization process before passing to oneDNN Graph. oneDNN Graph honors the data type passed via logical tensor and faithfully follows the numeric semantics. For example, if the graph has a Quantize operation followed by a Dequantize operation with exact same scale and zero point, oneDNN Graph implementation should not eliminate them since that implicitly changes the numeric precision.
oneDNN Graph partitioning API may return a partition containing Dequantize, Quantize, and Convolution operations in-between. It is not necessary to recognize the subgraph pattern explicitly and convert to fused operation. Depending on oneDNN Graph implementation capability, the partition may include more or fewer operations.
oneDNN Graph provides TypeCast operation, which can convert a f32 tensor to bf16 or f16, and vice versa. It is used to support auto mixed precision mechanism in popular deep learning frameworks. All oneDNN Graph operations support bf16 and f16 data types.
A TypeCast operation performing down conversion should be inserted clearly to indicate the use of low numeric precision. oneDNN Graph implementation fully honors the API-specified numeric precision and only performs the computation using the API-specified or higher numeric precision.