Introduction

To push higher performance during inference computations, recent work has focused on computing at a lower precision (that is, shrinking the size of data for activations and weights) to achieve higher throughput. Eight-bit computations (referred to as int8) offer improved performance over higher-precision types because they enable packing more data into a single instruction, at the cost of reduced (but acceptable) accuracy.

Int8 Workflow

There are different ways to use lower precision to perform inference. Please go through the Primitive Attributes: Quantization page to get the initial understanding of what kind of quantization model DNNL supports.

Quantization Process

To operate with int8 data types from a higher-precision format (for example, 32-bit floating point), data must first be quantized. The quantization process converts a given input into a lower-precision format. The precision and accuracy factors are determined by the scaling factors.

Scale

The scale is usually obtained from sampling the dataset of previous executions in the original format (for example, the activations and weights from training in fp32) and is formulated as:

\( R_{\{\alpha,w\}} = max(abs(T_{\{\alpha,w\}}))\)

where \(T_{\{\alpha,w\}} {}_{}\) is a tensor corresponding to either the weights \(w\) or the activations \(\alpha\).

The purpose is to establish the range of values used in the computation, where selecting a proper scaling factor prevents over- or underflows during computation of the lower-precision results.

Quantization Factor

The next step is to calculate the quantization factor for converting the values into the corresponding int8 range. This is also known as the scale or scaling factor applied to the original high-precision values and is calculated as:

\( Q_{\alpha} = \frac{255}{R_{\alpha}}\) is the quantization factor for activations with non-negative values.
\( Q_{w} = \frac{127}{R_{w}}\) is the quantization factor for weights.

The low-precision values, known as the quantized activation, weights, and bias values, are calculated as:

\(\alpha_{u8} = \lceil Q_{\alpha} \alpha_{f32} \rceil \in [0,255]\)
\(W_{s8} = \lceil Q_{w} W_{f32} \rceil \in [-127,127]\)
\(b_{s32} = \lceil Q_{\alpha} Q_{w} b_{f32} \rceil \in [-2^{31},2^{31}-1]\)

where the function \( \lceil \rceil \) rounds to the selected rounding mode (typically determined by the MXCSR register; the default value is RoundNearestEven).

When the destination value (for example, from a convolution) is stored as a signed 32-bit integer, the result is bound to the same quantization scaling factors:

\(X_{s32} = W_{s8} \times \alpha{u8} + b_{s32} \approx Q_{\alpha} Q_{\omega} X_{f32}\)
where \(X_{f32} = W_{f32} \times \alpha_{f32} + b_{f32}\)

where the approximated value is due to the rounded values.

Inversely, the dequantized value is calculated as:

\(X_{f32} \approx \frac{1}{Q_{\alpha} Q_{\omega}} X_{s32} \)

Quantization Example

To show how the int8 parameters are obtained, suppose we first start off with a set of arbitrary high-precision input and output values. These values come from sampling a previously executed training run and are in their original 32-bit floating point format as:

activations: \( T_{\alpha} = [15, 14, 15 ... 8, 11 ]\) where \( max(abs(T_{\alpha})) = 15\)
weights: \( T_{\omega} = [-5.1 , 6.8, ... -1.2, 9.8 ]\) where \( max(abs(T_{\omega})) = 9.8\)
bias: \( T_{\alpha} = [ 2.4, -5.2 ... -8 ]\) where \( max(abs(T_{\alpha})) = 8\)

The scaling factors are:

\( Q_{\alpha} = \frac{255}{R_{\alpha}} = \frac{255}{15} = 17 \)
\( Q_{w} = \frac{127}{R_{w}} = \frac{127}{9.8} = 12.96\)

Finally, the quantized input values for the 8-bit operation are calculated as:

\(\alpha_{u8} = \lceil Q_{\alpha} \alpha_{f32} \rceil\) \( = \lceil 17 \times [15, 14, ... 11 ] \rceil = [255, 238, ... 187] \)
\(W_{s8} = \lceil Q_{w} W_{f32} \rceil = \lceil 12.96 \times [-5.1 , 6.8, ... -1.2, 9.8 ] \rceil = [-66, 88, ... -15, 127] \)
\(b_{s32} = \lceil Q_{\alpha} Q_{w} b_{f32} \rceil = \lceil 17 \times 12.96 \times [ 2.4, -5.2 ... -8 ] \rceil = [528, -1145, ... -1762] \)

These arrays are the new inputs for the int8 net.

DNNL Support for Low-Precision int8 Primitives

DNNL supports low-precision computations for inference through the int8 primitives. int8 primitives are ordinary DNNL primitives that have their input and output parameters configured to 8-bit types. int8 primitives are optimized for high performance on the compatible hardware (see Data Types).

DNNL Attributes

DNNL primitive behavior may be extended for additional functionalities involving output data transformation. These additional features are configured via primitive attributes. The primitive attributes definition is an opaque structure for passing extra parameters to a primitive descriptor. These parameters include a scaling factor and fused post-ops. All operation primitives support the attributes structure; however, some configurations are not implemented and result in failed primitive creation.

The scaling factor, as previously described, is known prior to the inference operation where the values are calculated from a set of formulas. In DNNL, the scaling factor is applied to the output of a primitive. Moreover, to perform input transformations (for example, source, bias, and weights), DNNL performs quantizing and dequantizing of data for int8 through the Reorder Primitive.

DNNL has two formats for defining the output scaling factor. Depending on the configuration set by the scaling mask, either the output is scaled uniformly across all the dimensions (mask = 0) or a set of scaling values is applied to specific dimensions, as explained below:

A single floating point value shared across the tensor

Single-value scaling format
An array of floating point values each corresponding to a specific output channel

Multi-value scaling format
The mask parameter determines the dimension to which the scales array is applied, where the i^th-bit(s) of mask selects the dimension or dimensions d_i (where d is an n-dimensional output tensor with logical dimensions as [d0, d1, ..., dn-1]). For example:
The single-scale format always has mask = 0.
For a 5-dimensional tensor T[g0, o1,i2,h3,w4] where the numbering indicates the bit-index:
- A mask = 2 = 2¹ selects the output channel for scaling.
- A mask = 3 = 2⁰ | 2¹ selects the group and output channels.

Mask is always applied to the logical dimension; this is independent of the dimension format that the primitive might select. The dimensions in DNNL are defined as follows:

2D dimensional data the order of dimensions is always: (n, c)
4D dimensional data the order is always: (n, c, h, w)
5D dimensional weights the order is always: (g, oc, ic, kh, kw)

Fused post-ops allow chaining operations during the primitive computation. Note that the resulting output value from post-ops is always affected by the scaling factor. The supported operations are:

Accumulation where the primitive sums the resulting values from previously computed activations as:
- \(dst[ ] \leftarrow scale * dst[] + op(...)\), instead of
- \(dst[ ] \leftarrow op(...)\)
Element-wise (eltwise) operation with kind, alpha and beta parameters as:
- \(dst[ ] \leftarrow scale * eltwise\_op ( op(...) )\), instead of
- \(dst[ ] \leftarrow op(...)\)

The list of supported eltwise operations for int8 is currently limited to ReLU. For instance, post-ops may only configure a convolution with accumulation followed by eltwise (relu).

Example

CNN int8 inference example example walks through the steps of int8 inference.