.. index:: pair: page; Primitive Attributes: Quantization
.. _doxid-dev_guide_attributes_quantization:
Primitive Attributes: Quantization
==================================
:target:`doxid-dev_guide_attributes_quantization_1dgaq_intro`
Introduction
~~~~~~~~~~~~
Some primitives in the library support input/output tensors with the INT8 (either signed or unsigned) data type. The primary goal is to support reduced precision inference on the compatible hardware.
Related materials:
* `Lower Numerical Precision Deep Learning Inference and Training `__
* An example with annotations: :ref:`Int8 Inference `
Quantization Model
~~~~~~~~~~~~~~~~~~
The primary quantization model that the library assumes is the following:
.. math::
x_{f32}[:] = scale_{f32} \cdot (x_{int8}[:] - 0_{x\_int8})
where :math:`scale_{f32}` is a scaling factor that is somehow known in advance and :math:`[:]` is used to denote elementwise application of the formula to the arrays. Typically, the process of obtaining these scale factors is called the calibration. This might be counter-intuitive, but the library cannot compute any of the scale factors at run-time dynamically. Hence, the model is sometimes called a static quantization model. The main rationale to support only static quantization out-of-the-box is higher performance. To use dynamic quantization:
#. Compute the result in higher precision, like :ref:`dnnl::memory::data_type::s32 `.
#. Find the required characteristics, like min and max values, and derive the scale factor.
#. Re-quantize to the lower precision data type.
It is also worth mentioning that the library supports fixed zero position. For most of the primitives, real zero value is mapped to zero for quantized values; that is, :math:`0_{x\_int8} = 0`. For example, this is the only model that :ref:`Convolution ` and :ref:`Inner Product ` currently support. The :ref:`RNN ` primitives have limited support of shifted zero (for details, refer to the corresponding section in :ref:`RNN `).
For the rest of this guide, we will assume that :math:`0_{x\_int8} = 0`.
.. warning::
Depending on the architecture, the behavior of int8 computations might slightly vary. For more details, refer to :ref:`Nuances of int8 Computations `.
This guide does not cover how the appropriate scaling factor can be found. Refer to the materials in the :ref:`Introduction `.
Example: Convolution Quantization Workflow
------------------------------------------
Consider a convolution without bias. The tensors are represented as:
* :math:`\src_{f32}[:] = scale_{\src} \cdot \src_{int8}[:]`
* :math:`\weights_{f32}[:] = scale_{\weights} \cdot \weights_{int8}[:]`
* :math:`\dst_{f32}[:] = scale_{\dst} \cdot \dst_{int8}[:]`
Here the :math:`\src_{f32}, \weights_{f32}, \dst_{f32}` are not computed at all, the whole work happens with INT8 tensors. As mentioned above, we also somehow know all the scaling factors: :math:`scale_{\src}, scale_{\weights}, scale_{\dst}`.
So the task is to compute the :math:`\dst_{int8}` tensor.
Mathematically, the computations are straightforward:
.. math::
\dst_{int8}[:] = downconvert\_f32\_to\_int8( output\_scale \cdot conv_{s32}(\src_{int8}, \weights_{int8}) ),
where
* :math:`output\_scale := \frac{scale_{\src} \cdot scale_{\weights}}{scale_{\dst}}`;
* :math:`conv_{s32}` is just a regular convolution which takes source and weights with INT8 data type and compute the result in INT32 data type (INT32 is chosen to avoid overflows during the computations);
* :math:`downconvert\_f32\_to\_s8()` converts an ``f32`` value to ``s8`` with potential saturation if the values are out of the range of the INT8 data type.
Note that in order to perform the operation, one does not need to know the exact scaling factors for all the tensors; it is enough to know only the :math:`output\_scale`. The library utilizes this fact: a user needs to provide only this one extra parameter to the convolution primitive (see the :ref:`Output Scaling Attribute ` section below).
Per-Channel Scaling
-------------------
Some of the primitives have limited support of multiple scales for a quantized tensor. The most popular use case is the :ref:`Convolution ` primitive that supports per-output-channel scaling factors for the weights, meaning that the actual convolution computations would need to scale different output channels differently. This is possible without significant performance loss because the per-output-channel re-quantization is only required at the very end of the computations. It seems impossible to implement the same trick for the input channels, since that would require re-quantization for every input data point.
Let :math:`\alpha` denote scales:
* :math:`\src_{f32}(n, ic, ih, iw) = \alpha_{\src} \cdot \src_{int8}(n, ic, ih, iw)`
* :math:`\weights_{f32}(oc, ic, kh, kw) = \alpha_{\weights}(oc) \cdot \weights_{int8}(oc, ic, kh, kw)`
* :math:`\dst_{f32}(n, oc, oh, ow) = scale_{\dst} \cdot \dst_{int8}(n, oc, oh, ow)`
Note that now the weights' scaling factor depends on the :math:`oc`.
To compute the :math:`\dst_{int8}` we need to perform the following:
.. math::
\dst_{int8}(n, oc, oh, ow) = downconvert\_f32\_to\_int8( output\_scale(oc) \cdot conv_{s32}(\src_{int8}, \weights_{int8})|_{(n, oc, oh, ow)} ),
where
* :math:`output\_scale(oc) := \frac{\alpha_{\src} \cdot \alpha_{\weights}(oc)}{\alpha_{\dst}}`.
It is worth mentioning that the user is responsible for preparing quantized weights accordingly. oneDNN provides reorders that can perform per-channel scaling:
.. math::
\weights_{int8}(oc, ic, kh, kw) = downconvert\_f32\_to\_int8( output\_scale(oc) \cdot \weights_{f32}(oc, ic, kh, kw) ),
where
* :math:`output\_scale(oc) := \frac{1}{\alpha_{\weights}(oc_{})}`.
API
~~~
The library API to support for INT8 was designed for the model described above. However, it does not require users to follow exactly this model. As long as users can fit their model into the given functionality everything should work fine. Having this in mind we tried to design a minimal and simple yet powerful enough quantization API.
The most common data type for data tensors during INT8 inference is :ref:`dnnl::memory::data_type::s8 ` and :ref:`dnnl::memory::data_type::u8 `. All the scaling factors related to tensors are not attached in any way to the oneDNN memory objects and should be maintained by users.
The library essentially extends the ability of the primitives to scale the output before storing the result to the memory with the destination data type. That's exactly the minimum that we need to support INT8 inference (check the equations above only :math:`output\_scale` is non-standard).
The scaling happens in the single precision floating point data type (:ref:`dnnl::memory::data_type::f32 `). Before storing, the result is downconverted to the destination data type with saturation if required. The rounding happens according to the current HW setting (for instance, on CPU according to the MXCSR register).
:target:`doxid-dev_guide_attributes_quantization_1dev_guide_attributes_quantization_output_scale`
Output Scaling Attribute
------------------------
The library uses :ref:`Primitive Attributes ` API for setting the scaling factors for most of the primitives. The supporting attributes can be found in the documentation for each primitive. The unsupported cases are handled according to the :ref:`attributes error handling section `.
API:
* C: dnnl_primitive_attr_set_output_scales
* C++: dnnl::primitive_attr::set_output_scales
Primitives support output scales only when the data type of computation is an integer.
The parameters (C++ API for simplicity):
.. ref-code-block:: cpp
void dnnl::primitive_attr::set_output_scales(
int mask,
const std::vector &scales
);
In the simplest case, when there is only one common scale the attribute changes the op behavior from
.. math::
\dst[:] = Op(...)
to
.. math::
\dst[:] = scale \cdot Op(...).
To support scales per one or several dimensions, users must set the appropriate mask.
Say the destination is a :math:`D_0 \times ... \times D_{n-1}` tensor and we want to have output scales per :math:`d_i` dimension (where :math:`0 \le d_i < n`).
Then the mask should be set to:
* :math:`mask = \sum \limits_{d_i} 2^{d_i}`,
and the number of scales should be:
* ``scales.size()`` = :math:`\prod\limits_{d_i}D_{d_i}`.
Example 1: weights quantization with per-output-channel-and-group scaling
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
.. ref-code-block:: cpp
// weights dimensions
const int G, OC, IC, KH, KW;
// original f32 weights in user's format
:ref:`dnnl::memory::desc ` wei_user_f32_md(
{G, OC/G, IC/G, KH, KW}, // dims
:ref:`dnnl::memory::data_type::f32 `, // the data originally in f32
:ref:`dnnl::memory::format_tag::hwigo ` // the memory format a user uses
);
// the scaling factors for quantized weights
// An unique scale for each group and output-channel.
std::vector wei_scales(G * OC/G) = {...};
// ...
// int8 convolution primitive descriptor (will create it in the next example)
:ref:`dnnl::convolution_forward::primitive_desc ` conv_pd(...);
// query the convolution weights memory descriptor
:ref:`dnnl::memory::desc ` wei_conv_s8_md = conv_pd.weights_desc();
// prepare the inverse of the scales (f32 = scale * int8 --> int8 = 1/scale * f32)
std::vector inv_wei_scales(wei_scales.size());
for (size_t i = 0; i < wei_scales.size(); ++i)
inv_wei_scales[i] = 1.f / wei_scales[i];
// prepare the attributes for the reorder
:ref:`dnnl::primitive_attr ` attr;
const int mask = 0
| (1 << 0) // scale per G dimension, which is the dim #0
| (1 << 1); // scale per OC dimension, which is the dim #1
attr.set_output_scales(mask, inv_wei_scales);
// create reorder that would perform:
// wei_s8(g, oc, ic, kh, kw) <- 1/scale(g, oc) * wei_f32(g, oc, ic, kh, kw)
// including the data format transformation.
auto wei_reorder_pd = :ref:`dnnl::reorder::primitive_desc `(
wei_user_f32_md, engine, // source
wei_conv_s8_md, engine, // destination,
attr);
auto wei_reorder = :ref:`dnnl::reorder `(wei_reorder_pd);
// ...
Example 2: convolution with groups, with per-output-channel quantization
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
This example is complementary to the previous example (which should ideally be the first one). Let's say we want to have an INT8 convolution with per-output channel scaling.
.. ref-code-block:: cpp
const float src_scale; // src_f32[:] = src_scale * src_s8[:]
const float dst_scale; // dst_f32[:] = dst_scale * dst_s8[:]
// the scaling factors for quantized weights (as declared above)
// An unique scale for each group and output-channel.
std::vector wei_scales(G * OC/G) = {...};
// Src, weights, and dst memory descriptors for convolution,
// with memory format tag == any to allow a convolution implementation
// to chose the appropriate memory format
:ref:`dnnl::memory::desc ` src_conv_s8_any_md(
{BATCH, IC, IH, IW}, // dims
:ref:`dnnl::memory::data_type::s8 `, // the data originally in s8
:ref:`dnnl::memory::format_tag::any ` // let convolution to choose
);
:ref:`dnnl::memory::desc ` wei_conv_s8_any_md(
{G, OC/G, IC/G, KH, KW}, // dims
:ref:`dnnl::memory::data_type::s8 `, // the data originally in s8
:ref:`dnnl::memory::format_tag::any ` // let convolution to choose
);
:ref:`dnnl::memory::desc ` dst_conv_s8_any_md(...); // ditto
// prepare the attributes for the convolution
:ref:`dnnl::primitive_attr ` attr;
const int mask = 0
| (1 << 1); // scale per OC dimension, which is the dim #1 on dst tensor:
// (BATCH, OC, OH, OW)
// 0 1 2 3
std::vector conv_output_scales(G * OC/G);
for (int g_oc = 0; G * OC/G; ++g_oc)
conv_output_scales[g_oc] = src_scale * wei_scales(g_oc) / dst_scale;
attr.set_output_scales(mask, conv_output_scales);
// create a convolution primitive descriptor with the scaling factors
:ref:`dnnl::convolution_forward::primitive_desc ` conv_pd(
engine,
:ref:`dnnl::prop_kind::forward_inference `,
:ref:`dnnl::algorithm::convolution_direct `,
src_conv_s8_any_md, // what's important is that
wei_conv_s8_any_md, // we specified that we want
dst_conv_s8_any_md, // computations in s8
strides, padding_l, padding_r,
attr // the attributes contain the output scaling
);
// ...
Interplay of output scales with post-ops
++++++++++++++++++++++++++++++++++++++++
In general, the :ref:`post-ops ` are independent from the output scales. The output scales are applied to the result first; then post-ops will take effect.
For details, refer to the :ref:`Tanh -> Sum -> ScaleShift ` example.
That has an implication on the scaling factors passed to the library, however. Consider the following example of a convolution with :math:`\tanh` post-op:
.. math::
\dst_{s8}[:] = \frac{1}{scale_{\dst}} \cdot \tanh( scale_{\src} \cdot scale_{\weights} \cdot conv_{s32}(\src_{s8}, wei_{s8}) )
As you can see:
* The convolution output scales are now :math:`conv\_output\_scale = scale_{\src} \cdot scale_{\weights}`, i.e. there is no division by :math:`scale_{\dst}`;
* And the post-ops scale for :math:`\tanh` is set to :math:`scale\_tanh\_post\_op = \frac{1}{scale_{\dst}}`.