The convolution primitive computes forward, backward, or weight update for a batched convolution operation on 1D, 2D, or 3D spatial data with bias.
The convolution operation is defined by the following formulas. We show formulas only for 2D spatial data which are straightforward to generalize to cases of higher and lower dimensions. Variable names follow the standard Naming Conventions.
Let \(src\), \(weights\) and \(dst\) be \(N \times IC \times IH \times IW\), \(OC \times IC \times KH \times KW\), and \(N \times OC \times OH \times OW\) tensors respectively. Let \(bias\) be a 1D tensor with \(OC\) elements.
The following formulas show how Intel MKL-DNN computes convolutions. They are broken down into several types to simplify the exposition, but in reality the convolution types can be combined.
To further simplify the formulas, we assume that \(src(n, ic, ih, iw) = 0\) if \(ih < 0\), or \(ih \geq IH\), or \(iw < 0\), or \(iw \geq IW\).
\[dst(n, oc, oh, ow) = bias(oc) + \\ + \sum_{ic=0}^{IC-1}\sum_{kh=0}^{KH-1}\sum_{kw=0}^{KW-1} src(n, ic, oh \cdot SH + kh - ph_0, ow \cdot SW + kw - pw_0) \cdot weights(oc, ic, kh, kw).\]
Here:
In the API, Intel MKL-DNN adds a separate groups dimension to memory objects representing weights tensors and represents weights as \(G \times OC_G \times IC_G \times KH \times KW \) 5D tensors for 2D convolutions with groups.
\[ dst(n, g \cdot OC_G + oc_g, oh, ow) = bias(g \cdot OC_G + oc_g) + \\ + \sum_{ic_g=0}^{IC_G-1}\sum_{kh=0}^{KH-1}\sum_{kw=0}^{KW-1} src(n, g \cdot IC_G + ic_g, oh + kh - ph_0, ow + kw - pw_0) \cdot weights(g, oc_g, ic_g, kh, kw), \]
where
The case when \(OC_G = IC_G = 1\) is also known as a depthwise convolution.
\[ dst(n, oc, oh, ow) = bias(oc) + \\ + \sum_{ic=0}^{IC-1}\sum_{kh=0}^{KH-1}\sum_{kw=0}^{KW-1} src(n, ic, oh + kh \cdot dh - ph_0, ow + kw \cdot dw - pw_0) \cdot weights(oc, ic, kh, kw). \]
Here:
Deconvolutions (also called fractionally strided convolutions or transposed convolutions) work by swapping the forward and backward passes of a convolution. One way to put it is to note that the weights define a convolution, but whether it is a direct convolution or a transposed convolution is determined by how the forward and backward passes are computed.
There is no difference between the mkldnn_forward_training and mkldnn_forward_inference propagation kinds.
The backward propagation computes \(diff\_src\) based on \(diff\_dst\) and \(weights\).
The weights update computes \(diff\_weights\) and \(diff\_bias\) based on \(diff\_dst\) and \(src\).
N/A.
Convolution primitive supports the following combination of data types for source, destination, and weights memory objects:
Propagation | Source | Weights | Destination | Bia |
---|---|---|---|---|
forward / backward | f32 | f32 | f32 | f32 |
forward | f16 | f16 | f16 | f16 |
forward | u8, s8 | s8 | u8, s8, s32, f32 | u8, s8, s32, f32 |
forward | bf16 | bf16 | f32, bf16 | f32, bf16 |
backward | f32, bf16 | f32, bf16 | bf16 | f32, bf16 |
Like other CNN primitives, the convolution primitive expects the following tensors:
Spatial | Source / Destination | Wei |
---|---|---|
1D | \(N \times C \times W\) | \([G \times ] OC \times IC \times KW\) |
2D | \(N \times C \times H \times W\) | \([G \times ] OC \times IC \times KH \times KW\) |
3D | \(N \times C \times D \times H \times W\) | \([G \times ] OC \times IC \times KD \times KH \times KW\) |
Physical format of data and weights memory objects is critical for convolution primitive performance. In the Intel MKL-DNN programming model, convolution is one of the few primitives that support the placeholder memory format tag mkldnn::memory::format_tag::any (shortened to any
from now on) and can define data and weight memory objects format based on the primitive parameters. When using any
it is necessary to first create a convolution primitive descriptor and then query it for the actual data and weight memory objects formats.
While convolution primitives can be created with memory formats specified explicitly, the performance is likely to be suboptimal.
The table below shows the combinations for which plain memory formats the convolution primitive is optimized for.
Spatial | Convolution Type | Data / Weights logical tensor | Imp |
---|---|---|---|
1D, 2D, 3D | any | optimized | |
1D | f32, bf16 | NCW / OIW, GOIW | mkldnn_ncw (mkldnn_abc) / mkldnn_oiw (mkldnn_abc), mkldnn_goiw (mkldnn_abcd) |
1D | int8 | NCW / OIW | mkldnn_nwc (mkldnn_acb) / mkldnn_wio (mkldnn_cba) |
2D | f32, bf16 | NCHW / OIHW, GOIHW | mkldnn_nchw (mkldnn_abcd) / mkldnn_oihw (mkldnn_abcd), mkldnn_goihw (mkldnn_abcde) |
2D | int8 | NCHW / OIHW, GOIHW | mkldnn_nhwc (mkldnn_acdb) / mkldnn_hwio (mkldnn_cdba), mkldnn_hwigo (mkldnn_decab) |
3D | f32, bf16 | NCDHW / OIDHW, GOIDHW | mkldnn_ncdhw (mkldnn_abcde) / mkldnn_oidhw (mkldnn_abcde), mkldnn_goidhw (mkldnn_abcdef) |
3D | int8 | NCDHW / OIDHW | mkldnn_ndhwc (mkldnn_acdeb) / mkldnn_dhwio (mkldnn_cdeba) |
Post-ops and attributes enable you to modify the behavior of the convolution primitive by applying the output scale to the result of the primitive and by chaining certain operations after the primitive. The following attributes and post-ops are supported:
Propagation | Type | Operation | Restrictions | Des |
---|---|---|---|---|
forward | attribute | Output scale | int8 convolutions only | Scales the result of convolution by given scale factor(s) |
forward | post-op | eltwise | Applies an Eltwise operation to the result (currently only mkldnn_eltwise_relu algorithm is supported) | |
forward | post-op | sum | Adds the operation result to the destination tensor instead of overwriting it |
workspace
that is required to compute backward propagation correctly. Hence, in this particular case one should use separate convolution and eltwise primitives for training.The following post-ops chaining is supported by the library:
Type of convolutions | Pos |
---|---|
f32 and bf16 convolution | eltwise, sum, sum -> eltwise |
int8 convolution | eltwise, sum, sum -> eltwise, eltwise -> sum |
The attributes and post-ops take effect in the following sequence:
The operations during attributes and post-ops applying are done in single precision floating point data type. The conversion to the actual destination data type happens just before the actual storing.
Consider the following pseudo code:
The would lead to the following:
\[ dst(\overline{x}) = \gamma \cdot \tanh \left( \alpha \cdot conv(src, weights) + \beta \cdot dst(\overline{x}) \right) \]
The following pseudo code:
That would lead to the following:
\[ dst(\overline{x}) = \beta \cdot dst(\overline{x}) + \gamma \cdot ReLU \left( \alpha \cdot conv(src, weights), \eta \right) \]
Intel MKL-DNN implements convolution primitives using several different algorithms:
Intel MKL-DNN supports the direct convolution algorithm on all supported platforms for the following conditions:
any
).In case any of these constraints are not met, the implementation will silently fall back to an explicit GEMM algorithm.
Intel MKL-DNN supports the Winograd convolution algorithm on systems with Intel AVX-512 support and above under the following conditions:
any
as the data format).In case any of these constraints is not met, the implementation will silently fall back to the direct algorithm.
The Winograd convolution algorithm implementation additionally chooses tile size based on the problem shape and propagation kind:
forward_inference
Intel MKL-DNN supports \(F(2 \times 2, 3 \times 3)\) or \(F(4 \times 4, 3 \times 3)\)The following side effects should be weighed against the (potential) performance boost achieved from using the Winograd algorithm:
Create a Winograd convolution by simply creating a convolution descriptor (step 6 in simple network example specifying the Winograd algorithm. The rest of the steps are exactly the same.
Intel MKL-DNN supports mkldnn::algorithm::convolution_auto
algorithm that instructs the library to automatically select the best algorithm based on the heuristics that take into account tensor shapes and the number of logical processors available. (For automatic selection to work as intended, use the same thread affinity settings when creating the convolution as when executing the convolution.)