The Intel(R) MKL-DNN library provides performance critical primitives to accelerate operations used both during training deep learning models and during the operations performed when the models are used for inference.
During inference, the input data is fed into the trained model which in turn produces a result (e.g. makes a prediction). This process is usually called forward propagation and corresponds to the mkldnn::prop_kind::forward_inference propagation kind in Intel MKL-DNN.
Training usually consists of the following steps.
_training
here versus _inference
mentioned above. The differences are covered below in the corresponding section below.diff_src
from diff_dst
(see Naming Conventions). This step corresponds to the mkldnn::prop_kind::backward_data propagation kind;diff_weights
from diff_dst
. This step makes sense only for the operations that have learnable parameters and corresponds to the mkldnn::prop_kind::backward_weights propagation kind.Even though, mathematically, the forward propagation that happens during training and inference should be the same, in practice there are some differences mostly due to the performance considerations.
When executing inference, one may not care about values in the intermediate buffers during a model execution; hence one can reuse them as desired. However, if this is a forward propagation of a training it is beneficial to preserve input data, output data, or sometimes some intermediate data, that will later be used at the backward propagation to compute the gradients.
For example, let's take max pooling (Pooling with algorithm kind mkldnn::algorithm::pooling_max) as an example. The forward pass consists of computing the maximum values in the sliding window over the source tensor. Hence the output is just another tensor that contain these maximum values. However, in order to compute source gradient on backward propagation one needs to know the position of these maximum values in the source tensor. Of course, it is possible to use the original source tensor to locate the maximums again, but this might be more expensive compared to preserving the positions of the maximum values in another tensor, that will be then used during the backward propagation. Intel MKL-DNN uses the latter approach: for max pooling primitive when the propagation kind is set to mkldnn::prop_kind::forward_training the library produces one extra output called Workspace which will be covered later in this document.
As mentioned above, Intel MKL-DNN separates error back-propagation with respect to data and error back-propagation with respect to weights. The former corresponds to mkldnn::prop_kind::backward_data, while the latter corresponds to mkldnn::prop_kind::backward_weights (for example: Convolution).
The following list outlines the key specifics of running inference with Intel MKL-DNN:
Most of these techniques are shown in the following examples:
The following list outlines the key specifics of running training with Intel MKL-DNN:
diff_src
and diff_weights
at the same time. To highlight this behavior, the propagation kind is set to mkldnn::prop_kind::backward.diff_src
, one must pass diff_dst
memory and the original src
memory, which was exactly the intermediate one.)diff_dst
and src
, but to compute backward propagation for Softmax, one needs to pass diff_dst
and dst
. Check the documentation for each primitive to see what is required for each particular primitive.src
memory format of a convolution on forward propagation will always match the src
memory format of the corresponding convolution on backward by weights propagation. Of course, the library tries to avoid unnecessary reorder, so in most cases the formats will be the same, but this would by no means always be true.diff_dst
memory format the same as the original dst
. The mismatch of the formats would be handled correctly, but it might lead to highly suboptimal performance.workspace
, because it might be different for different implementations.Most of these techniques are shown in the following examples:
Intel MKL-DNN uses the notion of workspace
for some very particular cases. Specifically, the workspace
is a tensor that the primitive fills in during forward propagation and that will then be used by the corresponding backward propagation operation. The example with max pooling was already discussed above.
The workflow for using workspace is:
.workspace_desc()
.mkldnn::memory::desc()
or for which mkldnn::memory::desc::get_size() returns 0), no extra action is required–the workspace is not required for this primitive in this configuration.MKLDNN_ARG_WORKSPACE
tag.workspace
memory of zero size and follow the logic where the workspace is indeed required. Such an approach may simplify the integration because the common pass is used.