Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)  0.95.0 Performance library for Deep Learning

## Why use a different convolution algorithm?

Executing convolution using the Winograd algorithm often gives a significant performance boost compared with using the Direct algorithm. Details about the algorithm can be found in Fast Algorithms for Convolutional Neural Networks by A. Lavin and S. Gray.

Intel(R) MKL-DNN supports the Winograd algorithm for convolutions with the following sizes:

• 2D convolution (i.e. spatial depth `d=1`)
• kernel sizes `kh=3,kw=3`.
• strides `sh=sw=1`.
• Inference - Based on convolution sizes, MKLDNN chooses between two different tile sizes F(2x2, 3x3) or F(4x4, 3x3)(refer to Winograd paper for more informartion on tile sizes).
• Training - Uses F(4x4, 3x3) winograd.

Create a Winograd convolution by simply creating a convolution descriptor (step 6 in SimpleNet Example) with right algorithm. The rest of the steps for creating convolution are exactly the same as shown in the example.

auto conv1_desc = convolution_forward::desc(
conv1_src_md, conv1_weights_md, conv1_bias_md, conv1_dst_md,

## Auto dispatching of convolution algorithm

Instead of choosing a convolution algorithm for each and every convolution in a topology, a user could simply ask MKLDNN to make the choice.

Creating a convolution by using `convolution_auto` allows MKLDNN to dispatch the best algorithm.

auto conv1_desc = convolution_forward::desc(
prop_kind::forward_inference, algorithm::convolution_auto,
conv1_src_md, conv1_weights_md, conv1_bias_md, conv1_dst_md,

MKLDNN would choose the algorithm which will potentially give best performance based on

• convolution dimensions
• number of logical processors available. (For auto-dispatching to work as intended, use the same thread affinity settings when creating the convolution as when executing the convolution.) The relationship between convolution sizes and the best performing algorithm is empirically based on performance observations

### Example using benchdnn

The following examples use benchdnn to illustrate the performance benefits of using `convolution_auto`.

On a 2 Socket Intel Xeon 8180 processor with 28 cores/socket and HT off:

OMP_NUM_THREADS=56 KMP_AFFINITY=granularity=fine,compact numactl -l tests/benchdnn/benchdnn --mode=p --conv -v5 --alg=auto --dir=BWD_WB mb112ic64ih300oc64oh300kh3ph1n"ssd_300_voc0712:conv1_2"
mkldnn implementation: jit_wino_4x3:avx512_core
...
...
perf,ssd_300_voc0712:conv1_2,--dir=BWD_WB --alg=auto mb112ic64ih300oc64oh300kh3ph1nssd_300_voc0712:conv1_2,739.879,0,61.332,12063.5,62.503,11837.5

In the above test-case `convolution_auto` choses winograd convolution (using a heuristic based on the convolution sizes and number of threads), as winograd convolution is faster than direct in this case.

OMP_NUM_THREADS=56 KMP_AFFINITY=granularity=fine,compact numactl -l tests/benchdnn/benchdnn --mode=p --conv -v5 --alg=direct --dir=BWD_WB mb112ic64ih300oc64oh300kh3ph1n"ssd_300_voc0712:conv1_2"
mkldnn implementation: jit:avx512_common
...
mkldnn_verbose,exec,convolution,jit:avx512_common,backward_weights,fsrc:nchw fwei:gOhwi16o fbia:x fdst:nChw16c,alg:convolution_direct,mb112_g1ic64oc64_ih300oh300kh3sh1dh0ph1_iw300ow300kw3sw1dw0pw1,176.10
...
perf,ssd_300_voc0712:conv1_2,--dir=BWD_WB mb112ic64ih300oc64oh300kh3ph1nssd_300_voc0712:conv1_2,739.879,0,175.422,4217.7,180.315,4103.26

In the following example, `convolution_auto` chooses direct convolution because the winograd implementation is slower than direct in this case.

mkldnn implementation: jit:avx512_common
...
mkldnn_verbose,exec,convolution,jit:avx512_common,backward_weights,fsrc:nChw16c fwei:gOIhw16i16o fbia:x fdst:nChw16c,alg:convolution_direct,mb112_g1ic64oc64_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,1.13