Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)  0.95.0
Performance library for Deep Learning
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Winograd Convolution

Why use a different convolution algorithm?

Executing convolution using the Winograd algorithm often gives a significant performance boost compared with using the Direct algorithm. Details about the algorithm can be found in Fast Algorithms for Convolutional Neural Networks by A. Lavin and S. Gray.

Winograd in Intel(R) MKL-DNN

Intel(R) MKL-DNN supports the Winograd algorithm for convolutions with the following sizes:

Create a Winograd convolution by simply creating a convolution descriptor (step 6 in SimpleNet Example) with right algorithm. The rest of the steps for creating convolution are exactly the same as shown in the example.

auto conv1_desc = convolution_forward::desc(
prop_kind::forward_inference, algorithm::convolution_winograd,
conv1_src_md, conv1_weights_md, conv1_bias_md, conv1_dst_md,
conv1_strides, conv1_padding);

Auto dispatching of convolution algorithm

Instead of choosing a convolution algorithm for each and every convolution in a topology, a user could simply ask MKLDNN to make the choice.

Creating a convolution by using convolution_auto allows MKLDNN to dispatch the best algorithm.

auto conv1_desc = convolution_forward::desc(
prop_kind::forward_inference, algorithm::convolution_auto,
conv1_src_md, conv1_weights_md, conv1_bias_md, conv1_dst_md,
conv1_strides, conv1_padding);

MKLDNN would choose the algorithm which will potentially give best performance based on

Example using benchdnn

The following examples use benchdnn to illustrate the performance benefits of using convolution_auto.

On a 2 Socket Intel Xeon 8180 processor with 28 cores/socket and HT off:

OMP_NUM_THREADS=56 KMP_AFFINITY=granularity=fine,compact numactl -l tests/benchdnn/benchdnn --mode=p --conv -v5 --alg=auto --dir=BWD_WB mb112ic64ih300oc64oh300kh3ph1n"ssd_300_voc0712:conv1_2"
mkldnn implementation: jit_wino_4x3:avx512_core
...
mkldnn_verbose,exec,convolution,jit_wino_4x3:avx512_core,backward_weights,fsrc:nChw16c fwei:gOIhw16i16o fbia:x fdst:nChw16c,alg:convolution_winograd,mb112_g1ic64oc64_ih300oh300kh3sh1dh0ph1_iw300ow300kw3sw1dw0pw1,61.32
...
perf,ssd_300_voc0712:conv1_2,--dir=BWD_WB --alg=auto mb112ic64ih300oc64oh300kh3ph1nssd_300_voc0712:conv1_2,739.879,0,61.332,12063.5,62.503,11837.5

In the above test-case convolution_auto choses winograd convolution (using a heuristic based on the convolution sizes and number of threads), as winograd convolution is faster than direct in this case.

OMP_NUM_THREADS=56 KMP_AFFINITY=granularity=fine,compact numactl -l tests/benchdnn/benchdnn --mode=p --conv -v5 --alg=direct --dir=BWD_WB mb112ic64ih300oc64oh300kh3ph1n"ssd_300_voc0712:conv1_2"
mkldnn implementation: jit:avx512_common
...
mkldnn_verbose,exec,convolution,jit:avx512_common,backward_weights,fsrc:nchw fwei:gOhwi16o fbia:x fdst:nChw16c,alg:convolution_direct,mb112_g1ic64oc64_ih300oh300kh3sh1dh0ph1_iw300ow300kw3sw1dw0pw1,176.10
...
perf,ssd_300_voc0712:conv1_2,--dir=BWD_WB mb112ic64ih300oc64oh300kh3ph1nssd_300_voc0712:conv1_2,739.879,0,175.422,4217.7,180.315,4103.26


In the following example, convolution_auto chooses direct convolution because the winograd implementation is slower than direct in this case.

OMP_NUM_THREADS=56 KMP_AFFINITY=granularity=fine,compact tests/benchdnn/benchdnn --mode=p --conv -v5 --alg=auto --dir=BWD_WB mb112ic64ih28oc64oh28kh3ph1n"googlenet_v2:inception_3a/3x3"
mkldnn implementation: jit:avx512_common
...
mkldnn_verbose,exec,convolution,jit:avx512_common,backward_weights,fsrc:nChw16c fwei:gOIhw16i16o fbia:x fdst:nChw16c,alg:convolution_direct,mb112_g1ic64oc64_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,1.13
perf,googlenet_v2:inception_3a/3x3,--dir=BWD_WB --alg=auto mb112ic64ih28oc64oh28kh3ph1ngooglenet_v2:inception_3a/3x3,6.1693,0,1.04272,5916.52,1.13284,5445.88
OMP_NUM_THREADS=56 KMP_AFFINITY=granularity=fine,compact tests/benchdnn/benchdnn --mode=p --conv -v5 --alg=wino --dir=BWD_WB mb112ic64ih28oc64oh28kh3ph1n"googlenet_v2:inception_3a/3x3"
mkldnn implementation: jit_wino_4x3:avx512_core
...
mkldnn_verbose,exec,convolution,jit_wino_4x3:avx512_core,backward_weights,fsrc:nChw16c fwei:gOIhw16i16o fbia:x fdst:nChw16c,alg:convolution_winograd,mb112_g1ic64oc64_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,2.15
...
perf,googlenet_v2:inception_3a/3x3,--dir=BWD_WB --alg=wino mb112ic64ih28oc64oh28kh3ph1ngooglenet_v2:inception_3a/3x3,6.1693,0,2.14404,2877.41,2.20445,2798.56

Other considerations when using Winograd

The following side-effects should be weighed against the performance boost achieved when using Winograd: