Example code: cpu_performance_profiling.cpp

This example uses MKLDNN_VERBOSE trace output to tune Intel MKL-DNN code to align with the best practices.

It will assume knowledge of memory formats and their usage in Intel MKL-DNN. You can read more about this topic here.

The example has three different implementations of the mathematical operation:

Naive implementation executes 2D convolution followed by ReLU on the data in NCHW format. This implementation does not align with Intel MKL-DNN best practices and results in suboptimal performance.
Blocked format implementation executes the same operations sequence on the blocked format optimized for convolution performance. This implementation uses format_tag=ANY to create a convolution memory descriptor to determine the data format optimal for the convolution implementation. It then propagates the blocked format to the non-intensive ReLU. This implementation results in better overall performance than the naive implementation.
Fused implementation executes convolution fused with ReLU on blocked data format. This implementation uses format_tag=ANY to create a convolution memory descriptor, and then adds ReLU as a post-op to the convolution primitive. This version implements all of the best practices for inference resulting in the best overall performance.

Walkthrough

The program in cpu_performance_profiling.cpp includes all three implementations introduced above. You can select the specific implementation using command line options.

After compilation, you can execute each implementation with:

./program.exe implementation

Before you run the program, set your MKLDNN_VERBOSE environment variable to 1:

export MKLDNN_VERBOSE=1

The program starts by creating Intel MKL-DNN memory objects in NCHW format. These are called user_ because they are meant to represent the user's source data entering Intel MKL-DNN with the NCHW format.

    // set dimensions for synthetic data and weights
    const memory::dim BATCH = 1000;
    const memory::dim IC = 3, OC = 96;
    const memory::dim IH = 227, KH = 11, OH = 55;
    const memory::dim IW = 227, KW = 11, OW = 55;

    // create MKL-DNN memory objects for user's tensors (in nchw and oihw formats)
    // @note here the library allocates memory
    auto user_src = memory({{BATCH, IC, IH, IW}, memory::data_type::f32,
                    memory::format_tag::nchw}, cpu);
    auto user_wei = memory({{OC, IC, KH, KW}, memory::data_type::f32,
                    memory::format_tag::oihw}, cpu);
    auto user_dst = memory({{BATCH, OC, OH, OW}, memory::data_type::f32,
                    memory::format_tag::nchw}, cpu);

Note: You can change the batch size to easily increase/decrease the workload.

The following descriptions of each implementation will reference each other, and are meant to be read in order.

Naive Implementation

This implementation is launched with the following shell code:

./program.exe naive

The program will call the implementation defined in the function conv_relu_naive().

First it sets the dimensions and format for convolution memory descriptors (_md) to match user_ values–one md each for source, destination, and weight data. Then it uses those md to create the convolution descriptor conv_d, which tells Intel MKL-DNN to use plain format (NCHW) for the convolution.

    // copy the dimensions and format from user's memory
    auto conv_src_md = memory::desc(user_src.get_desc());
    auto conv_wei_md = memory::desc(user_wei.get_desc());
    auto conv_dst_md = memory::desc(user_dst.get_desc());

    // create a convolution descriptor
    auto conv_d = convolution_forward::desc(
            prop_kind::forward_inference, algorithm::convolution_direct,
            conv_src_md, conv_wei_md, conv_dst_md,
            strides, padding, padding);

Next the program creates a convolution primitive descriptor conv_pd and convolution primitive conv. These structs will inherit NCHW format from md by way of the conv_d. Finally it creates the convolution primitive conv and adds it to the stream s, and then executes the create_and_execute_relu(user_dst) function.

// create a convolution primitive descriptor

auto conv_pd = convolution_forward::primitive_desc(conv_d, cpu);

// create convolution primitive

auto conv = convolution_forward(conv_pd);

    // execute convolution by adding it to the stream s
    conv.execute(s, {
            {MKLDNN_ARG_SRC, user_src},
            {MKLDNN_ARG_WEIGHTS, user_wei},
            {MKLDNN_ARG_DST, user_dst}});

// execute relu (on convolution's destination format, whatever it is)

create_and_execute_relu(user_dst);

Note: The function for creation and execution of ReLU primitive is defined elsewhere to keep this example clean. It is an non-intensive operation, so the create_and_execute_relu() function uses whatever the input data format is at the time it is called.

Using NCHW data format may result in suboptimal performance for compute intensives primitives, as shown in the following MKLDNN_VERBOSE output by the convolution and relu execution times of 235.9 and 100.3 milliseconds, respectively.

MKLDNN_VERBOSE output (see configuration notice*):

mkldnn_verbose,exec,convolution,gemm:jit,forward_inference,src_f32::
        blocked:abcd:f0 wei_f32::blocked:abcd:f0 dst_f32::
        blocked:abcd:f0,alg:convolution_direct,
        mb1000_ic3oc96_ih227oh55kh11sh4dh0ph0_iw227ow55kw11sw4dw0pw0,235.86
mkldnn_verbose,exec,eltwise,jit:avx512_common,forward_inference,
        data_f32::blocked:abcd:f0,alg:eltwise_relu,1000x96x55x55,100.264

In Blocked format implementation, we will incorporate the best practice of letting Intel MKL-DNN determine the optimal format for convolution primitive.

Blocked format implementation

This implementation is launched with the following shell code:

./program.exe blocked

The program will call the implementation defined in the function conv_relu_blocked().

First it creates the md as in naive implementation. Next it changes the mkldnn::memory::format_tag for each md to ANY. Then it uses those md to create the convolution descriptor conv_d, which tells Intel MKL-DNN to use whatever format it recommends for the convolution. Intel MKL-DNN will choose the CPU-friendly blocked format.

    // copy the dimensions and format from user's memory
    auto conv_src_md = memory::desc(user_src.get_desc());
    auto conv_wei_md = memory::desc(user_wei.get_desc());
    auto conv_dst_md = memory::desc(user_dst.get_desc());
    // reset format to "any" to allow convolution to pick the best implementation
    conv_src_md.data.format_kind = mkldnn_format_kind_any;
    conv_wei_md.data.format_kind = mkldnn_format_kind_any;
    conv_dst_md.data.format_kind = mkldnn_format_kind_any;

    // create a convolution descriptor
    auto conv_d = convolution_forward::desc(
            prop_kind::forward_inference, algorithm::convolution_direct,
            conv_src_md, conv_wei_md, conv_dst_md,
            strides, padding, padding);

Next the program creates a convolution primitive descriptor conv_pd and convolution primitive conv as in naive implementation. However, in this implementation the structs will inherit blocked format from md by way of the conv_d.

// create a convolution primitive descriptor and primitive

auto conv_pd = convolution_forward::primitive_desc(conv_d, cpu);

Since the resulting convolution primitive will expect blocked source data, conditional reorders are inserted to convert input data to blocked format if required. The input data user_src is NCHW, so this conditional will be triggered:

Note: The reoders are applied using Intel MKL-DNN reorder primitive.

    // prepare convolution source
    memory conv_src = user_src;
    if (conv_pd.src_desc() != user_src.get_desc()) {
        conv_src = memory(conv_pd.src_desc(), cpu);
        auto r_pd = reorder::primitive_desc(user_src, conv_src);
        reorder(r_pd).execute(s, user_src, conv_src);
    }
    // prepare convolution weights
    memory conv_wei = user_wei;
    if (conv_pd.weights_desc() != user_wei.get_desc()) {
        conv_wei = memory(conv_pd.weights_desc(), cpu);
        auto r_pd = reorder::primitive_desc(user_wei, conv_wei);
        reorder(r_pd).execute(s, user_wei, conv_wei);
    }
    // prepare convolution destination
    memory conv_dst = user_dst;
    if (conv_pd.dst_desc() != user_dst.get_desc())
        conv_dst = memory(conv_pd.dst_desc(), cpu);

Finally it creates the convolution primitive conv and adds it to the stream s with the reordered data (conv_src, conv_wei, conv_dst1) as inputs and then executes the create_and_execute_relu(conv_dst) function.

// create convolution primitive

auto conv = convolution_forward(conv_pd);

    // execute convolution by adding it to the stream s
    conv.execute(s, {
            {MKLDNN_ARG_SRC, conv_src},
            {MKLDNN_ARG_WEIGHTS, conv_wei},
            {MKLDNN_ARG_DST, conv_dst}});

// execute relu (on convolution's destination format, whatever it is)

create_and_execute_relu(conv_dst);

Blocked memory format is recommended for Intel MKL-DNN primitive execution and provides better performance, as shown in the MKLDNN_VERBOSE output by the convolution and relu execution times of 119.6 and 34.4 milliseconds (down from 235.9 and 100.3 in naive implementation), respectively. In this implementation, there is an additional reorder operation that executes before and after the the conv + relu. This small cost is worth the gain from executing in blocked format. If fact, it becomes negligible when chaining together multiple Intel Mkl-DNN operations in succession. In these situations, you can do one reorder at the beginning and one at the end of the chain, and only pay the reorder penalty at those points in the execution.

MKLDNN_VERBOSE output (see configuration notice*):

mkldnn_verbose,exec,reorder,jit:uni,undef,src_f32::blocked:abcd:f0
        dst_f32::blocked:Acdb16a:f0,num:1,96x3x11x11,3.71387
mkldnn_verbose,exec,convolution,jit:avx512_common,forward_inference,
        src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0
        dst_f32::blocked:aBcd16b:f0,alg:convolution_direct,
        mb1000_ic3oc96_ih227oh55kh11sh4dh0ph0_iw227ow55kw11sw4dw0pw0,119.649
mkldnn_verbose,exec,eltwise,jit:avx512_common,forward_inference,
        data_f32::blocked:aBcd16b:f0,alg:eltwise_relu,1000x96x55x55,34.417
mkldnn_verbose,exec,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0
        dst_f32::blocked:abcd:f0,num:1,1000x96x55x55,97.3352

This inference implementation is closer to best practices than naive implementation because it uses Intel MKL-DNN recommended memory format. fused implementation will futher optimize the performance by using a fused version of the conv + ReLU primitive emplying the Intel MKL-DNN post-ops attribute

Fused Implementation

This implementation is launched with the following shell code:

./program.exe fused

The program will call the implementation defined in the function conv_relu_fused().

First the memory descriptors and convolution descriptor are created as in naive implementation.

Then in preparation for the convolution prim desctiptor, a ReLU post-op is built and added to the primitive attribute attr:

// function to create post-op attribute for fused relu
primitive_attr create_attr_with_relu_post_op() {
    // create a post-op with relu
    post_ops ops;
    ops.append_eltwise(1.f, algorithm::eltwise_relu, 0.f, 0.f);
    // create an attribute and set the corresponding post op
    primitive_attr attr;
    attr.set_post_ops(ops);
    return attr;
}

post-op by way of the attributes attr:

    // create an attribute for fused relu
    auto attr = create_attr_with_relu_post_op();
    // create a convolution primitive descriptor
    auto conv_pd = convolution_forward::primitive_desc(conv_d, attr, cpu);

Then conditional reorders are applied as in blocked format implementation to convert user_ format NCHW to blocked. Finally, it creates the convolution primitive conv and adds it to the stream s with the reordered data (conv_src, conv_wei, conv_dst1).

Note: There is no separate addition to the stream for the ReLU operation because it has been added as a post-op to the conv primitive.

// create convolution primitive

auto conv = convolution_forward(conv_pd);

    // execute convolution by adding it to the stream s
    conv.execute(s, {
            {MKLDNN_ARG_SRC, conv_src},
            {MKLDNN_ARG_WEIGHTS, conv_wei},
            {MKLDNN_ARG_DST, conv_dst}});

This implementation complies with best practices for f32 inference by using the Intel MKL-DNN recommended blocked format for convolution and adding ReLU as a post-op to execute a fused version of conv + ReLU. The consequence to following best practices can be seen in the execution time of the fused primitive of 103.9 milliseconds.

MKLDNN_VERBOSE output (see configuration notice*):

mkldnn_verbose,exec,convolution,jit:avx512_common,forward_inference,
        src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0
        dst_f32::blocked:aBcd16b:f0,alg:convolution_direct,
        mb1000_ic3oc96_ih227oh55kh11sh4dh0ph0_iw227ow55kw11sw4dw0pw0,103.916

Performance summary

Implmentation	Time, ms	Cumulative speedup
Naive	336.1	1.0
Blocked format	154.0	2.2
Fused	103.9	3.2

Configuration Notice

Note: This example is meant to demonstrate Intel MKL-DNN best practices.; It is not meant for benchmarking purposes. The platform is not fully; optimized, so the primitive execution times are only relevant in; relation to the other times in this example.

Runtime Settings:

OMP_NUM_THREADS=14
KMP_AFFINITY=granularity=fine,compact,1,0

Platform:

CPU: Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
RAM (DDR4): 1.45 TB