Example code: performance_profiling.cpp
This example uses DNNL_VERBOSE trace output to tune DNNL code to align with the best practices.
It will assume knowledge of memory formats and their usage in DNNL. You can read more about this topic here.
The example has three different implementations of the mathematical operation:
format_tag=ANY
to create a convolution memory descriptor to determine the data format optimal for the convolution implementation. It then propagates the blocked format to the non-intensive ReLU. This implementation results in better overall performance than the naive implementation.format_tag=ANY
to create a convolution memory descriptor, and then adds ReLU as a post-op to the convolution primitive. This version implements all of the best practices for inference resulting in the best overall performance.The program in performance_profiling.cpp includes all three implementations introduced above. You can select the specific implementation using command line options.
After compilation, you can execute each implementation with:
Before you run the program, set your DNNL_VERBOSE
environment variable to 1:
The program starts by creating DNNL memory objects in NCHW format. These are called user_
because they are meant to represent the user's source data entering DNNL with the NCHW format.
The following descriptions of each implementation will reference each other, and are meant to be read in order.
This implementation is launched with the following shell code:
The program will call the implementation defined in the function conv_relu_naive()
.
First it sets the dimensions and format for convolution memory descriptors (_md
) to match user_
values–one md
each for source, destination, and weight data. Then it uses those md
to create the convolution descriptor conv_d
, which tells DNNL to use plain format (NCHW) for the convolution.
Next the program creates a convolution primitive descriptor conv_pd
and convolution primitive conv
. These structs will inherit NCHW format from md
by way of the conv_d
. Finally it creates the convolution primitive conv
and adds it to the stream s
, and then executes the create_and_execute_relu(user_dst)
function.
create_and_execute_relu()
function uses whatever the input data format is at the time it is called.Using NCHW data format may result in suboptimal performance for compute intensive primitives, as shown in the following DNNL_VERBOSE output by the convolution and relu execution times of 38.3 and 2.9 milliseconds, respectively.
DNNL_VERBOSE output (see configuration notice*):
In Blocked format implementation, we will incorporate the best practice of letting DNNL determine the optimal format for convolution primitive.
This implementation is launched with the following shell code:
The program will call the implementation defined in the function conv_relu_blocked()
.
First it creates the md as in naive implementation. Next it changes the dnnl::memory::format_tag for each md to ANY
. Then it uses those md to create the convolution descriptor conv_d, which tells Intel DNNL to use whatever format it recommends for the convolution. DNNL will choose a friendly blocked format.
Next the program creates a convolution primitive descriptor conv_pd and convolution primitive conv as in naive implementation. However, in this implementation the structs will inherit blocked format from md by way of the conv_d.
Since the resulting convolution primitive will expect blocked source data, conditional reorders are inserted to convert input data to blocked format if required. The input data user_src is NCHW, so this conditional will be triggered:
reorder
primitive. Finally it creates the convolution primitive conv
and adds it to the stream s
with the reordered data (conv_src
, conv_wei
, conv_dst1
) as inputs and then executes the create_and_execute_relu(conv_dst)
function.
Blocked memory format is recommended for DNNL primitive execution and provides better performance, as shown in the DNNL_VERBOSE output by the convolution and relu execution times of 18.3 and 2.7 milliseconds (down from 38.3 and 2.9 in naive implementation), respectively. In this implementation, there is an additional reorder operation that executes before and after the the conv + relu. This small cost is worth the gain from executing in blocked format. If fact, it becomes negligible when chaining together multiple DNNL operations in succession. In these situations, you can do one reorder at the beginning and one at the end of the chain, and only pay the reorder penalty at those points in the execution.
DNNL_VERBOSE output (see configuration notice*):
This inference implementation is closer to best practices than naive implementation because it uses DNNL recommended memory format. fused implementation will futher optimize the performance by using a fused version of the conv + ReLU primitive emplying the Intel DNNL post-ops attribute
This implementation is launched with the following shell code:
The program will call the implementation defined in the function conv_relu_fused()
.
First the memory descriptors and convolution descriptor are created as in naive implementation.
Then in preparation for the convolution prim desctiptor, a ReLU post-op is built and added to the primitive attribute attr
:
post-op by way of the attributes attr
:
Then conditional reorders are applied as in blocked format implementation to convert user_
format NCHW to blocked. Finally, it creates the convolution primitive conv
and adds it to the stream s
with the reordered data (conv_src
, conv_wei
, conv_dst1
).
conv
primitive.This implementation complies with best practices for f32 inference by using the DNNL recommended blocked format for convolution and adding ReLU as a post-op to execute a fused version of conv + ReLU. The consequence to following best practices can be seen in the execution time of the fused primitive of 18.0 milliseconds.
DNNL_VERBOSE output (see configuration notice*):
Implementation | Time, ms | Cumulative speedup |
---|---|---|
Naive | 41.2 | 1.0 |
Blocked format | 21.0 | 2.0 |
Fused | 18.0 | 2.3 |
Runtime Settings:
Platform: