This C++ API example demonstrates how to run AlexNet's conv3 and relu3 with int8 data type.

Example code: cpu_cnn_inference_int8.cpp

Configure tensor shapes

    // AlexNet: conv3
    // {batch, 256, 13, 13} (x)  {384, 256, 3, 3}; -> {batch, 384, 13, 13}
    // strides: {1, 1}
    memory::dims conv_src_tz = { batch, 256, 13, 13 };
    memory::dims conv_weights_tz = { 384, 256, 3, 3 };
    memory::dims conv_bias_tz = { 384 };
    memory::dims conv_dst_tz = { batch, 384, 13, 13 };
    memory::dims conv_strides = { 1, 1 };
    memory::dims conv_padding = { 1, 1 };

Next, the example configures the scales used to quantize f32 data into int8. For this example, the scaling value is chosen as an arbitrary number, although in a realistic scenario, it should be calculated from a set of precomputed values as previously mentioned.

    // Choose scaling factors for input, weight, output and bias quantization
    const std::vector<float> src_scales = { 1.8f };
    const std::vector<float> weight_scales = { 2.0f };
    const std::vector<float> bias_scales = { 1.0f };
    const std::vector<float> dst_scales = { 0.55f };
    // Choose channel-wise scaling factors for convolution
    std::vector<float> conv_scales(384);
    const int scales_half = 384 / 2;
    std::fill(conv_scales.begin(), conv_scales.begin() + scales_half, 0.3f);
    std::fill(conv_scales.begin() + scales_half + 1, conv_scales.end(), 0.8f);

The source, weights, bias and destination datasets use the single-scale format with mask set to '0', while the output from the convolution (conv_scales) will use the array format where mask = 2 corresponding to the output dimension.

    const int src_mask = 0;
    const int weight_mask = 0;
    const int bias_mask = 0;
    const int dst_mask = 0;
    const int conv_mask = 2; // 1 << output_channel_dim

Create the memory primitives for user data (source, weights, and bias). The user data will be in its original 32-bit floating point format.

    auto user_src_memory = memory({ { conv_src_tz }, dt::f32, tag::nchw },
            cpu_engine, user_src.data());
    auto user_weights_memory
            = memory({ { conv_weights_tz }, dt::f32, tag::oihw }, cpu_engine,
                    conv_weights.data());
    auto user_bias_memory = memory({ { conv_bias_tz }, dt::f32, tag::x },
            cpu_engine, conv_bias.data());

Create a memory descriptor for each convolution parameter. The convolution data uses 8-bit integer values, so the memory descriptors are configured as:

8-bit unsigned (u8) for source and destination.
8-bit signed (s8) for bias and weights.

Note The destination type is chosen as unsigned because the convolution applies a ReLU operation where data results \(\geq 0\).

    auto conv_src_md = memory::desc({ conv_src_tz }, dt::u8, tag::any);
    auto conv_bias_md = memory::desc({ conv_bias_tz }, dt::s8, tag::any);
    auto conv_weights_md = memory::desc({ conv_weights_tz }, dt::s8, tag::any);
    auto conv_dst_md = memory::desc({ conv_dst_tz }, dt::u8, tag::any);

Create a convolution descriptor passing the int8 memory descriptors as parameters.

    auto conv_desc = convolution_forward::desc(prop_kind::forward,
            algorithm::convolution_direct, conv_src_md, conv_weights_md, conv_bias_md,
            conv_dst_md, conv_strides, conv_padding, conv_padding);

Configuring int8-specific parameters in an int8 primitive is done via the Attributes Primitive. Create an attributes object for the convolution and configure it accordingly.

primitive_attr conv_attr;

conv_attr.set_output_scales(conv_mask, conv_scales);

The ReLU layer from Alexnet is executed through the PostOps feature. Create a PostOps object and configure it to execute an eltwise relu operation.

    const float ops_scale = 1.f;
    const float ops_alpha = 0.f; // relu negative slope
    const float ops_beta = 0.f;
    post_ops ops;
    ops.append_eltwise(ops_scale, algorithm::eltwise_relu, ops_alpha, ops_beta);
    conv_attr.set_post_ops(ops);

Create a primitive descriptor using the convolution descriptor and passing along the int8 attributes in the constructor. The primitive descriptor for the convolution will contain the specific memory formats for the computation.

auto conv_prim_desc = convolution_forward::primitive_desc(

conv_desc, conv_attr, cpu_engine);

Create a memory for each of the convolution's data input parameters (source, bias, weights, and destination). Using the convolution primitive descriptor as the creation parameter enables Intel MKL-DNN to configure the memory formats for the convolution.

Scaling parameters are passed to the reorder primitive via the attributes primitive.

User memory must be transformed into convolution-friendly memory (for int8 and memory format). A reorder layer performs the data transformation from f32 (the original user data) into int8 format (the data used for the convolution). In addition, the reorder transforms the user data into the required memory format (as explained in the simple_net example).

    auto conv_src_memory = memory(conv_prim_desc.src_desc(), cpu_engine);
    primitive_attr src_attr;
    src_attr.set_output_scales(src_mask, src_scales);
    auto src_reorder_pd = reorder::primitive_desc(cpu_engine,
            user_src_memory.get_desc(), cpu_engine,
            conv_src_memory.get_desc(), src_attr);
    auto src_reorder = reorder(src_reorder_pd);
    src_reorder.execute(s, user_src_memory, conv_src_memory);
    auto conv_weights_memory
            = memory(conv_prim_desc.weights_desc(), cpu_engine);
    primitive_attr weight_attr;
    weight_attr.set_output_scales(weight_mask, weight_scales);
    auto weight_reorder_pd = reorder::primitive_desc(cpu_engine,
            user_weights_memory.get_desc(), cpu_engine,
            conv_weights_memory.get_desc(), weight_attr);
    auto weight_reorder = reorder(weight_reorder_pd);
    weight_reorder.execute(s, user_weights_memory, conv_weights_memory);
    auto conv_bias_memory = memory(conv_prim_desc.bias_desc(), cpu_engine);
    primitive_attr bias_attr;
    bias_attr.set_output_scales(bias_mask, bias_scales);
    auto bias_reorder_pd = reorder::primitive_desc(cpu_engine,
            user_bias_memory.get_desc(), cpu_engine,
            conv_bias_memory.get_desc(), bias_attr);
    auto bias_reorder = reorder(bias_reorder_pd);
    bias_reorder.execute(s, user_bias_memory, conv_bias_memory);

Create the convolution primitive and add it to the net. The int8 example computes the same Convolution +ReLU layers from AlexNet simple-net.cpp using the int8 and PostOps approach. Although performance is not measured here, in practice it would require less computation time to achieve similar results.

    auto conv = convolution_forward(conv_prim_desc);
    conv.execute(s,
            { { MKLDNN_ARG_SRC, conv_src_memory },
                    { MKLDNN_ARG_WEIGHTS, conv_weights_memory },
                    { MKLDNN_ARG_BIAS, conv_bias_memory },
                    { MKLDNN_ARG_DST, conv_dst_memory } });

Finally, dst memory may be dequantized from int8 into the original f32 format. Create a memory primitive for the user data in the original 32-bit floating point format and then apply a reorder to transform the computation output data.

    auto user_dst_memory = memory({ { conv_dst_tz }, dt::f32, tag::nchw },
            cpu_engine, user_dst.data());
    primitive_attr dst_attr;
    dst_attr.set_output_scales(dst_mask, dst_scales);
    auto dst_reorder_pd = reorder::primitive_desc(cpu_engine,
            conv_dst_memory.get_desc(), cpu_engine,
            user_dst_memory.get_desc(), dst_attr);
    auto dst_reorder = reorder(dst_reorder_pd);
    dst_reorder.execute(s, conv_dst_memory, user_dst_memory);