Convolution int8 inference example with Graph API

This is an example to demonstrate how to build an int8 graph with Graph API and run it on CPU.

This is an example to demonstrate how to build an int8 graph with Graph API and run it on CPU.

Example code: cpu_inference_int8.cpp

Some assumptions in this example:

  • Only workflow is demonstrated without checking correctness

  • Unsupported partitions should be handled by users themselves

Public headers

To start using oneDNN Graph, we must include the dnnl_graph.hpp header file in the application. All the C++ APIs reside in namespace dnnl::graph.

#include <iostream>
#include <memory>
#include <vector>
#include <unordered_map>
#include <unordered_set>

#include <assert.h>

#include "oneapi/dnnl/dnnl_graph.hpp"

#include "example_utils.hpp"
#include "graph_example_utils.hpp"

using namespace dnnl::graph;
using data_type = logical_tensor::data_type;
using layout_type = logical_tensor::layout_type;
using property_type = logical_tensor::property_type;
using dim = logical_tensor::dim;
using dims = logical_tensor::dims;

simple_pattern_int8() function

Build Graph and Get Partitions

In this section, we are trying to build a graph indicating an int8 convolution with relu post-op. After that, we can get all of partitions which are determined by backend.

Create input/output dnnl::graph::logical_tensor and op for the first Dequantize.

logical_tensor dequant0_src_desc {0, data_type::u8};
logical_tensor conv_src_desc {1, data_type::f32};
op dequant0(2, op::kind::Dequantize, {dequant0_src_desc}, {conv_src_desc},
        "dequant0");
dequant0.set_attr<std::string>(op::attr::qtype, "per_tensor");
dequant0.set_attr<std::vector<float>>(op::attr::scales, {0.1f});
dequant0.set_attr<std::vector<int64_t>>(op::attr::zps, {10});

Create input/output dnnl::graph::logical_tensor and op for the second Dequantize.

Note

It’s necessary to provide scale and weight information on the Dequantize on weight.

Note

Users can set weight property type to constant to enable dnnl weight cache for better performance

logical_tensor dequant1_src_desc {3, data_type::s8};
logical_tensor conv_weight_desc {
        4, data_type::f32, 4, layout_type::undef, property_type::constant};
op dequant1(5, op::kind::Dequantize, {dequant1_src_desc},
        {conv_weight_desc}, "dequant1");
dequant1.set_attr<std::string>(op::attr::qtype, "per_channel");
// the memory format of weight is XIO, which indicates channel equals
// to 64 for the convolution.
std::vector<float> wei_scales(64, 0.1f);
dims wei_zps(64, 0);
dequant1.set_attr<std::vector<float>>(op::attr::scales, wei_scales);
dequant1.set_attr<std::vector<int64_t>>(op::attr::zps, wei_zps);
dequant1.set_attr<int64_t>(op::attr::axis, 1);

Create input/output dnnl::graph::logical_tensor the op for Convolution.

logical_tensor conv_bias_desc {
        6, data_type::f32, 1, layout_type::undef, property_type::constant};
logical_tensor conv_dst_desc {7, data_type::f32, layout_type::undef};

// create the convolution op
op conv(8, op::kind::Convolution,
        {conv_src_desc, conv_weight_desc, conv_bias_desc}, {conv_dst_desc},
        "conv");
conv.set_attr<dims>(op::attr::strides, {1, 1});
conv.set_attr<dims>(op::attr::pads_begin, {0, 0});
conv.set_attr<dims>(op::attr::pads_end, {0, 0});
conv.set_attr<dims>(op::attr::dilations, {1, 1});
conv.set_attr<std::string>(op::attr::data_format, "NXC");
conv.set_attr<std::string>(op::attr::weights_format, "XIO");
conv.set_attr<int64_t>(op::attr::groups, 1);

Create input/output dnnl::graph::logical_tensor the op for ReLu.

logical_tensor relu_dst_desc {9, data_type::f32, layout_type::undef};
op relu(10, op::kind::ReLU, {conv_dst_desc}, {relu_dst_desc}, "relu");

Create input/output dnnl::graph::logical_tensor the op for Quantize.

logical_tensor quant_dst_desc {11, data_type::u8, layout_type::undef};
op quant(
        12, op::kind::Quantize, {relu_dst_desc}, {quant_dst_desc}, "quant");
quant.set_attr<std::string>(op::attr::qtype, "per_tensor");
quant.set_attr<std::vector<float>>(op::attr::scales, {0.1f});
quant.set_attr<std::vector<int64_t>>(op::attr::zps, {10});

Finally, those created ops will be added into the graph. The graph inside will maintain a list to store all these ops. To create a graph, dnnl::engine::kind is needed because the returned partitions maybe vary on different devices. For this example, we use CPU engine.

Note

The order of adding op doesn’t matter. The connection will be obtained through logical tensors.

Create graph and add ops to the graph

graph g(dnnl::engine::kind::cpu);

g.add_op(dequant0);
g.add_op(dequant1);
g.add_op(conv);
g.add_op(relu);
g.add_op(quant);

After finished above operations, we can get partitions by calling dnnl::graph::graph::get_partitions().

In this example, the graph will be partitioned into one partition.

auto partitions = g.get_partitions();

Compile and Execute Partition

In the real case, users like framework should provide device information at this stage. But in this example, we just use a self-defined device to simulate the real behavior.

Create a dnnl::engine. Also, set a user-defined dnnl::graph::allocator to this engine.

allocator alloc {};
dnnl::engine eng
        = make_engine_with_allocator(dnnl::engine::kind::cpu, 0, alloc);
dnnl::stream strm {eng};

Compile the partition to generate compiled partition with the input and output logical tensors.

compiled_partition cp = partition.compile(inputs, outputs, eng);

Execute the compiled partition on the specified stream.

cp.execute(strm, inputs_ts, outputs_ts);