Convolution int8 inference example with Graph API¶
This is an example to demonstrate how to build an int8 graph with Graph API and run it on CPU.
This is an example to demonstrate how to build an int8 graph with Graph API and run it on CPU.
Example code: cpu_inference_int8.cpp
Some assumptions in this example:
Only workflow is demonstrated without checking correctness
Unsupported partitions should be handled by users themselves
Public headers¶
To start using oneDNN Graph, we must include the dnnl_graph.hpp
header file in the application. All the C++ APIs reside in namespace dnnl::graph
.
#include <iostream> #include <memory> #include <vector> #include <unordered_map> #include <unordered_set> #include <assert.h> #include "oneapi/dnnl/dnnl_graph.hpp" #include "example_utils.hpp" #include "graph_example_utils.hpp" using namespace dnnl::graph; using data_type = logical_tensor::data_type; using layout_type = logical_tensor::layout_type; using property_type = logical_tensor::property_type; using dim = logical_tensor::dim; using dims = logical_tensor::dims;
simple_pattern_int8() function¶
Build Graph and Get Partitions¶
In this section, we are trying to build a graph indicating an int8 convolution with relu post-op. After that, we can get all of partitions which are determined by backend.
Create input/output dnnl::graph::logical_tensor and op for the first Dequantize
.
logical_tensor dequant0_src_desc {0, data_type::u8}; logical_tensor conv_src_desc {1, data_type::f32}; op dequant0(2, op::kind::Dequantize, {dequant0_src_desc}, {conv_src_desc}, "dequant0"); dequant0.set_attr<std::string>(op::attr::qtype, "per_tensor"); dequant0.set_attr<std::vector<float>>(op::attr::scales, {0.1f}); dequant0.set_attr<std::vector<int64_t>>(op::attr::zps, {10});
Create input/output dnnl::graph::logical_tensor and op for the second Dequantize
.
Note
It’s necessary to provide scale and weight information on the Dequantize
on weight.
Note
Users can set weight property type to constant
to enable dnnl weight cache for better performance
logical_tensor dequant1_src_desc {3, data_type::s8}; logical_tensor conv_weight_desc { 4, data_type::f32, 4, layout_type::undef, property_type::constant}; op dequant1(5, op::kind::Dequantize, {dequant1_src_desc}, {conv_weight_desc}, "dequant1"); dequant1.set_attr<std::string>(op::attr::qtype, "per_channel"); // the memory format of weight is XIO, which indicates channel equals // to 64 for the convolution. std::vector<float> wei_scales(64, 0.1f); dims wei_zps(64, 0); dequant1.set_attr<std::vector<float>>(op::attr::scales, wei_scales); dequant1.set_attr<std::vector<int64_t>>(op::attr::zps, wei_zps); dequant1.set_attr<int64_t>(op::attr::axis, 1);
Create input/output dnnl::graph::logical_tensor the op for Convolution
.
logical_tensor conv_bias_desc { 6, data_type::f32, 1, layout_type::undef, property_type::constant}; logical_tensor conv_dst_desc {7, data_type::f32, layout_type::undef}; // create the convolution op op conv(8, op::kind::Convolution, {conv_src_desc, conv_weight_desc, conv_bias_desc}, {conv_dst_desc}, "conv"); conv.set_attr<dims>(op::attr::strides, {1, 1}); conv.set_attr<dims>(op::attr::pads_begin, {0, 0}); conv.set_attr<dims>(op::attr::pads_end, {0, 0}); conv.set_attr<dims>(op::attr::dilations, {1, 1}); conv.set_attr<std::string>(op::attr::data_format, "NXC"); conv.set_attr<std::string>(op::attr::weights_format, "XIO"); conv.set_attr<int64_t>(op::attr::groups, 1);
Create input/output dnnl::graph::logical_tensor the op for ReLu
.
logical_tensor relu_dst_desc {9, data_type::f32, layout_type::undef}; op relu(10, op::kind::ReLU, {conv_dst_desc}, {relu_dst_desc}, "relu");
Create input/output dnnl::graph::logical_tensor the op for Quantize
.
logical_tensor quant_dst_desc {11, data_type::u8, layout_type::undef}; op quant( 12, op::kind::Quantize, {relu_dst_desc}, {quant_dst_desc}, "quant"); quant.set_attr<std::string>(op::attr::qtype, "per_tensor"); quant.set_attr<std::vector<float>>(op::attr::scales, {0.1f}); quant.set_attr<std::vector<int64_t>>(op::attr::zps, {10});
Finally, those created ops will be added into the graph. The graph inside will maintain a list to store all these ops. To create a graph, dnnl::engine::kind is needed because the returned partitions maybe vary on different devices. For this example, we use CPU engine.
Note
The order of adding op doesn’t matter. The connection will be obtained through logical tensors.
Create graph and add ops to the graph
graph g(dnnl::engine::kind::cpu); g.add_op(dequant0); g.add_op(dequant1); g.add_op(conv); g.add_op(relu); g.add_op(quant);
After finished above operations, we can get partitions by calling dnnl::graph::graph::get_partitions().
In this example, the graph will be partitioned into one partition.
auto partitions = g.get_partitions();
Compile and Execute Partition¶
In the real case, users like framework should provide device information at this stage. But in this example, we just use a self-defined device to simulate the real behavior.
Create a dnnl::engine. Also, set a user-defined dnnl::graph::allocator to this engine.
allocator alloc {}; dnnl::engine eng = make_engine_with_allocator(dnnl::engine::kind::cpu, 0, alloc); dnnl::stream strm {eng};
Compile the partition to generate compiled partition with the input and output logical tensors.
compiled_partition cp = partition.compile(inputs, outputs, eng);
Execute the compiled partition on the specified stream.
cp.execute(strm, inputs_ts, outputs_ts);