Single op partition on GPU¶
This is an example to demonstrate how to build a simple op graph and run it on gpu.
This is an example to demonstrate how to build a simple op graph and run it on gpu.
Example code: sycl_single_op_partition.cpp
Some key take-aways included in this example:
how to build a single-op partition quickly
how to create an engine, allocator and stream
how to compile a partition
how to execute a compiled partition
Some assumptions in this example:
Only workflow is demonstrated without checking correctness
Unsupported partitions should be handled by users themselves
Public headers¶
To start using oneDNN Graph, we must include the dnnl_graph.hpp
header file in the application. All the C++ APIs reside in namespace dnnl::graph
.
#include "oneapi/dnnl/dnnl_graph.hpp" #include "oneapi/dnnl/dnnl_graph_sycl.hpp" #include "oneapi/dnnl/dnnl_sycl.hpp" using namespace dnnl::graph; using namespace sycl; #include <assert.h> #include <iostream> #include <memory> #include <vector> #include <unordered_map> #include <unordered_set> #include "example_utils.hpp" #include "graph_example_utils.hpp" using namespace dnnl::graph; using data_type = logical_tensor::data_type; using layout_type = logical_tensor::layout_type; using dim = logical_tensor::dim; using dims = logical_tensor::dims;
sycl_single_op_partition_tutorial() function¶
Build Graph and Get Partitions¶
In this section, we are trying to create a partition containing the single op matmul
without building a graph and getting partition.
Create first Matmul
op (dnnl::graph::op) and attaches attributes to it, including transpose_a
and transpose_b
.
logical_tensor matmul_src0_desc {0, data_type::f32}; logical_tensor matmul_src1_desc {1, data_type::f32}; logical_tensor matmul_dst_desc {2, data_type::f32}; op matmul(0, op::kind::MatMul, {matmul_src0_desc, matmul_src1_desc}, {matmul_dst_desc}, "matmul"); matmul.set_attr<bool>(op::attr::transpose_a, false); matmul.set_attr<bool>(op::attr::transpose_b, false);
Compile and Execute Partition¶
In the real case, users like framework should provide device information at this stage. But in this example, we just use a self-defined device to simulate the real behavior.
Create a dnnl::graph::allocator with two user-defined dnnl_graph_sycl_allocate_f and dnnl_graph_sycl_deallocate_f call-back functions.
allocator alloc = sycl_interop::make_allocator( sycl_malloc_wrapper, sycl_free_wrapper);
Define SYCL queue (code outside of oneDNN graph)
sycl::queue q = (ekind == engine::kind::gpu) ? sycl::queue( sycl::gpu_selector_v, sycl::property::queue::in_order {}) : sycl::queue( sycl::cpu_selector_v, sycl::property::queue::in_order {});
Create a dnnl::engine based on SYCL device and context. Also, set a user-defined dnnl::graph::allocator to this engine.
dnnl::engine eng = sycl_interop::make_engine_with_allocator( q.get_device(), q.get_context(), alloc);
Create a dnnl::stream on a given engine
dnnl::stream strm = dnnl::sycl_interop::make_stream(eng, q);
Skip building graph and getting partition, and directly create the single-op partition
partition part(matmul, dnnl::engine::kind::cpu);
Compile the partition to generate compiled partition with the input and output logical tensors.
compiled_partition cp = part.compile(inputs, outputs, eng);
Execute the compiled partition on the specified stream.
cp.execute(strm, inputs_ts, outputs_ts);