This C++ API example demonstrates the basics of the DNNL programming model.
Example code: getting_started.cpp
This C++ API example demonstrates the basics of the DNNL programming model:
The example uses the ReLU operation and comprises the following steps:
These steps are implemented in the getting_started_tutorial() function, which in turn is called from main() function (which is also responsible for error handling).
To start using DNNL we must first include the dnnl.hpp header file in the program. We also include dnnl_debug.h in example_utils.hpp, which contains some debugging facilities like returning a string representation for common DNNL C types.
All DNNL primitives and memory objects are attached to a particular dnnl::engine, which is an abstraction of a computational device (see also Basic Concepts). The primitives are created and optimized for the device they are attached to and the memory objects refer to memory residing on the corresponding device. In particular, that means neither memory objects nor primitives that were created for one engine can be used on another.
To create an engine, we should specify the dnnl::engine::kind and the index of the device of the given kind.
In addition to an engine, all primitives require a dnnl::stream for the execution. The stream encapsulates an execution context and is tied to a particular engine.
The creation is pretty straightforward:
In the simple cases, when a program works with one device only (e.g. only on CPU), an engine and a stream can be created once and used throughout the program. Some frameworks create singleton objects that hold DNNL engine and stream and use them throughout the code.
Now that the preparation work is done, let's create some data to work with. We will create a 4D tensor in NHWC format, which is quite popular in many frameworks.
Note that even though we work with one image only, the image tensor is still 4D. The extra dimension (here N) corresponds to the batch, and, in case of a single image, is equal to 1. It is pretty typical to have the batch dimension even when working with a single image.
In DNNL, all CNN primitives assume that tensors have the batch dimension, which is always the first logical dimension (see also Naming Conventions).
Now, having the image ready, let's wrap it in a dnnl::memory object to be able to pass the data to DNNL primitives.
Creating dnnl::memory comprises two steps:
Thanks to the list initialization introduced in C++11, it is possible to combine these two steps whenever a memory descriptor is not used anywhere else but in creating a dnnl::memory object.
However, for the sake of demonstration, we will show both steps explicitly.
To initialize the dnnl::memory::desc, we need to pass:
The code:
The first thing to notice here is that we pass dimensions as {N, C, H, W}
while it might seem more natural to pass {N, H, W, C}
, which better corresponds to the user's code. This is because DNNL CNN primitives like ReLU always expect tensors in the following form:
Spatial dim | Ten |
---|---|
0D | \(N \times C\) |
1D | \(N \times C \times W\) |
2D | \(N \times C \times H \times W\) |
3D | \(N \times C \times D \times H \times W\) |
where:
Now that the logical order of dimension is defined, we need to specify the memory format (the third parameter), which describes how logical indices map to the offset in memory. This is the place where the user's format NHWC comes into play. DNNL has different dnnl::memory::format_tag values that cover the most popular memory formats like NCHW, NHWC, CHWN, and some others.
The memory descriptor for the image is called src_md
. The src
part comes from the fact that the image will be a source for the ReLU primitive (that is, we formulate memory names from the primitive perspective; hence we will use dst
to name the output memory). The md
is an initialism for Memory Descriptor.
Before we continue with memory creation, let us show the alternative way to create the same memory descriptor: instead of using the dnnl::memory::format_tag, we can directly specify the strides of each tensor dimension:
Just as before, the tensor's dimensions come in the N, C, H, W
order as required by CNN primitives. To define the physical memory format, the strides are passed as the third parameter. Note that the order of the strides corresponds to the order of the tensor's dimensions.
Having a memory descriptor and an engine prepared, let's create input and output memory objects for a ReLU primitive.
We already have a memory buffer for the source memory object. We pass it to the dnnl::memory::memory(const desc &, const engine &, void *) constructor that takes a buffer pointer as its last argument.
Let's use a constructor that instructs the library to allocate a memory buffer for the dst_mem
for educational purposes.
The key difference between these two are:
dst_mem
and will deallocate it when dst_mem
is destroyed. That means the memory buffer can be used only while dst_mem
is alive.In the subsequent section we will show how to get the buffer (pointer) from the dst_mem
memory object.
Let's now create a ReLU primitive.
The library implements ReLU primitive as a particular algorithm of a more general Eltwise primitive, which applies a specified function to each and every element of the source tensor.
Just as in the case of dnnl::memory, a user should always go through (at least) three creation steps (which however, can be sometimes combined thanks to C++11):
DNNL separates steps 2 and 3 to enable the user to inspect details of a primitive implementation prior to creating the primitive. This may be expensive, because, for example, DNNL generates the optimized computational code on the fly.
The code:
A note about variable names. Similar to the _md
suffix used for memory descriptor, we use _d
for the operation descriptor names, _pd
for the primitive descriptors, and no suffix for primitives themselves.
It is worth mentioning that we specified the exact tensor and its memory format when we were initializing the relu_d
. That means relu
primitive would perform computations with memory objects that correspond to this description. This is the one and only one way of creating non-compute-intensive primitives like Eltwise, Batch Normalization, and others.
Compute-intensive primitives (like Convolution) have an ability to define the appropriate memory format on their own. This is one of the key features of the library and will be discussed in detail in the next topic: Memory format propagation.
Finally, let's execute the primitive and wait for its completion.
The input and output memory objects are passed to the execute()
method using a <tag, memory> map. Each tag specifies what kind of tensor each memory object represents. All Eltwise primitives require the map to have two elements: a source memory object (input) and a destination memory (output).
A primitive is executed in a stream (the first parameter of the execute()
method). Depending on a stream kind, an execution might be blocking or non-blocking. This means that we need to call dnnl::stream::wait before accessing the results.
The Eltwise is one of the primitives that support in-place operations, meaning that the source and destination memory can be the same. To perform in-place transformation, the user must pass the same memory object for both the DNNL_ARG_SRC
and DNNL_ARG_DST
tags:
Now that we have the computed result, let's validate that it is actually correct. The result is stored in the dst_mem
memory object. So we need to obtain the C++ pointer to a buffer with data via dnnl::memory::get_data_handle() and cast it to the proper data type as shown below.
void
, which can safely be used. However, for engines other than CPU the handle might be runtime-specific type, such as cl_mem
in case of GPU/OpenCL.We now just call everything we prepared earlier.
Because we are using the DNNL C++ API, we use exceptions to handle errors (see API). The DNNL C++ API throws exceptions of type dnnl::error, which contains the error status (of type dnnl_status_t) and a human-readable error message accessible through regular what()
method.
Upon compiling and run the example the output should be just:
Users are encouraged to experiment with the code to familiarize themselves with the concepts. In particular, one of the changes that might be of interest is to spoil some of the library calls to check how error handling happens. For instance, if we replace
with
we should get the following output: