Example code: getting_started.cpp
This C++ API example demonstrates basics of DNNL programming model:
The example uses the ReLU operation and consists of the following steps:
These steps are implemented in the getting_started_tutorial() function which in turn is called from main() function which is also responsible for error handling.
To start using DNNL we should first include dnnl.hpp header file in the program. We also include dnnl_debug.h in example_utils.hpp that contains some debugging facilities like returning a string representation for common DNNL C types.
All DNNL primitives and memory objects are attached to a particular dnnl::engine, which is an abstraction of an computational device (see also Basic Concepts). The primitives are created and optimized for the device they are attached to and the memory objects refer to memory residing on the corresponding device. In particular, that means neither memory objects nor primitives that were created for one engine can be used on another.
To create an engine we should specify the dnnl::engine::kind and the index of the device of the given kind.
In addition to an engine, all primitives require a dnnl::stream for the execution. The stream encapsulates an execution context and is tied to a particular engine.
The creation is pretty straightforward:
In the simple cases, when a program works with one device only (e.g. only on CPU), an engine and a stream can be created once and used throughout the program. Some frameworks create singleton objects that hold DNNL engine and stream and are use them throughout the code.
Now that the preparation work is done, let's create some data to work with. We will create a 4D tensor in NHWC format, which is quite popular in many frameworks.
Note that even though we work with one image only, the image tensor is still 4D. The extra 4th dimension (here N) corresponds to the batch, and, in case of a single image, equals to 1. This is pretty typical to have the batch dimension even when working with a single image.
In DNNL all CNN primitives assume tensors have batch dimension, which is always the first logical dimension (see also Naming Conventions).
Now, having the image ready, let's wrap it in an dnnl::memory object to be able to pass the data to DNNL primitives.
Creating dnnl::memory consists of 2 steps:
Thanks to the list initialization introduced in C++11 it is possible to combine these two steps whenever a memory descriptor is not used anywhere else but in creating an dnnl::memory object.
However, for the sake of demonstration, we will show both steps explicitly.
To initialize the dnnl::memory::desc we need to pass:
The code:
The first thing to notice here is that we pass dimensions as {N, C, H, W}
while it might seem more natural to pass {N, H, W, C}
, which better corresponds to the user's code. This is because DNNL CNN primitives like ReLU always expect tensors in the following form:
Spatial dim | Ten |
---|---|
0D | \(N \times C\) |
1D | \(N \times C \times W\) |
2D | \(N \times C \times H \times W\) |
3D | \(N \times C \times D \times H \times W\) |
where:
Now that the logical order of dimension is defined, we need to specify the memory format (the third parameter), which describes how logical indices map to the offset in memory. This is the place where user's format NHWC comes into play. DNNL has different dnnl::memory::format_tag values that covers the most popular memory formats like NCHW, NHWC, CHWN, and some others.
The memory descriptor for the image is called src_md
. The src
part comes from the fact that the image will be a source for the ReLU primitive (i.e. we formulate memory names from the primitive perspective, hence we will use dst
to name the output memory). The md
is an acronym for Memory Descriptor.
Before we continue with memory creation, let us show the alternative way to create the same memory descriptor: instead of using the dnnl::memory::format_tag we can directly specify the strides of each tensor dimension:
Just as before, the tensor's dimensions come in the N, C, H, W
order as required by CNN primitives. To define the physical memory format the strides are passed as the third parameter. Note that the order of the strides corresponds to the order of the tensor's dimensions.
Having a memory descriptor and an engine prepared let's create an input and an output memory objects for ReLU primitive
We already have a memory buffer for the source memory object. We pass it to the dnnl::memory::memory(const desc &, const engine &, void *) constructor that takes a buffer pointer with its last argument.
Let's use a constructor that instructs the library to allocate a memory buffer for the dst_mem
for educational purposes.
The key difference between these two are:
dst_mem
and will deallocate it when dst_mem
is destroyed. That means the memory buffer can only be used while dst_mem
is alive.In subsequent section we will show how to get the buffer (pointer) from the dst_mem
memory object.
Let's now create a ReLU primitive.
The library implements ReLU primitive as a particular algorithm of a more general Eltwise primitive which applies specified function to each and every element of the source tensor.
Just like in case of dnnl::memory a user should always go through (at least) 3 creation steps (which however, can be sometimes combined thanks to C++11):
DNNL separates the steps 2 and 3 to allow user to inspect details of a primitive implementation prior to creating the primitive which may be expensive because, for example, DNNL generates the optimized computational code on the fly.
The code:
A note about variable names. Similar to the _md
suffix used for memory descriptor, we use _d
for the operation descriptor names, _pd
for the primitive descriptors, and no suffix for primitives themselves.
It is worth mentioning that we specified the exact tensor and its memory format when we were initializing the relu_d
. That means relu
primitive would perform computations with memory objects that correspond to this description. This is the one and only one way of creating non-compute-intensive primitives like Eltwise, Batch Normalization, and others.
Compute-intensive primitives (like Convolution) have an ability to define the appropriate memory format on their own. This is one of the key features of the library and will be discussed in detail in the next topic: Memory format propagation.
Finally, let's execute the primitive and wait for its completion.
The input and output memory objects are passed to the execute()
method using a <tag, memory> map. Each tag specifies what kind of tensor each memory object represents. All Eltwise primitives require the map to have two elements: a source memory object (input) and a destination memory (output).
A primitive is executed in a stream (the first parameter of the execute()
method). Depending on a stream kind an execution might be blocking or non-blocking. This means that we need to call dnnl::stream::wait before accessing the results.
The Eltwise is one of the primitives that support in-place operations, meaning the source and destination memory can be the same. To perform in-place transformation user needs to pass the same memory object for the both DNNL_ARG_SRC
and DNNL_ARG_DST
tags:
Now that we have the computed result let's validate that it is actually correct. The result is stored in the dst_mem
memory object. So we need to obtain the C++ pointer to a buffer with data via dnnl::memory::get_data_handle() and cast it to the proper data type as shown below.
void
which can safely be used. However, for engines other than CPU the handle might be runtime-specific type, such as cl_mem
in case of GPU/OpenCL.We now just call everything we prepared earlier.
Since we are using DNNL C++ API we use exception to handle errors (see C and C++ APIs). The DNNL C++ API throws exceptions of type dnnl::error, which contains the error status (of type dnnl_status_t) and a human-readable error message accessible through regular what()
method.
Upon compiling and run the example the output should be just:
Users are encouraged to experiment with the code to familiarize themselves with the concepts. In particular, one of the changes that might be of interest is to spoil some of the library calls to check how error handling happens. For instance, if we replace
with
we should get the following output: