Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)  0.21.0
Performance library for Deep Learning
Understanding Memory Formats

Introduction

Most of the computations are about data: analyzing data, adjusting data, reading and storing data, generating data... DNN domain is no exception. Images, weights/filters, sound, and text require efficient representation in computer memory to perform operations fast and in the most convenient way.

This article is devoted to data format – one form of data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space and why this is important for Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN).

Note: for the purpose of this article data format and layout are used interchangeably.

Nomenclature used

Data formats

Let's first focus on data formats for activations (images).

Activations consist of channels (aka feature maps) and a spatial domain, 1D, 2D or 3D. Spatial domain together with channels form an image. During the training phase images are typically grouped together in batches. Even if there is only one image, we would still assume there is a batch with batch size equal to 1. Hence, the overall dimensionality of activations is 4D (N, C, H, and W) or 5D (N, C, D, H, and W).

In this article for the sake of simplicity we will use 2D spatial only.

Plain data formats

It would be simpler to start with an example.

Consider 4D activations with batch equals 2, 16 channels, and 5 x 4 spatial domain. Logical representation is given on the picture below.

mem_fmt_img1.png
Activations

The value at the position (n, c, h, w) is generated with the following formula:

value(n, c, h, w) = n * CHW + c * HW + h * W + w

In order to define how data in this 4D-tensor is laid out in memory we need to define how to map it to a 1D tensor via an offset function that takes logical index (n, c, h, w) as an input and returns an address displacement to where the value is located:

offset : (int, int, int, int) --> int

NCHW

Let's describe the order in which the tensor values are laid out in memory for one the very popular format NCHW. The [a:?] marks refer to the jumps shown in the picture below that shows the 1D representation of an NCHW tensor in memory.

Then the offset function is:

offset_nchw(n, c, h, w) = n * CHW + c * HW + h * W + w

We use nchw here to denote that w is the inner-most dimension, meaning that two elements adjacent in memory would share the same indices of n, c, and h, and their index of w would be different by 1. This is of course true only for non-border elements. On the contrary n is the outermost dimension here, meaning that if you need to take the same pixel (c, h, w) but on the next image you have to jump over the whole image size C*H*W.

This data format is called NCHW and used by default in BVLC* Caffe. TensorFlow* also supports this data format.

Note: It is just a coincidence that offset_nchw() is the same as value() in this example.

One can create memory with NCHW data layout using mkldnn_nchw of the enum type mkldnn_memory_format_t defined in mkldnn_types.h for C API and mkldnn::memory::nchw defined in mkldnn.hpp for C++ API.

NHWC

Another quite popular data format is NHWC and it uses the following offset function:

offset_nhwc(n, c, h, w) = n * HWC + h * WC + w * C + c

In this case the inner-most dimension is channels ([b:0]) that is followed by width ([b:1]), height ([b:2]), and finally batch ([b:3]).

For a single image (N = 1), this format is very similar to how BMP-file format works, where the image is kept pixel by pixel and every pixel contains all required information about colors (for instance 3 channels for 24bit BMP).

NHWC data format is the default one for TensorFlow.

This layout corresponds to mkldnn_nhwc or mkldnn::memory::nhwc.

CHWN

The last example here for the plain data layout is CHWN which is used by Neon. This layout might be very interesting from a vectorization perspective if an appropriate batch size is used, but on the other hand users cannot always have good batch size (e.g. in case of real-time inference batch is typically 1).

The dimensions order is (from inner-most to outer-most): batch ([c:0]), width ([c:1]), height ([c:2]), channels ([c:3]).

The offset function for CHWN format is defined as:

offset_chwn(n, c, h, w) = c * HWN + h * WN + w * N + n

This layout corresponds to mkldnn_chwn or mkldnn::memory::chwn.

mem_fmt_img2.png
Different plain layouts

Relevant reading

TensorFlow Doc. Shapes and Layout

Generalization of the plain data layout

Strides

In the previous examples the data was kept packed or in dense form, meaning pixels follow one another. Sometimes it might be necessary to not keep data contiguous in memory. For instance some might need to work with sub-tensor within a bigger tensor. Sometimes it might be beneficial to artificially make the data disjoint, like in case of GEMM with non-trivial leading dimension to get better performance (see Tips 6).

The following picture shows simplified case for 2D matrix of size rows x columns kept in row-major format where rows have some non-trivial (i.e. not equal to the number of columns) stride.

strides.png
Strides

In this case the general offset function looks like:

offset(n, c, h, w) = n * stride_n
+ c * stride_c
+ h * stride_h
+ w * stride_w

Note, then NCHW, NHWC, and CHWN formats are just special cases of the format with strides. For example for NCHW we have:

stride_n = CHW, stride_c = HW, stride_h = W, stride_w = 1

Intel MKL-DNN supports strides via blocking structure. The pseudo code is:

memory_desc_t md; // memory descriptor object
// logical description, layout independent
md.ndims = 4; // # dimensions
md.dims = {N, C, H, W}; // dimensions themselves
// physical description
md.memory_format = mkldnn_blocked; // generic blocked format
md.layout_desc.blocking.strides[0] = {
stride_n, stride_c, stride_h, stride_w
};

In particular whenever a user creates memory with mkldnn_nchw format MKL-DNN computes the strides and fills the structure on behalf of the user. That can be done manually though.

Blocked layout

Plain layouts give great flexibility and are very convenient for use. That's why most of the frameworks and applications use either NCHW or NHWC layouts. However depending on the operation that is performed on data it might turn out that those layouts are sub-optimal from performance perspective.

In order to achieve better vectorization and cache re-usage Intel MKL-DNN introduces blocked layout that splits one or several dimensions into the blocks of fixed size. The most popular Intel MKL-DNN data format is nChw16c on AVX512+ systems and nChw8c on SSE4.2+ systems. As one might guess from the name the only dimension that is blocked is channels and the block size is either 16 in the former case or 8 in the later case.

Precisely, the offset function for nChw8c is:

offset_nChw8c(n, c, h, w) = n * CHW
+ (c / 8) * HW*8
+ h * W*8
+ w * 8
+ (c % 8)

Note that blocks of 8 channels are kept contiguously in memory. Pixel by pixel the spatial domain is covered. Then next slice covers subsequent 8 channels (i.e. moving from c=0..7 to c=8..15). Once all channel blocks are covered the next image in the batch appears.

mem_fmt_blk.png
nChw8c format

Note: we use lower- and uppercase letters in the formats to distinguish between the blocks (e.g. 8c) and the remaining co-dimension (C = channels / 8).

The reason behind the format choice can be found in this paper.

Intel MKL-DNN describes this type of memory via blocking structure as well. The pseudo code is:

memory_desc_t md;
// logical description, layout independent
md.ndims = 4; // # dimensions
md.dims = {N, C, H, W}; // dimensions themselves
// physical description
md.memory_format = mkldnn_nChw8c; // blocked layout with
// channels blocked by 8
md.layout_desc.blocking.block_dims = {
1, // no blocking by n, hence 1
8, // blocking by c, hence 8
1, // no blocking by h, hence 1
1, // no blocking by w, hence 1
};
ptrdiff_t stride_n = C*H*W;
ptrdiff_t stride_C = H*W*8;
ptrdiff_t stride_h = W*8;
ptrdiff_t stride_w = 8;
ptrdiff_t stride_8c = 1;
md.layout_desc.blocking.strides[0] = { // strides between blocks
stride_n, stride_C, stride_h, stride_w
};
md.layout_desc.blocking.strides[1] = { // strides within blocks
1, // ignored since no blocking by n
stride_8c, // blocks of channels are contiguous
1, // ignored since no blocking by h
1, // ignored since no blocking by w
};

What if channels are not multiple of 8 (or 16)?

Blocking data layout gives significant performance improvement for the convolutions, but what to do when the number of channels is not multiple of the block size, say 17 channels for nChw8c format?

Well one of the possible ways to handle that would be to use blocked layout for as many channels as possible by rounding them down to a number that is a multiple of the block size (in this case 16 = 17 / 8 * 8) and process the tail somehow. However that would lead to introduction of very special tail processing code into many Intel MKL-DNN kernels.

So we came up with another solution using zero-padding. The idea is to round the channels up to make them multiples of the block size and pad created tail with zeros (in the example above 24 = div_up(17, 8) * 8). Then primitives like convolutions might work with rounded-up number of channels instead of the original ones and compute the correct result (adding zeros doesn't change the result).

That allows supporting arbitrary number of channels with almost no changes to the kernels. The price would be some extra computations on those zeros, but this is either negligible or the performance with overheads is still higher than the performance with plain data layout.

The picture below depicts the idea. Note that some extra computations happen while computing d0, but that does not affect the result.

mem_fmt_padded_blk.png
Padded format

The feature is supported starting with Intel MKL-DNN v0.15. So far the support is limited by f32 data type and on the AVX512+ systems. We plan to extend the implementations for other cases as well.

Some pitfalls of the given approach:

Relevant Intel MKL-DNN code:

const int C = 17;
const int C_padded = div_up(17, 8) * 8; // 24
// logical description, layout independent
const int ndims = 4; // # of dimensions
mkldnn_dims_t dims = {N, C, H, W}; // dimensions themselves
memory_desc_t md;
// initialize memory descriptor
dims,
mkldnn_f32, // single precision data type
mkldnn_nChw8c // blocked layout
);
ptrdiff_t expect_stride_n = C_padded*H*W; // note C_padded here, not C
ptrdiff_t expect_stride_C = H*W*8;
ptrdiff_t expect_stride_h = W*8;
ptrdiff_t expect_stride_w = 8;
ptrdiff_t expect_stride_8c = 1;
bool expect_true = true
&& true // logical dims stay as is
&& md.dims[0] == N
&& md.dims[1] == C
&& md.dims[2] == H
&& md.dims[3] == W
&& true // padded dims are rounded accordingly
&& md.layout_desc.blocking.padding_dims[0] == N
&& md.layout_desc.blocking.padding_dims[1] == C_padded
&& md.layout_desc.blocking.padding_dims[2] == H
&& md.layout_desc.blocking.padding_dims[3] == W
&& true // strides correspond to the physcal layout
&& md.layout_desc.blocking.strides[0][0] == expect_stride_n
&& md.layout_desc.blocking.strides[0][1] == expect_stride_C
&& md.layout_desc.blocking.strides[0][2] == expect_stride_h
&& md.layout_desc.blocking.strides[0][3] == expect_stride_w
&& md.layout_desc.blocking.strides[1][1] == expect_stride_8c;
assert(expect_true);

Legal information