Some primitives might require temporary space while performing the computations. For instance, the operations that do not have enough independent work to utilize all cores on a system might use parallelization over the reduction axis (e.g. k-axis in matrix-matrix multiplication). In this case the threads compute partial results in a temporary buffer and once finished the library reduces partial results into the final one. Another example is a convolution implementation that uses GEMM. Before using a GEMM the source images needs to be rearranged by so-called im2col transformation. The rearrangement happens to an intermediate buffer that is then used as an input for GEMM.

In both of these examples, the temporary memory is not required once the computations are done. DNNL refers to such memory as a scratchpad.

Warning: Do not confuse scratchpad with Workspace. The workspace is a buffer that is shared between forward and backward propagation of a primitive (hence must be preserved between the calls) and is used only in training.

The amount of space required for the scratchpad depends on the primitive and the actual implementation. The GEMM-based convolutions require a scratchpad for the im2col data, while directly implemented convolutions can work with the original data.

Both types of implementation might need extra space for the reduction in case there are too few independent tasks. The im2col size is proportional to the size of the source image multiplied by the weights spatial size. The size of a buffer for reduction is proportional to the tensor size to be reduced (e.g., diff_weights in the case of backward by weights) multiplied by the number of threads in the reduction groups (the upper bound is the overall number of threads).

As you can see, the scratchpad in these cases might be significant. By contrast, some other primitives might require very little extra space. For instance, one of the implementation of the dnnl::sum primitive requires temporary space only to store the pointers to data for each and every input array (that is, the size of the scratchpad is n * sizeof(void *), where n is the number of summands).

DNNL supports two modes of dealing with scratchpads:

dnnl::scratchpad_mode::library. The library allocates memory for each primitive during its creation. This is the default behavior which enables user to not worry about the scratchpad at all. However this approach has two major downsides:
- If primitives are cached, they may reserve a significant amount of memory.
- Primitives are not thread safe, because simultaneous runs will make different threads to use the same scratchpad buffer.
dnnl::scratchpad_mode::user. A user provides scratchpad memory that has sufficient space at primitive execution (using the DNNL_ARG_SCRATCHPAD tag). This enables the user to reuse the memory as well as to make the primitives thread-safe. However, this requires a good memory manager (in terms of speed and locality) on the user's side and some extra boilerplate code.

Warning: Primitives are not thread-safe by default. Users should use dnnl::scratchpad_mode::user if they want to use a single primitive from different threads simultaneously.

The attributes (Primitive Attributes) are used to control who provides a scratchpad:

It is worth mentioning that all primitives support both scratchpad modes. That is, primitive descriptor creation success or failure cannot depend on the scratchpad mode used.

Scratchpad Memory Engine

If the user provides scratchpad memory to a primitive, this memory must be created using the same engine that the primitive uses.

Examples

Library Manages Scratchpad

As mentioned above, this is a default behavior. We only want to highlight how a user can query the amount of memory consumed by a primitive due to a scratchpad.

// Use default attr, hence the library allocates scratchpad
dnnl::primitive::primitive_desc op_pd(params, ...);
// Print how much memory would be hold by a primitive due to scratchpad
std::cout << "primitive will use "
          << op_pd.query_s64(dnnl::query::memory_consumption_s64)
          << " bytes" << std::endl;
// In this case scratchpad is internal, hence user visible scratchpad memory
// descriptor should be empty:
auto zero_md = dnnl::memory::desc();
assert(op_pd.scratchpad_desc() == zero_md);

User Manages Scratchpad

// Create an empty (default) attributes
dnnl::primitive_attr attr;
// Default scratchpad mode is `library`:
assert(attr.get_scratchpad_mode() == dnnl::scratchpad_mode::library);
// Set scratchpad mode to `user`
attr.set_scratchpad_mode(dnnl::scratchpad_mode::user);
// Create a primitive descriptor with custom attributes
dnnl::primitive::primitive_desc op_pd(op_d, attr, engine);
// Query the scratchpad memory descriptor
dnnl::memory::desc scratchpad_md = op_pd.scratchpad_desc();
// Note, that a primitive doesn't consume memory in this configuration:
assert(op_pd.query_s64(dnnl::query::memory_consumption_s64) == 0);
// Create a primitive
dnnl::primitive prim(op_pd);
// ...
// Create a scratchpad memory
// NOTE: if scratchpad is not required for a particular primitive the
//       scratchpad_md.get_size() will return 0. It is fine to have
//       scratchpad_ptr == nullptr in this case.
void *scratchpad_ptr = user_memory_manager::allocate(scratchpad_md.get_size());
// NOTE: engine here must much the engine of the primitive
dnnl::memory scratchpad(scratchpad_md, engine, scratchpad_ptr);
// Pass a scratchpad memory to a primitive
prim.execute(stream, {
        ...,
        {DNNL_ARG_SCRATCHPAD, scratchpad}});