The following settings are recommended for measuring oneDNN performance using benchdnn. When measuring performance using any deep learning framework, refer to its benchmarking documentation. However, the approach outlined below is true for almost any compute-intensive application.
It is a common practice to affinitize each compute thread to its own CPU core when benchmarking performance. The method to do this depends on the threading library used.
TBB intentionally does not provide a mechanism to control affinity or number of threads via environment variables. However, TBB does create threads based on the number of CPUs in the process CPU affinity mask at the time the library is initialized. This means that some of the examples below work for TBB as well. Additionally, TBB implements an observer mechanism that can be used to affinitize threads.
This document focuses on OpenMP runtime that has portable controls for thread affinity documented here. It should be noted that the OpenMP runtime that comes with Microsoft Visual studio does not support them nor does it provide any other ways to control thread affinity.
The general principles below are not operating system-specific. However, of all operating systems supported by oneDNN only Linux has the numactl(8) utility that makes it easy to demonstrate them. NUMA stands for non-uniform memory access which is the typical architecture of the modern CPUs in which individual sockets have their own memory with separate physical memory attached. NUMA configuration is possible even within a single socket when a socket consists of multiple chips or tiles or when sub-NUMA clustering configurations are enabled.
Also, many modern CPUs may have multiple hardware threads per CPU core enabled. Such threads are usually exposed by OS as additional logical processors (thus a system with 4 cores and 2 hardware threads per core has 8 logical processors). If this is the case, the recommendation is to use only one of hardware threads per core.
There are three most important setup variants when benchmarking oneDNN on CPU:
Typically a modern server CPU is configured to have multiple NUMA domains. When running benchmarks on a whole machine, it is best to instruct the OS to interleave physical memory allocation between those domains. This way the computations have a higher chance to access physical memory from a local domain and thus there is less cross-node traffic. This also lowers run-to-run variation.
Here we instruct numactl
to affinitize process to NUMA domain 0 both in terms of CPU and memory locality.
In this case we want to use numactl
options from the single NUMA domain scenario, but place OpenMP threads close one to another.
Unfortunately, this does not work when there are multiple hardware threads per CPU, OpenMP runtimes place multiple threads on each core with the settings above. Moreover, there is no way to describe the desired configuration in which there is only one OpenMP thread per core without listing the corresponding logical processors explicitly on the command line via numactl --physcpubind=<list>
(not shown here) or using non-portable environment variables supported by OpenMP runtimes based on the Intel OpenMP runtime (Clang, Intel C/C++ Compiler):
KMP_HW_SUBSET=1T
even if the machine is configured with a single hardware thread per core. It also makes it unnecessary to set OMP_NUM_THREADS
in all the scenarios but the last as the number of threads is then inferred from the total number of logical processors in the process CPU affinity mask.