A parallel loop construct incurs overhead cost for every chunk of work that it schedules. oneAPI Threading Building Blocks (oneTBB) chooses chunk sizes automatically, depending upon load balancing needs. The heuristic attempts to limit overheads while still providing ample opportunities for load balancing.
Typically a loop needs to take at least a million clock cycles to
make it worth using
parallel_for. For example, a loop that takes
at least 500 microseconds on a 2 GHz processor might benefit from
The default automatic chunking is recommended for most uses. As with most heuristics, however, there are situations where controlling the chunk size more precisely might yield better performance.