Controlling Chunking

Controlling Chunking#

Chunking is controlled by a partitioner and a grainsize. To gain the most control over chunking, you specify both.

  • Specify simple_partitioner() as the third argument to parallel_for. Doing so turns off automatic chunking.

  • Specify the grainsize when constructing the range. The thread argument form of the constructor is blocked_range<T>(begin,end,grainsize). The default value of grainsize is 1. It is in units of loop iterations per chunk.

If the chunks are too small, the overhead may exceed the performance advantage.

The following code is the last example from parallel_for, modified to use an explicit grainsize G.

#include "oneapi/tbb.h"


void ParallelApplyFoo( float a[], size_t n ) {
    parallel_for(blocked_range<size_t>(0,n,G), ApplyFoo(a),
                 simple_partitioner());
}

The grainsize sets a minimum threshold for parallelization. The parallel_for in the example invokes ApplyFoo::operator() on chunks, possibly of different sizes. Let chunksize be the number of iterations in a chunk. Using simple_partitioner guarantees that [G/2] <= chunksize <= G.

There is also an intermediate level of control where you specify the grainsize for the range, but use an auto_partitioner and affinity_partitioner. An auto_partitioner is the default partitioner. Both partitioners implement the automatic grainsize heuristic described in Automatic Chunking. An affinity_partitioner implies an additional hint, as explained later in Section Bandwidth and Cache Affinity. Though these partitioners may cause chunks to have more than G iterations, they never generate chunks with less than [G/2] iterations. Specifying a range with an explicit grainsize may occasionally be useful to prevent these partitioners from generating wastefully small chunks if their heuristics fail.

Because of the impact of grainsize on parallel loops, it is worth reading the following material even if you rely on auto_partitioner and affinity_partitioner to choose the grainsize automatically.

image0

image1

Case A

Case B

The above figure illustrates the impact of grainsize by showing the useful work as the gray area inside a brown border that represents overhead. Both Case A and Case B have the same total gray area. Case A shows how too small a grainsize leads to a relatively high proportion of overhead. Case B shows how a large grainsize reduces this proportion, at the cost of reducing potential parallelism. The overhead as a fraction of useful work depends upon the grainsize, not on the number of grains. Consider this relationship and not the total number of iterations or number of processors when setting a grainsize.

A rule of thumb is that grainsize iterations of operator() should take at least 100,000 clock cycles to execute. For example, if a single iteration takes 100 clocks, then the grainsize needs to be at least 1000 iterations. When in doubt, do the following experiment:

  1. Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations. If you have no idea of how many clock cycles an iteration might take, start with grainsize=100,000. The rationale is that each iteration normally requires at least one clock per iteration. In most cases, step 3 will guide you to a much smaller value.

  2. Run your algorithm.

  3. Iteratively halve the grainsize parameter and see how much the algorithm slows down or speeds up as the value decreases.

A drawback of setting a grainsize too high is that it can reduce parallelism. For example, if the grainsize is 1000 and the loop has 2000 iterations, the parallel_for distributes the loop across only two processors, even if more are available. However, if you are unsure, err on the side of being a little too high instead of a little too low, because too low a value hurts serial performance, which in turns hurts parallel performance if there is other parallelism available higher up in the call tree.

Tip

You do not have to set the grainsize too precisely.

The next figure shows the typical “bathtub curve” for execution time versus grainsize, based on the floating point a[i]=b[i]*c computation over a million indices. There is little work per iteration. The times were collected on a four-socket machine with eight hardware threads.

Wall Clock Time Versus Grainsize image2

The scale is logarithmic. The downward slope on the left side indicates that with a grainsize of one, most of the overhead is parallel scheduling overhead, not useful work. An increase in grainsize brings a proportional decrease in parallel overhead. Then the curve flattens out because the parallel overhead becomes insignificant for a sufficiently large grainsize. At the end on the right, the curve turns up because the chunks are so large that there are fewer chunks than available hardware threads. Notice that a grainsize over the wide range 100-100,000 works quite well.

Tip

A general rule of thumb for parallelizing loop nests is to parallelize the outermost one possible. The reason is that each iteration of an outer loop is likely to provide a bigger grain of work than an iteration of an inner loop.