Observing Memory Level Parallelism with JMH

Quite some time ago I observed an effect where breaking a cache-inefficient shuffle algorithm into short stages could improve throughput: when cache misses were likely, an improvement could be seen in throughput as a function of stage length. The implementations benchmarked were as follows, where op is either precomputed (a closure over an array of indices to swap) or a call to ThreadLocalRandom:

  public void shuffle(Blackhole bh) {
    for (int i = data.length; i > 1; i--)
      swap(data, i - 1, op.applyAsInt(i));

  public void shuffleBuffered(Blackhole bh) {
    for (int i = data.length; i - unroll > 1; i -= unroll) {
      for (int j = 0; j < buffer.length; ++j) {
        buffer[j] = op.applyAsInt(i - j);
      for (int j = 0; j < buffer.length; ++j) {
        swap(data, i - j - 1, buffer[j]);

  private static void swap(int[] arr, int i, int j) {
    arr[i] ^= arr[j];
    arr[j] ^= arr[i];
    arr[i] ^= arr[j];

I couldn’t explain the effect observed in terms of the default performance counters made available by JMH, but offered an intuitive explanation that the cache miss could be shared between four independent chains of execution so that cache misses in a given chain would not stall the others. This intuition was gleaned from perfasm: I guessed the bottleneck on this load was due to cache misses. In the simple shuffle, there was one big hit:

 73.97%   63.58%  │      ││  0x00007f8405c0a272: mov    %eax,0xc(%r11,%rcx,4) 

Executing the staged shuffle, I saw several smaller bottlenecks and could only guess the simpler code within each stage had more parallelism; these cache misses were happening at the same time and independently.

 10.33%   11.23%   ││  0x00007fdb35c09250: mov    %r9d,0xc(%rsi,%r10,4)  
 10.40%   10.66%   ││  0x00007fdb35c09283: mov    %r9d,0x8(%rsi,%r10,4) 

Travis Downs left a great comment on the post pointing me in the direction of the counters l1d_pend_miss.pending and l1d_pend_miss.pending_cycles. What do these counters mean? Many descriptions for counters are infuriating, l1d_pend_miss.pending especially so:

“This event counts duration of L1D miss outstanding, that is each
cycle number of Fill Buffers (FB) outstanding required by
Demand Reads. FB either is held by demand loads, or it is held by
non-demand loads and gets hit at least once by demand. The
valid outstanding interval is defined until the FB deallocation by
one of the following ways: from FB allocation, if FB is allocated
by demand; from the demand Hit FB, if it is allocated by
hardware or software prefetch.
Note: In the L1D, a Demand Read contains cacheable or
noncacheable demand loads, including ones causing cache-line
splits and reads due to page walks resulted from any request
type.” (source)

In the words of the Virgin Mary, come again? There is clarity in the terseness of the description in the Intel® 64 and IA-32 Architectures Software Developer’s Manual.

L1D_PEND_MISS.PENDING Increments the number of outstanding L1D misses every cycle.
L1D_PEND_MISS.PENDING_CYCLES Cycles with at least one outstanding L1D misses from this logical processor.

The first counter records how many loads from non L1 memory locations are in flight during a cycle (that is, how many L1 cache misses are happening right now), increasing whenever there is at least one cache miss happening, and increasing until the load is complete. The second counter records how many cycles have some kind of outstanding cache miss in flight during the cycle. If there’s pipelining taking place, the first counter can increase by more than one per cycle, and if at least some work is done without experiencing a cache miss, the second counter will be less than the total number of cycles, and if there are two or more cache misses outstanding at the same time, the counter will take a smaller value than if the cache misses had taken place sequentially. Therefore, their ratio l1d_pend_miss.pending / l1d_pend_miss.pending_cyclesindicates how much memory level parallelism exists, that is, to what extent loads take place at the same time.

Can this be measured in JMH with the perfnorm profiler? Yes, I couldn’t find any documentation for it but reverse engineered this from the LinuxPerfNormProfiler source code:

-prof perfnorm:events=l1d_pend_miss.pending,l1d_pend_miss.pending_cycles

Note that this argument will override the standard events, so CPI, cycles and so on need to be added at the command line explicitly. Now the hypothetical parallel cache misses can be quantified. The figures for shuffle (the implementation without any staging) is reassuringly flat as a function of stage length, whereas a clear positive trend can be seen in both the throughput and MLP ratio for the staged implementation.

Mode Benchmark 8 16 32 64
PRECOMPUTED shuffle 0.347 0.352 0.345 0.37
PRECOMPUTED shuffle:l1d_pend_miss.pending 17390603073 17718936860 15979073823 20057689191
PRECOMPUTED shuffle:l1d_pend_miss.pending_cycles 3657159215 3706319384 3489256633 3930306563
PRECOMPUTED shuffle:ratio 4.76 4.78 4.58 5.10
THREAD_LOCAL_RANDOM shuffle 0.217 0.233 0.231 0.214
THREAD_LOCAL_RANDOM shuffle:l1d_pend_miss.pending 18246771955 17801360193 17736302365 19638836068
THREAD_LOCAL_RANDOM shuffle:l1d_pend_miss.pending_cycles 7280468758 7093396781 7086435578 7843415714
THREAD_LOCAL_RANDOM shuffle:ratio 2.51 2.51 2.50 2.50
THREAD_LOCAL_RANDOM shuffleBuffered 0.248 0.307 0.326 0.345
THREAD_LOCAL_RANDOM shuffleBuffered:l1d_pend_miss.pending 21899069718 23064517091 23320550954 22387833224
THREAD_LOCAL_RANDOM shuffleBuffered:l1d_pend_miss.pending_cycles 6203611528 5021906699 4539979273 4132226201
THREAD_LOCAL_RANDOM shuffleBuffered:ratio 3.53 4.59 5.14 5.42

One thought on “Observing Memory Level Parallelism with JMH

  1. Yeah, the limitation of the non-buffered approach is that the RNG instructions “clog up” the reorder buffer of the CPU, so not as many parallel misses are able to be in flight at once.

    It’s an example of non-additive performance. If you measure the time to compute all the random numbers (alone) as A, and the time to access all the locations randomly (alone) as B, then you might expect that when you interleave those operations (as in your loop that calculates one random index, and then accesses it), this total time C would be something close to A + B, or perhaps faster since work can “overlap” – but in this case it is opposite. The CPU work of A interferes with the memory-bound work of B, and vice-versa, and you get A + B < C.

    The buffering solution is kind of brute force: you have to do a lot of extra work now to store all the index values to memory, and read them back out – but the effect is strong enough that it works. Another solution is to try to pack more prefetches into the "window" in the interleaved solution: you can use prefetches to do this (but probably not easily in Java!).

Leave a Reply

Your email address will not be published. Required fields are marked *