Does Inlined Mean Streamlined? Part 1: Escape Analysis

There’s a lot of folklore about the importance of inlining in the JVM. Undoubtedly, inlining can improve performance by removing the overhead of function calls, but, more importantly, various optimisations are disabled or reduced in scope when it can’t happen. However, I think the importance of inlining is often overstated, especially considering the trade off between flexibility and ability to inline. This post is the first in a series where I use JMH to run simple experiments to assess the impact of failure to inline on C2’s ability to optimise programs. This post is about how inlining affects escape analysis, and whether you should care.

Inlining is the process of replacing function calls with the function’s code, much like denormalisation of databases. Just as database denormalisation can eliminate the overhead of performing joins at the expense of increasing the level of data duplication and therefore database size, inlining removes the overhead of function calls, at the expense of the amount of space required to represent the program. The analogy breaks down because copying the function’s code into the call site also aids an optimising compiler like C2 by increasing the scope of what can be optimised within a method, so C2 does this aggressively. It’s well known that there are two ways to confound inlining: code size (InlineSmallCode sets the limit of what can be inlined to 2KB by default), and having lots of polymorphism. Failure to inline can also be provoked by the JMH annotation @CompilerControl(DONT_INLINE).

In the first benchmark, I will look at a contrived example of the kind of small method you may find in Java code written in a functional style. Functional programming exploits monads, which represent a generic computation as a wrapper type, a wrapping operation known as the unit function, and a way to compose functions applied to the wrapper type, known as the bind function. You can also think of them as burritos. Some monadic types common in functionally tinged Java are Either (contains an instance of one type or another), Try (produces an output or an exception) and Optional which exists in the JDK. One drawback of monadic types in Java is that the wrapper type needs to be materialised (rather than exist only as a figment of the compiler’s imagination) and risks being allocated.

Here is an interface exposing a method returning an Optional intended to safely map a potentially null value of type S to type Optional<T> via a mapping between the unwrapped types S and T. To avoid measuring the cost of different implementations, it is implemented the same way three times to reach the threshold where Hotspot will give up on inlining calls to the escapee.

public interface Escapee<T> {
  <S> Optional<T> map(S value, Function<S, T> mapper);
}

public class Escapee1<T> implements Escapee<T> {
  @Override
  public <S> Optional<T> map(S value, Function<S, T> mapper) {
    return Optional.ofNullable(value).map(mapper);
  }
}

In the benchmark, we can simulate conditions where we call between one and four implementations. We should probably expect the benchmark to behave differently when the input value is null because a different branch will be taken. To isolate the difference in throughput just for taking the other branch, the same function, which allocates an Instant, is evaluated on either branch. No attempt is made to make the branch unpredictable since it’s beside the point. Instant.now() is chosen because it is volatile and impure, meaning that its evaluation shouldn’t be eliminated by some other optimisation.

  @State(Scope.Benchmark)
  public static class InstantEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    Escapee<Instant>[] escapees;
    int size = 4;
    String input;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      scenario.fill(escapees);
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapValue(InstantEscapeeState state, Blackhole bh) {
    for (Escapee<Instant> escapee : state.escapees) {
      bh.consume(escapee.map(state.input, x -> Instant.now()).orElseGet(Instant::now));
    }
  }

Based on common knowledge about C2’s inlining capabilities, we should expect scenarios THREE and FOUR not to inline, whereas ONE should be inlined, and TWO should be inlined with a conditional. Verifying this well known outcome by printing inlining with -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining is trivial. See Aleksey Shipilёv’s authoritative post for reference.

The benchmark is run with the following arguments. Tiered compilation is disabled to bypass C1. A large heap is allocated to avoid measuring garbage collection pauses, and the low overhead SerialGC is selected to minimise interference from instrumented write barriers.

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee.csv -prof gc 
-jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValue$

Despite there being little absolute difference in throughput (the scenarios where we expect inlining to occur have slightly higher throughputs than when we expect inlining not to take place), the results are quite interesting.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapValue thrpt 1 15 24.013132 0.459482 ops/us true ONE
mapValue thrpt 1 15 22.448583 0.430733 ops/us true TWO
mapValue thrpt 1 15 20.291617 0.898656 ops/us true THREE
mapValue thrpt 1 15 20.651088 0.552091 ops/us true FOUR
mapValue thrpt 1 15 24.625237 0.535002 ops/us false ONE
mapValue thrpt 1 15 24.039407 0.432007 ops/us false TWO
mapValue thrpt 1 15 21.976675 0.741998 ops/us false THREE
mapValue thrpt 1 15 22.183469 0.43514 ops/us false FOUR

The megamorphic cases are slightly faster when the input value is null, which highlights how easy it would be to not capture the relevant effects at all. When the input value is always null, and when there is only one implementation and the input value is not null, the normalised allocation rate are all 24B/op, and just over half that of the non-null input multi implementation scenarios, which are all about the same at 40B/op.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000017 0.000001 B/op true ONE
mapValue:·gc.alloc.rate.norm thrpt 1 15 40.000018 0.000001 B/op true TWO
mapValue:·gc.alloc.rate.norm thrpt 1 15 40.00002 0.000001 B/op true THREE
mapValue:·gc.alloc.rate.norm thrpt 1 15 40.00002 0.000001 B/op true FOUR
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000017 0.000001 B/op false ONE
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000017 0.000001 B/op false TWO
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000019 0.000001 B/op false THREE
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000019 0.000001 B/op false FOUR

24B/op is the size of instances of the Instant class (when a simple garbage collector like SerialGC is used), which contains an 8 byte number of seconds since 1970 and a 4 byte number of nanoseconds, plus a 12 byte object header. So the wrapper type can’t have been allocated in these cases! 40B/op includes the 16 bytes taken up by the materialised Optional (12 bytes for the header and 4 bytes for a compressed reference to the Instant). The difference is caused by the limitations of escape analysis: it gives up trying to prove allocation is unnecessary whenever the allocating method can’t be inlined, and incidentally gives up when the allocation takes place within a conditional statement. In scenario TWO, a conditional statement is introduced by inlining two possible implementations, which means each operation allocates the 16 bytes required for the optional.

The signal is fairly weak in this benchmark, and is almost entirely masked by the fact the benchmark will allocate a 24 byte Instant per invocation. To accentuate the difference, we can isolate background allocation from the benchmark and track the same metrics.


  @State(Scope.Benchmark)
  public static class StringEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    Escapee<String>[] escapees;
    int size = 4;
    String input;
    String ifPresent;
    String ifAbsent;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      scenario.fill(escapees);
      ifPresent = UUID.randomUUID().toString();
      ifAbsent = UUID.randomUUID().toString();
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapValueNoAllocation(StringEscapeeState state, Blackhole bh) {
    for (Escapee<String> escapee : state.escapees) {
      bh.consume(escapee.map(state.input, x -> state.ifPresent).orElseGet(() -> state.ifAbsent));
    }
  }

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-string.csv -prof gc 
-jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValueNoAllocation

While even the cost of very low intensity realistic work (allocating a timestamp) is enough to mollify failure to inline, when the virtual call is a no-op we can make its impact look quite severe. ONE and TWO are much faster because they at least eliminate the virtual function call in each case, no matter whether the input is null or not.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapValueNoAllocation thrpt 1 15 206.913491 3.003555 ops/us true ONE
mapValueNoAllocation thrpt 1 15 162.014816 4.353872 ops/us true TWO
mapValueNoAllocation thrpt 1 15 77.959095 2.174789 ops/us true THREE
mapValueNoAllocation thrpt 1 15 77.845562 3.592952 ops/us true FOUR
mapValueNoAllocation thrpt 1 15 202.016045 2.830117 ops/us false ONE
mapValueNoAllocation thrpt 1 15 198.241125 2.351662 ops/us false TWO
mapValueNoAllocation thrpt 1 15 88.187145 3.908423 ops/us false THREE
mapValueNoAllocation thrpt 1 15 89.715024 2.234652 ops/us false FOUR

It’s easy to imagine that allocation has been curtailed, only to be caught out by the limitations of escape analysis in the presence of polymorphism. In scenario ONE, there is never any allocation: escape analysis must have worked. In scenario TWO, because of the inlined conditional, the 16 byte Optional is allocated once per invocation with non-null input, and when the input is always null, there are fewer allocations. However, when the inlining doesn’t work in scenarios THREE and FOUR, an extra 16 bytes is allocated once per invocation, but it’s not related to inlining. The unintentional 16 bytes comes from capturing the variable in each case (a 12 byte header and 4 byte compressed reference to the String), but how often do you check your benchmarks to ensure you are measuring what you think you are?

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 0.000002 0 B/op true ONE
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 16.000003 0 B/op true TWO
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 32.000005 0 B/op true THREE
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 32.000005 0 B/op true FOUR
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 0.000002 0 B/op false ONE
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 0.000002 0 B/op false TWO
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 16.000005 0 B/op false THREE
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 16.000005 0 B/op false FOUR

It’s not the sort of thing that can be exploited in real programs, but it looks as if allocations are better eliminated when the method, be it virtual or inlined, only ever sees a null value. Actually, Optional.empty() always returns the same instance, so there were no allocations in the first place.

Having contrived a case to accentuate the effect, it’s worth noting that the impact of failure to inline is smaller than the difference in the cost of allocating an instance and storing the value with different garbage collectors, which is a cost some developers seem to be unaware of.


  @State(Scope.Benchmark)
  public static class InstantStoreEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    int size = 4;
    String input;
    Escapee<Instant>[] escapees;
    Instant[] target;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      target = new Instant[size];
      scenario.fill(escapees);
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapAndStoreValue(InstantStoreEscapeeState state, Blackhole bh) {
    for (int i = 0; i < state.escapees.length; ++i) {
      state.target[i] = state.escapees[i].map(state.input, x -> Instant.now()).orElseGet(Instant::now);
    }
    bh.consume(state.target);
  }

I run the same benchmark in two modes:

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-serial.csv 
-prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapAndStoreValue$

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-g1.csv 
-prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseG1GC -mx8G" EscapeeBenchmark.mapAndStoreValue$

The cost of changing the garbage collector when triggering the write barriers (simple in the case of the serial collector and complex in the case of G1) is about as large as the cost of missing out on inlining. Note that this is not an argument that garbage collector overhead is unacceptable!

Benchmark GC Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapAndStoreValue SerialGC thrpt 1 15 23.739993 0.297493 ops/us true ONE
mapAndStoreValue SerialGC thrpt 1 15 22.41715 0.502928 ops/us true TWO
mapAndStoreValue SerialGC thrpt 1 15 21.096494 0.629228 ops/us true THREE
mapAndStoreValue SerialGC thrpt 1 15 20.656528 0.604725 ops/us true FOUR
mapAndStoreValue SerialGC thrpt 1 15 24.098976 0.479819 ops/us false ONE
mapAndStoreValue SerialGC thrpt 1 15 23.759017 0.460972 ops/us false TWO
mapAndStoreValue SerialGC thrpt 1 15 21.473803 0.411786 ops/us false THREE
mapAndStoreValue SerialGC thrpt 1 15 21.524173 0.393322 ops/us false FOUR
mapAndStoreValue G1GC thrpt 1 15 20.522258 0.463444 ops/us true ONE
mapAndStoreValue G1GC thrpt 1 15 18.520677 0.229133 ops/us true TWO
mapAndStoreValue G1GC thrpt 1 15 18.359042 0.276809 ops/us true THREE
mapAndStoreValue G1GC thrpt 1 15 18.446654 0.272189 ops/us true FOUR
mapAndStoreValue G1GC thrpt 1 15 20.768856 0.496087 ops/us false ONE
mapAndStoreValue G1GC thrpt 1 15 20.277051 0.411466 ops/us false TWO
mapAndStoreValue G1GC thrpt 1 15 18.875519 0.399535 ops/us false THREE
mapAndStoreValue G1GC thrpt 1 15 18.824234 0.657469 ops/us false FOUR

Inlining makes escape analysis possible, but is only effective when only one implementation is used. The marginal benefit decreases in the presence of even trivial allocation, but can be expected to increase with the size of the eliminated allocation. The difference can even be smaller than the runtime cost of write barriers in some garbage collectors. My benchmarks are on github, they were run with OpenJDK 11+28 on Ubuntu 18.04.2 LTS.

Perhaps this analysis is facile; many optimisations more powerful than escape analysis depend on inlining. The next post in the series will be on the benefits, or lack thereof, of inlining a reduction operation such as a hash code.

Garbage Collectors Affect Microbenchmarks

When comparing garbage collectors there are two key metrics: how much time is spent collecting garbage, and the maximum pause time. There’s another dimension to the choice of garbage collector though: how it instruments JIT compiled code and the consequences of that instrumentation. The cost of this instrumentation is usually a tiny price to pay for improved pause times which only matters to some applications, but it makes writing benchmarks for code which assigns and reads references potentially error prone: sometimes the effect of changing the garbage collector is larger than the difference between two competing implementations. To illustrate this I compare a microbenchmark for a document cursor with three garbage collectors: ParallelOld (the default in OpenJDK8), G1 (the default from OpenJDK 9 onwards) and the experimental ZGC available from JDK11 onwards.

The code being benchmarked is simple. Imagine a stream of JSON-like documents which need to be translated into another format. The documents contain a special field called the cursor, for which, for some reason, the last-encountered value must always be known. There is a callback which will be invoked whenever a value of a certain type is encountered (e.g. writeLong(long value)) and a callback which will be invoked whenever a name of an attribute is encountered: writeName(String name). The interface being implemented cannot be changed to include a method writeLong(String name, long value) because it is owned by a third party, so the state between the two calls must be saved between the invocations. On each invocation of the writeName callback, we could save the name in the cursor object.


public class CursoredScanner2 implements CursoredScanner {

  private final String trigger;
  private String current;
  private long cursor;

  public CursoredScanner2(String trigger) {
    this.trigger = trigger;
  }

  @Override
  public void writeName(String name) {
    this.current = name;
  }

  @Override
  public void writeLong(long value) {
    if (trigger.equals(current)) {
      this.cursor = value;
    }
  }

  @Override
  public long getCursor() {
    return cursor;
  }

}

Alternatively, we could do the same number of comparisons by storing whether the last name was the name of the cursor or not:


public class CursoredScanner1 implements CursoredScanner {

  private final String trigger;

  private boolean atCursor;
  private long cursor;

  public CursoredScanner1(String trigger) {
    this.trigger = trigger;
  }

  @Override
  public void writeName(String name) {
    this.atCursor = trigger.equals(name);
  }

  @Override
  public void writeLong(long value) {
    if (atCursor) {
      this.cursor = value;
    }
  }

  @Override
  public long getCursor() {
    return cursor;
  }
}

Each implementation performs the same number of string comparisons. Supposing performance matters, how can one of the alternatives be selected? I wrote a benchmark which captures the cursor value from documents of varying sizes. I ran this benchmark with different garbage collector settings with JDK11. With ParallelOld, I saw that CursoredScanner2 was slightly slower.

-XX:+UseCondCardMark -XX:+UseParallelOldGC -mx1G -XX:+AlwaysPreTouch

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scannerType Param: size Param: triggerName
scan thrpt 1 15 58.081438 1.008727 ops/us SCANNER1 10 trigger1
scan thrpt 1 15 6.586134 0.173920 ops/us SCANNER1 100 trigger1
scan thrpt 1 15 49.402537 0.943554 ops/us SCANNER2 10 trigger1
scan thrpt 1 15 5.248657 0.135281 ops/us SCANNER2 100 trigger1

The cost here can be attributed to the card marking which keeps the approximation of inter-generational references up to date when references are assigned (see here). By avoiding assigning the reference in CursoredScanner1, the garbage collector doesn’t need to instrument anything at all, because the object graph isn’t being mutated.

G1 offers significant reductions in maximum pause times by structuring the heap and tracking references differently, it also instruments reference assignments to keep its book-keeping data structures up to date. The effect of this is pronounced in this benchmark, the barrier can be seen here with some skid implicating the innocent adjacent instruction.

-XX:+UseG1GC -mx1G -XX:+AlwaysPreTouch

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scannerType Param: size Param: triggerName
scan thrpt 1 15 62.633572 0.995514 ops/us SCANNER1 10 trigger1
scan thrpt 1 15 7.660122 0.231402 ops/us SCANNER1 100 trigger1
scan thrpt 1 15 23.833586 0.379903 ops/us SCANNER2 10 trigger1
scan thrpt 1 15 2.757419 0.148344 ops/us SCANNER2 100 trigger1

What about ZGC, one of the two upcoming ultra low pause garbage collectors? I can’t pretend to understand in detail how ZGC works (beyond what I can glean from profilers) but suffice it to say: it works differently, and instruments application code differently. Rather than intercepting reference assignment, it seems to intercept reads. It’s not clear why both implementations perform slightly worse than CursoredScanner1 with G1 or ParallelOld, but there’s not much to choose between the two when using ZGC.

-XX:+UnlockExperimentalVMOptions -XX:+UseZGC -mx1G -XX:+AlwaysPreTouch

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scannerType Param: size Param: triggerName
scan thrpt 1 15 43.761915 1.160516 ops/us SCANNER1 10 trigger1
scan thrpt 1 15 6.190803 0.101114 ops/us SCANNER1 100 trigger1
scan thrpt 1 15 39.080922 0.826591 ops/us SCANNER2 10 trigger1
scan thrpt 1 15 4.763075 0.126938 ops/us SCANNER2 100 trigger1

Am I making a case for using ParallelOld and avoiding assigning references because the throughput is slightly better? Not really, while it’s possible that’s appropriate in some applications, the point is that unless benchmarks focus exclusively on primitive types, the garbage collector has to be considered, and results need to be qualified by this choice. It would be very hard to choose between these implementations without knowing in advance which garbage collector would be in use.

As an aside, this is the first time I have run ZGC, so I’m keen to track down the read barrier I have heard about. It looks like the sequence of instructions mov, test, jne occurs on each read:


0x00007fe47c765295: mov    0x30(%r10),%r9
0x00007fe47c765299: test   %r9,0x20(%r15)
0x00007fe47c76529d: jne    0x00007fe47c765b68 
0x00007fe47c7652a3: mov    0x10(%r9),%r14    
0x00007fe47c7652a7: test   %r14,0x20(%r15)
0x00007fe47c7652ab: jne    0x00007fe47c765b76  

The assembly above can be seen whenever a reference is read, and sets up a data dependency between reads: the mov instruction must happen before the test instruction, which may trigger the jne, so the second move must depend on the jump and can’t be reordered. I was wondering what the purpose of this was, and if the data dependency was the means or the end, and what’s in r15 and found a decent article about this. Aleksey Shipilëv, who writes garbage collectors for a living and is better placed to interpret this output, gave some feedback: in Hotspot, r15 is the base address for thread local storage. Here, r15 + 0x20 is ZGC’s so called bad mask, and failing the test means that the object needs to be marked or relocated. Neither marking nor relocation actually show up in this profile because there isn’t enough garbage generated to trigger it, so the code at 0x00007fe47c765b68 can’t be seen. If the test passes nothing need happen, and the next reference is read (and intercepted itself). What jumps out at me here is the data dependency, but there’s also no obvious bottleneck in the profile.

Observing Memory Level Parallelism with JMH

Quite some time ago I observed an effect where breaking a cache-inefficient shuffle algorithm into short stages could improve throughput: when cache misses were likely, an improvement could be seen in throughput as a function of stage length. The implementations benchmarked were as follows, where op is either precomputed (a closure over an array of indices to swap) or a call to ThreadLocalRandom:


  @Benchmark
  public void shuffle(Blackhole bh) {
    for (int i = data.length; i > 1; i--)
      swap(data, i - 1, op.applyAsInt(i));
    bh.consume(data);
  }

  @Benchmark
  public void shuffleBuffered(Blackhole bh) {
    for (int i = data.length; i - unroll > 1; i -= unroll) {
      for (int j = 0; j < buffer.length; ++j) {
        buffer[j] = op.applyAsInt(i - j);
      }
      for (int j = 0; j < buffer.length; ++j) {
        swap(data, i - j - 1, buffer[j]);
      }
    }
    bh.consume(data);
  }

  private static void swap(int[] arr, int i, int j) {
    arr[i] ^= arr[j];
    arr[j] ^= arr[i];
    arr[i] ^= arr[j];
  }

I couldn’t explain the effect observed in terms of the default performance counters made available by JMH, but offered an intuitive explanation that the cache miss could be shared between four independent chains of execution so that cache misses in a given chain would not stall the others. This intuition was gleaned from perfasm: I guessed the bottleneck on this load was due to cache misses. In the simple shuffle, there was one big hit:


 73.97%   63.58%  │      ││  0x00007f8405c0a272: mov    %eax,0xc(%r11,%rcx,4) 

Executing the staged shuffle, I saw several smaller bottlenecks and could only guess the simpler code within each stage had more parallelism; these cache misses were happening at the same time and independently.


 10.33%   11.23%   ││  0x00007fdb35c09250: mov    %r9d,0xc(%rsi,%r10,4)  
 ...  
 10.40%   10.66%   ││  0x00007fdb35c09283: mov    %r9d,0x8(%rsi,%r10,4) 

Travis Downs left a great comment on the post pointing me in the direction of the counters l1d_pend_miss.pending and l1d_pend_miss.pending_cycles. What do these counters mean? Many descriptions for counters are infuriating, l1d_pend_miss.pending especially so:

“This event counts duration of L1D miss outstanding, that is each
cycle number of Fill Buffers (FB) outstanding required by
Demand Reads. FB either is held by demand loads, or it is held by
non-demand loads and gets hit at least once by demand. The
valid outstanding interval is defined until the FB deallocation by
one of the following ways: from FB allocation, if FB is allocated
by demand; from the demand Hit FB, if it is allocated by
hardware or software prefetch.
Note: In the L1D, a Demand Read contains cacheable or
noncacheable demand loads, including ones causing cache-line
splits and reads due to page walks resulted from any request
type.” (source)

In the words of the Virgin Mary, come again? There is clarity in the terseness of the description in the Intel® 64 and IA-32 Architectures Software Developer’s Manual.

L1D_PEND_MISS.PENDING Increments the number of outstanding L1D misses every cycle.
L1D_PEND_MISS.PENDING_CYCLES Cycles with at least one outstanding L1D misses from this logical processor.

The first counter records how many loads from non L1 memory locations are in flight during a cycle (that is, how many L1 cache misses are happening right now), increasing whenever there is at least one cache miss happening, and increasing until the load is complete. The second counter records how many cycles have some kind of outstanding cache miss in flight during the cycle. If there’s pipelining taking place, the first counter can increase by more than one per cycle, and if at least some work is done without experiencing a cache miss, the second counter will be less than the total number of cycles, and if there are two or more cache misses outstanding at the same time, the counter will take a smaller value than if the cache misses had taken place sequentially. Therefore, their ratio l1d_pend_miss.pending / l1d_pend_miss.pending_cyclesindicates how much memory level parallelism exists, that is, to what extent loads take place at the same time.

Can this be measured in JMH with the perfnorm profiler? Yes, I couldn’t find any documentation for it but reverse engineered this from the LinuxPerfNormProfiler source code:

-prof perfnorm:events=l1d_pend_miss.pending,l1d_pend_miss.pending_cycles

Note that this argument will override the standard events, so CPI, cycles and so on need to be added at the command line explicitly. Now the hypothetical parallel cache misses can be quantified. The figures for shuffle (the implementation without any staging) is reassuringly flat as a function of stage length, whereas a clear positive trend can be seen in both the throughput and MLP ratio for the staged implementation.

Mode Benchmark 8 16 32 64
PRECOMPUTED shuffle 0.347 0.352 0.345 0.37
PRECOMPUTED shuffle:l1d_pend_miss.pending 17390603073 17718936860 15979073823 20057689191
PRECOMPUTED shuffle:l1d_pend_miss.pending_cycles 3657159215 3706319384 3489256633 3930306563
PRECOMPUTED shuffle:ratio 4.76 4.78 4.58 5.10
THREAD_LOCAL_RANDOM shuffle 0.217 0.233 0.231 0.214
THREAD_LOCAL_RANDOM shuffle:l1d_pend_miss.pending 18246771955 17801360193 17736302365 19638836068
THREAD_LOCAL_RANDOM shuffle:l1d_pend_miss.pending_cycles 7280468758 7093396781 7086435578 7843415714
THREAD_LOCAL_RANDOM shuffle:ratio 2.51 2.51 2.50 2.50
THREAD_LOCAL_RANDOM shuffleBuffered 0.248 0.307 0.326 0.345
THREAD_LOCAL_RANDOM shuffleBuffered:l1d_pend_miss.pending 21899069718 23064517091 23320550954 22387833224
THREAD_LOCAL_RANDOM shuffleBuffered:l1d_pend_miss.pending_cycles 6203611528 5021906699 4539979273 4132226201
THREAD_LOCAL_RANDOM shuffleBuffered:ratio 3.53 4.59 5.14 5.42

Mixing Vector and Scalar Instructions

I saw an interesting tweet from one of the developers of Pilosa this week, reporting performance improvements from unrolling a bitwise reduction in Go. This surprised me because Go seems to enjoy a reputation for being a high performance language, and it certainly has great support for concurrency, but compilers should unroll loops as standard so you don’t have to. Having been written in Go doesn’t seem to have hampered Pilosa, because they have some great benchmark numbers, and it certainly helps that they built their technology on top of a smart data structure: the roaring bitmap. You can read about their data model for yourself, but Pilosa is basically a large bit matrix which materialises relations between rows and columns by setting the bit at their intersection, for instance, genomes on rows to k-mers (sequences of bases like “GATTACA”) on columns. To compute the Hamming similarity between the genomes of two people (i.e. how many k-mers they have in common), Pilosa just needs to intersect the bitmaps of rows representing each genome and count the number of bits in the result. The intersection doesn’t even need to be materialised, and can be calculated on the fly as a dot product. What piqued my interest though was that the Pilosa developers had experimented with combining vector and scalar instructions and had found it unprofitable. Once there is a Vector API in Java, what will happen when there’s a gap that can only be plugged with a scalar implementation?

I don’t know much about Go but I get the impression its compiler is a long way behind C2. The instruction POPCNTQ, the bottleneck in this tale, has only recently been available to Go programmers in math/bits/bits.go, with demonstrable apathy for its inclusion in the standard library. As a point of comparison, Long.bitCount has been translated to POPCNTQ by C2 for longer than I have been using Java. If you want to do bioinformatics in Java, whatever you do, don’t unroll your loops! The unrolled version below will be slightly slower than the simple loop.


  @Benchmark
  public int popcnt() {
    int cardinality = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; ++i) {
      cardinality += Long.bitCount(left[i] & right[i]);
    }
    return cardinality;
  }

  @Benchmark
  public int unrolledPopcnt() {
    int cardinality1 = 0;
    int cardinality2 = 0;
    int cardinality3 = 0;
    int cardinality4 = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; i += 4) {
      cardinality1 += Long.bitCount(left[i+0] & right[i+0]);
      cardinality2 += Long.bitCount(left[i+1] & right[i+1]);
      cardinality3 += Long.bitCount(left[i+2] & right[i+2]);
      cardinality4 += Long.bitCount(left[i+3] & right[i+3]);
    }
    return cardinality1 + cardinality2 + cardinality3 + cardinality4;
  }

Ignoring the unrolled version because it’s a dead end, does C2 vectorise this reduction? No, because it can’t vectorise the bit count, but notice the floating point spills at the start for better register placement.


         ││ ↗          0x00007fe418240c4c: vmovq  %xmm0,%r9
         ││ │          0x00007fe418240c51: vmovq  %xmm1,%r8
  0.00%  ││ │      ↗   0x00007fe418240c56: vmovq  %r9,%xmm0
  0.04%  ││ │      │   0x00007fe418240c5b: vmovq  %r8,%xmm1                      
  1.71%  ││↗│      │   0x00007fe418240c60: movslq %ecx,%r9
  1.42%  ││││      │   0x00007fe418240c63: mov    0x10(%rbx,%rcx,8),%r10
  8.98%  ││││      │   0x00007fe418240c68: and    0x10(%rdi,%rcx,8),%r10
  3.51%  ││││      │   0x00007fe418240c6d: popcnt %r10,%r8
  3.21%  ││││      │   0x00007fe418240c72: add    %r8d,%edx
  2.48%  ││││      │   0x00007fe418240c75: mov    0x28(%rbx,%r9,8),%r10
  8.19%  ││││      │   0x00007fe418240c7a: and    0x28(%rdi,%r9,8),%r10
  3.59%  ││││      │   0x00007fe418240c7f: popcnt %r10,%r10
  3.73%  ││││      │   0x00007fe418240c84: mov    0x20(%rbx,%r9,8),%r8
  2.16%  ││││      │   0x00007fe418240c89: and    0x20(%rdi,%r9,8),%r8
  7.53%  ││││      │   0x00007fe418240c8e: popcnt %r8,%rsi
  6.21%  ││││      │   0x00007fe418240c93: mov    0x18(%rbx,%r9,8),%r8           
  2.30%  ││││      │   0x00007fe418240c98: and    0x18(%rdi,%r9,8),%r8
  2.07%  ││││      │   0x00007fe418240c9d: popcnt %r8,%r9
 12.75%  ││││      │   0x00007fe418240ca2: add    %r9d,%edx
  6.01%  ││││      │   0x00007fe418240ca5: add    %esi,%edx
  5.70%  ││││      │   0x00007fe418240ca7: add    %r10d,%edx                     
  8.60%  ││││      │   0x00007fe418240caa: add    $0x4,%ecx                      
  3.58%  ││││      │   0x00007fe418240cad: cmp    %r11d,%ecx
         ││╰│      │   0x00007fe418240cb0: jl     0x00007fe418240c60             
  0.04%  ││ │      │   0x00007fe418240cb2: mov    0x108(%r15),%r10               
  0.05%  ││ │      │   0x00007fe418240cb9: test   %eax,(%r10)                    
  0.29%  ││ │      │   0x00007fe418240cbc: cmp    %r11d,%ecx
         ││ ╰      │   0x00007fe418240cbf: jl     0x00007fe418240c4c

It’s nice that very good scalar code gets generated for this loop from the simplest possible code, but what if you want to go faster with vectorisation? There is no vector bit count until the VPOPCNTD/VPOPCNTQ AVX512 extension, currently only available on the Knights Mill processor, which is tantamount to there being no vector bit count instruction. There is a vector bit count algorithm originally written by Wojciech Mula for SSE3, and updated for AVX2 by Wojciech Mula and Daniel Lemire, which is used in clang. I made an attempt at implementing this using the Vector API a few months ago and found what felt were a few gaps but may look at this again soon.

I looked at a few ways of writing mixed loops, using the Vector API and Long.bitCount and found that there wasn’t much to be gained from partial vectorisation. There is a method for extracting scalar values from vectors: LongVector::get, which is very interesting because it highlights the gaps the JIT compiler needs to fill in on the wrong hardware, and why you should read the assembly code emitted from a benchmark before jumping to conclusions. Here’s the code and below it the hot part of the loop.


  @Benchmark
  public int vpandExtractPopcnt() {
    int cardinality = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; i += 4) {
      var intersection = YMM_LONG.fromArray(left, i).and(YMM_LONG.fromArray(right, i));
      cardinality += Long.bitCount(intersection.get(0));
      cardinality += Long.bitCount(intersection.get(1));
      cardinality += Long.bitCount(intersection.get(2));
      cardinality += Long.bitCount(intersection.get(3));
    }
    return cardinality;
  }


  0.43%  ││        ↗   0x00007fbfe024bb55: vmovdqu 0x10(%rax,%rcx,8),%ymm2
  2.58%  ││        │   0x00007fbfe024bb5b: vpand  0x10(%r13,%rcx,8),%ymm2,%ymm8 
  0.32%  ││        │   0x00007fbfe024bb62: movslq %ecx,%r10                     
  0.43%  ││        │   0x00007fbfe024bb65: vmovdqu 0x70(%rax,%r10,8),%ymm2       
  2.65%  ││        │   0x00007fbfe024bb6c: vpand  0x70(%r13,%r10,8),%ymm2,%ymm9  
  0.84%  ││        │   0x00007fbfe024bb73: vmovdqu 0x30(%rax,%r10,8),%ymm2
  0.46%  ││        │   0x00007fbfe024bb7a: vpand  0x30(%r13,%r10,8),%ymm2,%ymm10  
  3.06%  ││        │   0x00007fbfe024bb81: vmovdqu 0x50(%rax,%r10,8),%ymm6       
  0.03%  ││        │   0x00007fbfe024bb88: vmovq  %rax,%xmm2
  0.42%  ││        │   0x00007fbfe024bb8d: vpand  0x50(%r13,%r10,8),%ymm6,%ymm11
  2.60%  ││        │   0x00007fbfe024bb94: vmovq  %xmm8,%r10
  0.01%  ││        │   0x00007fbfe024bb99: popcnt %r10,%rbp
  0.51%  ││        │   0x00007fbfe024bb9e: add    %edx,%ebp
  3.86%  ││        │   0x00007fbfe024bba0: vmovq  %xmm10,%r10
  0.15%  ││        │   0x00007fbfe024bba5: popcnt %r10,%rax
  0.43%  ││        │   0x00007fbfe024bbaa: vmovq  %xmm11,%r10
  0.41%  ││        │   0x00007fbfe024bbaf: popcnt %r10,%r14
  2.84%  ││        │   0x00007fbfe024bbb4: vmovq  %xmm9,%r10
  0.18%  ││        │   0x00007fbfe024bbb9: popcnt %r10,%rdx
  0.35%  ││        │   0x00007fbfe024bbbe: vextracti128 $0x1,%ymm9,%xmm6
  0.41%  ││        │   0x00007fbfe024bbc4: vpextrq $0x0,%xmm6,%r10
  2.62%  ││        │   0x00007fbfe024bbca: popcnt %r10,%r10
  1.45%  ││        │   0x00007fbfe024bbcf: vextracti128 $0x1,%ymm9,%xmm6
  0.42%  ││        │   0x00007fbfe024bbd5: vpextrq $0x1,%xmm6,%r11
  2.20%  ││        │   0x00007fbfe024bbdb: popcnt %r11,%r8
  1.34%  ││        │   0x00007fbfe024bbe0: vpextrq $0x1,%xmm9,%r11
  2.44%  ││        │   0x00007fbfe024bbe6: popcnt %r11,%r11
  0.21%  ││        │   0x00007fbfe024bbeb: vpextrq $0x1,%xmm10,%r9
  0.97%  ││        │   0x00007fbfe024bbf1: popcnt %r9,%r9
  2.17%  ││        │   0x00007fbfe024bbf6: vextracti128 $0x1,%ymm8,%xmm6
  0.22%  ││        │   0x00007fbfe024bbfc: vpextrq $0x1,%xmm6,%rbx
  1.10%  ││        │   0x00007fbfe024bc02: popcnt %rbx,%rbx
  2.46%  ││        │   0x00007fbfe024bc07: vextracti128 $0x1,%ymm8,%xmm6
  0.22%  ││        │   0x00007fbfe024bc0d: vpextrq $0x0,%xmm6,%rdi
  1.00%  ││        │   0x00007fbfe024bc13: popcnt %rdi,%rsi
  2.64%  ││        │   0x00007fbfe024bc18: vpextrq $0x1,%xmm8,%rdi
  0.80%  ││        │   0x00007fbfe024bc1e: popcnt %rdi,%rdi
  0.35%  ││        │   0x00007fbfe024bc23: add    %edi,%ebp
  3.42%  ││        │   0x00007fbfe024bc25: add    %esi,%ebp
  0.38%  ││        │   0x00007fbfe024bc27: add    %ebx,%ebp
  0.70%  ││        │   0x00007fbfe024bc29: add    %ebp,%eax
  0.84%  ││        │   0x00007fbfe024bc2b: add    %r9d,%eax
  2.85%  ││        │   0x00007fbfe024bc2e: vpextrq $0x1,%xmm11,%r9
  0.35%  ││        │   0x00007fbfe024bc34: popcnt %r9,%rbx
  0.21%  ││        │   0x00007fbfe024bc39: vextracti128 $0x1,%ymm10,%xmm6
  2.82%  ││        │   0x00007fbfe024bc3f: vpextrq $0x1,%xmm6,%r9
  0.34%  ││        │   0x00007fbfe024bc45: popcnt %r9,%r9
  0.38%  ││        │   0x00007fbfe024bc4a: vextracti128 $0x1,%ymm10,%xmm6
  2.58%  ││        │   0x00007fbfe024bc50: vpextrq $0x0,%xmm6,%rdi
  0.42%  ││        │   0x00007fbfe024bc56: popcnt %rdi,%rsi
  0.56%  ││        │   0x00007fbfe024bc5b: add    %esi,%eax
  4.90%  ││        │   0x00007fbfe024bc5d: add    %r9d,%eax
  0.55%  ││        │   0x00007fbfe024bc60: add    %eax,%r14d
  0.87%  ││        │   0x00007fbfe024bc63: add    %ebx,%r14d
  1.46%  ││        │   0x00007fbfe024bc66: vextracti128 $0x1,%ymm11,%xmm6
  1.91%  ││        │   0x00007fbfe024bc6c: vpextrq $0x0,%xmm6,%r9
  0.12%  ││        │   0x00007fbfe024bc72: popcnt %r9,%r9
  1.33%  ││        │   0x00007fbfe024bc77: add    %r9d,%r14d
  2.20%  ││        │   0x00007fbfe024bc7a: vextracti128 $0x1,%ymm11,%xmm6
  0.08%  ││        │   0x00007fbfe024bc80: vpextrq $0x1,%xmm6,%r9
  2.51%  ││        │   0x00007fbfe024bc86: popcnt %r9,%rbx
  3.68%  ││        │   0x00007fbfe024bc8b: add    %ebx,%r14d
  0.45%  ││        │   0x00007fbfe024bc8e: add    %r14d,%edx
  1.69%  ││        │   0x00007fbfe024bc91: add    %r11d,%edx
  1.34%  ││        │   0x00007fbfe024bc94: add    %r10d,%edx
  3.71%  ││        │   0x00007fbfe024bc97: add    %r8d,%edx
  4.53%  ││        │   0x00007fbfe024bc9a: add    $0x10,%ecx

What’s going on here is that each 256 bit vector is first extracted to a 128 bit register, so a 64 bit word can be moved to a 64 bit register upon which POPCNTQ can operate. This doesn’t benchmark very well at all on my AVX2 capable laptop, but my laptop is a poor proxy for the kind of AVX512 capable processor bioinformatics workloads would expect to run on. AVX512F has VEXTRACTI64x4 which can dump a vector into four 64 bit registers in a single instruction, which LongVector512::get may well use on the right hardware. I’m not the only person in the world to have run benchmarks on a laptop in my spare time and it’s important to realise that some benchmarks might be slow just because an instruction is missing.

I found a slight improvement on the scalar loop by dumping the intersected vectors to a pre-allocated array, and manually unrolling the bit counts with three accumulators, because the latency of POPCNTQ is three times that of ADD. The unrolled version is roughly 20% faster than the scalar loop, but this isn’t the kind of gain usually expected from vectorisation.


  @Benchmark
  public int vpandStorePopcnt() {
    long[] intersections = buffer;
    int cardinality = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; i += 4) {
      YMM_LONG.fromArray(left, i).and(YMM_LONG.fromArray(right, i)).intoArray(intersections, 0);
      cardinality += Long.bitCount(intersections[0]);
      cardinality += Long.bitCount(intersections[1]);
      cardinality += Long.bitCount(intersections[2]);
      cardinality += Long.bitCount(intersections[3]);
    }
    return cardinality;
  }

  @Benchmark
  public int vpandStorePopcntUnrolled() {
    long[] intersections = buffer;
    int cardinality1 = 0;
    int cardinality2 = 0;
    int cardinality3 = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; i += 8) {
      YMM_LONG.fromArray(left, i).and(YMM_LONG.fromArray(right, i)).intoArray(intersections, 0);
      YMM_LONG.fromArray(left, i + 4).and(YMM_LONG.fromArray(right, i + 4)).intoArray(intersections, 4);
      cardinality1 += Long.bitCount(intersections[0]);
      cardinality2 += Long.bitCount(intersections[1]);
      cardinality3 += Long.bitCount(intersections[2]);
      cardinality1 += Long.bitCount(intersections[3]);
      cardinality2 += Long.bitCount(intersections[4]);
      cardinality3 += Long.bitCount(intersections[5]);
      cardinality1 += Long.bitCount(intersections[6]);
      cardinality2 += Long.bitCount(intersections[7]);
    }
    return cardinality1 + cardinality2 + cardinality3;
  }

Here the pairs of extracts are replaced with a store to the buffer.


  0.03%        ││ ││││  0x00007f6d50031358: vmovdqu 0x10(%rax,%r9,8),%ymm2
  0.02%        ││ ││││  0x00007f6d5003135f: vpand  0x10(%r13,%r9,8),%ymm2,%ymm2   
  0.06%        ││ ││││  0x00007f6d50031366: vmovdqu %ymm2,0x10(%r12,%rcx,8)      
  0.03%        ││ ││││  0x00007f6d5003136d: mov    %r9d,%edi
  0.00%        ││ ││││  0x00007f6d50031370: add    $0x4,%edi                      
  0.00%        ││ ││││  0x00007f6d50031373: movslq %edi,%rbx                      
  0.03%        ││ ││││  0x00007f6d50031376: vmovdqu 0x10(%rax,%rbx,8),%ymm2
  0.00%        ││ ││││  0x00007f6d5003137c: vpand  0x10(%r13,%rbx,8),%ymm2,%ymm2  
  0.10%        ││ ││││  0x00007f6d50031383: vmovdqu %ymm2,0x30(%r12,%rcx,8)      
  0.02%        ││ ││││  0x00007f6d5003138a: popcnt 0x28(%r12,%rcx,8),%rdi
  0.03%        ││ ││││  0x00007f6d50031391: popcnt 0x20(%r12,%rcx,8),%rdx
  0.04%        ││ ││││  0x00007f6d50031398: add    %r8d,%edx
  0.02%        ││ ││││  0x00007f6d5003139b: popcnt 0x38(%r12,%rcx,8),%r8
  0.22%        ││ ││││  0x00007f6d500313a2: add    %edx,%r8d
  0.08%        ││ ││││  0x00007f6d500313a5: popcnt 0x30(%r12,%rcx,8),%rdx
  0.03%        ││ ││││  0x00007f6d500313ac: popcnt 0x18(%r12,%rcx,8),%rbx
               ││ ││││  0x00007f6d500313b3: add    %r11d,%ebx
  0.04%        ││ ││││  0x00007f6d500313b6: add    %edx,%ebx
  0.15%        ││ ││││  0x00007f6d500313b8: popcnt 0x48(%r12,%rcx,8),%r11
               ││ ││││  0x00007f6d500313bf: add    %ebx,%r11d
  0.04%        ││ ││││  0x00007f6d500313c2: popcnt 0x40(%r12,%rcx,8),%rdx
  0.03%        ││ ││││  0x00007f6d500313c9: popcnt 0x10(%r12,%rcx,8),%rbx
  0.10%        ││ ││││  0x00007f6d500313d0: add    %esi,%ebx
  0.06%        ││ ││││  0x00007f6d500313d2: add    %ebx,%edi
  0.03%        ││ ││││  0x00007f6d500313d4: add    %edx,%edi

Here are all the results in summary:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
popcnt thrpt 1 20 2098.750355 12.877810 ops/ms 1024
unrolledPopcnt thrpt 1 20 2077.227092 29.230757 ops/ms 1024
vpandExtractPopcnt thrpt 1 20 1819.027524 12.728300 ops/ms 1024
vpandStorePopcnt thrpt 1 20 2372.775743 10.315422 ops/ms 1024
vpandStorePopcntUnrolled thrpt 1 20 2626.761626 26.099143 ops/ms 1024

The difference is probably attributable to a smaller number of instructions:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
popcnt:instructions thrpt 1 1 5241.243387 NaN #/op 1024
unrolledPopcnt:instructions thrpt 1 1 5203.655274 NaN #/op 1024
vpandExtractPopcnt:instructions thrpt 1 1 4579.851499 NaN #/op 1024
vpandStorePopcnt:instructions thrpt 1 1 3170.924853 NaN #/op 1024
vpandStorePopcntUnrolled:instructions thrpt 1 1 3528.127055 NaN #/op 1024

In total, the gains aren’t great, but the baseline is strong. There’s more to counting bits than computing the Hamming similarity between two bitmaps; various useful similarity metrics, such as Jaccard and Tanimoto, can be calculated in the same way by replacing intersection with other set relations already implemented in the Vector API.

Benchmarks

Vectorised Polynomial Hash Codes

To provide support for the idea of pluggable hashing strategies, Peter Lawrey demonstrates that there are better and faster hash codes than the JDK implementation of String.hashCode or Arrays.hashCode. I really like the analysis of output distribution so recommend reading the post. However, I’m not absolutely sure if pluggable hashing strategies would be a good idea. They can induce coupling between the strategy implementation and the contents of the hashed data structure, which have different life cycles and code ownership. If performance is what matters, why not just make the existing algorithm much faster?

Peter’s hash code uses Unsafe to reinterpret each four bytes as an integer, but is otherwise just another polynomial hash code with a different coefficient. It produces a slightly different result, but with potentially better properties. Here’s that hash code.

private static final int M2 = 0x7A646E4D;

// read 4 bytes at a time from a byte[] assuming Java 9+ Compact Strings
private static int getIntFromArray(byte[] value, int i) {
    return UnsafeMemory.INSTANCE.UNSAFE.getInt(value, BYTE_ARRAY_OFFSET + i); 
}

public static int nativeHashCode(byte[] value) {
    long h = getIntFromArray(value, 0);
    for (int i = 4; i < value.length; i += 4)
        h = h * M2 + getIntFromArray(value, i);
    h *= M2;
    return (int) h ^ (int) (h >>> 25);
}

Leaving the output distribution to one side, Peter reports that this hash code performs better than Arrays.hashCode(byte[]) and this is accurate. Where does the performance come from? The reintepretation reduces the number of multiplications by a factor of four, but you need Unsafe to achieve this. This also obviates the need to convert each byte to an integer to avoid overflow. Another problem is solved just by changing the multiplier. Arrays.hashCode is generally slow because the multiplication by 31 gets strength reduced to a left shift by five and a subtraction, which inadvertently creates a data dependency which can’t be unrolled. When the multiplier is 31, just unrolling the multiplication to disable the strength reduction can increase throughput by 2x, and the rather obscure choice of 0x7A646E4D means that no such transformation takes place: this results in independent chains of multiplications and additions in the main loop:


  0.18%    0.46%     ││  0x00007f3b21c05285: movslq 0x18(%rdx,%r8,1),%rsi
  5.93%    6.28%     ││  0x00007f3b21c0528a: movslq 0x1c(%rdx,%r8,1),%rax
  0.12%    0.42%     ││  0x00007f3b21c0528f: imul   $0x7a646e4d,%rcx,%rcx
 11.87%   37.31%     ││  0x00007f3b21c05296: movslq 0x14(%rdx,%r8,1),%rdi
  0.10%    0.18%     ││  0x00007f3b21c0529b: movslq 0x10(%rdx,%r8,1),%r8
  0.06%    0.58%     ││  0x00007f3b21c052a0: add    %rcx,%r8
  5.29%    1.30%     ││  0x00007f3b21c052a3: imul   $0x7a646e4d,%r8,%r8
 18.34%   21.94%     ││  0x00007f3b21c052aa: add    %r8,%rdi
  6.33%    1.96%     ││  0x00007f3b21c052ad: imul   $0x7a646e4d,%rdi,%r8
 17.60%   10.88%     ││  0x00007f3b21c052b4: add    %r8,%rsi
  5.39%    0.72%     ││  0x00007f3b21c052b7: imul   $0x7a646e4d,%rsi,%rcx
 17.80%   11.58%     ││  0x00007f3b21c052be: add    %rax,%rcx 

Is this as good as it can get and is there something fundamentally wrong with the JDK algorithm? The algorithm can be vectorised, but this is beyond C2’s autovectoriser. The same algorithm is used for Arrays.hashCode(int[]), which doesn’t have the complication of type promotion from byte to int. I have noted before that this can be transformed to a loop which C2 can autovectorise by precomputing the coefficients of the polynomial (i.e. the powers of 31 until they repeat modulo 32, but x -> x * 31 has a very long period modulo 32) but this requires either an enormous array or a maximum length.


    private int[] coefficients;
    private int seed;

    void init(int size) {
        coefficients = new int[size]; 
        coefficients[size - 1] = 1;
        for (int i = size - 2; i >= 0; --i) {
            coefficients[i] = 31 * coefficients[i + 1];
        }
        seed = 31 * coefficients[0];
    }

    public int hashCodeAutoVectorised() {
        int result = seed;
        for (int i = 0; i < data.length && i < coefficients.length; ++i) {
            result += coefficients[i] * data[i];
        }
        return result;
    }

This idea isn’t practical but demonstrates that this kind of polynomial can be computed efficiently, if only the coefficients could be generated without disabling autovectorisation. Generating the coefficients on the fly is possible with the Vector API. It requires a multiplier containing eight consecutive powers of 31, and the exponent of each element needs to be increased by eight in each iteration. This can be achieved with a broadcast variable.


  public int polynomialHashCode() {
    var next = YMM_INT.broadcast(POWERS_OF_31_BACKWARDS[33 - 9]);
    var coefficients = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 8);
    var acc = YMM_INT.zero();
    for (int i = data.length; i - YMM_INT.length() >= 0; i -= YMM_INT.length()) {
      acc = acc.add(coefficients.mul(YMM_INT.fromArray(data, i - YMM_INT.length())));
      coefficients = coefficients.mul(next);
    }
    return acc.addAll() + coefficients.get(7);
  }

There’s a problem here – it sandwiches a low latency addition between two high latency multiplications, so there is a data dependency and unrolling without breaking the dependency isn’t necessarily helpful. The dependency can be broken manually by using four accumulators, four coefficient vectors consisting of 32 consecutive powers of 31, and each coefficient must have its logarithm increased by 32 in each iteration. It may look dirty but the dependencies are eradicated.


  public int polynomialHashCodeUnrolled() {
    var next = YMM_INT.broadcast(POWERS_OF_31_BACKWARDS[0]);
    var coefficients1 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 8);
    var coefficients2 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 16);
    var coefficients3 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 24);
    var coefficients4 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 32);
    var acc1 = YMM_INT.zero();
    var acc2 = YMM_INT.zero();
    var acc3 = YMM_INT.zero();
    var acc4 = YMM_INT.zero();
    for (int i = data.length; i - 4 * YMM_INT.length() >= 0; i -= YMM_INT.length() * 4) {
      acc1 = acc1.add(coefficients1.mul(YMM_INT.fromArray(data, i - YMM_INT.length())));
      acc2 = acc2.add(coefficients2.mul(YMM_INT.fromArray(data, i - 2 * YMM_INT.length())));
      acc3 = acc3.add(coefficients3.mul(YMM_INT.fromArray(data, i - 3 * YMM_INT.length())));
      acc4 = acc4.add(coefficients4.mul(YMM_INT.fromArray(data, i - 4 * YMM_INT.length())));
      coefficients1 = coefficients1.mul(next);
      coefficients2 = coefficients2.mul(next);
      coefficients3 = coefficients3.mul(next);
      coefficients4 = coefficients4.mul(next);
    }
    return acc1.add(acc2).add(acc3).add(acc4).addAll() + coefficients1.get(7);
  }

The implementation above does not handle arbitrary length input, but the input could either be padded with zeroes (see this paper on padded SLP autovectorisation) or followed by a post loop. It produces the same output as Arrays.hashCode, so how much faster is it?

Benchmark (size) Mode Cnt Score Error Units
arraysHashCode 1024 thrpt 20 1095.089 3.980 ops/ms
arraysHashCode 65536 thrpt 20 16.963 0.130 ops/ms
hashCodeAutoVectorised 1024 thrpt 20 3716.853 18.025 ops/ms
hashCodeAutoVectorised 65536 thrpt 20 57.265 0.907 ops/ms
polynomialHashCode 1024 thrpt 20 2623.090 7.920 ops/ms
polynomialHashCode 65536 thrpt 20 39.344 0.238 ops/ms
polynomialHashCodeUnrolled 1024 thrpt 20 8266.119 34.480 ops/ms
polynomialHashCodeUnrolled 65536 thrpt 20 131.196 6.234 ops/ms

So there really is absolutely nothing wrong with the algorithm from a performance perspective, but the implementation can be improved vastly (~8x). It seems that the tools required for JDK engineers and users alike to make optimisations like these are in the pipeline!

What about byte arrays? One obstacle to vectorisation is that to implement the JDK algorithm strictly, the bytes must accumulate into 32 bit values, which means that the lanes need to widen, so the contents of a single vector register of bytes would need to fan out to four vector registers of integers. This would be achievable by loading vectors at eight byte offsets and permuting the first eight bytes of each vector into every fourth position, but this is quite convoluted.

Peter demonstrates that reinterpreting four bytes as an integer doesn’t necessarily degrade the hash distribution, and may even improve it, so I used the same trick: rebracketing a ByteVector to an IntVector to produce the “wrong” result, but a reasonable hash code nonetheless. A nice feature of the Vector API is allowing this kind of reinterpretation without resorting to Unsafe, via Vector.rebracket.

  
  public int hashCodeVectorAPINoDependencies() {
    var next = YMM_INT.broadcast(POWERS_OF_31[8]);
    var coefficients1 = YMM_INT.fromArray(POWERS_OF_31, 0);
    var coefficients2 = coefficients1.mul(next);
    var coefficients3 = coefficients2.mul(next);
    var coefficients4 = coefficients3.mul(next);
    next = next.mul(next);
    next = next.mul(next);
    var acc1 = YMM_INT.zero();
    var acc2 = YMM_INT.zero();
    var acc3 = YMM_INT.zero();
    var acc4 = YMM_INT.zero();
    for (int i = 0; i < data.length; i += YMM_BYTE.length() * 4) {
      acc1 = acc1.add(coefficients1.mul(YMM_BYTE.fromArray(data, i).rebracket(YMM_INT)));
      acc2 = acc2.add(coefficients2.mul(YMM_BYTE.fromArray(data, i + YMM_BYTE.length()).rebracket(YMM_INT)));
      acc3 = acc3.add(coefficients3.mul(YMM_BYTE.fromArray(data, i + 2 * YMM_BYTE.length()).rebracket(YMM_INT)));
      acc4 = acc4.add(coefficients4.mul(YMM_BYTE.fromArray(data, i + 3 * YMM_BYTE.length()).rebracket(YMM_INT)));
      coefficients1 = coefficients1.mul(next);
      coefficients2 = coefficients2.mul(next);
      coefficients3 = coefficients3.mul(next);
      coefficients4 = coefficients4.mul(next);
    }
    return acc1.add(acc2).add(acc3).add(acc4).addAll();
  }

Owing to its use of better tools yet to be released, this version is many times faster than either the JDK implementation or Peter’s.

Benchmark (size) Mode Cnt Score Error Units
arraysHashCode 128 thrpt 20 8897.392 220.582 ops/ms
arraysHashCode 256 thrpt 20 4286.621 156.794 ops/ms
arraysHashCode 512 thrpt 20 2024.858 72.030 ops/ms
arraysHashCode 1024 thrpt 20 1002.173 39.917 ops/ms
hashCodeVectorAPINoDependencies 128 thrpt 20 88395.374 3369.397 ops/ms
hashCodeVectorAPINoDependencies 256 thrpt 20 64799.697 1035.175 ops/ms
hashCodeVectorAPINoDependencies 512 thrpt 20 48248.967 864.728 ops/ms
hashCodeVectorAPINoDependencies 1024 thrpt 20 27429.025 916.850 ops/ms
hashCodeVectorAPIDependencies 128 thrpt 20 96679.811 316.470 ops/ms
hashCodeVectorAPIDependencies 256 thrpt 20 52140.582 1825.173 ops/ms
hashCodeVectorAPIDependencies 512 thrpt 20 26327.195 492.730 ops/ms
hashCodeVectorAPIDependencies 1024 thrpt 20 10929.500 351.732 ops/ms
nativeHashCode 128 thrpt 20 38799.185 188.636 ops/ms
nativeHashCode 256 thrpt 20 17438.536 257.168 ops/ms
nativeHashCode 512 thrpt 20 7041.815 209.817 ops/ms
nativeHashCode 1024 thrpt 20 3217.187 96.379 ops/ms

Benchmark source code

Paul Sandoz discussed this topic at Oracle Code One 2018 (Slides)