# Autovectorised FMA in JDK10

Fused-multiply-add (FMA) allows floating point expressions of the form `a * x + b` to be evaluated in a single instruction, which is useful for numerical linear algebra. Despite the obvious appeal of FMA, JVM implementors are rather constrained when it comes to floating point arithmetic because Java programs are expected to be reproducible across versions and target architectures. FMA does not produce precisely the same result as the equivalent multiplication and addition instructions (this is caused by the compounding effect of rounding) so its use is a change in semantics rather than an optimisation; the user must opt in. To the best of my knowledge, support for FMA was first proposed in 2000, along with reorderable floating point operations, which would have been activated by a `fastfp` keyword, but this proposal was withdrawn. In Java 9, the intrinsic `Math.fma` was introduced to provide access to FMA for the first time.

### DAXPY Benchmark

A good use case to evaluate `Math.fma` is DAXPY from the Basic Linear Algebra Subroutine library. The code below will compile with JDK9+

``````@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class DAXPY {

double s;

@Setup(Level.Invocation)
public void init() {
}

@Benchmark
public void daxpyFMA(DoubleData state, Blackhole bh) {
double[] a = state.data1;
double[] b = state.data2;
for (int i = 0; i < a.length; ++i) {
a[i] = Math.fma(s, b[i], a[i]);
}
bh.consume(a);
}

@Benchmark
public void daxpy(DoubleData state, Blackhole bh) {
double[] a = state.data1;
double[] b = state.data2;
for (int i = 0; i < a.length; ++i) {
a[i] += s * b[i];
}
bh.consume(a);
}
}
``````

Running this benchmark with Java 9, you may wonder why you bothered because the code is actually slower.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
daxpy thrpt 1 10 25.011242 2.259007 ops/ms 100000
daxpy thrpt 1 10 0.706180 0.046146 ops/ms 1000000
daxpyFMA thrpt 1 10 15.334652 0.271946 ops/ms 100000
daxpyFMA thrpt 1 10 0.623838 0.018041 ops/ms 1000000

This is because using `Math.fma` disables autovectorisation. Taking a look at `PrintAssembly` you can see that the naive `daxpy` routine exploits AVX2, whereas `daxpyFMA` reverts to scalar usage of SSE2.

```// daxpy routine, code taken from main vectorised loop
vmovdqu ymm1,ymmword ptr [r10+rdx*8+10h]
vmulpd  ymm1,ymm1,ymm2
vmovdqu ymmword ptr [r8+rdx*8+10h],ymm1

// daxpyFMA routine
vmovsd  xmm2,qword ptr [rcx+r13*8+10h]
vmovsd  qword ptr [rcx+r13*8+10h],xmm2
```

Not to worry, this seems to have been fixed in JDK 10. Since Java 10’s release is around the corner, there are early access builds available for all platforms. Rerunning this benchmark, FMA no longer incurs costs, and it doesn’t bring the performance boost some people might expect. The benefit is that there is less floating point error because the total number of roundings is halved.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
daxpy thrpt 1 10 2582.363228 116.637400 ops/ms 1000
daxpy thrpt 1 10 405.904377 32.364782 ops/ms 10000
daxpy thrpt 1 10 25.210111 1.671794 ops/ms 100000
daxpy thrpt 1 10 0.608660 0.112512 ops/ms 1000000
daxpyFMA thrpt 1 10 2650.264580 211.342407 ops/ms 1000
daxpyFMA thrpt 1 10 389.274693 43.567450 ops/ms 10000
daxpyFMA thrpt 1 10 24.941172 2.393358 ops/ms 100000
daxpyFMA thrpt 1 10 0.671310 0.158470 ops/ms 1000000

```// vectorised daxpyFMA routine, code taken from main loop (you can still see the old code in pre/post loops)
vmovdqu ymm0,ymmword ptr [r9+r13*8+10h]
vmovdqu ymmword ptr [r9+r13*8+10h],ymm0
```

# Spliterator Characteristics and Performance

The streams API has been around for a while now, and I’m a big fan of it. It allows for a clean declarative programming style, which permits various optimisations to occur, and keeps the pastafarians at bay. I also think the `Stream` is the perfect abstraction for data interchange across API boundaries. This is partly because a `Stream` is lazy, meaning you don’t need to pay for consumption until you actually need to, and partly because a `Stream` can only be used once and there can be no ambiguity about ownership. If you supply a `Stream` to an API, you must expect that it has been used and so must discard it. This almost entirely eradicates defensive copies and can mean that no intermediate data structures need ever exist. Despite my enthusiasm for this abstraction, there’s some weirdness in this API when you scratch beneath surface.

I wanted to find a way to quickly run length encode an `IntStream` and found it difficult to make this code as fast as I’d like it to be. The code is too slow because it’s necessary to inspect each `int`, even when there is enough context available to potentially apply optimisations such as skipping over ranges. It’s likely that I am experiencing the friction of treating `Stream` as a data interchange format, which wasn’t one of its design goals, but this led me to investigate spliterator characteristics (there is no contiguous characteristic, which could speed up RLE greatly) and their relationship with performance.

### Spliterator Characteristics

Streams have spliterators, which control iteration and splitting behaviour. If you want to process a stream in parallel, it is the spliterator which dictates how to split the stream, if possible. There’s more to a spliterator than parallel execution though, and each single threaded execution can be optimised based on the characteristics bit mask. The different characteristics are as follows:

• `ORDERED` promises that there is an order. For instance, `trySplit` is guaranteed to give a prefix of elements.
• `DISTINCT` a promise that each element in the stream is unique.
• `SORTED` a promise that the stream is already sorted.
• `SIZED` promises the size of the stream is known. This is not true when a call to `iterate` generates the stream.
• `NONNULL` promises that no elements in the stream are null.
• `IMMUTABLE` promises the underlying data will not change.
• `CONCURRENT` promises that the underlying data can be modified concurrently. Must not also be `IMMUTABLE`.
• `SUBSIZED` promises that the sizes of splits are known, must also be `SIZED`.

There’s javadoc for all of these flags, which should be your point of reference, and you need to read it because you wouldn’t guess based on relative performance. For instance, `IntStream.range(inclusive, exclusive)` creates an `RangeIntSpliterator` with the characteristics `ORDERED | SIZED | SUBSIZED | IMMUTABLE | NONNULL | DISTINCT | SORTED`. This means that this stream has no duplicates, no nulls, is already sorted in natural order, the size is known, and it will be chunked deterministically. The data and the iteration order never change, and if we split it, we will always get the same first chunk. So these code snippets should have virtually the same performance:

``````
@Benchmark
public long countRange() {
return IntStream.range(0, size).count();
}

@Benchmark
public long countRangeDistinct() {
return IntStream.range(0, size).distinct().count();
}
``````

This is completely at odds with observations. Even though the elements are already distinct, and metadata exists to support this, requesting the distinct elements decimates performance.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countRange thrpt 1 10 49.465729 1.804123 ops/us 262144
countRangeDistinct thrpt 1 10 0.000395 0.000002 ops/us 262144

It turns out this is because `IntStream.distinct` has a one-size-fits-all implementation which completely ignores the `DISTINCT` characteristic, and goes ahead and boxes the entire range.

``````    // from java.util.stream.IntPipeline
@Override
public final IntStream distinct() {
// While functional and quick to implement, this approach is not very efficient.
// An efficient version requires an int-specific map/set implementation.
return boxed().distinct().mapToInt(i -> i);
}
``````

There is even more observable weirdness. If we wanted to calculate the sum of the first 1000 natural numbers, these two snippets should have the same performance. Requesting what should be a redundant sort doubles the throughput.

``````
@Benchmark
return IntStream.range(0, size).limit(1000).sum();
}

@Benchmark
return IntStream.range(0, size).sorted().limit(1000).sum();
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
headSum thrpt 1 10 0.209763 0.002478 ops/us 262144
sortedHeadSum thrpt 1 10 0.584227 0.006004 ops/us 262144

In fact, you would have a hard time finding a relationship between Spliterator characteristics and performance, but you can see cases of characteristics driving optimisations if you look hard enough, such as in `IntStream.count`, where the `SIZED` characteristic is used.

``````    // see java.util.stream.ReduceOps.makeIntCounting
@Override
public <P_IN> Long evaluateSequential(PipelineHelper<Integer> helper, Spliterator<P_IN> spliterator) {
if (StreamOpFlag.SIZED.isKnown(helper.getStreamAndOpFlags()))
return spliterator.getExactSizeIfKnown();
return super.evaluateSequential(helper, spliterator);
}
``````

This is a measurably worthwhile optimisation, when benchmarked against the unsized spliterator created by `IntStream.iterate`:

``````
@Benchmark
public long countIterator() {
return IntStream.iterate(0, i -> i < size, i -> i + 1).count();
}

@Benchmark
public long countRange() {
return IntStream.range(0, size).count();
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countIterator thrpt 1 10 0.001198 0.001629 ops/us 262144
countRange thrpt 1 10 43.166065 4.628715 ops/us 262144

What about `limit`, that’s supposed to be useful for speeding up streams by limiting the amount of work done? Unfortunately not. It actually makes things potentially much worse. In `SliceOps.flags`, we see it will actually disable `SIZED` operations:

``````    //see java.util.stream.SliceOps
private static int flags(long limit) {
return StreamOpFlag.NOT_SIZED | ((limit != -1) ? StreamOpFlag.IS_SHORT_CIRCUIT : 0);
}
``````

This has a significant effect on performance, as can be seen in the following benchmark:

``````
@Benchmark
public long countRange() {
return IntStream.range(0, size).count();
}

@Benchmark
public long countHalfRange() {
return IntStream.range(0, size).limit(size / 2).count();
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countHalfRange thrpt 1 10 0.003632 0.003363 ops/us 262144
countRange thrpt 1 10 44.859998 6.191411 ops/us 262144

It’s almost as if there were grand plans involving characteristic based optimisation, and perhaps time ran out (`IntStream.distinct` has a very apologetic comment) or others were better on paper than in reality. In any case, it looks like they aren’t as influential as you might expect. Given that the relationship between the characteristics which exist and performance is flaky at best, it’s unlikely that a new one would get implemented, but I think the characteristic `CONTIGUOUS` is missing.

# Don’t Fear the Builder

Writing a nontrivial piece of code requires a programmer to make countless tiny decisions, and to produce good quality code quickly, many of these decisions must be made subconciously. But what happens if your instinct is mostly superstitious bullshit? I instinctively avoid allocating objects when I don’t have to, so would not voluntarily resort to using things like `EqualsBuilder` or `HashCodeBuilder` from Apache Commons. I feel dismayed whenever I am asked about the relationship between hash code and equals in interviews for contracts. This indicates a lack of mutual respect, but what’s revealing is that enough people passing themselves off as developers don’t know this stuff that it makes the question worth asking. `EqualsBuilder` and `HashCodeBuilder` make it so you don’t actually need to know, making it safe to put any object in a `HashSet`, whoever wrote the class. But should you use these classes just because they protect the naive (and you from the naive)? Is it a runtime tax? Or is any perceived cost magically eradicated by the JVM? It’s time to benchmark my instinct.

### Baseline

There are definitely no allocations in the code IntelliJ (or similar) would generate for `equals` and `hashCode` automatically. I will use this generated code as the baseline: if I can match or beat this code with `EqualsBuilder` or `HashCodeBuilder` respectively, I can dismiss my prejudice. Here’s the baseline class:

``````
public class Baseline {

private final String a;
private final int b;
private final String c;
private final Instant d;

public Baseline(String a, int b, String c, Instant d) {
this.a = a;
this.b = b;
this.c = c;
this.d = d;
}

@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;

Baseline baseline = (Baseline) o;

if (b != baseline.b) return false;
if (!a.equals(baseline.a)) return false;
if (!c.equals(baseline.c)) return false;
return d.equals(baseline.d);
}

@Override
public int hashCode() {
int result = a.hashCode();
result = 31 * result + b;
result = 31 * result + c.hashCode();
result = 31 * result + d.hashCode();
return result;
}
}
``````

It’s not pretty and it’s a real pain to keep generated code up to date when adding and removing fields (you can delete it and regenerate it but it creates merge noise) but there’s no obvious way to improve on this code. It’s correct and looks fast. To benchmark alternatives, I want to look at the effect on set membership queries against a `HashSet` of various sizes.

### Builder Implementation

I will contrast this with the following code using code from Apache Commons (incidentally, also generated by IntelliJ):

``````
public class Builder {

private final String a;
private final int b;
private final String c;
private final Instant d;

public Builder(String a, int b, String c, Instant d) {
this.a = a;
this.b = b;
this.c = c;
this.d = d;
}

@Override
public boolean equals(Object o) {
if (this == o) return true;

if (o == null || getClass() != o.getClass()) return false;

Builder builder = (Builder) o;

return new EqualsBuilder()
.append(b, builder.b)
.append(a, builder.a)
.append(c, builder.c)
.append(d, builder.d)
.isEquals();
}

@Override
public int hashCode() {
return new HashCodeBuilder(17, 37)
.append(a)
.append(b)
.append(c)
.append(d)
.toHashCode();
}
}
``````

This code is a bit neater and easier to maintain, but does it have an observable cost?

### HashSet Benchmark

For a parametrised range of `HashSet` sizes, I measure throughput for calls to `contains`. This requires that both `hashCode` and `equals` are called, the latter potentially several times. There is bias in this benchmark, because the implementation producing the best hash function will minimise the calls to `equals`, but I am willing to reward the better implementation rather than maniacally isolate any attributable allocation cost.

The code is simple, if a little repetitive. For each implementation, I make some random instances of my class, and make one that I can be sure isn’t in the set. I measure the cost of repeated invocations to the `hashCode` and `equals` methods by finding an object, and searching but failing to find an object separately.

``````
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class HashSetContainsBenchmark {

@Param({"8", "96", "1024", "8192"})
int setSize;

private HashSet<Baseline> baselineSet;
private Baseline presentBaseline;
private Baseline missingBaseline;

private HashSet<Builder> builderSet;
private Builder presentBuilder;
private Builder missingBuilder;

@Setup(Level.Trial)
public void setup() {
setupBaselineState();
setupBuilderState();
}

private void setupBaselineState() {
baselineSet = new HashSet<>();
for (int i = 0; i < setSize; ++i) {
}
presentBaseline = baselineSet.iterator().next();
do {
missingBaseline = newBaselineInstance();
} while (baselineSet.contains(missingBaseline));
}

private void setupBuilderState() {
builderSet = new HashSet<>();
for (int i = 0; i < setSize; ++i) {
}
presentBuilder = builderSet.iterator().next();
do {
missingBuilder = newBuilderInstance();
} while (builderSet.contains(missingBuilder));
}

@Benchmark
public boolean findPresentBaselineInstance() {
return baselineSet.contains(presentBaseline);
}

@Benchmark
public boolean findMissingBaselineInstance() {
return baselineSet.contains(missingBaseline);
}

@Benchmark
public boolean findPresentBuilderInstance() {
return builderSet.contains(presentBuilder);
}

@Benchmark
public boolean findMissingBuilderInstance() {
return builderSet.contains(missingBuilder);
}

private static Baseline newBaselineInstance() {
return new Baseline(UUID.randomUUID().toString(),
random.nextInt(),
UUID.randomUUID().toString(),
Instant.ofEpochMilli(random.nextLong(0, Instant.now().toEpochMilli())));
}

private static Builder newBuilderInstance() {
return new Builder(UUID.randomUUID().toString(),
random.nextInt(),
UUID.randomUUID().toString(),
Instant.ofEpochMilli(random.nextLong(0, Instant.now().toEpochMilli())));
}
}
``````

Running this benchmark with `org.apache.commons:commons-lang3:3.7` yields the throughput results below. There is an occasional and slight throughput degradation with the builders, but the builder implementation is sometimes faster in this benchmark, but the difference isn’t large enough to worry about.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: setSize
findMissingBaselineInstance thrpt 1 10 153.878772 7.993036 ops/us 8
findMissingBaselineInstance thrpt 1 10 135.835085 6.031588 ops/us 96
findMissingBaselineInstance thrpt 1 10 137.716073 25.731562 ops/us 1024
findMissingBaselineInstance thrpt 1 10 123.896741 15.484023 ops/us 8192
findMissingBuilderInstance thrpt 1 10 142.051528 4.177352 ops/us 8
findMissingBuilderInstance thrpt 1 10 132.326909 1.017351 ops/us 96
findMissingBuilderInstance thrpt 1 10 127.220440 8.680761 ops/us 1024
findMissingBuilderInstance thrpt 1 10 140.029701 6.272960 ops/us 8192
findPresentBaselineInstance thrpt 1 10 139.714236 1.626873 ops/us 8
findPresentBaselineInstance thrpt 1 10 140.244864 1.777298 ops/us 96
findPresentBaselineInstance thrpt 1 10 135.273760 7.587937 ops/us 1024
findPresentBaselineInstance thrpt 1 10 133.158555 5.101069 ops/us 8192
findPresentBuilderInstance thrpt 1 10 111.196443 18.804103 ops/us 8
findPresentBuilderInstance thrpt 1 10 124.008441 5.182294 ops/us 96
findPresentBuilderInstance thrpt 1 10 126.729750 4.286002 ops/us 1024
findPresentBuilderInstance thrpt 1 10 124.159285 5.287920 ops/us 8192

### Where did the builder go?

It’s worth taking a look at the compiler output to find out what happened. Considering the case where the object isn’t found in the set, we can see that the code gets inlined, and the hash code must be quite good. By enabling the args `-XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining` it’s clear to see that the entire hash code generation gets inlined into the calling loop, and that equals is never executed (suggesting that there are no collisions to resolve here) .

```                              @ 8   java.util.HashSet::contains (9 bytes)   inline (hot)
@ 5   java.util.HashMap::containsKey (18 bytes)   inline (hot)
\-> TypeProfile (17571/17571 counts) = java/util/HashMap
@ 2   java.util.HashMap::hash (20 bytes)   inline (hot)
@ 9   com.openkappa.simd.builder.Builder::hashCode (43 bytes)   inline (hot)
@ 8   org.apache.commons.lang3.builder.HashCodeBuilder::<init> (60 bytes)   inline (hot)
@ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
@ 26   org.apache.commons.lang3.Validate::isTrue (18 bytes)   unloaded signature classes
@ 46   org.apache.commons.lang3.Validate::isTrue (18 bytes)   unloaded signature classes
@ 15   org.apache.commons.lang3.builder.HashCodeBuilder::append (58 bytes)   inline (hot)
@ 21   java.lang.Object::getClass (0 bytes)   (intrinsic)
@ 24   java.lang.Class::isArray (0 bytes)   (intrinsic)
@ 49   java.lang.String::hashCode (49 bytes)   inline (hot)
@ 19   java.lang.String::isLatin1 (19 bytes)   inline (hot)
@ 29   java.lang.StringLatin1::hashCode (42 bytes)   inline (hot)
@ 22   org.apache.commons.lang3.builder.HashCodeBuilder::append (17 bytes)   inline (hot)
@ 29   org.apache.commons.lang3.builder.HashCodeBuilder::append (58 bytes)   inline (hot)
@ 21   java.lang.Object::getClass (0 bytes)   (intrinsic)
@ 24   java.lang.Class::isArray (0 bytes)   (intrinsic)
@ 49   java.lang.String::hashCode (49 bytes)   inline (hot)
@ 19   java.lang.String::isLatin1 (19 bytes)   inline (hot)
@ 29   java.lang.StringLatin1::hashCode (42 bytes)   inline (hot)
@ 36   org.apache.commons.lang3.builder.HashCodeBuilder::append (58 bytes)   inline (hot)
@ 21   java.lang.Object::getClass (0 bytes)   (intrinsic)
@ 24   java.lang.Class::isArray (0 bytes)   (intrinsic)
@ 49   java.time.Instant::hashCode (22 bytes)   inline (hot)
@ 39   org.apache.commons.lang3.builder.HashCodeBuilder::toHashCode (5 bytes)   accessor
@ 6   java.util.HashMap::getNode (148 bytes)   inline (hot)
@ 59   com.openkappa.simd.builder.Builder::equals (84 bytes)   never executed
@ 126   com.openkappa.simd.builder.Builder::equals (84 bytes)   never executed
```

So the code is clearly small enough to get inlined. But what about the builder itself, isn’t it allocated? Not if it doesn’t escape. On a debug JVM, it’s possible to observe the removal of the builder’s allocation the JVM args `-XX:+PrintEscapeAnalysis -XX:+PrintEliminateAllocations`. However, on a production JVM, it can be observed indirectly by measuring the difference when escape analysis is disabled with `-XX:-DoEscapeAnalysis`.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: setSize
findMissingBaselineInstance thrpt 1 10 150.972758 5.095163 ops/us 8
findMissingBaselineInstance thrpt 1 10 140.236057 2.222751 ops/us 96
findMissingBaselineInstance thrpt 1 10 132.680494 5.225503 ops/us 1024
findMissingBaselineInstance thrpt 1 10 115.220385 4.232488 ops/us 8192
findMissingBuilderInstance thrpt 1 10 68.869355 4.479944 ops/us 8
findMissingBuilderInstance thrpt 1 10 67.786351 5.980353 ops/us 96
findMissingBuilderInstance thrpt 1 10 71.113062 3.057181 ops/us 1024
findMissingBuilderInstance thrpt 1 10 75.088839 1.592294 ops/us 8192
findPresentBaselineInstance thrpt 1 10 141.425675 2.630898 ops/us 8
findPresentBaselineInstance thrpt 1 10 142.636359 2.854795 ops/us 96
findPresentBaselineInstance thrpt 1 10 143.939199 0.475918 ops/us 1024
findPresentBaselineInstance thrpt 1 10 142.006635 3.098352 ops/us 8192
findPresentBuilderInstance thrpt 1 10 71.546971 6.152584 ops/us 8
findPresentBuilderInstance thrpt 1 10 62.451705 11.896730 ops/us 96
findPresentBuilderInstance thrpt 1 10 68.903442 3.263955 ops/us 1024
findPresentBuilderInstance thrpt 1 10 69.576038 4.047581 ops/us 8192

Without escape analysis, the baseline code hardly slows down, whereas the change in the builder benchmarks is stark. The difference of differences here isolates the cost of allocating the `HashCodeBuilder` on the heap, though the builder has not been allocated on the stack – it has been replaced by its fields, which are allocated on the stack. Escape analysis is enabled by default, and the key point here is that even if the code looks like it might be slower, it might be as good or better – it depends what the JIT does with it.

# Beware Collection Factory Methods

I saw an interesting tweet referencing a Github issue where the impact of including an (in my view) unnecessary implementation of the `List` interface impacted inlining decisions, causing 20x degradation in throughput. Guava’s `ImmutableList` is my favourite class to seek and destroy because of the way it is often used – it tends to be associated with unnecessary copying where encapsulation would be a better solution. I had assumed performance gains won from finding and deleting all the instances of `ImmutableList` had been thanks to relieving the garbage collector from medieval torture. The performance degradation observed in the benchmark is caused by use of `ImmutableList`, along with all its subclasses, alongside `ArrayList`, making calls to `List` bimorphic at best, causing the JIT compiler to generate slower code. I may have inadvertently profited from better inlining in the past simply by removing as many `ImmutableList`s as possible!

This post doesn’t go into any details about the various mechanisms of method dispatch, and if you want to understand the impact of polymorphism on inlining, bookmark Aleksey Shipilev’s authoritative post and read it when you have some time to really concentrate.

Without resorting to using `LinkedList`, is it possible to contrive cases in Java 9 where performance is severely degraded by usages of `Collections.unmodifiableList` and `List.of` factory methods? Along with `ArrayList`, these are random access data structures so this should highlight the potential performance gains inlining can give.

The methodology is very simple: I randomly vary the `List` implementation and plug it into the same algorithm. It is cruder than you would see in Aleksey Shipilev’s post because I’ve targeted only the worst case by creating equal bias between implementations. Aleksey demonstrates that inlining decisions are statistical and opportunistic (the JIT can guess and later deoptimise), and if 90% of your call sites dispatch to the same implementation, it doesn’t matter as much as when the choice is made uniformly. It will vary from application to application, but it could easily be as bad as the case I present if `List` is used polymorphically.

I created five benchmarks which produce the same number, the same way. Three of these benchmarks only ever call into a single implementation of `List` and will be inlined monomorphically, to avoid bias, the result is XOR’d with a call to `ThreadLocalRandom.current().nextInt()` because the other benchmarks need this. One benchmark only ever calls into `List.of` and `ArrayList`, then one benchmark randomly chooses a list for each invocation. The difference is stark. You can really screw up performance by making the methods on `List` megamorphic.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
sumLength_ArrayList thrpt 1 10 55.785270 3.218552 ops/us
sumLength_Factory thrpt 1 10 58.565918 2.852415 ops/us
sumLength_Random2 thrpt 1 10 35.842255 0.684658 ops/us
sumLength_Random3 thrpt 1 10 11.177564 0.080164 ops/us
sumLength_Unmodifiable thrpt 1 10 51.776108 3.751297 ops/us

``````
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class MegamorphicList {

private List<String>[] strings;

@Setup(Level.Trial)
public void init() {
strings = new List[]{getArrayList(6), getFactoryList6(), getUnModifiableList(6)};
}

@Benchmark
public int sumLength_ArrayList(Blackhole bh) {
List<String> list = strings[0];
int blackhole = 0;
for (int i = 0; i < list.size(); ++i) {
blackhole += list.get(i).length();
}
}

@Benchmark
public int sumLength_Factory() {
List<String> list = strings[1];
int blackhole = 0;
for (int i = 0; i < list.size(); ++i) {
blackhole += list.get(i).length();
}
}

@Benchmark
public int sumLength_Unmodifiable() {
List<String> list = strings[2];
int blackhole = 0;
for (int i = 0; i < list.size(); ++i) {
blackhole += list.get(i).length();
}
}

@Benchmark
public int sumLength_Random2() {
int blackhole = 0;
for (int i = 0; i < list.size(); ++i) {
blackhole += list.get(i).length();
}
return blackhole;
}

@Benchmark
public int sumLength_Random3() {
int blackhole = 0;
for (int i = 0; i < list.size(); ++i) {
blackhole += list.get(i).length();
}
return blackhole;
}

private List<String> getUnModifiableList(int size) {
return Collections.unmodifiableList(getArrayList(size));
}

private List<String> getFactoryList6() {
return List.of(randomString(),
randomString(),
randomString(),
randomString(),
randomString(),
randomString()
);
}

private List<String> getArrayList(int size) {
List<String> list = new ArrayList<>();
for (int i = 0; i < size; ++i) {
}
return list;
}

private String randomString() {
}

}
``````

Since writing this post, I have been challenged on whether this result is due to failure to inline or not. This can be easily verified by setting the following JVM arguments to print compilation:

```-XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
```

You will see the `ArrayList` and `ListN` get inlined quickly in isolation:

```\-> TypeProfile (19810/19810 counts) = java/util/ArrayList
@ 27   java.util.ArrayList::get (15 bytes)   inline (hot)
...
\-> TypeProfile (363174/363174 counts) = java/util/ImmutableCollections\$ListN
@ 24   java.util.ImmutableCollections\$ListN::get (17 bytes)   inline (hot)
```

However, the call remains virtual (and not inlined) when three or more implementations are present:

```@ 30   java.util.List::get (0 bytes)   virtual call
```

I didn’t even bother to use factory methods with different arity, because three is the magic number. Syntactic sugar is nice, but use them with caution.

# Incidental Similarity

I recently saw an interesting class, BitVector, in Apache Arrow, which represents a column of bits, providing minimal or zero copy distribution. The implementation is similar to a bitset but backed by a `byte[]` rather than a `long[]`. Given the coincidental similarity in implementation, it’s tempting to look at this, extend its interface and try to use it as a general purpose, distributed bitset. Could this work? Why not just implement some extra methods? Fork it on Github!

This post details the caveats of trying to adapt an abstraction beyond its intended purpose; in a scenario where generic bitset capabilities are added to BitVector without due consideration, examined through the lens of performance. This runs into the observable effect of word widening on throughput, given the constraints imposed by JLS 15.22. In the end, the only remedy is to use a `long[]`, sacrificing the original zero copy design goal. I hope this is a fairly self-contained example of how uncontrolled adaptation can be hostile to the original design goals: having the source code isn’t enough reason to modify it.

### Checking bits

How fast is it to check if the bit at index `i` is set or not? BitVector implements this functionality, and was designed for it. This can be measured by JMH by generating a random `long[]` and creating a `byte[]` 8x longer with identical bits. The throughput of checking the value of the bit at random indices can be measured. It turns out that if all you want to do is access bits, `byte[]` isn’t such a bad choice, and if those bytes are coming directly from the network, it could even be a great choice. I ran the benchmark below and saw that the two operations are similar (within measurement error).

``````
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class BitSet {

@Param({"1024", "2048", "4096", "8192"})
int size;

private long[] leftLongs;
private long[] rightLongs;
private long[] differenceLongs;
private byte[] leftBytes;
private byte[] rightBytes;
private byte[] differenceBytes;

@Setup(Level.Trial)
public void init() {
this.leftLongs = createLongArray(size);
this.rightLongs = createLongArray(size);
this.differenceLongs = new long[size];
this.leftBytes = makeBytesFromLongs(leftLongs);
this.rightBytes = makeBytesFromLongs(rightLongs);
this.differenceBytes = new byte[size * 8];
}

@Benchmark
public boolean CheckBit_LongArray() {
int index = index();
return (leftLongs[index >>> 6] & (1L << index)) != 0;
}

@Benchmark
public boolean CheckBit_ByteArray() {
int index = index();
return ((leftBytes[index >>> 3] & 0xFF) & (1 << (index & 7))) != 0;
}

private int index() {
}

private static byte[] makeBytesFromLongs(long[] array) {
byte[] bytes = new byte[8 * array.length];
for (int i = 0; i < array.length; ++i) {
long word = array[i];
bytes[8 * i + 7] = (byte) word;
bytes[8 * i + 6] = (byte) (word >>> 8);
bytes[8 * i + 5] = (byte) (word >>> 16);
bytes[8 * i + 4] = (byte) (word >>> 24);
bytes[8 * i + 3] = (byte) (word >>> 32);
bytes[8 * i + 2] = (byte) (word >>> 40);
bytes[8 * i + 1] = (byte) (word >>> 48);
bytes[8 * i]     = (byte) (word >>> 56);
}
return bytes;
}
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
CheckBit_ByteArray thrpt 1 10 174.421170 1.583275 ops/us 1024
CheckBit_ByteArray thrpt 1 10 173.938408 1.445796 ops/us 2048
CheckBit_ByteArray thrpt 1 10 172.522190 0.815596 ops/us 4096
CheckBit_ByteArray thrpt 1 10 167.550530 1.677091 ops/us 8192
CheckBit_LongArray thrpt 1 10 171.639695 0.934494 ops/us 1024
CheckBit_LongArray thrpt 1 10 169.703960 2.427244 ops/us 2048
CheckBit_LongArray thrpt 1 10 169.333360 1.649654 ops/us 4096
CheckBit_LongArray thrpt 1 10 166.518375 0.815433 ops/us 8192

To support this functionality, there’s no reason to choose either way, and it must be very appealing to use bytes as they are delivered from the network, avoiding copying costs. Given that for a database column, this is the only operation needed, and Apache Arrow has a stated aim to copy data as little as possible, this seems like quite a good decision.

### Logical Conjugations

But what happens if you try to add a logical operation to BitVector, such as an XOR? We need to handle the fact that `byte`s are signed and their sign bit must be preserved in promotion, according to the JLS. This would break the bitset, so extra operations are required to keep the 8th bit in its right place. With the widening and its associated workarounds, suddenly the `byte[]` is a much poorer choice than a `long[]`, and it shows in benchmarks.

``````
@Benchmark
public void Difference_ByteArray(Blackhole bh) {
for (int i = 0; i < leftBytes.length && i < rightBytes.length; ++i) {
differenceBytes[i] = (byte)((leftBytes[i] & 0xFF) ^ (rightBytes[i] & 0xFF));
}
bh.consume(differenceBytes);
}

@Benchmark
public void Difference_LongArray(Blackhole bh) {
for (int i = 0; i < leftLongs.length && i < rightLongs.length; ++i) {
differenceLongs[i] = leftLongs[i] ^ rightLongs[i];
}
bh.consume(differenceLongs);
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
Difference_ByteArray thrpt 1 10 0.805872 0.038644 ops/us 1024
Difference_ByteArray thrpt 1 10 0.391705 0.017453 ops/us 2048
Difference_ByteArray thrpt 1 10 0.190102 0.008580 ops/us 4096
Difference_ByteArray thrpt 1 10 0.169104 0.015086 ops/us 8192
Difference_LongArray thrpt 1 10 2.450659 0.094590 ops/us 1024
Difference_LongArray thrpt 1 10 1.047330 0.016898 ops/us 2048
Difference_LongArray thrpt 1 10 0.546286 0.014211 ops/us 4096
Difference_LongArray thrpt 1 10 0.277378 0.015663 ops/us 8192

This is a fairly crazy slow down. Why? You need to look at the assembly generated in each case. For `long[]` it’s demonstrable that logical operations do vectorise. The JLS, specifically section 15.22, doesn’t really give the `byte[]` implementation a chance. It states that for logical operations, sub `dword` primitive types must be promoted or widened before the operation. This means that if one were to try to implement this operation with, say AVX2, using 256 bit `ymmword`s each consisting of 16 `byte`s, then each `ymmword` would have to be inflated by a factor of four: it gets complicated quickly, given this constraint. Despite that complexity, I was surprised to see that C2 does use 128 bit `xmmword`s, but it’s not as fast as using the full 256 bit registers available. This can be seen by printing out the emitted assembly like normal.

```movsxd  r10,ebx

vmovq   xmm2,mmword ptr [rsi+r10+10h]

vpxor   xmm2,xmm2,xmmword ptr [r8+r10+10h]

vmovq   mmword ptr [rax+r10+10h],xmm2
```