# L’esprit de l’escalier

L’esprit de l’escalier is the inability to respond until it’s too late to bother. Perhaps not until you’ve reached the bottom of a staircase, having had an altercation at the top. You’re most likely to have experienced this if you are at all introverted and have been to a job interview, having only managed to find solutions to simple problems once you find yourself alone on a train full of strangers. The more I ponder the psychology of this phenomenon, the more it fascinates me.

Some programmers dread brain teasers, claiming they are irrelevant to the job. I think this is wrong. In reality, so-called brain teasers are valid tests for application developers because they tend to be questions about programming translated into some other – admittedly, often fanciful – domain. They measure your ability to interpret requirements. I think if you try to assess this in person, the outcome often depends on personality types.

### Unwrapping the Riddle

Here’s an odd puzzle with a programming question hiding somewhere inside it.

There are 64 bottles of wine, and one is poisoned. In three days’ time there will be a banquet and you must make sure none of the guests are poisoned. The poison takes between two and two-and-a-half days to kill a man, but you can risk the lives of as many of 32 prisoners as you like to find the poisoned wine bottle. What’s the smallest number of prisoners you can use to find the poisoned bottle?

You may have seen this puzzle before and get out of jail free, and many websites exist to allow people to research these problems for a reason. But, phrased another way, you are actually being asked a simple programming question:

A 64-bit integer has a value equal to a power of two. Find its logarithm in as few steps as possible.

OK! Simple! Only one bit is set, it just needs to be found. The logarithm is the number of zeroes less significant than the bit. The set bit is either in the top half or the bottom half, so test this recursively. We can find the index of the bit by using a sequence of independent bit masks, incrementing a counter differently for each mask which intersects with the value. The code is very simple:

``````
private static int indexOfFirstBit(long value) {
int index = 0;
int temp;
if ((value & 0xFFFFFFFF00000000L) != 0) {
index += 32;
temp = (int)(value >>> 32);
} else {
temp = (int)value;
}
if ((temp & 0xFFFF0000) != 0) {
index += 16;
temp >>>= 16;
}
if ((temp & 0xFF00) != 0) {
index += 8;
temp >>>= 8;
}
if ((temp & 0xF0) != 0) {
index += 4;
temp >>>= 4;
}
if ((temp & 0xC) != 0) {
index += 2;
temp >>>= 2;
}
if ((temp & 0x2) != 0) {
index += 1;
}
return index;
}
``````

The idea is that you keep halving the size of the problem, and add the width of the zero part of each mask that passes (counting from the right). This can be rearranged but the code above is most obvious. You may even recognise that this is equivalent to the x86 instruction `TZCNT`, which in Java is accessible via the intrinsic `Long.numberOfTrailingZeros`.

So, going back to la-la land, you only need to risk the lives of six prisoners. All you have to do is ask each of the six prisoners to drink half of the bottles, the first one drinks from the first 32 bottles, the second from the first 16 and the third 16, and so on until the sixth prisoner who drinks from every other bottle. Before three days are up, up to six prisoners will die. If prisoner one dies, you know the poisoned bottle is in the second half. If prisoner two also dies, you know it’s in the third quarter. If the sixth prisoner doesn’t die, you know the poisoned bottle is one of the odd numbers.

### Psychology and Interactive Problem Solving

Some people have seen these problems before, others enjoy solving them collaboratively, others instantly recognise solutions. Why does this only come to mind on the train home for some people? Well, it could be stupidity, but I think it’s purely psychological. You may be the sort of person who doesn’t want to solve a problem, even a simple one like this, interactively. Perhaps it feels invasive to reveal your precise thought processes to someone you just met, and aren’t sure you like yet. You might find you take your work with you on walks, or come up with your best ideas in the shower. In short, you may just be an introvert.

There’s a darker side to this psychological cause of failure. You may find that your interviewer is actually an interrogator, is trying to insult your ego, and you’ll feel drawn to defend yourself. The more you defend yourself, the less you can commit to a logical thought process; you become emotional. Your thoughts become jumbled as your emotional defence takes over. You focus on how you are being judged, that you were being judged five seconds ago and still haven’t solved the problem, that you are being judged yet more harshly now. You aren’t actually thinking about the problem at all any more. You aren’t being tested on your ability to solve the problem, but on your ability to solve the problem and defend your self esteem at the same time. Some people fail at this.

I actually think this is fair game in many cases. For instance, if you want to be an Army Officer, you can’t get flustered and be allowed to be thrown off your game like this: people might be shooting at you! For programmers, however, I’m not sure. Most programmers are used to the feeling of being in “flow” – the feeling of being immersed in the task at hand, knowledge instantly accessible. There’s never another person there when you’re in flow. I suppose a programmer in flow is a lot like Schrödinger’s Cat – you destroy it just by interacting with it.

# Building RoaringBitmaps from Streams

RoaringBitmap is a fast compressed bitset format. In the Java implementation of Roaring, it was until recently preferential to build a bitset in one go from sorted data; there were performance penalties of varying magnitude for incremental or unordered insertions. In a recent pull request, I wanted to improve incremental monotonic insertion so I could build bitmaps from streams, but sped up unsorted batch creation significantly by accident.

### Incremental Ordered Insertion

If you want to build a bitmap, you can do so efficiently with the `RoaringBitmap.bitmapOf` factory method.

``````
int[] data = ...
RoaringBitmap bitmap = RoaringBitmap.bitmapOf(data);
``````

However, I often find that I want to stream integers into a bitmap. Given that the integers being inserted into a bitmap often represent indices into an array, such a stream is likely to be monotonic. You might implement this like so:

``````
IntStream stream = ...
RoaringBitmap bitmap = new RoaringBitmap();
``````

While this is OK, it has a few inefficiencies compared to the batch creation method.

• Indirection: the container being written to must be located on each insertion
• Eagerness: the cardinality must be kept up to date on each insertion
• Allocation pressure: the best container type can’t be known in advance. Choice of container may change as data is inserted, this requires allocations of new instances.

You could also collect the stream into an `int[]` and use the batch method, but it could be a large temporary object with obvious drawbacks.

### OrderedWriter

The solution I proposed is to create a writer object (OrderedWriter) which allocates a small buffer of 8KB, to use as a bitmap large enough to cover 16 bits. The stream to bitmap code becomes:

``````
IntStream stream = ...
RoaringBitmap bitmap = new RoaringBitmap();
OrderedWriter writer = new OrderedWriter(bitmap);
writer.flush(); // clear the buffer out
``````

This is implemented so that changes in key (where the most significant 16 bits of each integer is stored) trigger a flush of the buffer.

``````
short key = Util.highbits(value);
short low = Util.lowbits(value);
if (key != currentKey) {
if (Util.compareUnsigned(key, currentKey) < 0) {
throw new IllegalStateException("Must write in ascending key order");
}
flush();
}
int ulow = low & 0xFFFF;
bitmap[(ulow >>> 6)] |= (1L << ulow);
currentKey = key;
dirty = true;
}
``````

When a flush occurs, a container type is chosen and appended to the bitmap’s prefix index.

``````
public void flush() {
if (dirty) {
RoaringArray highLowContainer = underlying.highLowContainer;
// we check that it's safe to append since RoaringArray.append does no validation
if (highLowContainer.size > 0) {
short key = highLowContainer.getKeyAtIndex(highLowContainer.size - 1);
if (Util.compareUnsigned(currentKey, key) <= 0) {
throw new IllegalStateException("Cannot write " + currentKey + " after " + key);
}
}
highLowContainer.append(currentKey, chooseBestContainer());
clearBitmap();
dirty = false;
}
}
``````

There are significant performance advantages in this approach. There is no indirection cost, and no searches in the prefix index for containers: the writes are just buffered. The buffer is small enough to fit in cache, and containers only need to be created when the writer is flushed, which happens whenever a new key is seen, or when `flush` is called manually. During a flush, the cardinality can be computed in one go, the best container can be chosen, and run optimisation only has to happen once. Computing the cardinality is the only bottleneck – it requires 1024 calls to `Long.bitCount` which can’t be vectorised in a language like Java. It can’t be incremented on insertion without either sacrificing idempotence or incurring the cost of a membership check. After the flush, the buffer needs to be cleared, using a call to `Arrays.fill` which is vectorised. So, despite the cost of the buffer, this can be quite efficient.

This approach isn’t universally applicable. For instance, you must write data in ascending order of the most significant 16 bits. You must also remember to flush the writer when you’re finished: until you’ve called flush, the data in the last container may not be in the bitmap. For my particular use case, this is reasonable. However, there are times when this is not fit for purpose, such as if you are occasionally inserting values and expect them to be available to queries immediately. In general, if you don’t know when you’ll stop adding data to the bitmap, this isn’t a good fit because you won’t know when to call flush.

### Benchmark

The RoaringBitmap project is a big user of JMH, and most pull requests require benchmarks as evidence of performance improvement. I benchmarked the two approaches, varying bitmap sizes and randomness (likelihood of there not being a compressible run), and was amazed to find that this approach actually beats having a sorted array and using `RoaringBitmap.bitmapOf`. Less surprising was beating the existing API for incremental adds (this was the goal in the first place). Lower is better:

Benchmark (randomness) (size) Mode Cnt Score Error Units
buildRoaringBitmap 0.1 10000 avgt 5 54.263 3.393 us/op
buildRoaringBitmap 0.1 100000 avgt 5 355.188 15.234 us/op
buildRoaringBitmap 0.1 1000000 avgt 5 3567.839 135.149 us/op
buildRoaringBitmap 0.1 10000000 avgt 5 31982.046 1227.325 us/op
buildRoaringBitmap 0.5 10000 avgt 5 53.855 0.887 us/op
buildRoaringBitmap 0.5 100000 avgt 5 357.671 14.111 us/op
buildRoaringBitmap 0.5 1000000 avgt 5 3556.152 243.671 us/op
buildRoaringBitmap 0.5 10000000 avgt 5 34385.971 3864.143 us/op
buildRoaringBitmap 0.9 10000 avgt 5 59.354 10.385 us/op
buildRoaringBitmap 0.9 100000 avgt 5 374.245 54.485 us/op
buildRoaringBitmap 0.9 1000000 avgt 5 3712.684 657.964 us/op
buildRoaringBitmap 0.9 10000000 avgt 5 37223.976 4691.297 us/op
incrementalNativeAdd 0.1 10000 avgt 5 115.213 31.909 us/op
incrementalNativeAdd 0.1 100000 avgt 5 911.925 127.922 us/op
incrementalNativeAdd 0.1 1000000 avgt 5 8889.49 320.821 us/op
incrementalNativeAdd 0.1 10000000 avgt 5 102819.877 14247.868 us/op
incrementalNativeAdd 0.5 10000 avgt 5 116.878 28.232 us/op
incrementalNativeAdd 0.5 100000 avgt 5 947.076 128.255 us/op
incrementalNativeAdd 0.5 1000000 avgt 5 7190.443 202.012 us/op
incrementalNativeAdd 0.5 10000000 avgt 5 98843.303 4325.924 us/op
incrementalNativeAdd 0.9 10000 avgt 5 101.694 6.579 us/op
incrementalNativeAdd 0.9 100000 avgt 5 816.411 65.678 us/op
incrementalNativeAdd 0.9 1000000 avgt 5 9114.624 412.152 us/op
incrementalNativeAdd 0.9 10000000 avgt 5 108793.694 22562.527 us/op
incrementalUseOrderedWriter 0.1 10000 avgt 5 23.573 5.962 us/op
incrementalUseOrderedWriter 0.1 100000 avgt 5 289.588 36.814 us/op
incrementalUseOrderedWriter 0.1 1000000 avgt 5 2785.659 49.385 us/op
incrementalUseOrderedWriter 0.1 10000000 avgt 5 29489.758 2601.39 us/op
incrementalUseOrderedWriter 0.5 10000 avgt 5 23.57 1.536 us/op
incrementalUseOrderedWriter 0.5 100000 avgt 5 276.488 9.662 us/op
incrementalUseOrderedWriter 0.5 1000000 avgt 5 2799.408 198.77 us/op
incrementalUseOrderedWriter 0.5 10000000 avgt 5 28313.626 1976.042 us/op
incrementalUseOrderedWriter 0.9 10000 avgt 5 22.345 1.574 us/op
incrementalUseOrderedWriter 0.9 100000 avgt 5 280.205 36.987 us/op
incrementalUseOrderedWriter 0.9 1000000 avgt 5 2779.732 93.456 us/op
incrementalUseOrderedWriter 0.9 10000000 avgt 5 30568.591 2140.826 us/op

These benchmarks don’t go far enough to support replacing `RoaringBitmap.bitmapOf`.

### Unsorted Input Data

In the cases benchmarked, this approach seems to be worthwhile. I can’t actually think of a case where someone would want to build a bitmap from unsorted data, but it occurred to me that this approach might be fast enough to cover the cost of a sort. `OrderedWriter` is also relaxed enough that it only needs the most significant 16 bits to be monotonic, so a full sort isn’t necessary. Implementing a radix sort on the most significant 16 bits (stable in the least significant 16 bits), prior to incremental insertion via an `OrderedWriter`, leads to huge increases in performance over `RoaringBitmap.bitmapOf`. The implementation is as follows:

``````
public static RoaringBitmap bitmapOfUnordered(final int... data) {
RoaringBitmap bitmap = new RoaringBitmap();
OrderedWriter writer = new OrderedWriter(bitmap);
for (int i : data) {
}
writer.flush();
return bitmap;
}
``````

It did very well, according to benchmarks, even against various implementations of sort prior to `RoaringBitmap.bitmapOf`. Lower is better:

Benchmark (randomness) (size) Mode Cnt Score Error Units
bitmapOf 0.1 10000 avgt 5 1058.106 76.013 us/op
bitmapOf 0.1 100000 avgt 5 12323.905 976.68 us/op
bitmapOf 0.1 1000000 avgt 5 171812.526 9593.879 us/op
bitmapOf 0.1 10000000 avgt 5 3376296.157 170362.195 us/op
bitmapOf 0.5 10000 avgt 5 1096.663 477.795 us/op
bitmapOf 0.5 100000 avgt 5 12836.177 1674.54 us/op
bitmapOf 0.5 1000000 avgt 5 171998.126 4176 us/op
bitmapOf 0.5 10000000 avgt 5 3707804.439 974532.361 us/op
bitmapOf 0.9 10000 avgt 5 1124.881 65.673 us/op
bitmapOf 0.9 100000 avgt 5 14585.589 1894.788 us/op
bitmapOf 0.9 1000000 avgt 5 198506.813 8552.218 us/op
bitmapOf 0.9 10000000 avgt 5 3723942.934 423704.363 us/op
bitmapOfUnordered 0.1 10000 avgt 5 174.583 17.475 us/op
bitmapOfUnordered 0.1 100000 avgt 5 1768.613 86.543 us/op
bitmapOfUnordered 0.1 1000000 avgt 5 17889.705 135.714 us/op
bitmapOfUnordered 0.1 10000000 avgt 5 192645.352 6482.726 us/op
bitmapOfUnordered 0.5 10000 avgt 5 157.351 3.254 us/op
bitmapOfUnordered 0.5 100000 avgt 5 1674.919 90.138 us/op
bitmapOfUnordered 0.5 1000000 avgt 5 16900.458 778.999 us/op
bitmapOfUnordered 0.5 10000000 avgt 5 185399.32 4383.485 us/op
bitmapOfUnordered 0.9 10000 avgt 5 145.642 1.257 us/op
bitmapOfUnordered 0.9 100000 avgt 5 1515.845 82.914 us/op
bitmapOfUnordered 0.9 1000000 avgt 5 15807.597 811.048 us/op
bitmapOfUnordered 0.9 10000000 avgt 5 167863.49 3501.132 us/op
partialSortThenBitmapOf 0.1 10000 avgt 5 1060.152 168.802 us/op
partialSortThenBitmapOf 0.1 100000 avgt 5 10942.731 347.583 us/op
partialSortThenBitmapOf 0.1 1000000 avgt 5 100606.506 24705.341 us/op
partialSortThenBitmapOf 0.1 10000000 avgt 5 1035448.545 157383.713 us/op
partialSortThenBitmapOf 0.5 10000 avgt 5 1029.883 100.291 us/op
partialSortThenBitmapOf 0.5 100000 avgt 5 10472.509 832.719 us/op
partialSortThenBitmapOf 0.5 1000000 avgt 5 101144.032 16908.087 us/op
partialSortThenBitmapOf 0.5 10000000 avgt 5 958242.087 39650.946 us/op
partialSortThenBitmapOf 0.9 10000 avgt 5 1008.413 70.999 us/op
partialSortThenBitmapOf 0.9 100000 avgt 5 10458.34 600.416 us/op
partialSortThenBitmapOf 0.9 1000000 avgt 5 103945.644 2026.26 us/op
partialSortThenBitmapOf 0.9 10000000 avgt 5 1065638.269 102257.059 us/op
setupCost 0.1 10000 avgt 5 6.577 0.121 us/op
setupCost 0.1 100000 avgt 5 61.378 24.113 us/op
setupCost 0.1 1000000 avgt 5 1021.588 536.68 us/op
setupCost 0.1 10000000 avgt 5 13182.341 196.773 us/op
setupCost 0.5 10000 avgt 5 7.139 2.216 us/op
setupCost 0.5 100000 avgt 5 60.847 23.395 us/op
setupCost 0.5 1000000 avgt 5 800.888 14.711 us/op
setupCost 0.5 10000000 avgt 5 13431.625 553.44 us/op
setupCost 0.9 10000 avgt 5 6.599 0.09 us/op
setupCost 0.9 100000 avgt 5 60.946 22.511 us/op
setupCost 0.9 1000000 avgt 5 813.445 4.896 us/op
setupCost 0.9 10000000 avgt 5 13374.943 349.314 us/op
sortThenBitmapOf 0.1 10000 avgt 5 636.23 13.423 us/op
sortThenBitmapOf 0.1 100000 avgt 5 7411.756 174.264 us/op
sortThenBitmapOf 0.1 1000000 avgt 5 92299.305 3651.161 us/op
sortThenBitmapOf 0.1 10000000 avgt 5 1096374.443 162575.234 us/op
sortThenBitmapOf 0.5 10000 avgt 5 634.957 47.447 us/op
sortThenBitmapOf 0.5 100000 avgt 5 7939.074 409.328 us/op
sortThenBitmapOf 0.5 1000000 avgt 5 93505.427 5409.749 us/op
sortThenBitmapOf 0.5 10000000 avgt 5 1147933.592 57485.51 us/op
sortThenBitmapOf 0.9 10000 avgt 5 661.072 6.717 us/op
sortThenBitmapOf 0.9 100000 avgt 5 7915.506 356.148 us/op
sortThenBitmapOf 0.9 1000000 avgt 5 93403.343 5454.583 us/op
sortThenBitmapOf 0.9 10000000 avgt 5 1095960.734 85753.917 us/op

It looks like there are good performance gains available here, but these things tend to depend on particular data sets. I would be interested in hearing from anyone who has tried to use this class in a real application.

# Matrix Multiplication Revisited

In a recent post, I took a look at matrix multiplication in pure Java, to see if it can go faster than reported in SIMD Intrinsics on Managed Language Runtimes. I found faster implementations than the paper’s benchmarks implied was possible. Nevertheless, I found that there were some limitations in Hotspot’s autovectoriser that I didn’t expect to see, even in JDK10. Were these limitations somehow fundamental, or can other compilers do better with essentially the same input?

I took a look at the code generated by GCC’s autovectoriser to see what’s possible in C/C++ without resorting to complicated intrinsics. For a bit of fun, I went over to the dark side to squeeze out some a lot of extra performance, which gave inspiration to a simple vectorised Java implementation which can maintain intensity as matrix size increases.

### Background

The paper reports a 5x improvement in matrix multiplication throughput as a result of using LMS generated intrinsics. Using GCC as LMS’s backend, I easily reproduced very good throughput, but I found two Java implementations better than the paper’s baseline. The best performing Java implementation proposed in the paper was `blocked`. This post is not about the LMS benchmarks, but this code is this post’s inspiration.

``````
public void blocked(float[] a, float[] b, float[] c, int n) {
int BLOCK_SIZE = 8;
// GOOD: attempts to do as much work in submatrices
// GOOD: tries to avoid bringing data through cache multiple times
for (int kk = 0; kk < n; kk += BLOCK_SIZE) {
for (int jj = 0; jj < n; jj += BLOCK_SIZE) {
for (int i = 0; i < n; i++) {
for (int j = jj; j < jj + BLOCK_SIZE; ++j) {
// BAD: manual unrolling, bypasses optimisations
float sum = c[i * n + j];
for (int k = kk; k < kk + BLOCK_SIZE; ++k) {
// BAD: horizontal sums are inefficient
sum += a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sum;
}
}
}
}
}
``````

I proposed the following implementation for improved cache efficiency and expected it to vectorise automatically.

``````public void fast(float[] a, float[] b, float[] c, int n) {
// GOOD: 2x faster than "blocked" - why?
int in = 0;
for (int i = 0; i < n; ++i) {
int kn = 0;
for (int k = 0; k < n; ++k) {
float aik = a[in + k];
// MIXED: passes over c[in:in+n] multiple times per k-value, "free" if n is small
// MIXED: reloads b[kn:kn+n] repeatedly for each i, bad if n is large, "free" if n is small
// BAD: doesn't vectorise but should
for (int j = 0; j < n; ++j) {
c[in + j] += aik * b[kn + j]; // sequential writes and reads, cache and vectoriser friendly
}
kn += n;
}
in += n;
}
}
``````

My code actually doesn’t vectorise, even in JDK10, which really surprised me because the inner loop vectorises if the offsets are always zero. In any case, there is a simple hack involving the use of buffers, which unfortunately thrashes the cache, but narrows the field significantly.

``````
public void fastBuffered(float[] a, float[] b, float[] c, int n) {
float[] bBuffer = new float[n];
float[] cBuffer = new float[n];
int in = 0;
for (int i = 0; i < n; ++i) {
int kn = 0;
for (int k = 0; k < n; ++k) {
float aik = a[in + k];
System.arraycopy(b, kn, bBuffer, 0, n);
saxpy(n, aik, bBuffer, cBuffer);
kn += n;
}
System.arraycopy(cBuffer, 0, c, in, n);
Arrays.fill(cBuffer, 0f);
in += n;
}
}
``````

I left the problem looking like this, with the “JDKX vectorised” lines using the algorithm above with a buffer hack:

### GCC Autovectorisation

The Java code is very easy to translate into C/C++. Before looking at performance I want to get an idea of what GCC’s autovectoriser does. I want to see the code generated at GCC optimisation level 3, with unrolled loops, FMA, and AVX2, which can be seen as follows:

`g++ -mavx2 -mfma -march=native -funroll-loops -O3 -S mmul.cpp`

The generated assembly code can be seen in full context here. Let’s look at the `mmul_saxpy` routine first:

``````static void mmul_saxpy(const int n, const float* left, const float* right, float* result) {
int in = 0;
for (int i = 0; i < n; ++i) {
int kn = 0;
for (int k = 0; k < n; ++k) {
float aik = left[in + k];
for (int j = 0; j < n; ++j) {
result[in + j] += aik * right[kn + j];
}
kn += n;
}
in += n;
}
}
``````

This routine uses SIMD instructions, which means in principle any other compiler could do this too. The inner loop has been unrolled, but this is only by virtue of the `-funroll-loops` flag. C2 does this sort of thing as standard, but only for hot loops. In general you might not want to unroll loops because of the impact on code size, and it’s great that a JIT compiler can decide only to do this when it’s profitable.

``````.L9:
vmovups  (%rdx,%rax), %ymm4
vmovaps  %ymm4, (%r11,%rax)
vmovups  32(%rdx,%rax), %ymm5
vmovaps  %ymm5, 32(%r11,%rax)
vmovups  64(%rdx,%rax), %ymm1
vmovaps  %ymm1, 64(%r11,%rax)
vmovups  96(%rdx,%rax), %ymm2
vmovaps  %ymm2, 96(%r11,%rax)
vmovups  128(%rdx,%rax), %ymm4
vmovaps  %ymm4, 128(%r11,%rax)
vmovups  160(%rdx,%rax), %ymm5
vmovaps  %ymm5, 160(%r11,%rax)
vmovups  192(%rdx,%rax), %ymm1
vmovaps  %ymm1, 192(%r11,%rax)
vmovups  224(%rdx,%rax), %ymm2
vmovaps  %ymm2, 224(%r11,%rax)
cmpl  %r10d, 24(%rsp)
ja  .L9
``````

The `mmul_blocked` routine is compiled to quite convoluted assembly. It has a huge problem with the expression `right[k * n + j]`, which requires a gather and is almost guaranteed to create 8 cache misses per block for large matrices. Moreover, this inefficiency gets much worse with problem size.

``````static void mmul_blocked(const int n, const float* left, const float* right, float* result) {
int BLOCK_SIZE = 8;
for (int kk = 0; kk < n; kk += BLOCK_SIZE) {
for (int jj = 0; jj < n; jj += BLOCK_SIZE) {
for (int i = 0; i < n; i++) {
for (int j = jj; j < jj + BLOCK_SIZE; ++j) {
float sum = result[i * n + j];
for (int k = kk; k < kk + BLOCK_SIZE; ++k) {
sum += left[i * n + k] * right[k * n + j]; // second read here requires a gather
}
result[i * n + j] = sum;
}
}
}
}
}
``````

This compiles to assembly with the unrolled vectorised loop below:

``````.L114:
cmpq  %r10, %r9
setbe  %cl
cmpq  56(%rsp), %r8
setnb  %dl
orl  %ecx, %edx
cmpq  %r14, %r9
setbe  %cl
cmpq  64(%rsp), %r8
setnb  %r15b
orl  %ecx, %r15d
andl  %edx, %r15d
cmpq  %r11, %r9
setbe  %cl
cmpq  48(%rsp), %r8
setnb  %dl
orl  %ecx, %edx
andl  %r15d, %edx
cmpq  %rbx, %r9
setbe  %cl
cmpq  40(%rsp), %r8
setnb  %r15b
orl  %ecx, %r15d
andl  %edx, %r15d
cmpq  %rsi, %r9
setbe  %cl
cmpq  32(%rsp), %r8
setnb  %dl
orl  %ecx, %edx
andl  %r15d, %edx
cmpq  %rdi, %r9
setbe  %cl
cmpq  24(%rsp), %r8
setnb  %r15b
orl  %ecx, %r15d
andl  %edx, %r15d
cmpq  %rbp, %r9
setbe  %cl
cmpq  16(%rsp), %r8
setnb  %dl
orl  %ecx, %edx
andl  %r15d, %edx
cmpq  %r12, %r9
setbe  %cl
cmpq  8(%rsp), %r8
setnb  %r15b
orl  %r15d, %ecx
testb  %cl, %dl
je  .L111
leaq  32(%rax), %rdx
cmpq  %rdx, %r8
setnb  %cl
cmpq  %rax, %r9
setbe  %r15b
orb  %r15b, %cl
je  .L111
vmovups  (%r8), %ymm2
vmovups  %ymm0, (%r8)
``````

### Benchmarks

I implemented a suite of benchmarks to compare the implementations. You can run them, but since they measure throughput and intensity averaged over hundreds of iterations per matrix size, the full run will take several hours.

`g++ -mavx2 -mfma -march=native -funroll-loops -O3 mmul.cpp -o mmul.exe && ./mmul.exe > results.csv`

The `saxpy` routine wins, with `blocked` fading fast after a middling start.

name size throughput (ops/s) flops/cycle
blocked 64 22770.2 4.5916
saxpy 64 25638.4 5.16997
blocked 128 2736.9 4.41515
saxpy 128 4108.52 6.62783
blocked 192 788.132 4.29101
saxpy 192 1262.45 6.87346
blocked 256 291.728 3.76492
saxpy 256 521.515 6.73044
blocked 320 147.979 3.72997
saxpy 320 244.528 6.16362
blocked 384 76.986 3.35322
saxpy 384 150.441 6.55264
blocked 448 50.4686 3.4907
saxpy 448 95.0752 6.57594
blocked 512 30.0085 3.09821
saxpy 512 65.1842 6.72991
blocked 576 22.8301 3.35608
saxpy 576 44.871 6.59614
blocked 640 15.5007 3.12571
saxpy 640 32.3709 6.52757
blocked 704 12.2478 3.28726
saxpy 704 25.3047 6.79166
blocked 768 8.69277 3.02899
saxpy 768 19.8011 6.8997
blocked 832 7.29356 3.23122
saxpy 832 15.3437 6.7976
blocked 896 4.95207 2.74011
saxpy 896 11.9611 6.61836
blocked 960 3.4467 2.34571
saxpy 960 9.25535 6.29888
blocked 1024 2.02289 1.67082
saxpy 1024 6.87039 5.67463

With GCC autovectorisation, `saxpy` performs well, maintaining intensity as size increases, albeit well below the theoretical capacity. It would be nice if similar code could be JIT compiled in Java.

### Intel Intrinsics

To understand the problem space a bit better, I find out how fast matrix multiplication can get without domain expertise by handcrafting an algorithm with intrinsics. My laptop’s Skylake chip (turbo boost and hyperthreading disabled) is capable of 32 SP flops per cycle per core – Java and the LMS implementation previously fell a long way short of that. It was difficult getting beyond 4f/c with Java, and LMS peaked at almost 6f/c before quickly tailing off. GCC autovectorisation achieved and maintained 7f/c.

To start, I’ll take full advantage of the facility to align the matrices on 64 byte intervals, since I have 64B cache lines, though this might just be voodoo. I take the `saxpy` routine and replace its kernel with intrinsics. Because of the `-funroll-loops` option, this will get unrolled without effort.

``````static void mmul_saxpy_avx(const int n, const float* left, const float* right, float* result) {
int in = 0;
for (int i = 0; i < n; ++i) {
int kn = 0;
for (int k = 0; k < n; ++k) {
__m256 aik = _mm256_set1_ps(left[in + k]);
int j = 0;
for (; j < n; j += 8) {
}
for (; j < n; ++j) {
result[in + j] += left[in + k] * right[kn + j];
}
kn += n;
}
in += n;
}
}
``````

This code is actually not a lot faster, if at all, than the basic `saxpy` above: a lot of aggressive optimisations have already been applied.

### Combining Blocked and SAXPY

What makes `blocked` so poor is the gather and the cache miss, not the concept of blocking itself. A limiting factor for `saxpy` performance is that the ratio of loads to floating point operations is too high. With this in mind, I tried combining the blocking idea with `saxpy`, by implementing `saxpy` multiplications for smaller sub-matrices. This results in a different algorithm with fewer loads per floating point operation, and the inner two loops are swapped. It avoids the gather and the cache miss in `blocked`. Because the matrices are in row major format, I make the width of the blocks much larger than the height. Also, different heights and widths make sense depending on the size of the matrix, so I choose them dynamically. The design constraints are to avoid gathers and horizontal reduction.

``````static void mmul_tiled_avx(const int n, const float *left, const float *right, float *result) {
const int block_width = n >= 256 ? 512 : 256;
const int block_height = n >= 512 ? 8 : n >= 256 ? 16 : 32;
for (int row_offset = 0; row_offset < n; row_offset += block_height) {
for (int column_offset = 0; column_offset < n; column_offset += block_width) {
for (int i = 0; i < n; ++i) {
for (int j = column_offset; j < column_offset + block_width && j < n; j += 8) {
__m256 sum = _mm256_load_ps(result + i * n + j);
for (int k = row_offset; k < row_offset + block_height && k < n; ++k) {
sum = _mm256_fmadd_ps(_mm256_set1_ps(left[i * n + k]), _mm256_load_ps(right + k * n + j), sum);
}
_mm256_store_ps(result + i * n + j, sum);
}
}
}
}
}
``````

You will see in the benchmark results that this routine really doesn’t do very well compared to `saxpy`. Finally, I unroll it, which is profitable despite setting `-funroll-loops` because there is slightly more to this than an unroll. This is a sequence of vertical reductions which have no data dependencies.

``````static void mmul_tiled_avx_unrolled(const int n, const float *left, const float *right, float *result) {
const int block_width = n >= 256 ? 512 : 256;
const int block_height = n >= 512 ? 8 : n >= 256 ? 16 : 32;
for (int column_offset = 0; column_offset < n; column_offset += block_width) {
for (int row_offset = 0; row_offset < n; row_offset += block_height) {
for (int i = 0; i < n; ++i) {
for (int j = column_offset; j < column_offset + block_width && j < n; j += 64) {
__m256 sum1 = _mm256_load_ps(result + i * n + j);
__m256 sum2 = _mm256_load_ps(result + i * n + j + 8);
__m256 sum3 = _mm256_load_ps(result + i * n + j + 16);
__m256 sum4 = _mm256_load_ps(result + i * n + j + 24);
__m256 sum5 = _mm256_load_ps(result + i * n + j + 32);
__m256 sum6 = _mm256_load_ps(result + i * n + j + 40);
__m256 sum7 = _mm256_load_ps(result + i * n + j + 48);
__m256 sum8 = _mm256_load_ps(result + i * n + j + 56);
for (int k = row_offset; k < row_offset + block_height && k < n; ++k) {
__m256 multiplier = _mm256_set1_ps(left[i * n + k]);
sum2 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 8), sum2);
sum3 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 16), sum3);
sum4 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 24), sum4);
sum5 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 32), sum5);
sum6 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 40), sum6);
sum7 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 48), sum7);
sum8 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 56), sum8);
}
_mm256_store_ps(result + i * n + j, sum1);
_mm256_store_ps(result + i * n + j + 8, sum2);
_mm256_store_ps(result + i * n + j + 16, sum3);
_mm256_store_ps(result + i * n + j + 24, sum4);
_mm256_store_ps(result + i * n + j + 32, sum5);
_mm256_store_ps(result + i * n + j + 40, sum6);
_mm256_store_ps(result + i * n + j + 48, sum7);
_mm256_store_ps(result + i * n + j + 56, sum8);
}
}
}
}
}
``````

This final implementation is fast, and is probably as good as I am going to manage, without reading papers. This should be a CPU bound problem because the algorithm is O(n^3) whereas the problem size is O(n^2). But the flops/cycle decreases with problem size in all of these implementations. It’s possible that this could be amelioarated by a better dynamic tiling policy. I’m unlikely to be able to fix that.

It does make a huge difference being able to go very low level – handwritten intrinsics with GCC unlock awesome throughput – but it’s quite hard to actually get to the point where you can beat a good optimising compiler. Mind you, there are harder problems to solve this, and you may well be a domain expert.

The benchmark results summarise this best:

name size throughput (ops/s) flops/cycle
saxpy_avx 64 49225.7 9.92632
tiled_avx 64 33680.5 6.79165
tiled_avx_unrolled 64 127936 25.7981
saxpy_avx 128 5871.02 9.47109
tiled_avx 128 4210.07 6.79166
tiled_avx_unrolled 128 15997.6 25.8072
saxpy_avx 192 1603.84 8.73214
tiled_avx 192 1203.33 6.55159
tiled_avx_unrolled 192 4383.09 23.8638
saxpy_avx 256 633.595 8.17689
tiled_avx 256 626.157 8.0809
tiled_avx_unrolled 256 1792.52 23.1335
saxpy_avx 320 284.161 7.1626
tiled_avx 320 323.197 8.14656
tiled_avx_unrolled 320 935.571 23.5822
saxpy_avx 384 161.517 7.03508
tiled_avx 384 188.215 8.19794
tiled_avx_unrolled 384 543.235 23.6613
saxpy_avx 448 99.1987 6.86115
tiled_avx 448 118.588 8.2022
tiled_avx_unrolled 448 314 21.718
saxpy_avx 512 70.0296 7.23017
tiled_avx 512 73.2019 7.55769
tiled_avx_unrolled 512 197.815 20.4233
saxpy_avx 576 46.1944 6.79068
tiled_avx 576 50.6315 7.44294
tiled_avx_unrolled 576 126.045 18.5289
saxpy_avx 640 33.8209 6.81996
tiled_avx 640 37.0288 7.46682
tiled_avx_unrolled 640 92.784 18.7098
saxpy_avx 704 24.9096 6.68561
tiled_avx 704 27.7543 7.44912
tiled_avx_unrolled 704 69.0399 18.53
saxpy_avx 768 19.5158 6.80027
tiled_avx 768 21.532 7.50282
tiled_avx_unrolled 768 54.1763 18.8777
saxpy_avx 832 12.8635 5.69882
tiled_avx 832 14.6666 6.49766
tiled_avx_unrolled 832 37.9592 16.8168
saxpy_avx 896 12.0526 6.66899
tiled_avx 896 13.3799 7.40346
tiled_avx_unrolled 896 34.0838 18.8595
saxpy_avx 960 8.97193 6.10599
tiled_avx 960 10.1052 6.87725
tiled_avx_unrolled 960 21.0263 14.3098
saxpy_avx 1024 6.73081 5.55935
tiled_avx 1024 7.21214 5.9569
tiled_avx_unrolled 1024 12.7768 10.5531

### Can we do better in Java?

Writing genuinely fast code gives an indication of how little of the processor Java actually utilises, but is it possible to bring this knowledge over to Java? The `saxpy` based implementations in my previous post performed well for small to medium sized matrices. Once the matrices grow, however, they become too big to be allowed to pass through cache multiple times: we need hot, small cached data to be replenished from the larger matrix. Ideally we wouldn’t need to make any copies, but it seems that the autovectoriser doesn’t like offsets: `System.arraycopy` is a reasonably fast compromise. The basic sequential read pattern is validated: even native code requiring a gather does not perform well for this problem. The best effort C++ code translates almost verbatim into this Java code, which is quite fast for large matrices.

``````
public void tiled(float[] a, float[] b, float[] c, int n) {
final int bufferSize = 512;
final int width = Math.min(n, bufferSize);
final int height = Math.min(n, n >= 512 ? 8 : n >= 256 ? 16 : 32);
float[] sum = new float[bufferSize];
float[] vector = new float[bufferSize];
for (int rowOffset = 0; rowOffset < n; rowOffset += height) {
for (int columnOffset = 0; columnOffset < n; columnOffset += width) {
for (int i = 0; i < n; ++i) {
for (int j = columnOffset; j < columnOffset + width && j < n; j += width) {
int stride = Math.min(n - columnOffset, bufferSize);
// copy to give autovectorisation a hint
System.arraycopy(c, i * n + j, sum, 0, stride);
for (int k = rowOffset; k < rowOffset + height && k < n; ++k) {
float multiplier = a[i * n + k];
System.arraycopy(b, k * n  + j, vector, 0, stride);
for (int l = 0; l < stride; ++l) {
sum[l] = Math.fma(multiplier, vector[l], sum[l]);
}
}
System.arraycopy(sum, 0, c, i * n + j, stride);
}
}
}
}
}
``````

Benchmarking it using the same harness used in the previous post, the performance is ~10% higher for large arrays than my previous best effort. Still, the reality is that this is too slow to be useful. If you need to do linear algebra, use C/C++ for the time being!

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size flops/cycle
fastBuffered thrpt 1 10 53.331195 0.270526 ops/s 448 3.688688696
fastBuffered thrpt 1 10 34.365765 0.16641 ops/s 512 3.548072999
fastBuffered thrpt 1 10 26.128264 0.239719 ops/s 576 3.840914622
fastBuffered thrpt 1 10 19.044509 0.139197 ops/s 640 3.84031059
fastBuffered thrpt 1 10 14.312154 1.045093 ops/s 704 3.841312378
fastBuffered thrpt 1 10 7.772745 0.074598 ops/s 768 2.708411991
fastBuffered thrpt 1 10 6.276182 0.067338 ops/s 832 2.780495238
fastBuffered thrpt 1 10 4.8784 0.067368 ops/s 896 2.699343067
fastBuffered thrpt 1 10 4.068907 0.038677 ops/s 960 2.769160387
fastBuffered thrpt 1 10 2.568101 0.108612 ops/s 1024 2.121136502
tiled thrpt 1 10 56.495366 0.584872 ops/s 448 3.907540754
tiled thrpt 1 10 30.884954 3.221017 ops/s 512 3.188698735
tiled thrpt 1 10 15.580581 0.412654 ops/s 576 2.290381075
tiled thrpt 1 10 9.178969 0.841178 ops/s 640 1.850932038
tiled thrpt 1 10 12.229763 0.350233 ops/s 704 3.282408783
tiled thrpt 1 10 9.371032 0.330742 ops/s 768 3.265334889
tiled thrpt 1 10 7.727068 0.277969 ops/s 832 3.423271628
tiled thrpt 1 10 6.076451 0.30305 ops/s 896 3.362255222
tiled thrpt 1 10 4.916811 0.2823 ops/s 960 3.346215151
tiled thrpt 1 10 3.722623 0.26486 ops/s 1024 3.074720008

# Sum of Squares

Streams and lambdas, especially the limited support offered for primitive types, are a fantastic addition to the Java language. They’re not supposed to be fast, but how do these features compare to a good old `for` loop? For a simple calculation amenable to instruction level parallelism, I compare modern and traditional implementations and observe the differences in instructions generated.

### Sum of Squares

The sum of squares is the building block of a linear regression analysis so is ubiquitous in statistical computing. It is associative and therefore data-parallel. I compare four implementations: a sequential stream wrapping an array, a parallel stream wrapping an array, a generative sequential stream and a traditional `for` loop. The benchmark code is on github.

``````
@Param({"1024", "8192"})
int size;

private double[] data;

@Setup(Level.Iteration)
public void init() {
this.data = createDoubleArray(size);
}

@Benchmark
public double SS_SequentialStream() {
return DoubleStream.of(data)
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0D);
}

@Benchmark
public double SS_ParallelStream() {
return DoubleStream.of(data)
.parallel()
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}

@Benchmark
public double SS_ForLoop() {
double result = 0D;
for (int i = 0; i < data.length; ++i) {
result += data[i] * data[i];
}
return result;
}

@Benchmark
public double SS_GenerativeSequentialStream() {
return IntStream.iterate(0, i -> i < size, i -> i + 1)
.mapToDouble(i -> data[i])
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}
``````

I must admit I prefer the readability of the stream versions, but let’s see if there is a comedown after the syntactic sugar rush.

### Running a Benchmark

I compare the four implementations on an array of one million doubles. I am using `JDK 9.0.1, VM 9.0.1+11` on a fairly powerful laptop with 8 processors:

```\$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
```

Before running the benchmark we might expect the `for` loop and stream to have similar performance, and the parallel version to be about eight times faster (though remember that the arrays aren’t too big). The generative version is very similar to the `for` loop so a slow down might not be expected.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
SS_ForLoop thrpt 1 10 258351.774491 39797.567968 ops/s 1024
SS_ForLoop thrpt 1 10 29463.408428 4814.826388 ops/s 8192
SS_GenerativeSequentialStream thrpt 1 10 219699.607567 9095.569546 ops/s 1024
SS_GenerativeSequentialStream thrpt 1 10 28351.900454 828.513989 ops/s 8192
SS_ParallelStream thrpt 1 10 22827.821827 2826.577213 ops/s 1024
SS_ParallelStream thrpt 1 10 23230.623610 273.415352 ops/s 8192
SS_SequentialStream thrpt 1 10 225431.985145 9051.538442 ops/s 1024
SS_SequentialStream thrpt 1 10 29123.734157 1333.721437 ops/s 8192

The `for` loop and stream are similar. The parallel version is a long way behind (yes that’s right: more threads less power), but exhibits constant scaling (incidentally, a measurement like this is a good way to guess the minimum unit of work in a parallelised implementation). If the data is large it could become profitable to use it. The generative stream is surprisingly good, almost as good as the version that wraps the array, though there is a fail-safe way to slow it down: add a limit clause to the method chain (try it…).

Profiling with perfasm, it is clear that the `for` loop body is being vectorised, but only the loads and multiplications are done in parallel – the complicated string of SSE instructions is the reduction, which must be done in order.

```<-- unrolled load -->
0.01%    0x00000243d8969170: vmovdqu ymm1,ymmword ptr [r11+r8*8+0f0h]
0.07%    0x00000243d896917a: vmovdqu ymm2,ymmword ptr [r11+r8*8+0d0h]
0.75%    0x00000243d8969184: vmovdqu ymm3,ymmword ptr [r11+r8*8+0b0h]
0.01%    0x00000243d896918e: vmovdqu ymm4,ymmword ptr [r11+r8*8+90h]
0.02%    0x00000243d8969198: vmovdqu ymm5,ymmword ptr [r11+r8*8+70h]
0.03%    0x00000243d896919f: vmovdqu ymm6,ymmword ptr [r11+r8*8+50h]
0.77%    0x00000243d89691a6: vmovdqu ymm10,ymmword ptr [r11+r8*8+30h]
0.02%    0x00000243d89691ad: vmovdqu ymm7,ymmword ptr [r11+r8*8+10h]
<-- multiplication starts -->
0.01%    0x00000243d89691b4: vmulpd  ymm1,ymm1,ymm1
0.02%    0x00000243d89691b8: vmovdqu ymmword ptr [rsp+28h],ymm1
0.76%    0x00000243d89691be: vmulpd  ymm15,ymm7,ymm7
0.00%    0x00000243d89691c2: vmulpd  ymm12,ymm2,ymm2
0.01%    0x00000243d89691c6: vmulpd  ymm7,ymm3,ymm3
0.02%    0x00000243d89691ca: vmulpd  ymm8,ymm4,ymm4
0.72%    0x00000243d89691ce: vmulpd  ymm9,ymm5,ymm5
0.00%    0x00000243d89691d2: vmulpd  ymm11,ymm6,ymm6
0.01%    0x00000243d89691d6: vmulpd  ymm13,ymm10,ymm10
<-- multiplication ends here, scalar reduction starts -->
0.72%    0x00000243d89691e0: vpshufd xmm5,xmm15,0eh
2.14%    0x00000243d89691ea: vextractf128 xmm6,ymm15,1h
3.21%    0x00000243d89691f4: vpshufd xmm5,xmm6,0eh
2.82%    0x00000243d8969202: vpshufd xmm5,xmm13,0eh
2.87%    0x00000243d896920c: vextractf128 xmm6,ymm13,1h
3.03%    0x00000243d8969216: vpshufd xmm5,xmm6,0eh
2.70%    0x00000243d8969224: vpshufd xmm5,xmm11,0eh
2.98%    0x00000243d896922e: vextractf128 xmm6,ymm11,1h
3.11%    0x00000243d8969238: vpshufd xmm5,xmm6,0eh
2.61%    0x00000243d8969246: vpshufd xmm5,xmm9,0eh
2.89%    0x00000243d8969250: vextractf128 xmm6,ymm9,1h
3.13%    0x00000243d896925a: vpshufd xmm5,xmm6,0eh
2.83%    0x00000243d8969268: vpshufd xmm4,xmm8,0eh
3.00%    0x00000243d8969272: vextractf128 xmm10,ymm8,1h
3.13%    0x00000243d896927d: vpshufd xmm4,xmm10,0eh
2.95%    0x00000243d896928b: vpshufd xmm1,xmm7,0eh
3.06%    0x00000243d8969294: vextractf128 xmm2,ymm7,1h
3.07%    0x00000243d896929e: vpshufd xmm1,xmm2,0eh
2.92%    0x00000243d89692ac: vpshufd xmm3,xmm12,0eh
3.11%    0x00000243d89692b6: vextractf128 xmm1,ymm12,1h
3.02%    0x00000243d89692c0: vpshufd xmm3,xmm1,0eh
2.97%    0x00000243d89692c9: vmovdqu ymm1,ymmword ptr [rsp+28h]
3.05%    0x00000243d89692d3: vpshufd xmm2,xmm1,0eh
2.97%    0x00000243d89692dc: vextractf128 xmm14,ymm1,1h
2.99%    0x00000243d89692e7: vpshufd xmm2,xmm14,0eh
```

The sequential stream code is not as good – it is scalar – but the difference in performance is not as stark as it might be because of the inefficient scalar reduction in the `for` loop: this is JLS floating point semantics twisting C2’s arm behind its back.

```  0.00%    0x0000021a1df54c24: vmovsd  xmm0,qword ptr [rbx+r9*8+48h]
0.00%    0x0000021a1df54c2b: vmovsd  xmm2,qword ptr [rbx+r9*8+18h]
0.02%    0x0000021a1df54c32: vmovsd  xmm3,qword ptr [rbx+r9*8+40h]
2.93%    0x0000021a1df54c39: vmovsd  xmm4,qword ptr [rbx+r9*8+38h]
0.00%    0x0000021a1df54c40: vmovsd  xmm5,qword ptr [rbx+r9*8+30h]
0.01%    0x0000021a1df54c47: vmovsd  xmm6,qword ptr [rbx+r9*8+28h]
0.02%    0x0000021a1df54c4e: vmovsd  xmm7,qword ptr [rbx+r9*8+20h]
2.99%    0x0000021a1df54c55: vmulsd  xmm8,xmm0,xmm0
0.00%    0x0000021a1df54c59: vmulsd  xmm0,xmm7,xmm7
0x0000021a1df54c5d: vmulsd  xmm6,xmm6,xmm6
0.01%    0x0000021a1df54c61: vmulsd  xmm5,xmm5,xmm5
2.91%    0x0000021a1df54c65: vmulsd  xmm4,xmm4,xmm4
0.00%    0x0000021a1df54c69: vmulsd  xmm3,xmm3,xmm3
0.00%    0x0000021a1df54c6d: vmulsd  xmm2,xmm2,xmm2
```

The same code can be seen in `SS_ParallelStream`. `SS_GenerativeSequentialStream` is much more interesting because it hasn’t been unrolled – see the interleaved control statements. It is also not vectorised.

```           0x0000013c1a639c17: vmovsd  xmm0,qword ptr [rbp+r9*8+10h]
0.01%    0x0000013c1a639c1e: vmulsd  xmm2,xmm0,xmm0
0.01%    0x0000013c1a639c22: test    r8d,r8d
0x0000013c1a639c25: jne     13c1a639e09h
0x0000013c1a639c2b: mov     r10d,dword ptr [r12+rax*8+8h]
0x0000013c1a639c30: cmp     r10d,0f8022d85h
0x0000013c1a639c37: jne     13c1a639e3bh
0.01%    0x0000013c1a639c41: vmovsd  qword ptr [rdi+10h],xmm2
0.00%    0x0000013c1a639c46: movsxd  r10,r9d
0x0000013c1a639c49: vmovsd  xmm0,qword ptr [rbp+r10*8+18h]
0.01%    0x0000013c1a639c50: vmulsd  xmm0,xmm0,xmm0
0.01%    0x0000013c1a639c54: mov     r10d,dword ptr [r12+rax*8+8h]
0.00%    0x0000013c1a639c59: cmp     r10d,0f8022d85h
0x0000013c1a639c60: jne     13c1a639e30h
0.02%    0x0000013c1a639c6a: vmovsd  qword ptr [rdi+10h],xmm0
0.02%    0x0000013c1a639c6f: mov     r10d,r9d
0x0000013c1a639c76: cmp     r10d,r11d
0x0000013c1a639c79: jnl     13c1a639d96h
0.02%    0x0000013c1a639c83: vmovsd  xmm1,qword ptr [rbp+r10*8+10h]
0.00%    0x0000013c1a639c8a: movzx   r8d,byte ptr [rdi+0ch]
0.00%    0x0000013c1a639c8f: vmulsd  xmm1,xmm1,xmm1
0.01%    0x0000013c1a639c93: test    r8d,r8d
0x0000013c1a639c96: jne     13c1a639dfbh
0.01%    0x0000013c1a639ca0: vmovsd  qword ptr [rdi+10h],xmm1
0.02%    0x0000013c1a639ca5: movsxd  r8,r10d
0.00%    0x0000013c1a639ca8: vmovsd  xmm0,qword ptr [rbp+r8*8+18h]
0x0000013c1a639caf: vmulsd  xmm0,xmm0,xmm0
0.06%    0x0000013c1a639cb7: vmovsd  qword ptr [rdi+10h],xmm0
```

So it looks like streams don’t vectorise like good old `for` loops, and you won’t gain from `Stream.parallelStream` unless you have humungous arrays (which you might be avoiding for other reasons). This was actually a very nice case for the `Stream` because optimal code can’t be generated for floating point reductions. What happens with sum of squares for `int`s? Generating data in an unsurprising way:

``````
@Benchmark
public int SS_SequentialStream_Int() {
return IntStream.of(intData)
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}

@Benchmark
public int SS_ParallelStream_Int() {
return IntStream.of(intData)
.parallel()
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}

@Benchmark
public int SS_ForLoop_Int() {
int result = 0;
for (int i = 0; i < intData.length; ++i) {
result += intData[i] * intData[i];
}
return result;
}

@Benchmark
public int SS_GenerativeSequentialStream_Int() {
return IntStream.iterate(0, i -> i < size, i -> i + 1)
.map(i -> intData[i])
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}
``````

The landscape has completely changed, thanks to the exploitation of associative arithmetic and the `VPHADDD` instruction which simplifies the reduction in the for loop.

```<-- load -->
0.00%    0x000001f5cdd8cd30: vmovdqu ymm0,ymmword ptr [rdi+r10*4+0f0h]
1.93%    0x000001f5cdd8cd3a: vmovdqu ymm1,ymmword ptr [rdi+r10*4+0d0h]
0.10%    0x000001f5cdd8cd44: vmovdqu ymm2,ymmword ptr [rdi+r10*4+0b0h]
0.07%    0x000001f5cdd8cd4e: vmovdqu ymm3,ymmword ptr [rdi+r10*4+90h]
0.05%    0x000001f5cdd8cd58: vmovdqu ymm4,ymmword ptr [rdi+r10*4+70h]
1.75%    0x000001f5cdd8cd5f: vmovdqu ymm5,ymmword ptr [rdi+r10*4+50h]
0.08%    0x000001f5cdd8cd66: vmovdqu ymm6,ymmword ptr [rdi+r10*4+30h]
0.07%    0x000001f5cdd8cd6d: vmovdqu ymm7,ymmword ptr [rdi+r10*4+10h]
<-- multiply -->
0.01%    0x000001f5cdd8cd74: vpmulld ymm0,ymm0,ymm0
1.81%    0x000001f5cdd8cd79: vmovdqu ymmword ptr [rsp+28h],ymm0
0.02%    0x000001f5cdd8cd7f: vpmulld ymm15,ymm7,ymm7
1.79%    0x000001f5cdd8cd84: vpmulld ymm11,ymm1,ymm1
0.06%    0x000001f5cdd8cd89: vpmulld ymm8,ymm2,ymm2
1.82%    0x000001f5cdd8cd8e: vpmulld ymm9,ymm3,ymm3
0.06%    0x000001f5cdd8cd93: vpmulld ymm10,ymm4,ymm4
1.79%    0x000001f5cdd8cd98: vpmulld ymm12,ymm5,ymm5
0.08%    0x000001f5cdd8cd9d: vpmulld ymm6,ymm6,ymm6
<-- vectorised reduce -->
1.85%    0x000001f5cdd8cdac: vextracti128 xmm7,ymm4,1h
1.78%    0x000001f5cdd8cdb6: vmovd   xmm7,r8d
0.11%    0x000001f5cdd8cdbf: vmovd   r11d,xmm7
5.43%    0x000001f5cdd8cdce: vextracti128 xmm7,ymm4,1h
4.34%    0x000001f5cdd8cdd8: vmovd   xmm7,r11d
1.40%    0x000001f5cdd8cde1: vmovd   r8d,xmm7
3.25%    0x000001f5cdd8cdf0: vextracti128 xmm4,ymm6,1h
6.36%    0x000001f5cdd8cdfa: vmovd   xmm4,r8d
1.69%    0x000001f5cdd8ce03: vmovd   r8d,xmm4
0.10%    0x000001f5cdd8ce12: vextracti128 xmm7,ymm4,1h
0.72%    0x000001f5cdd8ce1c: vmovd   xmm7,r8d
4.42%    0x000001f5cdd8ce25: vmovd   r11d,xmm7
0.12%    0x000001f5cdd8ce34: vextracti128 xmm1,ymm5,1h
0.22%    0x000001f5cdd8ce3e: vmovd   xmm1,r11d
3.81%    0x000001f5cdd8ce47: vmovd   r11d,xmm1
0.22%    0x000001f5cdd8ce56: vextracti128 xmm3,ymm0,1h
0.36%    0x000001f5cdd8ce60: vmovd   xmm3,r11d
4.55%    0x000001f5cdd8ce69: vmovd   r8d,xmm3
0.09%    0x000001f5cdd8ce78: vextracti128 xmm1,ymm2,1h
1.57%    0x000001f5cdd8ce82: vmovd   xmm1,r8d
4.84%    0x000001f5cdd8ce8b: vmovd   r11d,xmm1
0.06%    0x000001f5cdd8ce90: vmovdqu ymm0,ymmword ptr [rsp+28h]
2.16%    0x000001f5cdd8cea0: vextracti128 xmm14,ymm13,1h
0.09%    0x000001f5cdd8ceab: vmovd   xmm14,r11d
```

If you’re the guy replacing all the `for` loops with streams because it’s 2018, you may be committing performance vandalism! That nice declarative API (as opposed to language feature) is at arms length and it really isn’t well optimised yet.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
SS_ForLoop_Int thrpt 1 10 1021725.018981 74264.883362 ops/s 1024
SS_ForLoop_Int thrpt 1 10 129250.855026 5764.608094 ops/s 8192
SS_GenerativeSequentialStream_Int thrpt 1 10 55069.227826 1111.903102 ops/s 1024
SS_GenerativeSequentialStream_Int thrpt 1 10 6769.176830 684.970867 ops/s 8192
SS_ParallelStream_Int thrpt 1 10 20970.387258 719.846643 ops/s 1024
SS_ParallelStream_Int thrpt 1 10 19621.397202 1514.374286 ops/s 8192
SS_SequentialStream_Int thrpt 1 10 586847.001223 22390.512706 ops/s 1024
SS_SequentialStream_Int thrpt 1 10 87620.959677 3437.083075 ops/s 8192

Parallel streams might not be the best thing to reach for.

# The Much Aligned Garbage Collector

A power of two is often a good choice for the size of an array. Sometimes you might see this being exploited to replace an integer division with a bitwise intersection. You can see why with a toy benchmark of a bloom filter, which deliberately folds in a representative cost of a hash function and array access to highlight the significance of the differential cost of the division mechanism to a method that does real work:

``````@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class BloomFilter {

private long[] bitset;

@Param({"1000", "1024"})
int size;

@Setup(Level.Trial)
public void init() {
bitset = DataUtil.createLongArray(size);
}

@Benchmark
public boolean containsAnd() {
int hash = hash();
int pos = hash & (size - 1);
return (bitset[pos >>> 6] & (1L << pos)) != 0;
}

@Benchmark
public boolean containsAbsMod() {
int hash = hash();
int pos = Math.abs(hash % size);
return (bitset[pos >>> 6] & (1L << pos)) != 0;
}

private int hash() {
return ThreadLocalRandom.current().nextInt(); // a stand in for a hash function;
}
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
containsAbsMod thrpt 1 10 104.063744 4.068283 ops/us 1000
containsAbsMod thrpt 1 10 103.849577 4.991040 ops/us 1024
containsAnd thrpt 1 10 161.917397 3.807912 ops/us 1024

Disregarding the case which produces an incorrect result, you can do two thirds as many lookups again in the same period of time if you just use a 1024 element bloom filter. Note that the compiler clearly won’t magically transform cases like `AbsMod 1024`; you need to do this yourself. You can readily see this property exploited in any open source bit set, hash set, or bloom filter you care to look at. This is boring, at least, we often get this right by accident. What is quite interesting is a multiplicative decrease in throughput of DAXPY as a result of this same choice of lengths:

``````@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class DAXPYAlignment {

@Param({"250", "256", "1000", "1024"})
int size;

double s;
double[] a;
double[] b;

@Setup(Level.Trial)
public void init() {
a = createDoubleArray(size);
b = createDoubleArray(size);
}

@Benchmark
public void daxpy(Blackhole bh) {
for (int i = 0; i < a.length; ++i) {
a[i] += s * b[i];
}
bh.consume(a);
}
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
daxpy thrpt 1 10 23.499857 0.891309 ops/us 250
daxpy thrpt 1 10 22.425412 0.989512 ops/us 256
daxpy thrpt 1 10 2.420674 0.098991 ops/us 1000
daxpy thrpt 1 10 6.263005 0.175048 ops/us 1024

1000 and 1024 are somehow very different, yet 250 and 256 are almost equivalent. This will either shock you, or you will guess that my page size is 4KB. My page size is indeed 4KB (which is a power of 2 precisely because of the prohibitive cost of integer division). The 250 element array only needs 2000B of contiguous space, so the next array allocated will reside in the same page, assuming the start of the array is at the start of the page. On the other hand, the 1000 element array takes up 8000B, that’s 1.95 pages. If the second array starts immediately after this array, the array will start in the 5% remaining in the current page, take up another page, and then 90% of yet another. Is this a big deal? Perhaps, when a page is accessed for the first time, it needs to be looked up in the page table, and is then cached in the TLB. Access to the cache is fast, whereas accessing the page table is slow. That extra page access costs something, but can it really cost this much, and is this all that’s going on? Since these measurements are made on an Intel Skylake processor, they will also be affected by 4K aliasing. Let’s allocate an array in between the two we want to loop over, to vary the offsets:

``````  @Param({"0", "6", "12", "18", "24"})
int offset;

double s;
double[] a;
double[] b;

@Setup(Level.Trial)
public void init() {
a = createDoubleArray(size);
b = createDoubleArray(size);
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: offset Param: size
daxpy thrpt 1 10 2.224875 0.247778 ops/us 0 1000
daxpy thrpt 1 10 6.159791 0.441525 ops/us 0 1024
daxpy thrpt 1 10 2.350425 0.136992 ops/us 6 1000
daxpy thrpt 1 10 6.047009 0.360723 ops/us 6 1024
daxpy thrpt 1 10 3.332370 0.253739 ops/us 12 1000
daxpy thrpt 1 10 6.506141 0.155733 ops/us 12 1024
daxpy thrpt 1 10 6.621031 0.345151 ops/us 18 1000
daxpy thrpt 1 10 6.827635 0.970527 ops/us 18 1024
daxpy thrpt 1 10 7.456584 0.214229 ops/us 24 1000
daxpy thrpt 1 10 7.451441 0.104871 ops/us 24 1024

The pattern is curious (pay attention to the offset parameter) – the ratio of the throughputs for each size ranging from 3x throughput degradation through to parity:

The loop in question is vectorised, which can be disabled by setting `-XX:-UseSuperWord`. Doing so is revealing, because the trend is still present but it is dampened to the extent it could be waved away as noise:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: offset Param: size
daxpy thrpt 1 10 1.416452 0.079905 ops/us 0 1000
daxpy thrpt 1 10 1.806841 0.200231 ops/us 0 1024
daxpy thrpt 1 10 1.408526 0.085147 ops/us 6 1000
daxpy thrpt 1 10 1.921026 0.049655 ops/us 6 1024
daxpy thrpt 1 10 1.459186 0.076427 ops/us 12 1000
daxpy thrpt 1 10 1.809220 0.199885 ops/us 12 1024
daxpy thrpt 1 10 1.824435 0.169680 ops/us 18 1000
daxpy thrpt 1 10 1.842230 0.204414 ops/us 18 1024
daxpy thrpt 1 10 1.934717 0.229822 ops/us 24 1000
daxpy thrpt 1 10 1.964316 0.039893 ops/us 24 1024

The point is, you may not have cared about alignment much before because it’s unlikely you would have noticed unless you were really looking for it. Decent autovectorisation seems to raise the stakes enormously.

### Analysis with Perfasm

It’s impossible to know for sure what the cause of this behaviour is without profiling. Since I observed this effect on my Windows development laptop, I use xperf via `WinPerfAsmProfiler`, which is part of JMH.

I did some instruction profiling. The same code is going to get generated in each case, with a preloop, main loop and post loop, but by looking at the sampled instruction frequency we can see what’s taking the most time in the vectorised main loop. From now on, superword parallelism is never disabled. The full output of this run can be seen at github. Here is the main loop for size=1024, offset=0, which is unrolled, spending most time loading and storing data (`vmovdqu`) but spending a decent amount of time in the multiplication:

```  0.18%    0x0000020dddc5af90: vmovdqu ymm0,ymmword ptr [r10+r8*8+10h]
9.27%    0x0000020dddc5af97: vmulpd  ymm0,ymm0,ymm2
0.22%    0x0000020dddc5af9b: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+10h]
7.48%    0x0000020dddc5afa2: vmovdqu ymmword ptr [r11+r8*8+10h],ymm0
10.16%    0x0000020dddc5afa9: vmovdqu ymm0,ymmword ptr [r10+r8*8+30h]
0.09%    0x0000020dddc5afb0: vmulpd  ymm0,ymm0,ymm2
3.62%    0x0000020dddc5afb4: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+30h]
10.60%    0x0000020dddc5afbb: vmovdqu ymmword ptr [r11+r8*8+30h],ymm0
0.26%    0x0000020dddc5afc2: vmovdqu ymm0,ymmword ptr [r10+r8*8+50h]
3.76%    0x0000020dddc5afc9: vmulpd  ymm0,ymm0,ymm2
0.20%    0x0000020dddc5afcd: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+50h]
13.23%    0x0000020dddc5afd4: vmovdqu ymmword ptr [r11+r8*8+50h],ymm0
9.46%    0x0000020dddc5afdb: vmovdqu ymm0,ymmword ptr [r10+r8*8+70h]
0.11%    0x0000020dddc5afe2: vmulpd  ymm0,ymm0,ymm2
4.63%    0x0000020dddc5afe6: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+70h]
9.78%    0x0000020dddc5afed: vmovdqu ymmword ptr [r11+r8*8+70h],ymm0
```

In the worst performer (size=1000, offset=0) a lot more time is spent on the stores, a much smaller fraction of observed instructions are involved with multiplication or addition. This indicates either a measurement bias (perhaps there’s some mechanism that makes a store/load easier to observe) or an increase in load/store cost.

```  0.24%    0x000002d1a946f510: vmovdqu ymm0,ymmword ptr [r10+r8*8+10h]
3.61%    0x000002d1a946f517: vmulpd  ymm0,ymm0,ymm2
4.63%    0x000002d1a946f51b: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+10h]
9.73%    0x000002d1a946f522: vmovdqu ymmword ptr [r11+r8*8+10h],ymm0
4.34%    0x000002d1a946f529: vmovdqu ymm0,ymmword ptr [r10+r8*8+30h]
2.13%    0x000002d1a946f530: vmulpd  ymm0,ymm0,ymm2
7.77%    0x000002d1a946f534: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+30h]
13.46%    0x000002d1a946f53b: vmovdqu ymmword ptr [r11+r8*8+30h],ymm0
3.37%    0x000002d1a946f542: vmovdqu ymm0,ymmword ptr [r10+r8*8+50h]
0.47%    0x000002d1a946f549: vmulpd  ymm0,ymm0,ymm2
1.47%    0x000002d1a946f54d: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+50h]
13.00%    0x000002d1a946f554: vmovdqu ymmword ptr [r11+r8*8+50h],ymm0
4.24%    0x000002d1a946f55b: vmovdqu ymm0,ymmword ptr [r10+r8*8+70h]
2.40%    0x000002d1a946f562: vmulpd  ymm0,ymm0,ymm2
8.92%    0x000002d1a946f566: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+70h]
14.10%    0x000002d1a946f56d: vmovdqu ymmword ptr [r11+r8*8+70h],ymm0
```

This trend can be seen to generally improve as 1024 is approached from below, and do bear in mind that this is a noisy measure. Interpret the numbers below as probabilities: were you to stop the execution of daxpy at random, at offset zero, you would have a 94% chance of finding yourself within the main vectorised loop. You would have a 50% chance of observing a store, and only 31% chance of observing a multiply or add. As we get further from 1024, the stores dominate the main loop, and the main loop comes to dominate the method. Again, this is approximate. When the arrays aren’t well aligned, we spend less time loading, less time multiplying and adding, and much more time storing.

classification offset = 0 offset = 6 offset = 12 offset = 18 offset = 24
add 22.79 21.46 15.41 7.77 8.03
load 12.19 11.95 15.55 21.9 21.19
multiply 8.61 7.7 9.54 13.15 8.33
store 50.29 51.3 49.16 42.34 44.56
main loop 93.88 92.41 89.66 85.16 82.11

I stop short of counting TLB misses because this is on Windows and it would be a complete pain in the arse to capture. The truth is, I still don’t have enough information to say what the cause is, but I know the stores are getting more expensive, and that I would need to be quite unlucky to have put these two arrays next to each other. I may update this post with a Linux measurement where it’s much easier to profile hardware events with perf.

After discussing the post on Twitter (see here and here), Vsevolod and Aleksey Shipilёv correctly attributed this performance bug to 4K aliasing. The effect observed here is also a contributing factor to fluctuations in throughput observed in JDK-8150730.

### Garbage Collection

Is it necessary to make sure all arrays are of a size equal to a power of two and aligned with pages? In this microbenchmark, it’s easy to arrange that, for typical developers this probably isn’t feasible (which isn’t to say there aren’t people out there who do this). Fortunately, this isn’t necessary for most use cases. True to the title, this post has something to do with garbage collection. The arrays were allocated in order, and no garbage would be produced during the benchmarks, so the second array will be split across pages. Let’s put some code into the initialisation of the benchmark bound to trigger garbage collection:

``````  String acc = "";

@Setup(Level.Trial)
public void init() {
a = createDoubleArray(size);
b = createDoubleArray(size);
// don't do this in production
for (int i = 0; i < 10000; ++i) {
acc += UUID.randomUUID().toString();
}
}
``````

A miracle occurs: the code speeds up!

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
daxpy thrpt 1 10 6.854161 0.261247 ops/us 1000
daxpy thrpt 1 10 6.328602 0.163391 ops/us 1024

Why? G1 has rearranged the heap and that second array probably isn’t straddling three pages any more, and is unlikely to remain in such an unlucky position relative to the first array. Future garbage collectors are expected to be even better at doing this than G1. This makes the cost of garbage collection difficult to quantify, because if it takes resources with one hand it gives them back with another.

The benchmark code is available at github.