# Collecting Rocks and Benchmarks

As long as I can remember, I have been interested in rocks, I have hundreds of them in storage. Rocks are interesting because they hold little clues about processes nobody has ever seen happen. For instance, one of the first rocks I ever took an interest in was a smooth granite pebble, which I collected on a walk when I was six. Let’s be honest, as a six year old, it was the pattern on the surface of the rock that caught my eye, but this truly was a fascinating rock because it shouldn’t have been smooth and it shouldn’t have been in England. It probably came from Norway, and while it’s possible that a Norwegian brought the rock to England, it’s highly unlikely.

The best possible explanation of the rock’s existence in that point in space and time was that there was once a glacier covering Norway, the North Sea, and Northern England, moving in that direction. The glacier carried the pebble, and many others like it, and dumped it as glacial outwash in the English Midlands. Nobody alive saw this happen, but if this happened, Scotland (relieved of the weight of an ice sheet) should still be rising in altitude, and there should be clay in the Midlands but not in the South. It’s likely that there was an ice age, not just because I found a granite pebble, but because Scotland is rising, and the English Midlands are covered in clay. You may lack the tools, funds, or time to do so, but you can apply this process to virtually anything to figure out how something happened or works.

If you’re an application developer, as I am, it’s highly unlikely that you wrote the platform you use, so you probably don’t really understand it. The vast majority of the platform you use was written by other people over what might as well be geological timescales. Benchmarks are a lot like rocks in that they can reveal implementation details you aren’t otherwise party to, and, with the right trade offs, you may be able to harness these in your applications. Collecting rocks doesn’t make a geologist, and I think you need to be quite inquisitive for benchmarking to work out, because you need to seek corroborating evidence: just as the presence of a pebble is scant support for an ice age, a single benchmark score doesn’t say much about anything.

If you’re interested in the JVM, I think instruction profiling is essential because it gives so much away. For instance, you may not appreciate the significance of choice of garbage collector, but you’ll see lots of strange instructions in some benchmarks if you profile them, and you may have the curiosity to ask what they do and what put them there. If you don’t do it, you won’t really know the boundaries of validity of whatever your observation is. You could find that changing your garbage collector invalidates your findings.

Partly because I want the information on this site to be basically correct, but also to illustrate how conclusions can be jumped to because a benchmark score confirms a personal bias, I’ll look again at a couple of very superficial benchmarks I wrote about last year, to do with polynomial hash codes. The measurements were probably fine, but I didn’t really get to the bottom of the issue and I could easily have found a faster implementation if I had looked a bit harder.

### Polynomial Hash Codes

The Java String.hashCode and Arrays.hashCode methods are implemented as the dot product of the data and a vector of powers of 31.

    // Arrays.java
public static int hashCode(int a[]) {
if (a == null)
return 0;

int result = 1;
for (int element : a)
result = 31 * result + element;

return result;
}


Since String caches its hash code, you’d be hard pressed to make use of the following optimisations, but I found I learned something about what C2 does and doesn’t do by attempting to optimise it. First of all, what does the built in hash code actually do? You can only really find out with instruction profiling, which you can enable in JMH with -prof perfasm on Linux or -prof xperfasm on Windows.

The bulk of the sampled instructions are in this block below:


2.37%    0x000002720f204af6: mov     r10d,edi
1.72%    0x000002720f204af9: shl     r10d,5h
2.06%    0x000002720f204afd: sub     r10d,edi
3.77%    0x000002720f204b00: add     r10d,dword ptr [rax+rbx*4+14h]
3.93%    0x000002720f204b05: mov     r8d,r10d
0.01%    0x000002720f204b08: shl     r8d,5h
3.80%    0x000002720f204b0c: sub     r8d,r10d
3.98%    0x000002720f204b0f: add     r8d,dword ptr [rax+rbx*4+18h]
4.12%    0x000002720f204b14: mov     r10d,r8d
0x000002720f204b17: shl     r10d,5h
3.78%    0x000002720f204b1b: sub     r10d,r8d
3.75%    0x000002720f204b1e: add     r10d,dword ptr [rax+rbx*4+1ch]
3.81%    0x000002720f204b23: mov     r8d,r10d
0x000002720f204b26: shl     r8d,5h
4.04%    0x000002720f204b2a: sub     r8d,r10d
4.15%    0x000002720f204b2d: add     r8d,dword ptr [rax+rbx*4+20h]
3.98%    0x000002720f204b32: mov     r10d,r8d
0x000002720f204b35: shl     r10d,5h
4.27%    0x000002720f204b39: sub     r10d,r8d
3.95%    0x000002720f204b3c: add     r10d,dword ptr [rax+rbx*4+24h]
3.77%    0x000002720f204b41: mov     r8d,r10d
0x000002720f204b44: shl     r8d,5h
4.01%    0x000002720f204b48: sub     r8d,r10d
4.02%    0x000002720f204b4b: add     r8d,dword ptr [rax+rbx*4+28h]
4.11%    0x000002720f204b50: mov     ecx,r8d
0x000002720f204b53: shl     ecx,5h
4.08%    0x000002720f204b56: sub     ecx,r8d
4.31%    0x000002720f204b59: add     ecx,dword ptr [rax+rbx*4+2ch]


The first thing to ask is where is the multiplication? There is no multiplication, it’s been replaced by a left shift and a subtraction.


public int StrengthReduction() {
int result = 1;
for (int i = 0; i < data.length; ++i) {
result = (result << 5) - result + data[i];
}
return result;
}


This is the compiler trying to be helpful because shifts and subtractions are cheaper than multiplications, and for 31, this transformation is possible. The snippet is one long chain of instructions: notice the register dependencies in the assembly snippet:


4.15%    0x000002720f204b2d: add     r8d,dword ptr [rax+rbx*4+20h]
3.98%    0x000002720f204b32: mov     r10d,r8d
0x000002720f204b35: shl     r10d,5h
4.27%    0x000002720f204b39: sub     r10d,r8d


The addition must complete before the contents of r8d are available for the move, the left shift waits for the move, and the subtraction waits for the shift. No two elements of the array are ever processed simultaneously. First suggested by Peter Levart, I came across it on Daniel Lemire’s blog, the dependency can be broken by manually unrolling the loop:


@Benchmark
public int Unrolled() {
if (data == null)
return 0;

int result = 1;
int i = 0;
for (; i + 7 < data.length; i += 8) {
result = 31 * 31 * 31 * 31 * 31 * 31 * 31 * 31 * result
+ 31 * 31 * 31 * 31 * 31 * 31 * 31 * data[i]
+ 31 * 31 * 31 * 31 * 31 * 31 * data[i + 1]
+ 31 * 31 * 31 * 31 * 31 * data[i + 2]
+ 31 * 31 * 31 * 31 * data[i + 3]
+ 31 * 31 * 31 * data[i + 4]
+ 31 * 31 * data[i + 5]
+ 31 * data[i + 6]
+ data[i + 7]
;
}
for (; i < data.length; i++) {
result = 31 * result + data[i];
}
return result;
}


Weirdly, this implementation does very well (this isn’t new: there has been a ticket for it for several years). Without even bothering with a throughput score (the money shot comes at the end), the profiling output shows that this must be much better. The assembly is quite hard to read because it’s full of tricks I didn’t know existed, but look out for the hexadecimal constants and convince yourself that several are simply powers of 31. The multiplication by 31 is strength reduced to a left shift and a subtraction again.


0.26%    0x000001d67bdd3c8e: mov     r8d,94446f01h
0.01%    0x000001d67bdd3c94: jmp     1d67bdd3cb1h
0x000001d67bdd3c96: nop     word ptr [rax+rax+0h]
0x000001d67bdd3ca0: mov     edi,r11d
0.03%    0x000001d67bdd3ca3: vmovq   r14,xmm0
0.42%    0x000001d67bdd3ca8: mov     ebp,dword ptr [rsp+70h]
7.14%    0x000001d67bdd3cac: vmovq   rbx,xmm1
0.01%    0x000001d67bdd3cb1: cmp     edi,r9d
0x000001d67bdd3cb4: jnb     1d67bdd3d6ch
0.04%    0x000001d67bdd3cba: imul    r10d,dword ptr [rcx+rdi*4+10h],67e12cdfh ; Another strength reduction trick
7.74%    0x000001d67bdd3cc3: add     r10d,r8d                                 ; 1742810335 * x + 2487512833
2.69%    0x000001d67bdd3cc6: mov     r11d,edi
0.46%    0x000001d67bdd3ccd: cmp     r11d,r9d
0x000001d67bdd3cd0: jnb     1d67bdd3dbah
6.82%    0x000001d67bdd3cd6: vmovq   xmm1,rbx
0.61%    0x000001d67bdd3cdb: mov     dword ptr [rsp+70h],ebp
0.06%    0x000001d67bdd3cdf: vmovq   xmm0,r14
0.46%    0x000001d67bdd3ce4: mov     r14,qword ptr [r15+70h]
6.60%    0x000001d67bdd3ce8: mov     r11d,edi
0.04%    0x000001d67bdd3cef: movsxd  rax,edi
6.87%    0x000001d67bdd3cf5: mov     edx,dword ptr [rcx+rax*4+28h]
0.68%    0x000001d67bdd3cf9: imul    r8d,dword ptr [rcx+rax*4+14h],34e63b41h ; multiply by 887503681
0.67%    0x000001d67bdd3d02: add     r8d,r10d   ; --------------------------
7.30%    0x000001d67bdd3d05: mov     r10d,edx   ; Multiply by 31
0.63%    0x000001d67bdd3d08: shl     r10d,5h    ;
0.08%    0x000001d67bdd3d0c: sub     r10d,edx   ; --------------------------
0.73%    0x000001d67bdd3d0f: imul    edx,dword ptr [rcx+rax*4+24h],3c1h     ; multiply by 961
7.47%    0x000001d67bdd3d17: imul    ebp,dword ptr [rcx+rax*4+20h],745fh    ; multiply by 29791
0.56%    0x000001d67bdd3d1f: imul    esi,dword ptr [rcx+rax*4+1ch],0e1781h  ; multiply by 923521
7.02%    0x000001d67bdd3d27: imul    ebx,dword ptr [rcx+rax*4+18h],1b4d89fh ; multiply by 28629151
7.04%    0x000001d67bdd3d3e: add     r8d,dword ptr [rcx+rax*4+2ch] ; add the data value at offset 7
1.73%    0x000001d67bdd3d43: test    dword ptr [r14],eax
0.06%    0x000001d67bdd3d46: cmp     edi,r9d
0x000001d67bdd3d49: jnl     1d67bdd3c15h
0.45%    0x000001d67bdd3d4f: imul    r8d,r8d,94446f01h ; multiply by 2487512833 (coprime to 31, follow r8d backwards)
11.90%    0x000001d67bdd3d56: cmp     r11d,r9d


It’s probably not worth deciphering all the tricks in the code above, but notice that there is a lot of parallelism: the chain of signed multiplications target different registers and are independent. This code is much faster.

I wrote the code below in July last year to try to parallelise this calculation. At the expense of precomputing the powers of 31 up to some fixed length, such as the maximum length of strings in your database, it’s quite fast.


private final int[] coefficients;

public FixedLengthHashCode(int maxLength) {
this.coefficients = new int[maxLength + 1];
coefficients[0] = 1;
for (int i = 1; i <= maxLength; ++i) {
coefficients[i] = 31 * coefficients[i - 1];
}
}

public int hashCode(int[] value) {
final int max = value.length;
int result = coefficients[max];
for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
result += coefficients[max - i - 1] * value[i];
}
return result;
}


I was non-commital in the original post but I sort-of claimed this code was vectorised without even bothering to look at the disassembly. It’s scalar, but it’s much more parallel than the unrolled version, and all the clever strength reductions and dependencies are gone.


0.15%    0x0000022d8e6825e0: movsxd  rbx,ecx
0.07%    0x0000022d8e6825e3: mov     rdx,rsi
3.57%    0x0000022d8e6825e6: sub     rdx,rbx
0.08%    0x0000022d8e6825e9: mov     r10d,dword ptr [r9+rbx*4+10h]
0.18%    0x0000022d8e6825ee: imul    r10d,dword ptr [rdi+rdx*4+0ch]
4.25%    0x0000022d8e6825f7: mov     r11d,dword ptr [r9+rbx*4+14h]
0.14%    0x0000022d8e6825fc: imul    r11d,dword ptr [rdi+rdx*4+8h]
1.31%    0x0000022d8e682605: mov     r10d,dword ptr [r9+rbx*4+18h]
3.93%    0x0000022d8e68260a: imul    r10d,dword ptr [rdi+rdx*4+4h]
0.94%    0x0000022d8e682613: mov     r8d,dword ptr [r9+rbx*4+1ch]
0.05%    0x0000022d8e682618: imul    r8d,dword ptr [rdi+rdx*4]
1.02%    0x0000022d8e682620: mov     r10d,dword ptr [r9+rbx*4+20h]
0.19%    0x0000022d8e682625: imul    r10d,dword ptr [rdi+rdx*4+0fffffffffffffffch]
4.71%    0x0000022d8e68262e: mov     r11d,dword ptr [r9+rbx*4+24h]
0.08%    0x0000022d8e682633: imul    r11d,dword ptr [rdi+rdx*4+0fffffffffffffff8h]
5.57%    0x0000022d8e68263c: mov     r10d,dword ptr [r9+rbx*4+28h]
0.04%    0x0000022d8e682641: imul    r10d,dword ptr [rdi+rdx*4+0fffffffffffffff4h]
4.67%    0x0000022d8e68264a: mov     r11d,dword ptr [r9+rbx*4+2ch]
0.08%    0x0000022d8e68264f: imul    r11d,dword ptr [rdi+rdx*4+0fffffffffffffff0h]
4.50%    0x0000022d8e682658: mov     r10d,dword ptr [r9+rbx*4+30h]
0.20%    0x0000022d8e68265d: imul    r10d,dword ptr [rdi+rdx*4+0ffffffffffffffech]
3.82%    0x0000022d8e682666: mov     r8d,dword ptr [r9+rbx*4+34h]
0.05%    0x0000022d8e68266b: imul    r8d,dword ptr [rdi+rdx*4+0ffffffffffffffe8h]
4.18%    0x0000022d8e682674: mov     r10d,dword ptr [r9+rbx*4+38h]
0.02%    0x0000022d8e682679: imul    r10d,dword ptr [rdi+rdx*4+0ffffffffffffffe4h]
5.11%    0x0000022d8e682682: mov     r11d,dword ptr [r9+rbx*4+3ch]
0.03%    0x0000022d8e682687: imul    r11d,dword ptr [rdi+rdx*4+0ffffffffffffffe0h]
4.88%    0x0000022d8e682690: mov     r10d,dword ptr [r9+rbx*4+40h]
0.09%    0x0000022d8e682695: imul    r10d,dword ptr [rdi+rdx*4+0ffffffffffffffdch]
4.56%    0x0000022d8e68269e: mov     r8d,dword ptr [r9+rbx*4+44h]
0.02%    0x0000022d8e6826a3: imul    r8d,dword ptr [rdi+rdx*4+0ffffffffffffffd8h]
4.73%    0x0000022d8e6826ac: mov     r10d,dword ptr [r9+rbx*4+48h]
0.06%    0x0000022d8e6826b1: imul    r10d,dword ptr [rdi+rdx*4+0ffffffffffffffd4h]
4.12%    0x0000022d8e6826ba: mov     r8d,dword ptr [r9+rbx*4+4ch]


That blog post really was lazy. There’s a bit of a problem with the access pattern because the coefficients are accessed in reverse order, and at an offset: it’s too complicated for the optimiser. The code below is just a dot product and it should come as no surprise that it’s faster.


private int[] coefficients;
private int seed;

void init(int size) {
coefficients = new int[size];
coefficients[size - 1] = 1;
for (int i = size - 2; i >= 0; --i) {
coefficients[i] = 31 * coefficients[i + 1];
}
seed = 31 * coefficients[0];
}

@Benchmark
public int Vectorised() {
int result = seed;
for (int i = 0; i < data.length && i < coefficients.length; ++i) {
result += coefficients[i] * data[i];
}
return result;
}


The code vectorises, but the reduction isn’t as good as it could be with handwritten code.


0.22%    0x000001d9c2e0f320: vmovdqu ymm0,ymmword ptr [rdi+rsi*4+70h]
2.31%    0x000001d9c2e0f326: vpmulld ymm0,ymm0,ymmword ptr [r11+rsi*4+70h]
0.61%    0x000001d9c2e0f32d: vmovdqu ymm1,ymmword ptr [rdi+rsi*4+50h]
2.61%    0x000001d9c2e0f333: vpmulld ymm9,ymm1,ymmword ptr [r11+rsi*4+50h]
0.53%    0x000001d9c2e0f33a: vmovdqu ymm1,ymmword ptr [rdi+rsi*4+30h]
2.07%    0x000001d9c2e0f340: vpmulld ymm10,ymm1,ymmword ptr [r11+rsi*4+30h]
0.60%    0x000001d9c2e0f347: vmovdqu ymm1,ymmword ptr [rdi+rsi*4+10h]
2.33%    0x000001d9c2e0f34d: vpmulld ymm11,ymm1,ymmword ptr [r11+rsi*4+10h]
3.56%    0x000001d9c2e0f35e: vextracti128 xmm8,ymm7,1h
1.56%    0x000001d9c2e0f369: vmovd   xmm8,r8d
0.93%    0x000001d9c2e0f372: vmovd   edx,xmm8
2.32%    0x000001d9c2e0f380: vextracti128 xmm6,ymm2,1h
0.63%    0x000001d9c2e0f38a: vmovd   xmm6,edx
7.76%    0x000001d9c2e0f392: vmovd   edx,xmm6
0.34%    0x000001d9c2e0f3a0: vextracti128 xmm1,ymm5,1h
0.88%    0x000001d9c2e0f3aa: vmovd   xmm1,edx
7.85%    0x000001d9c2e0f3b2: vmovd   edx,xmm1
0.34%    0x000001d9c2e0f3c0: vextracti128 xmm3,ymm4,1h
0.80%    0x000001d9c2e0f3ca: vmovd   xmm3,edx
8.09%    0x000001d9c2e0f3d2: vmovd   r8d,xmm3


Using JDK9, this results in a 3x throughput gain over the built in Arrays.hashCode, and that includes the cost of doubling the number of bytes to process and a suboptimal reduction phase. This is going to be a prime candidate for the Vector API, where a vector of powers of 31 could be multiplied by 31^8 on each iteration, before multiplying by a vector of the next 8 data elements.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 3.439980 0.024231 ops/us 256
BuiltIn thrpt 1 10 0.842664 0.009019 ops/us 1024
BuiltIn thrpt 1 10 0.103638 0.003070 ops/us 8192
FixedLength thrpt 1 10 7.421648 0.169558 ops/us 256
FixedLength thrpt 1 10 1.966116 0.044398 ops/us 1024
FixedLength thrpt 1 10 0.249666 0.009994 ops/us 8192
StrengthReduction thrpt 1 10 3.417248 0.045733 ops/us 256
StrengthReduction thrpt 1 10 0.840830 0.011091 ops/us 1024
StrengthReduction thrpt 1 10 0.104265 0.001537 ops/us 8192
Unrolled thrpt 1 10 6.353271 0.042330 ops/us 256
Unrolled thrpt 1 10 1.672845 0.035389 ops/us 1024
Unrolled thrpt 1 10 0.205575 0.009375 ops/us 8192
Vectorised thrpt 1 10 10.958270 0.109993 ops/us 256
Vectorised thrpt 1 10 2.948918 0.081830 ops/us 1024
Vectorised thrpt 1 10 0.358819 0.027316 ops/us 8192

# Explicit Intent and Even Faster Hash Codes

I wrote a post recently about how disappointed I was that the optimiser couldn’t outsmart some clever Java code for computing hash codes. Well, here’s a faster hash code along the same lines.

The hash code implemented in Arrays.hashCode is a polynomial hash, it applies to any data type with a positional interpretation. It takes the general form $\sum_{i=0}^{n}x_{i}31^{n - i}$ where $x_0 = 1$. In other words, it’s a dot product of the elements of the array and some powers of 31. Daniel Lemire’s implementation makes it explicit to the optimiser, in a way it won’t otherwise infer, that this operation is data parallel. If it’s really just a dot product it can be made even more obvious at the cost of a loss of flexibility.

Imagine you are processing fixed or limited length strings (VARCHAR(255) or an URL) or coordinates of a space of fixed dimension. Then you could pre-compute the coefficients in an array and write the hash code explicitly as a dot product. Java 9 uses AVX instructions for dot products, so it should be very fast.


public class FixedLengthHashCode {

private final int[] coefficients;

public FixedLengthHashCode(int maxLength) {
this.coefficients = new int[maxLength + 1];
coefficients[maxLength] = 1;
for (int i = maxLength - 1; i >= 0; --i) {
coefficients[i] = 31 * coefficients[i + 1];
}
}

public int hashCode(int[] value) {
int result = coefficients[0];
for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
result += coefficients[i + 1] * value[i];
}
return result;
}
}


This is really explicit, unambiguously parallelisable, and the results are remarkable.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
HashCode.BuiltIn thrpt 1 10 10.323026 0.223614 ops/us 100
HashCode.BuiltIn thrpt 1 10 0.959246 0.038900 ops/us 1000
HashCode.BuiltIn thrpt 1 10 0.096005 0.001836 ops/us 10000
HashCode.FixedLength thrpt 1 10 20.186800 0.297590 ops/us 100
HashCode.FixedLength thrpt 1 10 2.314187 0.082867 ops/us 1000
HashCode.FixedLength thrpt 1 10 0.227090 0.005377 ops/us 10000
HashCode.Unrolled thrpt 1 10 13.250821 0.752609 ops/us 100
HashCode.Unrolled thrpt 1 10 1.503368 0.058200 ops/us 1000
HashCode.Unrolled thrpt 1 10 0.152179 0.003541 ops/us 10000

Modifying the algorithm slightly to support limited variable length arrays degrades performance slightly, but there are seemingly equivalent implementations which do much worse.


public class FixedLengthHashCode {

private final int[] coefficients;

public FixedLengthHashCode(int maxLength) {
this.coefficients = new int[maxLength + 1];
coefficients[0] = 1;
for (int i = 1; i >= maxLength; ++i) {
coefficients[i] = 31 * coefficients[i - 1];
}
}

public int hashCode(int[] value) {
final int max = value.length;
int result = coefficients[max];
for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
result += coefficients[max - i - 1] * value[i];
}
return result;
}
}


Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
FixedLength thrpt 1 10 19.172574 0.742637 ops/us 100
FixedLength thrpt 1 10 2.233006 0.115285 ops/us 1000
FixedLength thrpt 1 10 0.227451 0.012231 ops/us 10000

The benchmark code is at github.

# Still True in Java 9: Handwritten Hash Codes are Faster

One of the things to keep in mind with Java is that the best performance advice for one version may not apply to the next. The JVM has an optimiser which works by detecting intent in byte-code; it does a better job when the programmer is skilled enough to make that intent clear. There have been times when the optimiser has done a bad job, either through bugs or feature gaps, and compensatory idioms emerge. The value of these idioms degrades over time as the optimiser improves; a stroke of ingenuity in one version can become ritualistic nonsense in the next. It’s important not to be that guy who fears adding strings together because it was costly a decade ago, but does it always get better, even with low hanging fruit?

I reproduced the results of an extremely astute optimisation presented in 2015 by Daniel Lemire. I was hoping to see that an improved optimiser in Java 9, having observed several cases of automatic vectorisation, would render this optimisation null.

### Hash Codes for Arrays

I encourage you to read the original blog post because it is informative, but for the sake of coherence I will summarise the key point here. Arrays.hashCode implements the following hash code:


public static int hashCode(int[] a) {
if (a == null)
return 0;

int result = 1;
for (int element : a)
result = 31 * result + element;

return result;
}


This results in a good hash code, but a scalar internal representation of this code has a problem: a data dependency on the multiplication, which is slower than moving data into registers. A significant and reproducible speed up can be observed when unrolling the loop, which allows prefetching 128 bits of the array into registers before doing the slower multiplications:


public static int hashCode(int[] a) {
if (a == null)
return 0;

int result = 1;
int i = 0;
for (; i + 3 < a.length; i += 4) {
result = 31 * 31 * 31 * 31 * result
+ 31 * 31 * 31 * a[i]
+ 31 * 31 * a[i + 1]
+ 31 * a[i + 2]
+ a[i + 3];
}
for (; i < a.length; i++) {
result = 31 * result + a[i];
}
return result;
}


The improvement in performance from this simple change can be confirmed with Java 8 by running a simple benchmark.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 9.537736 0.382617 ops/us 100
BuiltIn thrpt 1 10 0.804620 0.103037 ops/us 1000
BuiltIn thrpt 1 10 0.092297 0.003947 ops/us 10000
Unrolled thrpt 1 10 14.974398 0.628522 ops/us 100
Unrolled thrpt 1 10 1.514986 0.035759 ops/us 1000
Unrolled thrpt 1 10 0.137408 0.010200 ops/us 10000

The performance improvement is so obvious, and the change so easy to make, that one wonders why JVM vendors didn’t make the change themselves.

### Java 9: Universally Better Automatic Optimisations?

As the comments on the blog post suggest, this is a prime candidate for vectorisation. Auto-vectorisation is a thing in Java 9. Using intrinsics or code clean enough to express intent clearly, you can really expect to see good usage of SIMD in Java 9. I was really expecting the situation to reverse in Java 9; but it doesn’t.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 9.822568 0.381087 ops/us 100
BuiltIn thrpt 1 10 0.949273 0.024021 ops/us 1000
BuiltIn thrpt 1 10 0.093171 0.004502 ops/us 10000
Unrolled thrpt 1 10 13.762617 0.440135 ops/us 100
Unrolled thrpt 1 10 1.501106 0.094897 ops/us 1000
Unrolled thrpt 1 10 0.139963 0.011487 ops/us 10000

This is a still a smart optimisation two years later, but it offends my sensibilities in the same way hints do in SQL – a layer of abstraction has been punctured. I will have to try again in Java 10.

Better gains are available when precomputing the coefficients is possible.