Explicit Intent and Even Faster Hash Codes

I wrote a post recently about how disappointed I was that the optimiser couldn’t outsmart some clever Java code for computing hash codes. Well, here’s a faster hash code along the same lines.

The hash code implemented in Arrays.hashCode is a polynomial hash, it applies to any data type with a positional interpretation. It takes the general form \sum_{i=0}^{n}x_{i}31^{n - i} where x_0 = 1. In other words, it’s a dot product of the elements of the array and some powers of 31. Daniel Lemire’s implementation makes it explicit to the optimiser, in a way it won’t otherwise infer, that this operation is data parallel. If it’s really just a dot product it can be made even more obvious at the cost of a loss of flexibility.

Imagine you are processing fixed or limited length strings (VARCHAR(255) or an URL) or coordinates of a space of fixed dimension. Then you could pre-compute the coefficients in an array and write the hash code explicitly as a dot product. Java 9 uses AVX instructions for dot products, so it should be very fast.


public class FixedLengthHashCode {

    private final int[] coefficients;

    public FixedLengthHashCode(int maxLength) {
        this.coefficients = new int[maxLength + 1];
        coefficients[maxLength] = 1;
        for (int i = maxLength - 1; i >= 0; --i) {
            coefficients[i] = 31 * coefficients[i + 1];
        }
    }

    public int hashCode(int[] value) {
        int result = coefficients[0];
        for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
            result += coefficients[i + 1] * value[i];
        }
        return result;
    }
}

This is really explicit, unambiguously parallelisable, and the results are remarkable.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
HashCode.BuiltIn thrpt 1 10 10.323026 0.223614 ops/us 100
HashCode.BuiltIn thrpt 1 10 0.959246 0.038900 ops/us 1000
HashCode.BuiltIn thrpt 1 10 0.096005 0.001836 ops/us 10000
HashCode.FixedLength thrpt 1 10 20.186800 0.297590 ops/us 100
HashCode.FixedLength thrpt 1 10 2.314187 0.082867 ops/us 1000
HashCode.FixedLength thrpt 1 10 0.227090 0.005377 ops/us 10000
HashCode.Unrolled thrpt 1 10 13.250821 0.752609 ops/us 100
HashCode.Unrolled thrpt 1 10 1.503368 0.058200 ops/us 1000
HashCode.Unrolled thrpt 1 10 0.152179 0.003541 ops/us 10000

Modifying the algorithm slightly to support limited variable length arrays degrades performance slightly, but there are seemingly equivalent implementations which do much worse.


public class FixedLengthHashCode {

    private final int[] coefficients;

    public FixedLengthHashCode(int maxLength) {
        this.coefficients = new int[maxLength + 1];
        coefficients[0] = 1;
        for (int i = 1; i >= maxLength; ++i) {
            coefficients[i] = 31 * coefficients[i - 1];
        }
    }

    public int hashCode(int[] value) {
        final int max = value.length;
        int result = coefficients[max];
        for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
            result += coefficients[max - i - 1] * value[i];
        }
        return result;
    }
}

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
FixedLength thrpt 1 10 19.172574 0.742637 ops/us 100
FixedLength thrpt 1 10 2.233006 0.115285 ops/us 1000
FixedLength thrpt 1 10 0.227451 0.012231 ops/us 10000

The benchmark code is at github.

Still True in Java 9: Handwritten Hash Codes are Faster

One of the things to keep in mind with Java is that the best performance advice for one version may not apply to the next. The JVM has an optimiser which works by detecting intent in byte-code; it does a better job when the programmer is skilled enough to make that intent clear. There have been times when the optimiser has done a bad job, either through bugs or feature gaps, and compensatory idioms emerge. The value of these idioms degrades over time as the optimiser improves; a stroke of ingenuity in one version can become ritualistic nonsense in the next. It’s important not to be that guy who fears adding strings together because it was costly a decade ago, but does it always get better, even with low hanging fruit?

I reproduced the results of an extremely astute optimisation presented in 2015 by Daniel Lemire. I was hoping to see that an improved optimiser in Java 9, having observed several cases of automatic vectorisation, would render this optimisation null.

Hash Codes for Arrays

I encourage you to read the original blog post because it is informative, but for the sake of coherence I will summarise the key point here. Arrays.hashCode implements the following hash code:


    public static int hashCode(int[] a) {
        if (a == null)
            return 0;

        int result = 1;
        for (int element : a)
            result = 31 * result + element;

        return result;
    }

This results in a good hash code, but a scalar internal representation of this code has a problem: a data dependency on the multiplication, which is slower than moving data into registers. A significant and reproducible speed up can be observed when unrolling the loop, which allows prefetching 128 bits of the array into registers before doing the slower multiplications:


    public static int hashCode(int[] a) {
        if (a == null)
            return 0;

        int result = 1;
        int i = 0;
        for (; i + 3 < a.length; i += 4) {
            result = 31 * 31 * 31 * 31 * result
                   + 31 * 31 * 31 * a[i]
                   + 31 * 31 * a[i + 1]
                   + 31 * a[i + 2]
                   + a[i + 3];
        }
        for (; i < a.length; i++) {
            result = 31 * result + a[i];
        }
        return result;
    }

The improvement in performance from this simple change can be confirmed with Java 8 by running a simple benchmark.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 9.537736 0.382617 ops/us 100
BuiltIn thrpt 1 10 0.804620 0.103037 ops/us 1000
BuiltIn thrpt 1 10 0.092297 0.003947 ops/us 10000
Unrolled thrpt 1 10 14.974398 0.628522 ops/us 100
Unrolled thrpt 1 10 1.514986 0.035759 ops/us 1000
Unrolled thrpt 1 10 0.137408 0.010200 ops/us 10000

The performance improvement is so obvious, and the change so easy to make, that one wonders why JVM vendors didn’t make the change themselves.

Java 9: Universally Better Automatic Optimisations?

As the comments on the blog post suggest, this is a prime candidate for vectorisation. Auto-vectorisation is a thing in Java 9. Using intrinsics or code clean enough to express intent clearly, you can really expect to see good usage of SIMD in Java 9. I was really expecting the situation to reverse in Java 9; but it doesn’t.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 9.822568 0.381087 ops/us 100
BuiltIn thrpt 1 10 0.949273 0.024021 ops/us 1000
BuiltIn thrpt 1 10 0.093171 0.004502 ops/us 10000
Unrolled thrpt 1 10 13.762617 0.440135 ops/us 100
Unrolled thrpt 1 10 1.501106 0.094897 ops/us 1000
Unrolled thrpt 1 10 0.139963 0.011487 ops/us 10000

This is a still a smart optimisation two years later, but it offends my sensibilities in the same way hints do in SQL – a layer of abstraction has been punctured. I will have to try again in Java 10.

Better gains are available when precomputing the coefficients is possible.

New Methods in Java 9: Math.fma and Arrays.mismatch

There are two noteworthy new methods in Java 9: Arrays.mismatch and Math.fma.

Arrays.mismatch

This method takes two primitive arrays, and returns the index of the first differing values. This effectively computes the longest common prefix of the two arrays. This is really quite useful, mostly for text processing but also for Bioinformatics (protein sequencing and so on, much more interesting than the sort of thing I work on). Having worked extensively with Apache HBase (where a vast majority of the API involves manipulating byte arrays) I can think of lots of less interesting use cases for this method.

Looking carefully, you can see that the method calls into the internal ArraysSupport utility class, which will try to perform a vectorised mismatch (an intrinsic candidate). Since this will use AVX instructions, this is very fast; much faster than a handwritten loop.

Let’s measure the boost versus a handwritten loop, testing across a range of common prefices and array lengths of byte[].


    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void Mismatch_Intrinsic(BytePrefixData data, Blackhole bh) {
        bh.consume(Arrays.mismatch(data.data1, data.data2));
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void Mismatch_Handwritten(BytePrefixData data, Blackhole bh) {
        byte[] data1 = data.data1;
        byte[] data2 = data.data2;
        int length = Math.min(data1.length, data2.length);
        int mismatch = -1;
        for (int i = 0; i < length; ++i) {
            if (data1[i] != data2[i]) {
                mismatch = i;
                break;
            }
        }
        bh.consume(mismatch);
    }

The results speak for themselves. Irritatingly, there is some duplication in output because I haven’t figured out how to make JMH use a subset of the Cartesian product of its parameters.

Benchmark (prefix) (size) Mode Cnt Score Error Units
Mismatch_Handwritten 10 100 thrpt 10 22.360 ± 0.938 ops/us
Mismatch_Handwritten 10 1000 thrpt 10 2.459 ± 0.256 ops/us
Mismatch_Handwritten 10 10000 thrpt 10 0.255 ± 0.009 ops/us
Mismatch_Handwritten 100 100 thrpt 10 22.763 ± 0.869 ops/us
Mismatch_Handwritten 100 1000 thrpt 10 2.690 ± 0.044 ops/us
Mismatch_Handwritten 100 10000 thrpt 10 0.273 ± 0.008 ops/us
Mismatch_Handwritten 1000 100 thrpt 10 24.970 ± 0.713 ops/us
Mismatch_Handwritten 1000 1000 thrpt 10 2.791 ± 0.066 ops/us
Mismatch_Handwritten 1000 10000 thrpt 10 0.281 ± 0.007 ops/us
Mismatch_Intrinsic 10 100 thrpt 10 89.169 ± 2.759 ops/us
Mismatch_Intrinsic 10 1000 thrpt 10 26.995 ± 0.501 ops/us
Mismatch_Intrinsic 10 10000 thrpt 10 3.553 ± 0.065 ops/us
Mismatch_Intrinsic 100 100 thrpt 10 83.037 ± 5.590 ops/us
Mismatch_Intrinsic 100 1000 thrpt 10 26.249 ± 0.714 ops/us
Mismatch_Intrinsic 100 10000 thrpt 10 3.523 ± 0.122 ops/us
Mismatch_Intrinsic 1000 100 thrpt 10 87.921 ± 6.566 ops/us
Mismatch_Intrinsic 1000 1000 thrpt 10 25.812 ± 0.442 ops/us
Mismatch_Intrinsic 1000 10000 thrpt 10 4.177 ± 0.059 ops/us

Why is there such a big difference? Look at how the score decreases as a function of array length, even when the common prefix, and therefore the number of comparisons required, is small: clearly the performance of this algorithm depends on the efficiency of memory access. Arrays.mismatch optimises this, XORing 256 bits at a time.

  3.02%    1.83%        0x00007fea85ba50a0: push   %rbp
  0.14%    0.15%        0x00007fea85ba50a1: mov    %rsp,%rbp
  2.84%    4.70%        0x00007fea85ba50a4: shl    %cl,%rdx
  0.41%    0.38%        0x00007fea85ba50a7: xor    %rax,%rax
  2.72%    4.49%        0x00007fea85ba50aa: cmp    $0x8,%rdx
                        0x00007fea85ba50ae: je     Stub::vectorizedMismatch+148 0x00007fea85ba5134
  0.19%    0.17%        0x00007fea85ba50b4: jl     Stub::vectorizedMismatch+182 0x00007fea85ba5156
  0.18%    0.16%        0x00007fea85ba50ba: cmp    $0x10,%rdx
  0.00%           ╭     0x00007fea85ba50be: je     Stub::vectorizedMismatch+103 0x00007fea85ba5107
  0.12%    0.10%  │     0x00007fea85ba50c4: jl     Stub::vectorizedMismatch+148 0x00007fea85ba5134
  2.80%    1.69%  │     0x00007fea85ba50ca: cmp    $0x20,%rdx
                  │╭    0x00007fea85ba50ce: jl     Stub::vectorizedMismatch+97 0x00007fea85ba5101
  0.09%    0.08%  ││    0x00007fea85ba50d0: sub    $0x20,%rdx
  0.18%    0.18%  ││↗   0x00007fea85ba50d4: vmovdqu (%rdi,%rax,1),%ymm0
  0.15%    0.15%  │││   0x00007fea85ba50d9: vmovdqu (%rsi,%rax,1),%ymm1
  8.63%    9.44%  │││   0x00007fea85ba50de: vpxor  %ymm1,%ymm0,%ymm2
  2.63%    2.96%  │││   0x00007fea85ba50e2: vptest %ymm2,%ymm2
  3.49%    3.84%  │││   0x00007fea85ba50e7: jne    Stub::vectorizedMismatch+291 0x00007fea85ba51c3
 12.19%   14.10%  │││   0x00007fea85ba50ed: add    $0x20,%rax
  0.30%    0.32%  │││   0x00007fea85ba50f1: sub    $0x20,%rdx
                  ││╰   0x00007fea85ba50f5: jge    Stub::vectorizedMismatch+52 0x00007fea85ba50d4
  0.00%    0.00%  ││    0x00007fea85ba50f7: add    $0x20,%rdx
                  ││    0x00007fea85ba50fb: je     Stub::vectorizedMismatch+363 0x00007fea85ba520b
  0.00%    0.00%  │↘    0x00007fea85ba5101: cmp    $0x10,%rdx
  0.00%           │  ╭  0x00007fea85ba5105: jl     Stub::vectorizedMismatch+142 0x00007fea85ba512e
                  ↘  │  0x00007fea85ba5107: vmovdqu (%rdi,%rax,1),%xmm0
                     │  0x00007fea85ba510c: vmovdqu (%rsi,%rax,1),%xmm1
                     │  0x00007fea85ba5111: vpxor  %xmm1,%xmm0,%xmm2
                     │  0x00007fea85ba5115: vptest %xmm2,%xmm2
                     │  0x00007fea85ba511a: jne    Stub::vectorizedMismatch+319 0x00007fea85ba51df
                     │  0x00007fea85ba5120: add    $0x10,%rax
                     │  0x00007fea85ba5124: sub    $0x10,%rdx
                     │  0x00007fea85ba5128: je     Stub::vectorizedMismatch+363 0x00007fea85ba520b
  2.91%    2.96%     ↘  0x00007fea85ba512e: cmp    $0x8,%rdx
                        0x00007fea85ba5132: jl     Stub::vectorizedMismatch+182 0x00007fea85ba5156

The code for this benchmark is at github.

Math.fma

In comparison to users of some languages, Java programmers are lackadaisical about floating point errors. It’s a good job that historically Java hasn’t been considered suitable for the implementation of numerical algorithms. But all of a sudden there is a revolution of data science on the JVM, albeit mostly driven by the Scala community, with JVM implementations of structures like recurrent neural networks abounding. It matters less for machine learning than root finding, but how accurate can these implementations be without JVM level support for minimising the propagation floating point errors? With Math.fma this is improving, by allowing two common operations to be performed before rounding.

Math.fma fuses a multiplication and an addition into a single floating point operation to compute expressions like ab + c. This has two key benefits:

  1. There’s only one operation, and only one rounding error
  2. This is explicitly supported in AVX2 by the VFMADD* instructions

Newton’s Method

To investigate any superior suppression of floating point errors, I use a toy implementation of Newton’s method to compute the root of a quadratic equation, which any teenager could calculate analytically (the error is easy to quantify).

I compare these two implementations for 4x^2 - 12x + 9 (there is a repeated root at 1.5) to get an idea for the error (defined by |1.5 - x_n|) after a large number of iterations.

I implemented this using FMA:


public class NewtonsMethodFMA {

    private final double[] coefficients;

    public NewtonsMethodFMA(double[] coefficients) {
        this.coefficients = coefficients;
    }

    public double evaluateF(double x) {
        double f = 0D;
        int power = coefficients.length - 1;
        for (int i = 0; i < coefficients.length; ++i) {
            f = Math.fma(coefficients[i], Math.pow(x, power--), f);
        }
        return f;
    }

    public double evaluateDF(double x) {
        double df = 0D;
        int power = coefficients.length - 2;
        for (int i = 0; i < coefficients.length - 1; ++i) {
            df = Math.fma((power + 1) * coefficients[i],  Math.pow(x, power--), df);
        }
        return df;
    }

    public double solve(double initialEstimate, int maxIterations) {
        double result = initialEstimate;
        for (int i = 0; i < maxIterations; ++i) {
            result -= evaluateF(result)/evaluateDF(result);
        }
        return result;
    }
}

And an implementation with normal operations:



public class NewtonsMethod {

    private final double[] coefficients;

    public NewtonsMethod(double[] coefficients) {
        this.coefficients = coefficients;
    }

    public double evaluateF(double x) {
        double f = 0D;
        int power = coefficients.length - 1;
        for (int i = 0; i < coefficients.length; ++i) {
            f += coefficients[i] * Math.pow(x, power--);
        }
        return f;
    }

    public double evaluateDF(double x) {
        double df = 0D;
        int power = coefficients.length - 2;
        for (int i = 0; i < coefficients.length - 1; ++i) {
            df += (power + 1) * coefficients[i] * Math.pow(x, power--);
        }
        return df;
    }

    public double solve(double initialEstimate, int maxIterations) {
        double result = initialEstimate;
        for (int i = 0; i < maxIterations; ++i) {
            result -= evaluateF(result)/evaluateDF(result);
        }
        return result;
    }
}

When I run this code for 1000 iterations, the FMA version results in 1.5000000083575202, whereas the vanilla version results in 1.500000017233207. It’s completely unscientific, but seems plausible and confirms my prejudice so… In fact, it’s not that simple, and over a range of initial values, there is only a very small difference in FMA’s favour. There’s not even a performance improvement – clearly this method wasn’t added so you can start implementing numerical root finding algorithms – the key takeaway is that the results are slightly different because a different rounding strategy has been used.

Benchmark (maxIterations) Mode Cnt Score Error Units
NM_FMA 100 thrpt 10 93.805 ± 5.174 ops/ms
NM_FMA 1000 thrpt 10 9.420 ± 1.169 ops/ms
NM_FMA 10000 thrpt 10 0.962 ± 0.044 ops/ms
NM_HandWritten 100 thrpt 10 93.457 ± 5.048 ops/ms
NM_HandWritten 1000 thrpt 10 9.274 ± 0.483 ops/ms
NM_HandWritten 10000 thrpt 10 0.928 ± 0.041 ops/ms