Vectorised Polynomial Hash Codes

To provide support for the idea of pluggable hashing strategies, Peter Lawrey demonstrates that there are better and faster hash codes than the JDK implementation of String.hashCode or Arrays.hashCode. I really like the analysis of output distribution so recommend reading the post. However, I’m not absolutely sure if pluggable hashing strategies would be a good idea. They can induce coupling between the strategy implementation and the contents of the hashed data structure, which have different life cycles and code ownership. If performance is what matters, why not just make the existing algorithm much faster?

Peter’s hash code uses Unsafe to reinterpret each four bytes as an integer, but is otherwise just another polynomial hash code with a different coefficient. It produces a slightly different result, but with potentially better properties. Here’s that hash code.

private static final int M2 = 0x7A646E4D;

// read 4 bytes at a time from a byte[] assuming Java 9+ Compact Strings
private static int getIntFromArray(byte[] value, int i) {
    return UnsafeMemory.INSTANCE.UNSAFE.getInt(value, BYTE_ARRAY_OFFSET + i); 
}

public static int nativeHashCode(byte[] value) {
    long h = getIntFromArray(value, 0);
    for (int i = 4; i < value.length; i += 4)
        h = h * M2 + getIntFromArray(value, i);
    h *= M2;
    return (int) h ^ (int) (h >>> 25);
}

Leaving the output distribution to one side, Peter reports that this hash code performs better than Arrays.hashCode(byte[]) and this is accurate. Where does the performance come from? The reintepretation reduces the number of multiplications by a factor of four, but you need Unsafe to achieve this. This also obviates the need to convert each byte to an integer to avoid overflow. Another problem is solved just by changing the multiplier. Arrays.hashCode is generally slow because the multiplication by 31 gets strength reduced to a left shift by five and a subtraction, which inadvertently creates a data dependency which can’t be unrolled. When the multiplier is 31, just unrolling the multiplication (javac will deduce that 31 * 31 = 961 so there is no runtime cost) to disable the strength reduction can increase throughput by 2x, and the rather obscure choice of 0x7A646E4D means that no such transformation takes place: this results in independent chains of multiplications and additions in the main loop:


  0.18%    0.46%     ││  0x00007f3b21c05285: movslq 0x18(%rdx,%r8,1),%rsi
  5.93%    6.28%     ││  0x00007f3b21c0528a: movslq 0x1c(%rdx,%r8,1),%rax
  0.12%    0.42%     ││  0x00007f3b21c0528f: imul   $0x7a646e4d,%rcx,%rcx
 11.87%   37.31%     ││  0x00007f3b21c05296: movslq 0x14(%rdx,%r8,1),%rdi
  0.10%    0.18%     ││  0x00007f3b21c0529b: movslq 0x10(%rdx,%r8,1),%r8
  0.06%    0.58%     ││  0x00007f3b21c052a0: add    %rcx,%r8
  5.29%    1.30%     ││  0x00007f3b21c052a3: imul   $0x7a646e4d,%r8,%r8
 18.34%   21.94%     ││  0x00007f3b21c052aa: add    %r8,%rdi
  6.33%    1.96%     ││  0x00007f3b21c052ad: imul   $0x7a646e4d,%rdi,%r8
 17.60%   10.88%     ││  0x00007f3b21c052b4: add    %r8,%rsi
  5.39%    0.72%     ││  0x00007f3b21c052b7: imul   $0x7a646e4d,%rsi,%rcx
 17.80%   11.58%     ││  0x00007f3b21c052be: add    %rax,%rcx 

Is this as good as it can get and is there something fundamentally wrong with the JDK algorithm? The algorithm can be vectorised, but this is beyond C2’s autovectoriser. The same algorithm is used for Arrays.hashCode(int[]), which doesn’t have the complication of type promotion from byte to int. I have noted before that this can be transformed to a loop which C2 can autovectorise by precomputing the coefficients of the polynomial (i.e. the powers of 31 until they repeat modulo 32, but x -> x * 31 has a very long period modulo 32) but this requires either an enormous array or a maximum length.


    private int[] coefficients;
    private int seed;

    void init(int size) {
        coefficients = new int[size]; 
        coefficients[size - 1] = 1;
        for (int i = size - 2; i >= 0; --i) {
            coefficients[i] = 31 * coefficients[i + 1];
        }
        seed = 31 * coefficients[0];
    }

    public int hashCodeAutoVectorised() {
        int result = seed;
        for (int i = 0; i < data.length && i < coefficients.length; ++i) {
            result += coefficients[i] * data[i];
        }
        return result;
    }

This idea isn’t practical but demonstrates that this kind of polynomial can be computed efficiently, if only the coefficients could be generated without disabling autovectorisation. Generating the coefficients on the fly is possible with the Vector API. It requires a multiplier containing eight consecutive powers of 31, and the exponent of each element needs to be increased by eight in each iteration. This can be achieved with a broadcast variable.


  public int polynomialHashCode() {
    var next = YMM_INT.broadcast(POWERS_OF_31_BACKWARDS[33 - 9]);
    var coefficients = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 8);
    var acc = YMM_INT.zero();
    for (int i = data.length; i - YMM_INT.length() >= 0; i -= YMM_INT.length()) {
      acc = acc.add(coefficients.mul(YMM_INT.fromArray(data, i - YMM_INT.length())));
      coefficients = coefficients.mul(next);
    }
    return acc.addAll() + coefficients.get(7);
  }

There’s a problem here – it sandwiches a low latency addition between two high latency multiplications, so there is a data dependency and unrolling without breaking the dependency isn’t necessarily helpful. The dependency can be broken manually by using four accumulators, four coefficient vectors consisting of 32 consecutive powers of 31, and each coefficient must have its logarithm increased by 32 in each iteration. It may look dirty but the dependencies are eradicated.


  public int polynomialHashCodeUnrolled() {
    var next = YMM_INT.broadcast(POWERS_OF_31_BACKWARDS[0]);
    var coefficients1 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 8);
    var coefficients2 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 16);
    var coefficients3 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 24);
    var coefficients4 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 32);
    var acc1 = YMM_INT.zero();
    var acc2 = YMM_INT.zero();
    var acc3 = YMM_INT.zero();
    var acc4 = YMM_INT.zero();
    for (int i = data.length; i - 4 * YMM_INT.length() >= 0; i -= YMM_INT.length() * 4) {
      acc1 = acc1.add(coefficients1.mul(YMM_INT.fromArray(data, i - YMM_INT.length())));
      acc2 = acc2.add(coefficients2.mul(YMM_INT.fromArray(data, i - 2 * YMM_INT.length())));
      acc3 = acc3.add(coefficients3.mul(YMM_INT.fromArray(data, i - 3 * YMM_INT.length())));
      acc4 = acc4.add(coefficients4.mul(YMM_INT.fromArray(data, i - 4 * YMM_INT.length())));
      coefficients1 = coefficients1.mul(next);
      coefficients2 = coefficients2.mul(next);
      coefficients3 = coefficients3.mul(next);
      coefficients4 = coefficients4.mul(next);
    }
    return acc1.add(acc2).add(acc3).add(acc4).addAll() + coefficients1.get(7);
  }

The implementation above does not handle arbitrary length input, but the input could either be padded with zeroes (see this paper on padded SLP autovectorisation) or followed by a post loop. It produces the same output as Arrays.hashCode, so how much faster is it?

Benchmark (size) Mode Cnt Score Error Units
arraysHashCode 1024 thrpt 20 1095.089 3.980 ops/ms
arraysHashCode 65536 thrpt 20 16.963 0.130 ops/ms
hashCodeAutoVectorised 1024 thrpt 20 3716.853 18.025 ops/ms
hashCodeAutoVectorised 65536 thrpt 20 57.265 0.907 ops/ms
polynomialHashCode 1024 thrpt 20 2623.090 7.920 ops/ms
polynomialHashCode 65536 thrpt 20 39.344 0.238 ops/ms
polynomialHashCodeUnrolled 1024 thrpt 20 8266.119 34.480 ops/ms
polynomialHashCodeUnrolled 65536 thrpt 20 131.196 6.234 ops/ms

So there really is absolutely nothing wrong with the algorithm from a performance perspective, but the implementation can be improved vastly (~8x). It seems that the tools required for JDK engineers and users alike to make optimisations like these are in the pipeline!

What about byte arrays? One obstacle to vectorisation is that to implement the JDK algorithm strictly, the bytes must accumulate into 32 bit values, which means that the lanes need to widen, so the contents of a single vector register of bytes would need to fan out to four vector registers of integers. This would be achievable by loading vectors at eight byte offsets and permuting the first eight bytes of each vector into every fourth position, but this is quite convoluted.

Peter demonstrates that reinterpreting four bytes as an integer doesn’t necessarily degrade the hash distribution, and may even improve it, so I used the same trick: rebracketing a ByteVector to an IntVector to produce the “wrong” result, but a reasonable hash code nonetheless. A nice feature of the Vector API is allowing this kind of reinterpretation without resorting to Unsafe, via Vector.rebracket.

  
  public int hashCodeVectorAPINoDependencies() {
    var next = YMM_INT.broadcast(POWERS_OF_31[8]);
    var coefficients1 = YMM_INT.fromArray(POWERS_OF_31, 0);
    var coefficients2 = coefficients1.mul(next);
    var coefficients3 = coefficients2.mul(next);
    var coefficients4 = coefficients3.mul(next);
    next = next.mul(next);
    next = next.mul(next);
    var acc1 = YMM_INT.zero();
    var acc2 = YMM_INT.zero();
    var acc3 = YMM_INT.zero();
    var acc4 = YMM_INT.zero();
    for (int i = 0; i < data.length; i += YMM_BYTE.length() * 4) {
      acc1 = acc1.add(coefficients1.mul(YMM_BYTE.fromArray(data, i).rebracket(YMM_INT)));
      acc2 = acc2.add(coefficients2.mul(YMM_BYTE.fromArray(data, i + YMM_BYTE.length()).rebracket(YMM_INT)));
      acc3 = acc3.add(coefficients3.mul(YMM_BYTE.fromArray(data, i + 2 * YMM_BYTE.length()).rebracket(YMM_INT)));
      acc4 = acc4.add(coefficients4.mul(YMM_BYTE.fromArray(data, i + 3 * YMM_BYTE.length()).rebracket(YMM_INT)));
      coefficients1 = coefficients1.mul(next);
      coefficients2 = coefficients2.mul(next);
      coefficients3 = coefficients3.mul(next);
      coefficients4 = coefficients4.mul(next);
    }
    return acc1.add(acc2).add(acc3).add(acc4).addAll();
  }

Owing to its use of better tools yet to be released, this version is many times faster than either the JDK implementation or Peter’s.

Benchmark (size) Mode Cnt Score Error Units
arraysHashCode 128 thrpt 20 8897.392 220.582 ops/ms
arraysHashCode 256 thrpt 20 4286.621 156.794 ops/ms
arraysHashCode 512 thrpt 20 2024.858 72.030 ops/ms
arraysHashCode 1024 thrpt 20 1002.173 39.917 ops/ms
hashCodeVectorAPINoDependencies 128 thrpt 20 88395.374 3369.397 ops/ms
hashCodeVectorAPINoDependencies 256 thrpt 20 64799.697 1035.175 ops/ms
hashCodeVectorAPINoDependencies 512 thrpt 20 48248.967 864.728 ops/ms
hashCodeVectorAPINoDependencies 1024 thrpt 20 27429.025 916.850 ops/ms
hashCodeVectorAPIDependencies 128 thrpt 20 96679.811 316.470 ops/ms
hashCodeVectorAPIDependencies 256 thrpt 20 52140.582 1825.173 ops/ms
hashCodeVectorAPIDependencies 512 thrpt 20 26327.195 492.730 ops/ms
hashCodeVectorAPIDependencies 1024 thrpt 20 10929.500 351.732 ops/ms
nativeHashCode 128 thrpt 20 38799.185 188.636 ops/ms
nativeHashCode 256 thrpt 20 17438.536 257.168 ops/ms
nativeHashCode 512 thrpt 20 7041.815 209.817 ops/ms
nativeHashCode 1024 thrpt 20 3217.187 96.379 ops/ms

Benchmark source code

Leave a Reply

Your email address will not be published. Required fields are marked *