Collecting Rocks and Benchmarks

As long as I can remember, I have been interested in rocks, I have hundreds of them in storage. Rocks are interesting because they hold little clues about processes nobody has ever seen happen. For instance, one of the first rocks I ever took an interest in was a smooth granite pebble, which I collected on a walk when I was six. Let’s be honest, as a six year old, it was the pattern on the surface of the rock that caught my eye, but this truly was a fascinating rock because it shouldn’t have been smooth and it shouldn’t have been in England. It probably came from Norway, and while it’s possible that a Norwegian brought the rock to England, it’s highly unlikely.

The best possible explanation of the rock’s existence in that point in space and time was that there was once a glacier covering Norway, the North Sea, and Northern England, moving in that direction. The glacier carried the pebble, and many others like it, and dumped it as glacial outwash in the English Midlands. Nobody alive saw this happen, but if this happened, Scotland (relieved of the weight of an ice sheet) should still be rising in altitude, and there should be clay in the Midlands but not in the South. It’s likely that there was an ice age, not just because I found a granite pebble, but because Scotland is rising, and the English Midlands are covered in clay. You may lack the tools, funds, or time to do so, but you can apply this process to virtually anything to figure out how something happened or works.

If you’re an application developer, as I am, it’s highly unlikely that you wrote the platform you use, so you probably don’t really understand it. The vast majority of the platform you use was written by other people over what might as well be geological timescales. Benchmarks are a lot like rocks in that they can reveal implementation details you aren’t otherwise party to, and, with the right trade offs, you may be able to harness these in your applications. Collecting rocks doesn’t make a geologist, and I think you need to be quite inquisitive for benchmarking to work out, because you need to seek corroborating evidence: just as the presence of a pebble is scant support for an ice age, a single benchmark score doesn’t say much about anything.

If you’re interested in the JVM, I think instruction profiling is essential because it gives so much away. For instance, you may not appreciate the significance of choice of garbage collector, but you’ll see lots of strange instructions in some benchmarks if you profile them, and you may have the curiosity to ask what they do and what put them there. If you don’t do it, you won’t really know the boundaries of validity of whatever your observation is. You could find that changing your garbage collector invalidates your findings.

Partly because I want the information on this site to be basically correct, but also to illustrate how conclusions can be jumped to because a benchmark score confirms a personal bias, I’ll look again at a couple of very superficial benchmarks I wrote about last year, to do with polynomial hash codes. The measurements were probably fine, but I didn’t really get to the bottom of the issue and I could easily have found a faster implementation if I had looked a bit harder.

Polynomial Hash Codes

The Java String.hashCode and Arrays.hashCode methods are implemented as the dot product of the data and a vector of powers of 31.

    // Arrays.java
    public static int hashCode(int a[]) {
        if (a == null)
            return 0;

        int result = 1;
        for (int element : a)
            result = 31 * result + element;

        return result;
    }

Since String caches its hash code, you’d be hard pressed to make use of the following optimisations, but I found I learned something about what C2 does and doesn’t do by attempting to optimise it. First of all, what does the built in hash code actually do? You can only really find out with instruction profiling, which you can enable in JMH with -prof perfasm on Linux or -prof xperfasm on Windows.

The bulk of the sampled instructions are in this block below:


           0x000002720f204af2: add     edi,dword ptr [rax+rbx*4+10h]
  2.37%    0x000002720f204af6: mov     r10d,edi
  1.72%    0x000002720f204af9: shl     r10d,5h
  2.06%    0x000002720f204afd: sub     r10d,edi
  3.77%    0x000002720f204b00: add     r10d,dword ptr [rax+rbx*4+14h]
  3.93%    0x000002720f204b05: mov     r8d,r10d
  0.01%    0x000002720f204b08: shl     r8d,5h
  3.80%    0x000002720f204b0c: sub     r8d,r10d
  3.98%    0x000002720f204b0f: add     r8d,dword ptr [rax+rbx*4+18h]
  4.12%    0x000002720f204b14: mov     r10d,r8d
           0x000002720f204b17: shl     r10d,5h
  3.78%    0x000002720f204b1b: sub     r10d,r8d
  3.75%    0x000002720f204b1e: add     r10d,dword ptr [rax+rbx*4+1ch]
  3.81%    0x000002720f204b23: mov     r8d,r10d
           0x000002720f204b26: shl     r8d,5h
  4.04%    0x000002720f204b2a: sub     r8d,r10d
  4.15%    0x000002720f204b2d: add     r8d,dword ptr [rax+rbx*4+20h]
  3.98%    0x000002720f204b32: mov     r10d,r8d
           0x000002720f204b35: shl     r10d,5h
  4.27%    0x000002720f204b39: sub     r10d,r8d
  3.95%    0x000002720f204b3c: add     r10d,dword ptr [rax+rbx*4+24h]
  3.77%    0x000002720f204b41: mov     r8d,r10d
           0x000002720f204b44: shl     r8d,5h
  4.01%    0x000002720f204b48: sub     r8d,r10d
  4.02%    0x000002720f204b4b: add     r8d,dword ptr [rax+rbx*4+28h]
  4.11%    0x000002720f204b50: mov     ecx,r8d
           0x000002720f204b53: shl     ecx,5h
  4.08%    0x000002720f204b56: sub     ecx,r8d
  4.31%    0x000002720f204b59: add     ecx,dword ptr [rax+rbx*4+2ch]

The first thing to ask is where is the multiplication? There is no multiplication, it’s been replaced by a left shift and a subtraction.


    public int StrengthReduction() {
        int result = 1;
        for (int i = 0; i < data.length; ++i) {
            result = (result << 5) - result + data[i];
        }
        return result;
    }

This is the compiler trying to be helpful because shifts and subtractions are cheaper than multiplications, and for 31, this transformation is possible. The snippet is one long chain of instructions: notice the register dependencies in the assembly snippet:


  4.15%    0x000002720f204b2d: add     r8d,dword ptr [rax+rbx*4+20h]
  3.98%    0x000002720f204b32: mov     r10d,r8d
           0x000002720f204b35: shl     r10d,5h
  4.27%    0x000002720f204b39: sub     r10d,r8d

The addition must complete before the contents of r8d are available for the move, the left shift waits for the move, and the subtraction waits for the shift. No two elements of the array are ever processed simultaneously. First suggested by Peter Levart, I came across it on Daniel Lemire’s blog, the dependency can be broken by manually unrolling the loop:


    @Benchmark
    public int Unrolled() {
        if (data == null)
            return 0;

        int result = 1;
        int i = 0;
        for (; i + 7 < data.length; i += 8) {
            result = 31 * 31 * 31 * 31 * 31 * 31 * 31 * 31 * result
                   + 31 * 31 * 31 * 31 * 31 * 31 * 31 * data[i]
                   + 31 * 31 * 31 * 31 * 31 * 31 * data[i + 1]
                   + 31 * 31 * 31 * 31 * 31 * data[i + 2]
                   + 31 * 31 * 31 * 31 * data[i + 3]
                   + 31 * 31 * 31 * data[i + 4]
                   + 31 * 31 * data[i + 5]
                   + 31 * data[i + 6]
                   + data[i + 7]
                    ;
        }
        for (; i < data.length; i++) {
            result = 31 * result + data[i];
        }
        return result;
    }

Weirdly, this implementation does very well (this isn’t new: there has been a ticket for it for several years). Without even bothering with a throughput score (the money shot comes at the end), the profiling output shows that this must be much better. The assembly is quite hard to read because it’s full of tricks I didn’t know existed, but look out for the hexadecimal constants and convince yourself that several are simply powers of 31. The multiplication by 31 is strength reduced to a left shift and a subtraction again.


  0.26%    0x000001d67bdd3c8e: mov     r8d,94446f01h
  0.01%    0x000001d67bdd3c94: jmp     1d67bdd3cb1h
           0x000001d67bdd3c96: nop     word ptr [rax+rax+0h]
           0x000001d67bdd3ca0: mov     edi,r11d
  0.03%    0x000001d67bdd3ca3: vmovq   r14,xmm0
  0.42%    0x000001d67bdd3ca8: mov     ebp,dword ptr [rsp+70h]
  7.14%    0x000001d67bdd3cac: vmovq   rbx,xmm1          
  0.01%    0x000001d67bdd3cb1: cmp     edi,r9d
           0x000001d67bdd3cb4: jnb     1d67bdd3d6ch
  0.04%    0x000001d67bdd3cba: imul    r10d,dword ptr [rcx+rdi*4+10h],67e12cdfh ; Another strength reduction trick
  7.74%    0x000001d67bdd3cc3: add     r10d,r8d                                 ; 1742810335 * x + 2487512833
  2.69%    0x000001d67bdd3cc6: mov     r11d,edi                                 
  0.09%    0x000001d67bdd3cc9: add     r11d,7h                                  
  0.46%    0x000001d67bdd3ccd: cmp     r11d,r9d                             
           0x000001d67bdd3cd0: jnb     1d67bdd3dbah
  6.82%    0x000001d67bdd3cd6: vmovq   xmm1,rbx
  0.61%    0x000001d67bdd3cdb: mov     dword ptr [rsp+70h],ebp
  0.06%    0x000001d67bdd3cdf: vmovq   xmm0,r14          
  0.46%    0x000001d67bdd3ce4: mov     r14,qword ptr [r15+70h]
  6.60%    0x000001d67bdd3ce8: mov     r11d,edi
  0.67%    0x000001d67bdd3ceb: add     r11d,8h 
  0.04%    0x000001d67bdd3cef: movsxd  rax,edi
  0.41%    0x000001d67bdd3cf2: add     edi,0fh
  6.87%    0x000001d67bdd3cf5: mov     edx,dword ptr [rcx+rax*4+28h]
  0.68%    0x000001d67bdd3cf9: imul    r8d,dword ptr [rcx+rax*4+14h],34e63b41h ; multiply by 887503681
  0.67%    0x000001d67bdd3d02: add     r8d,r10d   ; --------------------------
  7.30%    0x000001d67bdd3d05: mov     r10d,edx   ; Multiply by 31
  0.63%    0x000001d67bdd3d08: shl     r10d,5h    ;
  0.08%    0x000001d67bdd3d0c: sub     r10d,edx   ; --------------------------
  0.73%    0x000001d67bdd3d0f: imul    edx,dword ptr [rcx+rax*4+24h],3c1h     ; multiply by 961
  7.47%    0x000001d67bdd3d17: imul    ebp,dword ptr [rcx+rax*4+20h],745fh    ; multiply by 29791 
  0.56%    0x000001d67bdd3d1f: imul    esi,dword ptr [rcx+rax*4+1ch],0e1781h  ; multiply by 923521 
  7.02%    0x000001d67bdd3d27: imul    ebx,dword ptr [rcx+rax*4+18h],1b4d89fh ; multiply by 28629151 
  0.57%    0x000001d67bdd3d2f: add     r8d,ebx
  6.99%    0x000001d67bdd3d32: add     r8d,esi
  0.66%    0x000001d67bdd3d35: add     r8d,ebp
  0.14%    0x000001d67bdd3d38: add     r8d,edx
  0.91%    0x000001d67bdd3d3b: add     r8d,r10d
  7.04%    0x000001d67bdd3d3e: add     r8d,dword ptr [rcx+rax*4+2ch] ; add the data value at offset 7
  1.73%    0x000001d67bdd3d43: test    dword ptr [r14],eax  
  0.06%    0x000001d67bdd3d46: cmp     edi,r9d
           0x000001d67bdd3d49: jnl     1d67bdd3c15h      
  0.45%    0x000001d67bdd3d4f: imul    r8d,r8d,94446f01h ; multiply by 2487512833 (coprime to 31, follow r8d backwards)
 11.90%    0x000001d67bdd3d56: cmp     r11d,r9d

It’s probably not worth deciphering all the tricks in the code above, but notice that there is a lot of parallelism: the chain of signed multiplications target different registers and are independent. This code is much faster.

I wrote the code below in July last year to try to parallelise this calculation. At the expense of precomputing the powers of 31 up to some fixed length, such as the maximum length of strings in your database, it’s quite fast.


    private final int[] coefficients;

    public FixedLengthHashCode(int maxLength) {
        this.coefficients = new int[maxLength + 1];
        coefficients[0] = 1;
        for (int i = 1; i <= maxLength; ++i) {
            coefficients[i] = 31 * coefficients[i - 1];
        }
    }

    public int hashCode(int[] value) {
        final int max = value.length;
        int result = coefficients[max];
        for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
            result += coefficients[max - i - 1] * value[i];
        }
        return result;
    }

I was non-commital in the original post but I sort-of claimed this code was vectorised without even bothering to look at the disassembly. It’s scalar, but it’s much more parallel than the unrolled version, and all the clever strength reductions and dependencies are gone.


  0.15%    0x0000022d8e6825e0: movsxd  rbx,ecx
  0.07%    0x0000022d8e6825e3: mov     rdx,rsi
  3.57%    0x0000022d8e6825e6: sub     rdx,rbx
  0.08%    0x0000022d8e6825e9: mov     r10d,dword ptr [r9+rbx*4+10h]
  0.18%    0x0000022d8e6825ee: imul    r10d,dword ptr [rdi+rdx*4+0ch]
  0.15%    0x0000022d8e6825f4: add     r10d,r8d
  4.25%    0x0000022d8e6825f7: mov     r11d,dword ptr [r9+rbx*4+14h]
  0.14%    0x0000022d8e6825fc: imul    r11d,dword ptr [rdi+rdx*4+8h]
  0.19%    0x0000022d8e682602: add     r11d,r10d
  1.31%    0x0000022d8e682605: mov     r10d,dword ptr [r9+rbx*4+18h]
  3.93%    0x0000022d8e68260a: imul    r10d,dword ptr [rdi+rdx*4+4h]
  0.22%    0x0000022d8e682610: add     r10d,r11d
  0.94%    0x0000022d8e682613: mov     r8d,dword ptr [r9+rbx*4+1ch]
  0.05%    0x0000022d8e682618: imul    r8d,dword ptr [rdi+rdx*4]
  3.68%    0x0000022d8e68261d: add     r8d,r10d
  1.02%    0x0000022d8e682620: mov     r10d,dword ptr [r9+rbx*4+20h]
  0.19%    0x0000022d8e682625: imul    r10d,dword ptr [rdi+rdx*4+0fffffffffffffffch]
  0.61%    0x0000022d8e68262b: add     r10d,r8d
  4.71%    0x0000022d8e68262e: mov     r11d,dword ptr [r9+rbx*4+24h]
  0.08%    0x0000022d8e682633: imul    r11d,dword ptr [rdi+rdx*4+0fffffffffffffff8h]
  0.82%    0x0000022d8e682639: add     r11d,r10d
  5.57%    0x0000022d8e68263c: mov     r10d,dword ptr [r9+rbx*4+28h]
  0.04%    0x0000022d8e682641: imul    r10d,dword ptr [rdi+rdx*4+0fffffffffffffff4h]
  0.68%    0x0000022d8e682647: add     r10d,r11d
  4.67%    0x0000022d8e68264a: mov     r11d,dword ptr [r9+rbx*4+2ch]
  0.08%    0x0000022d8e68264f: imul    r11d,dword ptr [rdi+rdx*4+0fffffffffffffff0h]
  0.45%    0x0000022d8e682655: add     r11d,r10d
  4.50%    0x0000022d8e682658: mov     r10d,dword ptr [r9+rbx*4+30h]
  0.20%    0x0000022d8e68265d: imul    r10d,dword ptr [rdi+rdx*4+0ffffffffffffffech]
  0.37%    0x0000022d8e682663: add     r10d,r11d
  3.82%    0x0000022d8e682666: mov     r8d,dword ptr [r9+rbx*4+34h]
  0.05%    0x0000022d8e68266b: imul    r8d,dword ptr [rdi+rdx*4+0ffffffffffffffe8h]
  0.42%    0x0000022d8e682671: add     r8d,r10d
  4.18%    0x0000022d8e682674: mov     r10d,dword ptr [r9+rbx*4+38h]
  0.02%    0x0000022d8e682679: imul    r10d,dword ptr [rdi+rdx*4+0ffffffffffffffe4h]
  0.25%    0x0000022d8e68267f: add     r10d,r8d
  5.11%    0x0000022d8e682682: mov     r11d,dword ptr [r9+rbx*4+3ch]
  0.03%    0x0000022d8e682687: imul    r11d,dword ptr [rdi+rdx*4+0ffffffffffffffe0h]
  0.28%    0x0000022d8e68268d: add     r11d,r10d
  4.88%    0x0000022d8e682690: mov     r10d,dword ptr [r9+rbx*4+40h]
  0.09%    0x0000022d8e682695: imul    r10d,dword ptr [rdi+rdx*4+0ffffffffffffffdch]
  0.21%    0x0000022d8e68269b: add     r10d,r11d
  4.56%    0x0000022d8e68269e: mov     r8d,dword ptr [r9+rbx*4+44h]
  0.02%    0x0000022d8e6826a3: imul    r8d,dword ptr [rdi+rdx*4+0ffffffffffffffd8h]
  0.18%    0x0000022d8e6826a9: add     r8d,r10d
  4.73%    0x0000022d8e6826ac: mov     r10d,dword ptr [r9+rbx*4+48h]
  0.06%    0x0000022d8e6826b1: imul    r10d,dword ptr [rdi+rdx*4+0ffffffffffffffd4h]
  0.10%    0x0000022d8e6826b7: add     r10d,r8d
  4.12%    0x0000022d8e6826ba: mov     r8d,dword ptr [r9+rbx*4+4ch]

That blog post really was lazy. There’s a bit of a problem with the access pattern because the coefficients are accessed in reverse order, and at an offset: it’s too complicated for the optimiser. The code below is just a dot product and it should come as no surprise that it’s faster.


    private int[] coefficients;
    private int seed;

    void init(int size) {
        coefficients = new int[size]; 
        coefficients[size - 1] = 1;
        for (int i = size - 2; i >= 0; --i) {
            coefficients[i] = 31 * coefficients[i + 1];
        }
        seed = 31 * coefficients[0];
    }

    @Benchmark
    public int Vectorised() {
        int result = seed;
        for (int i = 0; i < data.length && i < coefficients.length; ++i) {
            result += coefficients[i] * data[i];
        }
        return result;
    }

The code vectorises, but the reduction isn’t as good as it could be with handwritten code.


  0.22%    0x000001d9c2e0f320: vmovdqu ymm0,ymmword ptr [rdi+rsi*4+70h]
  2.31%    0x000001d9c2e0f326: vpmulld ymm0,ymm0,ymmword ptr [r11+rsi*4+70h]
  0.61%    0x000001d9c2e0f32d: vmovdqu ymm1,ymmword ptr [rdi+rsi*4+50h]
  2.61%    0x000001d9c2e0f333: vpmulld ymm9,ymm1,ymmword ptr [r11+rsi*4+50h]
  0.53%    0x000001d9c2e0f33a: vmovdqu ymm1,ymmword ptr [rdi+rsi*4+30h]
  2.07%    0x000001d9c2e0f340: vpmulld ymm10,ymm1,ymmword ptr [r11+rsi*4+30h]
  0.60%    0x000001d9c2e0f347: vmovdqu ymm1,ymmword ptr [rdi+rsi*4+10h]
  2.33%    0x000001d9c2e0f34d: vpmulld ymm11,ymm1,ymmword ptr [r11+rsi*4+10h]
  0.61%    0x000001d9c2e0f354: vphaddd ymm7,ymm11,ymm11
  3.04%    0x000001d9c2e0f359: vphaddd ymm7,ymm7,ymm8
  3.56%    0x000001d9c2e0f35e: vextracti128 xmm8,ymm7,1h
  0.53%    0x000001d9c2e0f364: vpaddd  xmm7,xmm7,xmm8
  1.56%    0x000001d9c2e0f369: vmovd   xmm8,r8d
  1.77%    0x000001d9c2e0f36e: vpaddd  xmm8,xmm8,xmm7
  0.93%    0x000001d9c2e0f372: vmovd   edx,xmm8
  0.27%    0x000001d9c2e0f376: vphaddd ymm2,ymm10,ymm10
  2.75%    0x000001d9c2e0f37b: vphaddd ymm2,ymm2,ymm6
  2.32%    0x000001d9c2e0f380: vextracti128 xmm6,ymm2,1h
  1.95%    0x000001d9c2e0f386: vpaddd  xmm2,xmm2,xmm6
  0.63%    0x000001d9c2e0f38a: vmovd   xmm6,edx
  0.50%    0x000001d9c2e0f38e: vpaddd  xmm6,xmm6,xmm2
  7.76%    0x000001d9c2e0f392: vmovd   edx,xmm6
  0.22%    0x000001d9c2e0f396: vphaddd ymm5,ymm9,ymm9
  2.68%    0x000001d9c2e0f39b: vphaddd ymm5,ymm5,ymm1
  0.34%    0x000001d9c2e0f3a0: vextracti128 xmm1,ymm5,1h
  6.27%    0x000001d9c2e0f3a6: vpaddd  xmm5,xmm5,xmm1
  0.88%    0x000001d9c2e0f3aa: vmovd   xmm1,edx
  0.92%    0x000001d9c2e0f3ae: vpaddd  xmm1,xmm1,xmm5
  7.85%    0x000001d9c2e0f3b2: vmovd   edx,xmm1
  0.43%    0x000001d9c2e0f3b6: vphaddd ymm4,ymm0,ymm0
  2.59%    0x000001d9c2e0f3bb: vphaddd ymm4,ymm4,ymm3
  0.34%    0x000001d9c2e0f3c0: vextracti128 xmm3,ymm4,1h
  5.72%    0x000001d9c2e0f3c6: vpaddd  xmm4,xmm4,xmm3
  0.80%    0x000001d9c2e0f3ca: vmovd   xmm3,edx
  0.58%    0x000001d9c2e0f3ce: vpaddd  xmm3,xmm3,xmm4
  8.09%    0x000001d9c2e0f3d2: vmovd   r8d,xmm3

Using JDK9, this results in a 3x throughput gain over the built in Arrays.hashCode, and that includes the cost of doubling the number of bytes to process and a suboptimal reduction phase. This is going to be a prime candidate for the Vector API, where a vector of powers of 31 could be multiplied by 31^8 on each iteration, before multiplying by a vector of the next 8 data elements.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 3.439980 0.024231 ops/us 256
BuiltIn thrpt 1 10 0.842664 0.009019 ops/us 1024
BuiltIn thrpt 1 10 0.103638 0.003070 ops/us 8192
FixedLength thrpt 1 10 7.421648 0.169558 ops/us 256
FixedLength thrpt 1 10 1.966116 0.044398 ops/us 1024
FixedLength thrpt 1 10 0.249666 0.009994 ops/us 8192
StrengthReduction thrpt 1 10 3.417248 0.045733 ops/us 256
StrengthReduction thrpt 1 10 0.840830 0.011091 ops/us 1024
StrengthReduction thrpt 1 10 0.104265 0.001537 ops/us 8192
Unrolled thrpt 1 10 6.353271 0.042330 ops/us 256
Unrolled thrpt 1 10 1.672845 0.035389 ops/us 1024
Unrolled thrpt 1 10 0.205575 0.009375 ops/us 8192
Vectorised thrpt 1 10 10.958270 0.109993 ops/us 256
Vectorised thrpt 1 10 2.948918 0.081830 ops/us 1024
Vectorised thrpt 1 10 0.358819 0.027316 ops/us 8192

12 Comments

  • Maaartinus says:

    Actually, I’m not sure, if hashCode caching in String is that useful. Quite often, old strings sit in a HashMap, which caches the hashCode as well, while newly created strings get used as lookup keys.

    Getting a factor of three can be achieved by simply unrolling the loop four times using precomputed powers up to 31**3, see https://stackoverflow.com/q/21745619/581205.

    I guess, vectorization doesn’t happen in my old code, but the latency of the multiplication (or of the strength reduced multiplication, which isn’t much better) gets out of the critical path.

    • Yes, I don’t think the precise ratios are too important: the unroll is profitable. The thing is, the vectorised version could be twice as fast because it shouldn’t have to drag all the coefficients in from memory. At some point in the near future, it will be possible to load [1, 31, …, 31^7] into a single register, broadcast 31^8 into another register, and multiply these two registers once per iteration. I expect that code would be twice as fast, so 6x the status quo.

      You might be right about the futility of precomputing hash codes in this particular case.

      • Maaartinus says:

        Sure, the unroll is profitable; even the most trivial one gains a factor of two.

        The main loop body could be something like

        a0 = 31**8 * a0 + s.charAt(i+0);
        a1 = 31**8 * a1 + s.charAt(i+1);
        a2 = 31**8 * a2 + s.charAt(i+2);

        a7 = 31**8 * a7 + s.charAt(i+7);
        i += 8;

        using a single multiplier. I guess, that’s what you mean. I’m rather unfamiliar with the vector instruction set, especially with all its irregularities.

  • Maaartinus says:

    No, the Unrolled implementation is what I posted about back in 2014. My current idea is using eight accumulators

    
    static int shinyHashCode(int[] data) {
        checkArgument((data.length & 7) == 0, "too lazy");
        int a0 = 0;
        int a1 = 0;
        int a2 = 0;
        int a3 = 0;
        int a4 = 0;
        int a5 = 0;
        int a6 = 0;
        int a7 = 0;
        for (int i=0; i+7 < data.length; i+=8) {
            a0 = P8 * a0 + data[i+0];
            a1 = P8 * a1 + data[i+1];
            a2 = P8 * a2 + data[i+2];
            a3 = P8 * a3 + data[i+3];
            a4 = P8 * a4 + data[i+4];
            a5 = P8 * a5 + data[i+5];
            a6 = P8 * a6 + data[i+6];
            a7 = P8 * a7 + data[i+7];
        }
        return P7 * a0
                + P6 * a1
                + P5 * a2
                + P4 * a3
                + P3 * a4
                + P2 * a5
                + P1 * a6
                + P0 * a7;
    }
    

    and should be nicely vectorizable.

    • If you initialise a0...a7 to powers of 31, and P8 is 31^8 that’s what I’m talking about when I refer to using the upcoming vector API. As it’s written, it’s possible a C compiler would optimise it fully, but I don’t think C2 is smart enough to do it.

  • Maaartinus says:

    Anyway, it works pretty well on my ancient i5-2400 with Java 8:

    HashCode.BuiltIn          10000  thrpt   10  0.083 ±  0.001  ops/us
    HashCode.FixedLength      10000  thrpt   10  0.140 ±  0.001  ops/us
    HashCode.Shiny4HashCode   10000  thrpt   10  0.142 ±  0.001  ops/us
    HashCode.Shiny8HashCode   10000  thrpt   10  0.184 ±  0.001  ops/us
    HashCode.Unrolled         10000  thrpt   10  0.130 ±  0.001  ops/us
    

    This is the full implementation:
    https://github.com/Maaartinus/simdbenchmarks/commit/c4713407b8b74aed7b64c23ed1c380b79dd7b898

    I couldn’t find “Vectorized” there.

    I haven’t tested it for correctness, so it may be wrong (but it’s surely fixable as I’ve tested a similar code for String#hashCode).

    • Sorry that was quite sloppy, I just pushed the code used in this post to github.

      Also, I really think testing for correctness is incredibly important. I was looking at an implementation in C of a data structure from an academic paper with amazing benchmark results, I found at least 5 ways to make the implementation segfault really quickly so I’ve lost interest in the author’s work.

  • Maaartinus says:

    I’ve rebased my repo on top of yours, added a test and fixed an error in Shiny4HashCode. The results are not really good. It’s still faster than Unrolled (which you improved in your last commit), but slower than Vectorised.

    Benchmark                   (size)   Mode  Cnt  Score   Error   Units
    HashCode.BuiltIn              8192  thrpt    5  0.101 ± 0.001  ops/us
    HashCode.FixedLength          8192  thrpt    5  0.174 ± 0.001  ops/us
    HashCode.Shiny4HashCode       8192  thrpt    5  0.224 ± 0.005  ops/us
    HashCode.Shiny8HashCode       8192  thrpt    5  0.225 ± 0.002  ops/us
    HashCode.StrengthReduction    8192  thrpt    5  0.101 ± 0.001  ops/us
    HashCode.Unrolled             8192  thrpt    5  0.218 ± 0.001  ops/us
    HashCode.Vectorised           8192  thrpt    5  0.248 ± 0.001  ops/us
    

    Could you run the benchmark on your system?

    I’ve also tried to make Vectorised work with fixed length coefficients (not pushed), but every trivial change to indexing seems to throw off the optimization.

    • The code in the last commit was the code used in this post. The benchmark was already there from a post last year, the point of this post was that benchmark was too superficial. The Vectorised implementation with, say, 8192 coefficients should work with 1024 elements in the array. You need to avoid array offsets, otherwise the optimiser gets upset because it won’t know data and coefficients don’t alias. I’ll try your benchmark soon.

      • Here’s a run with JDK10.

        Benchmark                   (size)   Mode  Cnt  Score    Error   Units
        HashCode.BuiltIn               256  thrpt   10  3.439 ±  0.004  ops/us
        HashCode.BuiltIn              1024  thrpt   10  0.846 ±  0.001  ops/us
        HashCode.BuiltIn              8192  thrpt   10  0.105 ±  0.001  ops/us
        HashCode.FixedLength           256  thrpt   10  7.314 ±  0.017  ops/us
        HashCode.FixedLength          1024  thrpt   10  2.089 ±  0.041  ops/us
        HashCode.FixedLength          8192  thrpt   10  0.271 ±  0.001  ops/us
        HashCode.Shiny4HashCode        256  thrpt   10  6.655 ±  0.111  ops/us
        HashCode.Shiny4HashCode       1024  thrpt   10  1.717 ±  0.002  ops/us
        HashCode.Shiny4HashCode       8192  thrpt   10  0.257 ±  0.001  ops/us
        HashCode.Shiny8HashCode        256  thrpt   10  6.658 ±  0.126  ops/us
        HashCode.Shiny8HashCode       1024  thrpt   10  1.711 ±  0.001  ops/us
        HashCode.Shiny8HashCode       8192  thrpt   10  0.257 ±  0.001  ops/us
        HashCode.StrengthReduction     256  thrpt   10  3.444 ±  0.001  ops/us
        HashCode.StrengthReduction    1024  thrpt   10  0.844 ±  0.001  ops/us
        HashCode.StrengthReduction    8192  thrpt   10  0.105 ±  0.001  ops/us
        HashCode.Unrolled              256  thrpt   10  6.397 ±  0.009  ops/us
        HashCode.Unrolled             1024  thrpt   10  1.629 ±  0.013  ops/us
        HashCode.Unrolled             8192  thrpt   10  0.214 ±  0.001  ops/us
        HashCode.Vectorised            256  thrpt   10  9.835 ±  0.010  ops/us
        HashCode.Vectorised           1024  thrpt   10  2.502 ±  0.003  ops/us
        HashCode.Vectorised           8192  thrpt   10  0.310 ±  0.001  ops/us
        

        Note that JDK10 is probably slightly slower because of loop strip mining.

        • Maaartinus says:

          That’s about the same as on my system: Only slightly better than Unrolled. Sad.

          I’m confused by you writing about aliasing. How can it matter when both arrays get only read?

Leave a Reply

Your email address will not be published. Required fields are marked *