Mixing Vector and Scalar Instructions

I saw an interesting tweet from one of the developers of Pilosa this week, reporting performance improvements from unrolling a bitwise reduction in Go. This surprised me because Go seems to enjoy a reputation for being a high performance language, and it certainly has great support for concurrency, but compilers should unroll loops as standard so you don’t have to. Having been written in Go doesn’t seem to have hampered Pilosa, because they have some great benchmark numbers, and it certainly helps that they built their technology on top of a smart data structure: the roaring bitmap. You can read about their data model for yourself, but Pilosa is basically a large bit matrix which materialises relations between rows and columns by setting the bit at their intersection, for instance, genomes on rows to k-mers (sequences of bases like “GATTACA”) on columns. To compute the Hamming similarity between the genomes of two people (i.e. how many k-mers they have in common), Pilosa just needs to intersect the bitmaps of rows representing each genome and count the number of bits in the result. The intersection doesn’t even need to be materialised, and can be calculated on the fly as a dot product. What piqued my interest though was that the Pilosa developers had experimented with combining vector and scalar instructions and had found it unprofitable. Once there is a Vector API in Java, what will happen when there’s a gap that can only be plugged with a scalar implementation?

I don’t know much about Go but I get the impression its compiler is a long way behind C2. The instruction POPCNTQ, the bottleneck in this tale, has only recently been available to Go programmers in math/bits/bits.go, with demonstrable apathy for its inclusion in the standard library. As a point of comparison, Long.bitCount has been translated to POPCNTQ by C2 for longer than I have been using Java. If you want to do bioinformatics in Java, whatever you do, don’t unroll your loops! The unrolled version below will be slightly slower than the simple loop.

  public int popcnt() {
    int cardinality = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; ++i) {
      cardinality += Long.bitCount(left[i] & right[i]);
    return cardinality;

  public int unrolledPopcnt() {
    int cardinality1 = 0;
    int cardinality2 = 0;
    int cardinality3 = 0;
    int cardinality4 = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; i += 4) {
      cardinality1 += Long.bitCount(left[i+0] & right[i+0]);
      cardinality2 += Long.bitCount(left[i+1] & right[i+1]);
      cardinality3 += Long.bitCount(left[i+2] & right[i+2]);
      cardinality4 += Long.bitCount(left[i+3] & right[i+3]);
    return cardinality1 + cardinality2 + cardinality3 + cardinality4;

Ignoring the unrolled version because it’s a dead end, does C2 vectorise this reduction? No, because it can’t vectorise the bit count, but notice the floating point spills at the start for better register placement.

         ││ ↗          0x00007fe418240c4c: vmovq  %xmm0,%r9
         ││ │          0x00007fe418240c51: vmovq  %xmm1,%r8
  0.00%  ││ │      ↗   0x00007fe418240c56: vmovq  %r9,%xmm0
  0.04%  ││ │      │   0x00007fe418240c5b: vmovq  %r8,%xmm1                      
  1.71%  ││↗│      │   0x00007fe418240c60: movslq %ecx,%r9
  1.42%  ││││      │   0x00007fe418240c63: mov    0x10(%rbx,%rcx,8),%r10
  8.98%  ││││      │   0x00007fe418240c68: and    0x10(%rdi,%rcx,8),%r10
  3.51%  ││││      │   0x00007fe418240c6d: popcnt %r10,%r8
  3.21%  ││││      │   0x00007fe418240c72: add    %r8d,%edx
  2.48%  ││││      │   0x00007fe418240c75: mov    0x28(%rbx,%r9,8),%r10
  8.19%  ││││      │   0x00007fe418240c7a: and    0x28(%rdi,%r9,8),%r10
  3.59%  ││││      │   0x00007fe418240c7f: popcnt %r10,%r10
  3.73%  ││││      │   0x00007fe418240c84: mov    0x20(%rbx,%r9,8),%r8
  2.16%  ││││      │   0x00007fe418240c89: and    0x20(%rdi,%r9,8),%r8
  7.53%  ││││      │   0x00007fe418240c8e: popcnt %r8,%rsi
  6.21%  ││││      │   0x00007fe418240c93: mov    0x18(%rbx,%r9,8),%r8           
  2.30%  ││││      │   0x00007fe418240c98: and    0x18(%rdi,%r9,8),%r8
  2.07%  ││││      │   0x00007fe418240c9d: popcnt %r8,%r9
 12.75%  ││││      │   0x00007fe418240ca2: add    %r9d,%edx
  6.01%  ││││      │   0x00007fe418240ca5: add    %esi,%edx
  5.70%  ││││      │   0x00007fe418240ca7: add    %r10d,%edx                     
  8.60%  ││││      │   0x00007fe418240caa: add    $0x4,%ecx                      
  3.58%  ││││      │   0x00007fe418240cad: cmp    %r11d,%ecx
         ││╰│      │   0x00007fe418240cb0: jl     0x00007fe418240c60             
  0.04%  ││ │      │   0x00007fe418240cb2: mov    0x108(%r15),%r10               
  0.05%  ││ │      │   0x00007fe418240cb9: test   %eax,(%r10)                    
  0.29%  ││ │      │   0x00007fe418240cbc: cmp    %r11d,%ecx
         ││ ╰      │   0x00007fe418240cbf: jl     0x00007fe418240c4c

It’s nice that very good scalar code gets generated for this loop from the simplest possible code, but what if you want to go faster with vectorisation? There is no vector bit count until the VPOPCNTD/VPOPCNTQ AVX512 extension, currently only available on the Knights Mill processor, which is tantamount to there being no vector bit count instruction. There is a vector bit count algorithm originally written by Wojciech Mula for SSE3, and updated for AVX2 by Wojciech Mula and Daniel Lemire, which is used in clang. I made an attempt at implementing this using the Vector API a few months ago and found what felt were a few gaps but may look at this again soon.

I looked at a few ways of writing mixed loops, using the Vector API and Long.bitCount and found that there wasn’t much to be gained from partial vectorisation. There is a method for extracting scalar values from vectors: LongVector::get, which is very interesting because it highlights the gaps the JIT compiler needs to fill in on the wrong hardware, and why you should read the assembly code emitted from a benchmark before jumping to conclusions. Here’s the code and below it the hot part of the loop.

  public int vpandExtractPopcnt() {
    int cardinality = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; i += 4) {
      var intersection = YMM_LONG.fromArray(left, i).and(YMM_LONG.fromArray(right, i));
      cardinality += Long.bitCount(intersection.get(0));
      cardinality += Long.bitCount(intersection.get(1));
      cardinality += Long.bitCount(intersection.get(2));
      cardinality += Long.bitCount(intersection.get(3));
    return cardinality;

  0.43%  ││        ↗   0x00007fbfe024bb55: vmovdqu 0x10(%rax,%rcx,8),%ymm2
  2.58%  ││        │   0x00007fbfe024bb5b: vpand  0x10(%r13,%rcx,8),%ymm2,%ymm8 
  0.32%  ││        │   0x00007fbfe024bb62: movslq %ecx,%r10                     
  0.43%  ││        │   0x00007fbfe024bb65: vmovdqu 0x70(%rax,%r10,8),%ymm2       
  2.65%  ││        │   0x00007fbfe024bb6c: vpand  0x70(%r13,%r10,8),%ymm2,%ymm9  
  0.84%  ││        │   0x00007fbfe024bb73: vmovdqu 0x30(%rax,%r10,8),%ymm2
  0.46%  ││        │   0x00007fbfe024bb7a: vpand  0x30(%r13,%r10,8),%ymm2,%ymm10  
  3.06%  ││        │   0x00007fbfe024bb81: vmovdqu 0x50(%rax,%r10,8),%ymm6       
  0.03%  ││        │   0x00007fbfe024bb88: vmovq  %rax,%xmm2
  0.42%  ││        │   0x00007fbfe024bb8d: vpand  0x50(%r13,%r10,8),%ymm6,%ymm11
  2.60%  ││        │   0x00007fbfe024bb94: vmovq  %xmm8,%r10
  0.01%  ││        │   0x00007fbfe024bb99: popcnt %r10,%rbp
  0.51%  ││        │   0x00007fbfe024bb9e: add    %edx,%ebp
  3.86%  ││        │   0x00007fbfe024bba0: vmovq  %xmm10,%r10
  0.15%  ││        │   0x00007fbfe024bba5: popcnt %r10,%rax
  0.43%  ││        │   0x00007fbfe024bbaa: vmovq  %xmm11,%r10
  0.41%  ││        │   0x00007fbfe024bbaf: popcnt %r10,%r14
  2.84%  ││        │   0x00007fbfe024bbb4: vmovq  %xmm9,%r10
  0.18%  ││        │   0x00007fbfe024bbb9: popcnt %r10,%rdx
  0.35%  ││        │   0x00007fbfe024bbbe: vextracti128 $0x1,%ymm9,%xmm6
  0.41%  ││        │   0x00007fbfe024bbc4: vpextrq $0x0,%xmm6,%r10
  2.62%  ││        │   0x00007fbfe024bbca: popcnt %r10,%r10
  1.45%  ││        │   0x00007fbfe024bbcf: vextracti128 $0x1,%ymm9,%xmm6
  0.42%  ││        │   0x00007fbfe024bbd5: vpextrq $0x1,%xmm6,%r11
  2.20%  ││        │   0x00007fbfe024bbdb: popcnt %r11,%r8
  1.34%  ││        │   0x00007fbfe024bbe0: vpextrq $0x1,%xmm9,%r11
  2.44%  ││        │   0x00007fbfe024bbe6: popcnt %r11,%r11
  0.21%  ││        │   0x00007fbfe024bbeb: vpextrq $0x1,%xmm10,%r9
  0.97%  ││        │   0x00007fbfe024bbf1: popcnt %r9,%r9
  2.17%  ││        │   0x00007fbfe024bbf6: vextracti128 $0x1,%ymm8,%xmm6
  0.22%  ││        │   0x00007fbfe024bbfc: vpextrq $0x1,%xmm6,%rbx
  1.10%  ││        │   0x00007fbfe024bc02: popcnt %rbx,%rbx
  2.46%  ││        │   0x00007fbfe024bc07: vextracti128 $0x1,%ymm8,%xmm6
  0.22%  ││        │   0x00007fbfe024bc0d: vpextrq $0x0,%xmm6,%rdi
  1.00%  ││        │   0x00007fbfe024bc13: popcnt %rdi,%rsi
  2.64%  ││        │   0x00007fbfe024bc18: vpextrq $0x1,%xmm8,%rdi
  0.80%  ││        │   0x00007fbfe024bc1e: popcnt %rdi,%rdi
  0.35%  ││        │   0x00007fbfe024bc23: add    %edi,%ebp
  3.42%  ││        │   0x00007fbfe024bc25: add    %esi,%ebp
  0.38%  ││        │   0x00007fbfe024bc27: add    %ebx,%ebp
  0.70%  ││        │   0x00007fbfe024bc29: add    %ebp,%eax
  0.84%  ││        │   0x00007fbfe024bc2b: add    %r9d,%eax
  2.85%  ││        │   0x00007fbfe024bc2e: vpextrq $0x1,%xmm11,%r9
  0.35%  ││        │   0x00007fbfe024bc34: popcnt %r9,%rbx
  0.21%  ││        │   0x00007fbfe024bc39: vextracti128 $0x1,%ymm10,%xmm6
  2.82%  ││        │   0x00007fbfe024bc3f: vpextrq $0x1,%xmm6,%r9
  0.34%  ││        │   0x00007fbfe024bc45: popcnt %r9,%r9
  0.38%  ││        │   0x00007fbfe024bc4a: vextracti128 $0x1,%ymm10,%xmm6
  2.58%  ││        │   0x00007fbfe024bc50: vpextrq $0x0,%xmm6,%rdi
  0.42%  ││        │   0x00007fbfe024bc56: popcnt %rdi,%rsi
  0.56%  ││        │   0x00007fbfe024bc5b: add    %esi,%eax
  4.90%  ││        │   0x00007fbfe024bc5d: add    %r9d,%eax
  0.55%  ││        │   0x00007fbfe024bc60: add    %eax,%r14d
  0.87%  ││        │   0x00007fbfe024bc63: add    %ebx,%r14d
  1.46%  ││        │   0x00007fbfe024bc66: vextracti128 $0x1,%ymm11,%xmm6
  1.91%  ││        │   0x00007fbfe024bc6c: vpextrq $0x0,%xmm6,%r9
  0.12%  ││        │   0x00007fbfe024bc72: popcnt %r9,%r9
  1.33%  ││        │   0x00007fbfe024bc77: add    %r9d,%r14d
  2.20%  ││        │   0x00007fbfe024bc7a: vextracti128 $0x1,%ymm11,%xmm6
  0.08%  ││        │   0x00007fbfe024bc80: vpextrq $0x1,%xmm6,%r9
  2.51%  ││        │   0x00007fbfe024bc86: popcnt %r9,%rbx
  3.68%  ││        │   0x00007fbfe024bc8b: add    %ebx,%r14d
  0.45%  ││        │   0x00007fbfe024bc8e: add    %r14d,%edx
  1.69%  ││        │   0x00007fbfe024bc91: add    %r11d,%edx
  1.34%  ││        │   0x00007fbfe024bc94: add    %r10d,%edx
  3.71%  ││        │   0x00007fbfe024bc97: add    %r8d,%edx
  4.53%  ││        │   0x00007fbfe024bc9a: add    $0x10,%ecx

What’s going on here is that each 256 bit vector is first extracted to a 128 bit register, so a 64 bit word can be moved to a 64 bit register upon which POPCNTQ can operate. This doesn’t benchmark very well at all on my AVX2 capable laptop, but my laptop is a poor proxy for the kind of AVX512 capable processor bioinformatics workloads would expect to run on. AVX512F has VEXTRACTI64x4 which can dump a vector into four 64 bit registers in a single instruction, which LongVector512::get may well use on the right hardware. I’m not the only person in the world to have run benchmarks on a laptop in my spare time and it’s important to realise that some benchmarks might be slow just because an instruction is missing.

I found a slight improvement on the scalar loop by dumping the intersected vectors to a pre-allocated array, and manually unrolling the bit counts with three accumulators, because the latency of POPCNTQ is three times that of ADD. The unrolled version is roughly 20% faster than the scalar loop, but this isn’t the kind of gain usually expected from vectorisation.

  public int vpandStorePopcnt() {
    long[] intersections = buffer;
    int cardinality = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; i += 4) {
      YMM_LONG.fromArray(left, i).and(YMM_LONG.fromArray(right, i)).intoArray(intersections, 0);
      cardinality += Long.bitCount(intersections[0]);
      cardinality += Long.bitCount(intersections[1]);
      cardinality += Long.bitCount(intersections[2]);
      cardinality += Long.bitCount(intersections[3]);
    return cardinality;

  public int vpandStorePopcntUnrolled() {
    long[] intersections = buffer;
    int cardinality1 = 0;
    int cardinality2 = 0;
    int cardinality3 = 0;
    for (int i = 0; i < size && i < left.length && i < right.length; i += 8) {
      YMM_LONG.fromArray(left, i).and(YMM_LONG.fromArray(right, i)).intoArray(intersections, 0);
      YMM_LONG.fromArray(left, i + 4).and(YMM_LONG.fromArray(right, i + 4)).intoArray(intersections, 4);
      cardinality1 += Long.bitCount(intersections[0]);
      cardinality2 += Long.bitCount(intersections[1]);
      cardinality3 += Long.bitCount(intersections[2]);
      cardinality1 += Long.bitCount(intersections[3]);
      cardinality2 += Long.bitCount(intersections[4]);
      cardinality3 += Long.bitCount(intersections[5]);
      cardinality1 += Long.bitCount(intersections[6]);
      cardinality2 += Long.bitCount(intersections[7]);
    return cardinality1 + cardinality2 + cardinality3;

Here the pairs of extracts are replaced with a store to the buffer.

  0.03%        ││ ││││  0x00007f6d50031358: vmovdqu 0x10(%rax,%r9,8),%ymm2
  0.02%        ││ ││││  0x00007f6d5003135f: vpand  0x10(%r13,%r9,8),%ymm2,%ymm2   
  0.06%        ││ ││││  0x00007f6d50031366: vmovdqu %ymm2,0x10(%r12,%rcx,8)      
  0.03%        ││ ││││  0x00007f6d5003136d: mov    %r9d,%edi
  0.00%        ││ ││││  0x00007f6d50031370: add    $0x4,%edi                      
  0.00%        ││ ││││  0x00007f6d50031373: movslq %edi,%rbx                      
  0.03%        ││ ││││  0x00007f6d50031376: vmovdqu 0x10(%rax,%rbx,8),%ymm2
  0.00%        ││ ││││  0x00007f6d5003137c: vpand  0x10(%r13,%rbx,8),%ymm2,%ymm2  
  0.10%        ││ ││││  0x00007f6d50031383: vmovdqu %ymm2,0x30(%r12,%rcx,8)      
  0.02%        ││ ││││  0x00007f6d5003138a: popcnt 0x28(%r12,%rcx,8),%rdi
  0.03%        ││ ││││  0x00007f6d50031391: popcnt 0x20(%r12,%rcx,8),%rdx
  0.04%        ││ ││││  0x00007f6d50031398: add    %r8d,%edx
  0.02%        ││ ││││  0x00007f6d5003139b: popcnt 0x38(%r12,%rcx,8),%r8
  0.22%        ││ ││││  0x00007f6d500313a2: add    %edx,%r8d
  0.08%        ││ ││││  0x00007f6d500313a5: popcnt 0x30(%r12,%rcx,8),%rdx
  0.03%        ││ ││││  0x00007f6d500313ac: popcnt 0x18(%r12,%rcx,8),%rbx
               ││ ││││  0x00007f6d500313b3: add    %r11d,%ebx
  0.04%        ││ ││││  0x00007f6d500313b6: add    %edx,%ebx
  0.15%        ││ ││││  0x00007f6d500313b8: popcnt 0x48(%r12,%rcx,8),%r11
               ││ ││││  0x00007f6d500313bf: add    %ebx,%r11d
  0.04%        ││ ││││  0x00007f6d500313c2: popcnt 0x40(%r12,%rcx,8),%rdx
  0.03%        ││ ││││  0x00007f6d500313c9: popcnt 0x10(%r12,%rcx,8),%rbx
  0.10%        ││ ││││  0x00007f6d500313d0: add    %esi,%ebx
  0.06%        ││ ││││  0x00007f6d500313d2: add    %ebx,%edi
  0.03%        ││ ││││  0x00007f6d500313d4: add    %edx,%edi

Here are all the results in summary:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
popcnt thrpt 1 20 2098.750355 12.877810 ops/ms 1024
unrolledPopcnt thrpt 1 20 2077.227092 29.230757 ops/ms 1024
vpandExtractPopcnt thrpt 1 20 1819.027524 12.728300 ops/ms 1024
vpandStorePopcnt thrpt 1 20 2372.775743 10.315422 ops/ms 1024
vpandStorePopcntUnrolled thrpt 1 20 2626.761626 26.099143 ops/ms 1024

The difference is probably attributable to a smaller number of instructions:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
popcnt:instructions thrpt 1 1 5241.243387 NaN #/op 1024
unrolledPopcnt:instructions thrpt 1 1 5203.655274 NaN #/op 1024
vpandExtractPopcnt:instructions thrpt 1 1 4579.851499 NaN #/op 1024
vpandStorePopcnt:instructions thrpt 1 1 3170.924853 NaN #/op 1024
vpandStorePopcntUnrolled:instructions thrpt 1 1 3528.127055 NaN #/op 1024

In total, the gains aren’t great, but the baseline is strong. There’s more to counting bits than computing the Hamming similarity between two bitmaps; various useful similarity metrics, such as Jaccard and Tanimoto, can be calculated in the same way by replacing intersection with other set relations already implemented in the Vector API.


Join the conversation


  1. If popcnt was implemented “as expected”, unrolling wouldn’t matter as much, because it’s a 1-per-cycle throughput (3 cycles latency) on recent Intel so as long as the rest of your loop is simple, the whole loop can probably execute at 1 cycle per iteration.

    However, popcnt (and also lzcnt and tzcnt) is a bit of a special snowflake, due to an implementation issue: there is a false dependency on the write-only output register. So in a not-unrolled loop that always does a popcnt into eax (for example), the 2nd popcnt instruction won’t run until 3 cycles after the first, since it is waiting for the result in eax to be ready, even though it is going to overwrite it completely! So the loop runs 3x slower than it needs to.

    See: https://stackoverflow.com/q/21390165/149138 for details on the false dependency for these 3 instructions.

    Unrolling can solve this issue since with 4 popcnt writing to different registers, the gap between register re-use (4 cycles) is large enough to hide the latency. The other trick is to explicitly zero out the register with xor eax, eax or whatever – but unless Go had been taught that special quirk it wouldn’t happen on its own.

    This issue was finally fixed for popcnt in CNL, and for lzcnt and tzcnt in Skylake.

    1. Heterogeneous deployment environments, where performance issues come and go over the generations of hardware, can be handled well by JIT compilers, yet I often hear it said that Go has an advantage over Java because it is “native” i.e. is compiled ahead of time. Actually, I became interested in hardware only very recently, when I noticed a server upgrade from old to a newer but not bleeding edge CPU, had a better than expected impact on performance without code changes, probably only possible because of JIT compilation.

      1. I think JIT can definitely be faster than “compile to native” which runs against conventional wisdom.

        The real problem is that we have no apples to apples comparisons: we compare Java to C, but Java is a very different language: garbage collected, memory safe, all member functions virtual by default, with many higher level abstractions and features, and written partly for simplicity and consistency over speed. It’s not the JIT that makes Java slow, it’s the JIT that allows Java to even compete the same league.

        A JIT has one primary disadvantage: the runtime of the JIT ends up being included in the runtime of the program, so the compiler has to trade off compilation speed and resulting binary speed. That’s the main reasons JITs don’t generate great code, but there is a lot of room to improve here, including more levels of recompilations for long-running processes beyond what we have now, and smarter targeting of more complex (slow) optimizations to the places where it would help. I think we have only scratched the surface there.

        On other hand, the JIT has some massive advantages over compile-to-native. You mentioned that it knows the hardware it is running on, so it can target it (this of course includes hardware that didn’t even exist when the code was originally written, something that is impossible for compile-to-native to target even if it jumps though all the hoops of multiple targets). It also knows exactly what parts of the code are hot, the behavior of loops, the full set of loaded classes (for devirtualization), the size of memory, the values of instantiated objects, runtime-calculated constants, etc. The list goes on. It’s like always using the best possible LTO and PGO on a run-by-run basis – and in many ways even better than that.

  2. > “AVX512F has VEXTRACTI64x4 which can dump a vector into four 64 bit registers in a single instruction,”

    I wish! I’ve often wanted something like that, but after double-checking VEXTRACTI64x4 doesn’t do that: it grabs either the upper or lower 256-bits of a zmm registers, and writes it to a ymm register or to memory. There is no option to write it to 4 GP registers.

    We are unlikely to see such a thing, at least with an efficient implementation because of how renaming works: current chips break instructions down into uops one common between uops is they write to at most a *single* destination register. Anything that updates two registers (e.g., many mul instructions) needs 2 uops. So unless the renamer was significantly changed, even if such an instruction did exist it would probably take 4 uops. That might still be better than the alternative of writing everything to memory and then loading it back 64-bits at a time (not always though since such loads may be fused).

Leave a comment

Your email address will not be published. Required fields are marked *