Stages

The conventional wisdom is that the stages of a task should be pipelined, so you don’t need to wait for the completion of one stage before the next is started. It surprises me that it seems you can sometimes do better when performing each stage of a pipeline in a short batch. Useful optimisation opportunities can arise from this phenomenon, with only minor code changes. I recently applied this principle to implement fast batch iterators in RoaringBitmap.

I came across a discussion about shuffling arrays on Twitter, stimulated by a blog post by Daniel Lemire. Imagine you want to randomly shuffle the contents of an array. One approach to take would be to iterate over the array in reverse, at each index i, generate a random index j smaller than i, and swap the elements at i and j. Here’s some benchmark code to measure how long this takes for an assortment of swapping strategies, including one where the swaps are just precomputed and looked up in an array.


  @Benchmark
  public void shuffle(Blackhole bh) {
    for (int i = data.length; i > 1; i--)
      swap(data, i - 1, op.applyAsInt(i));
    bh.consume(data);
  }

  private static void swap(int[] arr, int i, int j) {
    arr[i] ^= arr[j];
    arr[j] ^= arr[i];
    arr[i] ^= arr[j];
  }

There is a large difference between the version where the random swap is precomputed and when the swap is computed on the fly with ThreadLocalRandom.nextInt.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: mode Param: size
shuffle thrpt 1 10 2198.459182 274.965189 ops/s THREAD_LOCAL_RANDOM 65536
shuffle thrpt 1 10 1015.796005 16.225480 ops/s THREAD_LOCAL_RANDOM 131072
shuffle thrpt 1 10 7300.732038 46.788234 ops/s PRECOMPUTED 65536
shuffle thrpt 1 10 3828.021096 450.874537 ops/s PRECOMPUTED 131072

The difference is large, but a lot more work is being done when the random indices are computed on the fly. A good measure of efficiency per unit of work is cycles per instruction (CPI). Running the benchmark with -prof perfnorm shows that these benchmarks are at parity for cycles per instruction: if throughput is lower when the random numbers are generated on the fly, it’s because there are more instructions to execute.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: mode Param: size
shuffle:CPI thrpt 1 1 0.427028 NaN #/op THREAD_LOCAL_RANDOM 65536
shuffle:CPI thrpt 1 1 0.447793 NaN #/op THREAD_LOCAL_RANDOM 131072
shuffle:CPI thrpt 1 1 0.477202 NaN #/op PRECOMPUTED 65536
shuffle:CPI thrpt 1 1 0.565153 NaN #/op PRECOMPUTED 131072

Nevertheless, instruction profiling with -prof perfasm shows that the code is qualitatively different when computing the next swapped index is simple. When there is random number generation to do, most of the time is attributed either to mov or just after mov instructions (probably because of profiling skid) during the swap. For example, with the smaller array:


  0.04%    0.00%  │   ││  0x00007fa009c0a8f9: xor    0x10(%rsi,%rdx,4),%r10d  
 15.31%   13.18%  │   ││  0x00007fa009c0a8fe: mov    %r10d,0xc(%rsi,%rcx,4)  
  3.43%    3.05%  │   ││  0x00007fa009c0a903: xor    0x10(%rsi,%rdx,4),%r10d  
  5.37%    5.92%  │   ││  0x00007fa009c0a908: mov    %r10d,0x10(%rsi,%rdx,4)  
  4.15%    4.22%  │   ││  0x00007fa009c0a90d: xor    %r10d,0xc(%rsi,%rcx,4)  
 10.80%    8.80%  │   ││  0x00007fa009c0a912: cmp    $0x1,%r9d ; probably skid

The key difference in the precomputed case is that the loop is unrolled with several isomorphic chains of instructions. None of the loads seem to be quite so expensive according to the sampled frequencies.


  0.08%    0.16%  │      0x00007fda2dc0dfb2: cmp    %r10d,%r9d
                  │      0x00007fda2dc0dfb5: jae    0x00007fda2dc0e264
  0.00%    0.00%  │      0x00007fda2dc0dfbb: xor    0x10(%rdx,%r9,4),%edi
  2.90%    2.89%  │      0x00007fda2dc0dfc0: mov    %edi,0xc(%rdx,%r11,4)
  0.48%    0.33%  │      0x00007fda2dc0dfc5: xor    0x10(%rdx,%r9,4),%edi
  0.45%    0.48%  │      0x00007fda2dc0dfca: mov    %edi,0x10(%rdx,%r9,4)
  0.56%    0.46%  │      0x00007fda2dc0dfcf: xor    %edi,0xc(%rdx,%r11,4)
  4.29%    3.88%  │      0x00007fda2dc0dfd4: mov    0x8(%rdx,%r11,4),%edi
  0.03%    0.01%  │      0x00007fda2dc0dfd9: mov    0x8(%rsi,%r11,4),%r9d
  1.38%    1.46%  │      0x00007fda2dc0dfde: mov    %r11d,%ebx
  0.02%    0.01%  │      0x00007fda2dc0dfe1: add    $0xfffffffe,%ebx   

  0.63%    0.61%  │      0x00007fda2dc0dfe4: cmp    %r10d,%r9d
                  │      0x00007fda2dc0dfe7: jae    0x00007fda2dc0e26f
  0.00%    0.01%  │      0x00007fda2dc0dfed: xor    0x10(%rdx,%r9,4),%edi
  2.60%    2.38%  │      0x00007fda2dc0dff2: mov    %edi,0x8(%rdx,%r11,4)
  0.58%    0.51%  │      0x00007fda2dc0dff7: xor    0x10(%rdx,%r9,4),%edi
  0.90%    0.96%  │      0x00007fda2dc0dffc: mov    %edi,0x10(%rdx,%r9,4)
  0.68%    0.66%  │      0x00007fda2dc0e001: xor    %edi,0x8(%rdx,%r11,4)
  4.85%    4.17%  │      0x00007fda2dc0e006: mov    0x4(%rdx,%r11,4),%edi
  0.01%    0.02%  │      0x00007fda2dc0e00b: mov    0x4(%rsi,%r11,4),%r9d
  1.12%    0.95%  │      0x00007fda2dc0e010: mov    %r11d,%ecx
  0.01%    0.00%  │      0x00007fda2dc0e013: add    $0xfffffffd,%ecx  

  1.02%    1.02%  │      0x00007fda2dc0e016: cmp    %r10d,%r9d
                  │      0x00007fda2dc0e019: jae    0x00007fda2dc0e267
  0.01%    0.01%  │      0x00007fda2dc0e01f: xor    0x10(%rdx,%r9,4),%edi
  2.47%    2.10%  │      0x00007fda2dc0e024: mov    %edi,0x4(%rdx,%r11,4)
  0.69%    0.57%  │      0x00007fda2dc0e029: xor    0x10(%rdx,%r9,4),%edi
  1.37%    1.50%  │      0x00007fda2dc0e02e: mov    %edi,0x10(%rdx,%r9,4)
  0.77%    0.83%  │      0x00007fda2dc0e033: xor    %edi,0x4(%rdx,%r11,4)
  4.28%    3.85%  │      0x00007fda2dc0e038: mov    (%rdx,%r11,4),%edi
  0.03%    0.02%  │      0x00007fda2dc0e03c: mov    (%rsi,%r11,4),%r9d
  1.14%    0.97%  │      0x00007fda2dc0e040: mov    %r11d,%ebx
  0.01%    0.00%  │      0x00007fda2dc0e043: add    $0xfffffffc,%ebx  

With unrolling, some of each chain can take place concurrently, and if there is a cache miss in one chain, it won’t stall the progress of the other chains. Without this capacity for parallelism, a cache miss during the swap will stop all work from progressing. As the probability of a cache miss increases, the cost of the load bottleneck in the swap should increase: this can be stressed by increasing the size of the array. With a large (100M) array, there’s a good chance of a cache miss virtually all the time. CPI increases in both cases, markedly so with the precomputed swaps, but throughput converges: access to main memory has become the bottleneck.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: mode Param: size
shuffle:CPI thrpt 1 1 1.354325 NaN #/op THREAD_LOCAL_RANDOM 100000000
shuffle:CPI thrpt 1 1 3.854150 NaN #/op PRECOMPUTED 100000000

The perfasm output points to the first load in the generated swap as the bottleneck: notice the large cost on the mov instruction.


  0.15%    0.24%  │      ││  0x00007f8405c0a264: cmp    %r9d,%edx
                  │      ││  0x00007f8405c0a267: jae    0x00007f8405c0a350
  0.10%    0.11%  │      ││  0x00007f8405c0a26d: xor    0x10(%r11,%rdx,4),%eax  
 73.97%   63.58%  │      ││  0x00007f8405c0a272: mov    %eax,0xc(%r11,%rcx,4)  
  2.46%    1.87%  │      ││  0x00007f8405c0a277: xor    0x10(%r11,%rdx,4),%eax 
  1.42%    0.67%  │      ││  0x00007f8405c0a27c: mov    %eax,0x10(%r11,%rdx,4) 
  2.19%    1.44%  │      ││  0x00007f8405c0a281: xor    %eax,0xc(%r11,%rcx,4) 
  2.16%    1.37%  │      ││  0x00007f8405c0a286: cmp    $0x1,%edi

With precomputed swaps, there is no single bottleneck, and my intuition is that there is some concurrency, despite the higher CPI. This is a long way from being proven.


 10.33%   11.23%   ││  0x00007fdb35c09250: mov    %r9d,0xc(%rsi,%r10,4)  
  0.41%    0.45%   ││  0x00007fdb35c09255: xor    0x10(%rsi,%r11,4),%r9d  
  0.36%    0.25%   ││  0x00007fdb35c0925a: mov    %r9d,0x10(%rsi,%r11,4)  
  0.42%    0.42%   ││  0x00007fdb35c0925f: xor    %r9d,0xc(%rsi,%r10,4)  
  0.51%    0.66%   ││  0x00007fdb35c09264: mov    0x8(%rsi,%r10,4),%r9d  
  0.03%    0.09%   ││  0x00007fdb35c09269: mov    0x8(%r13,%r10,4),%r11d 
  0.25%    0.20%   ││  0x00007fdb35c0926e: mov    %r10d,%r8d
  0.03%    0.15%   ││  0x00007fdb35c09271: add    $0xfffffffe,%r8d  
  0.19%    0.17%   ││  0x00007fdb35c09275: cmp    %ebx,%r11d
                   ││  0x00007fdb35c09278: jae    0x00007fdb35c09440
  0.02%    0.06%   ││  0x00007fdb35c0927e: xor    0x10(%rsi,%r11,4),%r9d  
 10.40%   10.66%   ││  0x00007fdb35c09283: mov    %r9d,0x8(%rsi,%r10,4) 
  0.41%    0.35%   ││  0x00007fdb35c09288: xor    0x10(%rsi,%r11,4),%r9d 
  0.41%    0.30%   ││  0x00007fdb35c0928d: mov    %r9d,0x10(%rsi,%r11,4) 
  0.45%    0.39%   ││  0x00007fdb35c09292: xor    %r9d,0x8(%rsi,%r10,4)  
  0.48%    0.60%   ││  0x00007fdb35c09297: mov    0x4(%rsi,%r10,4),%r9d  
  0.03%    0.06%   ││  0x00007fdb35c0929c: mov    0x4(%r13,%r10,4),%r11d 
  0.06%    0.11%   ││  0x00007fdb35c092a1: mov    %r10d,%edi
  0.02%    0.16%   ││  0x00007fdb35c092a4: add    $0xfffffffd,%edi   
  0.25%    0.20%   ││  0x00007fdb35c092a7: cmp    %ebx,%r11d

This can be exploited so the random numbers can be generated on the fly without a single bottleneck by using a hybrid approach. The random swaps can be generated on the fly and written into a small buffer. Once the buffer is full, the swaps are done. This should “decouple” the random number generation from the swapping code, and should allow some of the swaps to be performed independently. Concretely:


  @Benchmark
  public void shuffleBuffered(Blackhole bh) {
    for (int i = data.length; i - unroll > 1; i -= unroll) {
      for (int j = 0; j < buffer.length; ++j) {
        buffer[j] = op.applyAsInt(i - j);
      }
      for (int j = 0; j < buffer.length; ++j) {
        swap(data, i - j - 1, buffer[j]);
      }
    }
    bh.consume(data);
  }

There’s not much to be gained (or lost) from this until the array gets quite large, but it’s a relatively interesting outcome. CPI is on the whole improved, and throughput improves as a function of buffer size, so long as the buffer is small.

Mode Benchmark 16 32 64 128 256
PRECOMPUTED shuffle 0.30639 0.296566 0.309829 0.312449 0.311183
PRECOMPUTED shuffle:CPI 3.004183 3.126903 2.989748 2.987508 3.000369
THREAD_LOCAL_RANDOM shuffle 0.271536 0.266418 0.271858 0.265593 0.264507
THREAD_LOCAL_RANDOM shuffle:CPI 1.303454 1.328127 1.300731 1.32857 1.377559
THREAD_LOCAL_RANDOM shuffleBuffered 0.296098 0.324416 0.346934 0.353246 0.35277
THREAD_LOCAL_RANDOM shuffleBuffered:CPI 0.96738 0.937101 0.893673 0.87786 0.874607

Frankly, I can’t think of anything less interesting than scrambling an array, but this observation made me think there may be something in the idea of decoupling stages of work in choice scenarios. After going through the otherwise pointless exercise above, I decided it might be worthwhile spending some time porting a batch iteration feature implemented in the Go version of RoaringBitmap to Java. Confident in the potential advantages of batching, I decided to try porting the feature to Java. This idea turned out to be hugely profitable, speeding up iteration between 2x and 10x. If you use RoaringBitmap, it might be worth switching to these new batch iterators.

This topic is explored in more depth by Daniel Lemire in Fast Random Integer Generation in an Interval.

Data Driven Logic

I really don’t like reading or writing blocks of if-else statements. They make my eyes glaze over. Rumour has it that processors don’t like executing them either, though that’s less true now than it once was. There are two problems with these blocks of statements, and neither one of them is performance:

  1. They are hard to read and tend to have subtle dependencies on line order.
  2. They can’t be treated as data, and can’t be executed remotely unless you do something weird like serialise code.

Since I started programming in Java, I have been aware of the existence of rule engines, but I have never heard of a single case of “soft coding” working out. In my own experience, every time I have been involved in the implementation of a system with a DSL to empower business analysts to control the business logic, there has been low stakeholder participation during the design of the DSL and developers have ended up writing the business logic anyway. The most excruciating aspect of this is that it dilutes accountability for testing by blurring the boundaries between the application and user input. Perhaps your experience differs. However, rule engines can eradicate cyclomatic complexity in application code, and systems consisting of straight line code (with high test coverage) tend to do what they are supposed to. Soft coding isn’t the appeal of rule engines, getting rid of the if-else blocks is. If you squint at rule engines the right way, they look data driven and they start to get exciting. I can’t see value in anything more complicated than a decision table.

You can represent a block of if-else statements as a decision table by considering every possible branch as a line in the table. Your decision table doesn’t need to be exhaustive: there can be cases where you fall through and throw an exception or choose a default. This can be quite hard to write in imperative code, and you may need to throw the same exception in multiple places, set flags, or otherwise rely on line order to achieve this. Decision tables have a really nice property: if you want to start treating certain cases as exceptional, you just delete the line from the table.

Decision tables are very similar in character to case classes in Scala, or to the weaker when expressions present in Kotlin, but decision tables can be allowed to grow much larger. I wouldn’t allow a match expression with 50,000 cases through a code review even if someone had the energy to write one deviously enough to come in under the maximum byte code method size.

I looked at several implementations of decision tables on GitHub and saw a lot of clean code, but not a lot of textbook computer science. Every single implementation iterated through a list of rules checking the rule against the input data. I have implemented a password strength checker like this in the past (I know! I probably shouldn’t have done this myself!) which is fine because a strong password checker might have at most a dozen rules. What if you work in adtech and want to classify the people you track (how do you sleep at night?) as members of one or many of 50,000 clusters which can be described in terms of regions of, say, a 50 dimensional feature space? Imagine your task is to guess which cluster your quarry belongs to in a few microseconds at most. You won’t get far if you iterate through thousands of rules.

I implemented a small library in the evenings over the last couple of weeks called bitrules. This was based on some ideas I had about using RoaringBitmap for decision tables last year. The idea is very simple: think of a list of rules with constrained attributes as a matrix, and transpose that matrix and loop through the attributes during classification. This is a similar approach to that taken in blocked signatures, a search technique used in BitFunnel, which translates an expensive signature scan to a random access. In the case of bitrules, for each constraint on each attribute, bits are removed from a mask of potentially matching rules. These bitsets are intersected sequentially, resulting in a bitset rapidly decreasing in cardinality. Because I used RoaringBitmap, rapid reduction in cardinality means a rapid reduction in size, which means cache friendliness. There are a few tricks in the code like using range encoding for range attributes, so that range queries can be evaluated with a single bitset intersection. I plan to implement a hopefully efficient serialisation format so the table can be sent to another server and used for classification remotely.

I don’t actually know how fast this code is: performance is context sensitive, and I shy away from making “performance measurements”. It’s best suited to cases where there are a large number of rules (thousands) and I bet it’s really fast when there are ~50,000 segments in a ~50 dimensional space. I don’t even have a use case right now for bitrules: it was just fun writing the code. I have started releasing it to maven central, while I can’t guarantee its fitness for purpose, perhaps it may be of some use to someone else.

Parallel Bitmap Aggregation

A bitmap index represents predicates over records as sets consisting of the integer identities of each record satisfying each predicate. This representation is actually a few decades out of date, and systems like Pilosa use much more sophisticated data structures, and Sybase had even more on offer back in the 90s. But the chances are, if you’ve rolled your own bitmap index, you’ve used equality encoding and have a bitmap per indexed predicate. RoaringBitmap is a great choice for the bitmaps used in this kind of data structure, because it offers a good tradeoff between bitmap compression and performance. It’s also succinct, that is, you don’t need to decompress the structure in order to operate on it. With the naive index structure described, it’s likely that you have many bitmaps to aggregate (union, intersection, difference, and so on) when you want to query your index.

RoaringBitmap provides a class FastAggregation for aggregations, and the method FastAggregation.and is incredibly fast, particularly given its apparent simplicity. This reflects a nice property of set intersection, that the size of the intersection cannot increase and tends to get smaller as the number of sets increases. Unions and differences are different: the problem size tends to increase in magnitude as the number of contributing sets increases. While FastAggregation.or and FastAggregation.xor are highly optimised, not a lot can be done about the fact each additional set makes the problem bigger. So it may be worth throwing some threads at the problem, and this gets more attractive as you add more dimensions to your index. You can, of course, completely bypass this need by reading some database research and sharing bitmaps between overlapping predicates.

I implemented the class ParallelAggregation in RoaringBitmap, but I’m not convinced the technique used performs as well as it could do. RoaringBitmap stores the 16 bit prefix of each integer in a sorted array, with the rest of each integer in that 16 bit range stored in a container at the same index in another array. This makes the structure very easy to split. The implementation I worked on seeks to exploit this by grouping all the containers by common key as a SortedMap<Short, List<Container>> before executing each reduction (i.e. Function<List<Container>, Container>) in parallel in a ForkJoinPool. This results in a reasonable speedup of 2x-6.5x compared to FastAggregation on an 8 core machine, but it uses quite a lot of temporary memory just to set the problem up. I don’t think it should be possible to beat this approach without grouping the containers by key somehow, but I suspect there are lighter weight approaches which use less memory and give better throughput. Perhaps this would be an interesting problem to work on?

Iterating Over a Bitset in Java

How fast can you iterate over a bitset? Daniel Lemire published a benchmark recently in support of a strategy using the number of trailing zeroes to skip over empty bits. I have used the same technique in Java several times in my hobby project SplitMap and this is something I am keen to optimise. I think that the best strategy depends on what you want to do with the set bits, and how sparse and uniformly distributed they are. I argue that the cost of iteration is less important than the constraints your API imposes on the caller, and whether the caller is free to exploit patterns in the data.

C2 Generates Good Code

If you think C++ is much faster than Java, you either don’t know much about Java or do lots of floating point arithmetic. This isn’t about benchmarking C++ against Java, but comparing the compilation outputs for a C++ implementation and a Java implementation shows that there won’t be much difference if your Java method gets hot. Only the time to performance will differ, and this is amortised over the lifetime of an application. The trailing zeroes implementation is probably the fastest technique in Java as well as in C++, but that is to ignore the optimisations you can’t apply to the callback if you use it too literally.

Compiling this C++ function with GCC yields the snippet of assembly taken from the loop kernel:


template <typename CALLBACK>
static void for_each(const long* bitmap, const int size, const CALLBACK& callback) {
    for (size_t k = 0; k < size; ++k) {
        long bitset = bitmap[k];
        while (bitset != 0) {
            callback((k * 64) + __builtin_ctzl(bitset));
            bitset ^= (bitset & -bitset);
        }
    }
}

The instruction tzcntl calculates the next set bit and blsr switches it off.


.L99:
  movq  %rdi, %rcx
  blsr  %ebx, %ebx
  call  _ZNSo3putEc
  movq  %rax, %rcx
  call  _ZNSo5flushEv
  testl  %ebx, %ebx
  je  .L96
.L100:
  xorl  %edx, %edx
  movq  %r12, %rcx
  tzcntl  %ebx, %edx
  addl  %ebp, %edx
  call  _ZNSolsEi
  movq  %rax, %rdi
  movq  (%rax), %rax
  movq  -24(%rax), %rax
  movq  240(%rdi,%rax), %rsi
  testq  %rsi, %rsi
  je  .L108
  cmpb  $0, 56(%rsi)
  jne  .L109
  movq  %rsi, %rcx
  call  _ZNKSt5ctypeIcE13_M_widen_initEv
  movq  (%rsi), %rax
  movl  $10, %edx
  movq  48(%rax), %rax
  cmpq  %r14, %rax
  je  .L99
  movq  %rsi, %rcx
  call  *%rax
  movsbl  %al, %edx
  jmp  .L99
  .p2align 4,,10

In Java, almost identical code is generated.


public void forEach(long[] bitmap, IntConsumer consumer) {
    for (int i = 0; i < bitmap.length; ++i) {
      long word = bitmap[i];
      while (word != 0) {
        consumer.accept(Long.SIZE * i + Long.numberOfTrailingZeros(word));
        word ^= Long.lowestOneBit(word);
      }
    }
  }

The key difference is that xor and blsi haven’t been fused into blsr, so the C++ code is probably slightly faster. A lambda function accumulating the contents of an array is inlined into this loop (the add comes from an inlined lambda, but notice how little time is spent adding compared to computing the bit to switch off in this sample produced by perfasm).


   .83%    0x000002d79d366a19: tzcnt   r9,rcx
  8.53%    0x000002d79d366a1e: add     r9d,ebx
  0.42%    0x000002d79d366a21: cmp     r9d,r8d
  0.00%    0x000002d79d366a24: jnb     2d79d366a4dh
  0.62%    0x000002d79d366a26: add     r10d,dword ptr [rdi+r9*4+10h]
 16.22%    0x000002d79d366a2b: vmovq   r11,xmm4
  6.68%    0x000002d79d366a30: mov     dword ptr [r11+10h],r10d
 27.92%    0x000002d79d366a34: blsi    r10,rcx
  0.55%    0x000002d79d366a39: xor     rcx,r10         
  0.10%    0x000002d79d366a3c: mov     r11,qword ptr [r15+70h]  

It’s this Java code, and its impact on which optimisations can be applied to the IntConsumer that this post focuses on. There are different principles, particularly related to inlining and vectorisation opportunities in C++, but this blog is about Java. Depending on what your callback does, you get different benchmark results and you should make different choices about how to do the iteration: you just can’t assess this in isolation.

Special Casing -1

Imagine you have an int[] containing data, and you are iterating over a mask or materialised predicate over that data. For each set bit, you want to add the corresponding entry in the array to a sum. In Java, that looks like this (you’ve already seen the generated assembly above):


  @Benchmark
  public int reduce() {
    int[] result = new int[1];
    forEach(bitmap, i -> result[0] += data[i]);
    return result[0];
  }

How fast can this get? It obviously depends on how full the bitset is. The worst case would be that it’s completely full, and it couldn’t get much better than if only one bit per word were set. The difference is noticeable, but scales by a factor less than the number of bits:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
reduce thrpt 1 10 7.435909 0.017491 ops/ms FULL
reduce thrpt 1 10 260.305307 6.081961 ops/ms ONE_BIT_PER_WORD

But the important code here, the callback itself, is stuck at entry level compilation. There is no unrolling, no vectorisation, the adds can’t be pipelined because there is a data dependency on blsi and xor. We can do much better in some cases, and not much worse in others, just by treating -1 as a special case, profiting from optimisations that can now be applied inside the callback. Passing a different callback which consumes whole words costs a branch, but it’s often worth it. Here’s the iterator now:


  interface WordConsumer {
    void acceptWord(int wordIndex, long word);
  }

  public void forEach(long[] bitmap, IntConsumer intConsumer, WordConsumer wordConsumer) {
    for (int i = 0; i < bitmap.length; ++i) {
      long word = bitmap[i];
      if (word == -1L) {
        wordConsumer.acceptWord(i, word);
      } else {
        while (word != 0) {
          intConsumer.accept(Long.SIZE * i + Long.numberOfTrailingZeros(word));
          word ^= Long.lowestOneBit(word);
        }
      }
    }
  }

  @Benchmark
  public int reduceWithWordConsumer() {
    int[] result = new int[1];
    forEach(bitmap, i -> result[0] += data[i], (index, word) -> {
      if (word != -1L) {
        throw new IllegalStateException();
      }
      int sum = 0;
      for (int i = index * Long.SIZE; i < (index + 1) * Long.SIZE; ++i) {
        sum += data[i];
      }
      result[0] += sum;
    });
    return result[0];
  }

This really pays off when the bitset is full, but having that extra branch does seem to cost something even though it is never taken, whereas the full case improves 6x.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
reduce thrpt 1 10 7.401202 0.118648 ops/ms FULL
reduce thrpt 1 10 261.682016 4.155856 ops/ms ONE_BIT_PER_WORD
reduceWithWordConsumer thrpt 1 10 43.972759 0.993264 ops/ms FULL
reduceWithWordConsumer thrpt 1 10 222.824868 4.877147 ops/ms ONE_BIT_PER_WORD

We still don’t actually know the cost of the branch when it’s taken every now and then. To estimate it, we need a new scenario (or new scenarios) which mix full and sparse words. As you might expect, having the WordConsumer is great when one word in every few is full: the fast path is so much faster, it practically skips the word.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
reduce thrpt 1 10 157.358633 4.538679 ops/ms SPARSE_16_FULL_WORDS
reduceWithWordConsumer thrpt 1 10 257.041035 7.446404 ops/ms SPARSE_16_FULL_WORDS

So in this scenario, the branch has paid for itself. How? The data dependency has been removed with a countable loop. Here’s the perfasm output. Notice two things: long runs of add instructions, and the vastly reduced percentage against blsi. The time is now spent adding numbers up, not switching off least significant bits. This feels like progress.


  0.05%    0x000001dd5b35af03: add     ebx,dword ptr [rdi+r9*4+10h]
  0.31%    0x000001dd5b35af08: add     ebx,dword ptr [rdi+r11*4+14h]
  0.32%    0x000001dd5b35af0d: add     ebx,dword ptr [rdi+r11*4+18h]
  0.33%    0x000001dd5b35af12: add     ebx,dword ptr [rdi+r11*4+1ch]
  0.37%    0x000001dd5b35af17: add     ebx,dword ptr [rdi+r11*4+20h]
  0.34%    0x000001dd5b35af1c: add     ebx,dword ptr [rdi+r11*4+24h]
  0.39%    0x000001dd5b35af21: add     ebx,dword ptr [rdi+r11*4+28h]
  0.36%    0x000001dd5b35af26: add     ebx,dword ptr [rdi+r11*4+2ch]
  0.34%    0x000001dd5b35af2b: add     ebx,dword ptr [rdi+r11*4+30h]
  0.35%    0x000001dd5b35af30: add     ebx,dword ptr [rdi+r11*4+34h]
  0.38%    0x000001dd5b35af35: add     ebx,dword ptr [rdi+r11*4+38h]
  0.36%    0x000001dd5b35af3a: add     ebx,dword ptr [rdi+r11*4+3ch]
  0.49%    0x000001dd5b35af3f: add     ebx,dword ptr [rdi+r11*4+40h]
  0.39%    0x000001dd5b35af44: add     ebx,dword ptr [rdi+r11*4+44h]
  0.42%    0x000001dd5b35af49: add     ebx,dword ptr [rdi+r11*4+48h]
  0.39%    0x000001dd5b35af4e: add     ebx,dword ptr [rdi+r11*4+4ch]
...
  2.39%    0x000001dd5b35afe9: tzcnt   r11,rbx
  2.65%    0x000001dd5b35afee: add     r11d,r10d         
  2.15%    0x000001dd5b35aff1: cmp     r11d,r9d
  0.00%    0x000001dd5b35aff4: jnb     1dd5b35b04dh
  2.29%    0x000001dd5b35aff6: add     r8d,dword ptr [rdi+r11*4+10h]
 11.03%    0x000001dd5b35affb: vmovq   r11,xmm0
  2.45%    0x000001dd5b35b000: mov     dword ptr [r11+10h],r8d  
  3.14%    0x000001dd5b35b004: mov     r11,qword ptr [r15+70h]
  2.18%    0x000001dd5b35b008: blsi    r8,rbx
  2.23%    0x000001dd5b35b00d: xor     rbx,r8

Heroically ploughing through the full words tells a different story: blsi is up at 11%. This indicates more time is spent iterating rather than evaluating the callback.


  6.98%    0x0000019f106c6799: tzcnt   r9,rdi
  3.47%    0x0000019f106c679e: add     r9d,ebx           
  1.65%    0x0000019f106c67a1: cmp     r9d,r10d
           0x0000019f106c67a4: jnb     19f106c67cdh
  1.67%    0x0000019f106c67a6: add     r11d,dword ptr [r8+r9*4+10h]
 11.45%    0x0000019f106c67ab: vmovq   r9,xmm2
  3.20%    0x0000019f106c67b0: mov     dword ptr [r9+10h],r11d  
 11.31%    0x0000019f106c67b4: blsi    r11,rdi
  1.71%    0x0000019f106c67b9: xor     rdi,r11           

This shows the cost of a data dependency in a loop. The operation we want to perform is associative, so we could even vectorise this. In C++ that might happen automatically, or could be ensured with intrinsics, but C2 has various heuristics: it won’t try to vectorise a simple reduction, and 64 would probably be on the short side for most cases it would try to vectorise.

Acknowledging Runs

You might be tempted to transfer even more control to the callback, by accumulating runs and then calling the callback once per run. It simplifies the code to exclude incomplete start and end words from the run.


private interface RunConsumer {
    void acceptRun(int start, int end);
  }

  public void forEach(long[] bitmap, IntConsumer intConsumer, RunConsumer runConsumer) {
    int runStart = -1;
    for (int i = 0; i < bitmap.length; ++i) {
      long word = bitmap[i];
      if (word == -1L) {
        if (runStart == -1) {
          runStart = i;
        }
      } else {
        if (runStart != -1) {
          runConsumer.acceptRun(runStart * Long.SIZE, i * Long.SIZE);
          runStart = -1;
        }
        while (word != 0) {
          intConsumer.accept(Long.SIZE * i + Long.numberOfTrailingZeros(word));
          word ^= Long.lowestOneBit(word);
        }
      }
    }
    if (runStart != -1) {
      runConsumer.acceptRun(runStart * Long.SIZE, bitmap.length * Long.SIZE);
    }
  }

For a simple reduction, the extra complexity isn’t justified: you’re better off with the WordIterator.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
reduce thrpt 1 10 160.502749 2.960568 ops/ms SPARSE_16_FULL_WORDS
reduce thrpt 1 10 7.294747 0.186678 ops/ms FULL
reduce thrpt 1 10 258.064511 8.902233 ops/ms ONE_BIT_PER_WORD
reduce thrpt 1 10 159.613877 3.424432 ops/ms SPARSE_1_16_WORD_RUN
reduceWithRunConsumer thrpt 1 10 251.683131 6.799639 ops/ms SPARSE_16_FULL_WORDS
reduceWithRunConsumer thrpt 1 10 37.809154 0.723198 ops/ms FULL
reduceWithRunConsumer thrpt 1 10 218.133560 13.756779 ops/ms ONE_BIT_PER_WORD
reduceWithRunConsumer thrpt 1 10 140.896826 8.495777 ops/ms SPARSE_1_16_WORD_RUN
reduceWithWordConsumer thrpt 1 10 257.961783 5.892072 ops/ms SPARSE_16_FULL_WORDS
reduceWithWordConsumer thrpt 1 10 43.909471 0.601319 ops/ms FULL
reduceWithWordConsumer thrpt 1 10 213.731758 20.398077 ops/ms ONE_BIT_PER_WORD
reduceWithWordConsumer thrpt 1 10 258.280428 11.316647 ops/ms SPARSE_1_16_WORD_RUN

It’s simplistic to measure this and conclude that this is a bad approach though. There are several other dimensions to this problem:

  1. Vectorised callbacks
  2. Inlining failures preventing optimisations
  3. The number of runs and their lengths (i.e. your data and how you structure it)

Vectorisable Callbacks

There are real benefits to batching up callbacks if the workload in the callback can be vectorised. The code doesn’t need to get much more complicated to start benefitting from larger iteration batches. Mapping each bit to a scaled and squared value from the data array and storing it into an output array illustrates this.


  @Benchmark
  public void map(Blackhole bh) {
    forEach(bitmap, i -> output[i] = data[i] * data[i] * factor);
    bh.consume(output);
  }

  @Benchmark
  public void mapWithWordConsumer(Blackhole bh) {
    forEach(bitmap, i -> output[0] = data[i] * factor, (WordConsumer) (index, word) -> {
      if (word != -1L) {
        throw new IllegalStateException();
      }
      for (int i = index * Long.SIZE; i < (index + 1) * Long.SIZE; ++i) {
        output[i] = data[i] * data[i] * factor;
      }
    });
    bh.consume(output);
  }

  @Benchmark
  public void mapWithRunConsumer(Blackhole bh) {
    forEach(bitmap, i -> output[0] = data[i] * factor, (RunConsumer) (start, end) -> {
      for (int i = start; i < end; ++i) {
        output[i] = data[i] * data[i] * factor;
      }
    });
    bh.consume(output);
  }

The RunConsumer does much better in the full case, never much worse than the WordConsumer and always better than the basic strategy – even when there is only one run in the entire bitset, or when there are a few full words in an otherwise sparse bitset.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
map thrpt 1 10 127.876662 3.411741 ops/ms SPARSE_16_FULL_WORDS
map thrpt 1 10 10.598974 0.022404 ops/ms FULL
map thrpt 1 10 126.434666 18.608547 ops/ms ONE_BIT_PER_WORD
map thrpt 1 10 115.977840 20.449258 ops/ms SPARSE_1_16_WORD_RUN
mapWithRunConsumer thrpt 1 10 199.186167 8.138446 ops/ms SPARSE_16_FULL_WORDS
mapWithRunConsumer thrpt 1 10 64.230868 2.871434 ops/ms FULL
mapWithRunConsumer thrpt 1 10 219.963063 4.257561 ops/ms ONE_BIT_PER_WORD
mapWithRunConsumer thrpt 1 10 203.403804 6.907366 ops/ms SPARSE_1_16_WORD_RUN
mapWithWordConsumer thrpt 1 10 229.822235 5.276084 ops/ms SPARSE_16_FULL_WORDS
mapWithWordConsumer thrpt 1 10 48.381990 3.845642 ops/ms FULL
mapWithWordConsumer thrpt 1 10 218.907803 5.331011 ops/ms ONE_BIT_PER_WORD
mapWithWordConsumer thrpt 1 10 240.795280 10.204818 ops/ms SPARSE_1_16_WORD_RUN

This is simply because the callback was vectorised, and the style of the RunConsumer API allows this to be exploited. This can be seen with perfasm. Both the WordConsumer and RunConsumer are actually vectorised, but the thing to notice is that there are two hot regions in the WordConsumer benchmark: the iteration and the callback, this boundary is often crossed. On the other hand, the RunConsumer implementation spends most of its time in the callback.

WordConsumer


....[Hottest Region 1]..............................................................................
c2, com.openkappa.simd.iterate.generated.BitSetIterator_mapWithWordConsumer_jmhTest::mapWithWordConsumer_thrpt_jmhStub, version 172 (227 bytes) 
...
  1.55%    0x000001c2aa13c790: vmovdqu ymm1,ymmword ptr [r9+r10*4+10h]
  0.15%    0x000001c2aa13c797: vpmulld ymm1,ymm1,ymm1
  3.72%    0x000001c2aa13c79c: vpmulld ymm1,ymm1,ymm2
 16.02%    0x000001c2aa13c7a1: vmovdqu ymmword ptr [rdx+r10*4+10h],ymm1
  1.69%    0x000001c2aa13c7a8: movsxd  r8,r10d
  1.55%    0x000001c2aa13c7ab: vmovdqu ymm1,ymmword ptr [r9+r8*4+30h]
  1.46%    0x000001c2aa13c7b2: vpmulld ymm1,ymm1,ymm1
  1.71%    0x000001c2aa13c7b7: vpmulld ymm1,ymm1,ymm2
  3.20%    0x000001c2aa13c7bc: vmovdqu ymmword ptr [rdx+r8*4+30h],ymm1
  0.07%    0x000001c2aa13c7c3: add     r10d,10h          
  1.70%    0x000001c2aa13c7c7: cmp     r10d,r11d
           0x000001c2aa13c7ca: jl      1c2aa13c790h      
  0.02%    0x000001c2aa13c7cc: mov     r8,qword ptr [r15+70h]  
  1.50%    0x000001c2aa13c7d0: test    dword ptr [r8],eax  
  0.04%    0x000001c2aa13c7d3: cmp     r10d,r11d
           0x000001c2aa13c7d6: jl      1c2aa13c78ah
  0.05%    0x000001c2aa13c7d8: mov     r11d,dword ptr [rsp+5ch]
  0.02%    0x000001c2aa13c7dd: add     r11d,39h
  1.57%    0x000001c2aa13c7e1: mov     r8d,ecx
  0.02%    0x000001c2aa13c7e4: cmp     r8d,r11d
  0.06%    0x000001c2aa13c7e7: mov     ecx,80000000h
  0.02%    0x000001c2aa13c7ec: cmovl   r11d,ecx
  1.50%    0x000001c2aa13c7f0: cmp     r10d,r11d
           0x000001c2aa13c7f3: jnl     1c2aa13c819h
  0.02%    0x000001c2aa13c7f5: nop                       
  0.06%    0x000001c2aa13c7f8: vmovdqu ymm1,ymmword ptr [r9+r10*4+10h]
  0.21%    0x000001c2aa13c7ff: vpmulld ymm1,ymm1,ymm1
  2.16%    0x000001c2aa13c804: vpmulld ymm1,ymm1,ymm2
  1.80%    0x000001c2aa13c809: vmovdqu ymmword ptr [rdx+r10*4+10h],ymm1
...
 53.26%  <total for region 1>

RunConsumer


....[Hottest Region 1]..............................................................................
c2, com.openkappa.simd.iterate.BitSetIterator$$Lambda$44.1209658195::acceptRun, version 166 (816 bytes) 
...
  0.92%    0x0000016658954860: vmovdqu ymm0,ymmword ptr [rdx+r8*4+10h]
  1.31%    0x0000016658954867: vpmulld ymm0,ymm0,ymm0
  1.74%    0x000001665895486c: vpmulld ymm0,ymm0,ymm1
  4.55%    0x0000016658954871: vmovdqu ymmword ptr [rdi+r8*4+10h],ymm0
  0.69%    0x0000016658954878: movsxd  rcx,r8d
  0.01%    0x000001665895487b: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+30h]
  0.41%    0x0000016658954881: vpmulld ymm0,ymm0,ymm0
  0.78%    0x0000016658954886: vpmulld ymm0,ymm0,ymm1
  0.83%    0x000001665895488b: vmovdqu ymmword ptr [rdi+rcx*4+30h],ymm0
  0.25%    0x0000016658954891: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+50h]
  1.29%    0x0000016658954897: vpmulld ymm0,ymm0,ymm0
  1.51%    0x000001665895489c: vpmulld ymm0,ymm0,ymm1
  3.65%    0x00000166589548a1: vmovdqu ymmword ptr [rdi+rcx*4+50h],ymm0
  0.54%    0x00000166589548a7: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+70h]
  0.31%    0x00000166589548ad: vpmulld ymm0,ymm0,ymm0
  0.47%    0x00000166589548b2: vpmulld ymm0,ymm0,ymm1
  1.11%    0x00000166589548b7: vmovdqu ymmword ptr [rdi+rcx*4+70h],ymm0
  0.28%    0x00000166589548bd: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+90h]
  1.17%    0x00000166589548c6: vpmulld ymm0,ymm0,ymm0
  1.89%    0x00000166589548cb: vpmulld ymm0,ymm0,ymm1
  3.56%    0x00000166589548d0: vmovdqu ymmword ptr [rdi+rcx*4+90h],ymm0
  0.73%    0x00000166589548d9: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+0b0h]
  0.21%    0x00000166589548e2: vpmulld ymm0,ymm0,ymm0
  0.34%    0x00000166589548e7: vpmulld ymm0,ymm0,ymm1
  1.29%    0x00000166589548ec: vmovdqu ymmword ptr [rdi+rcx*4+0b0h],ymm0
  0.33%    0x00000166589548f5: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+0d0h]
  0.97%    0x00000166589548fe: vpmulld ymm0,ymm0,ymm0
  1.90%    0x0000016658954903: vpmulld ymm0,ymm0,ymm1
  3.59%    0x0000016658954908: vmovdqu ymmword ptr [rdi+rcx*4+0d0h],ymm0
  0.82%    0x0000016658954911: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+0f0h]
  0.18%    0x000001665895491a: vpmulld ymm0,ymm0,ymm0
  0.29%    0x000001665895491f: vpmulld ymm0,ymm0,ymm1
  1.25%    0x0000016658954924: vmovdqu ymmword ptr [rdi+rcx*4+0f0h],ymm0
  0.33%    0x000001665895492d: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+110h]
  1.10%    0x0000016658954936: vpmulld ymm0,ymm0,ymm0
  2.11%    0x000001665895493b: vpmulld ymm0,ymm0,ymm1
  3.67%    0x0000016658954940: vmovdqu ymmword ptr [rdi+rcx*4+110h],ymm0
  0.93%    0x0000016658954949: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+130h]
  0.13%    0x0000016658954952: vpmulld ymm0,ymm0,ymm0
  0.25%    0x0000016658954957: vpmulld ymm0,ymm0,ymm1
  1.35%    0x000001665895495c: vmovdqu ymmword ptr [rdi+rcx*4+130h],ymm0
  0.32%    0x0000016658954965: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+150h]
  0.93%    0x000001665895496e: vpmulld ymm0,ymm0,ymm0
  2.16%    0x0000016658954973: vpmulld ymm0,ymm0,ymm1
  3.73%    0x0000016658954978: vmovdqu ymmword ptr [rdi+rcx*4+150h],ymm0
  0.95%    0x0000016658954981: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+170h]
  0.14%    0x000001665895498a: vpmulld ymm0,ymm0,ymm0
  0.21%    0x000001665895498f: vpmulld ymm0,ymm0,ymm1
  1.39%    0x0000016658954994: vmovdqu ymmword ptr [rdi+rcx*4+170h],ymm0
  0.29%    0x000001665895499d: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+190h]
  1.42%    0x00000166589549a6: vpmulld ymm0,ymm0,ymm0
  2.61%    0x00000166589549ab: vpmulld ymm0,ymm0,ymm1
  4.42%    0x00000166589549b0: vmovdqu ymmword ptr [rdi+rcx*4+190h],ymm0
  1.01%    0x00000166589549b9: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+1b0h]
  0.10%    0x00000166589549c2: vpmulld ymm0,ymm0,ymm0
  0.17%    0x00000166589549c7: vpmulld ymm0,ymm0,ymm1
  1.46%    0x00000166589549cc: vmovdqu ymmword ptr [rdi+rcx*4+1b0h],ymm0
  0.27%    0x00000166589549d5: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+1d0h]
 13.60%    0x00000166589549de: vpmulld ymm0,ymm0,ymm0
  3.51%    0x00000166589549e3: vpmulld ymm0,ymm0,ymm1
  4.69%    0x00000166589549e8: vmovdqu ymmword ptr [rdi+rcx*4+1d0h],ymm0
  1.00%    0x00000166589549f1: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+1f0h]
  0.11%    0x00000166589549fa: vpmulld ymm0,ymm0,ymm0
  0.15%    0x00000166589549ff: vpmulld ymm0,ymm0,ymm1
  1.46%    0x0000016658954a04: vmovdqu ymmword ptr [rdi+rcx*4+1f0h],ymm0
                                                         
  0.26%    0x0000016658954a0d: add     r8d,80h           
  0.01%    0x0000016658954a14: cmp     r8d,r10d
           0x0000016658954a17: jl      16658954860h      
  0.00%    0x0000016658954a1d: mov     r14,qword ptr [r15+70h]  
  0.06%    0x0000016658954a21: test    dword ptr [r14],eax  
  0.17%    0x0000016658954a24: cmp     r8d,r10d
           0x0000016658954a27: jl      16658954860h
           0x0000016658954a2d: mov     r10d,r9d
           0x0000016658954a30: add     r10d,0fffffff9h
           0x0000016658954a34: cmp     r9d,r10d
  0.00%    0x0000016658954a37: cmovl   r10d,ebx
           0x0000016658954a3b: cmp     r8d,r10d
           0x0000016658954a3e: jnl     16658954a61h      
           0x0000016658954a40: vmovdqu ymm0,ymmword ptr [rdx+r8*4+10h]
  0.14%    0x0000016658954a47: vpmulld ymm0,ymm0,ymm0
  0.05%    0x0000016658954a4c: vpmulld ymm0,ymm0,ymm1
  0.03%    0x0000016658954a51: vmovdqu ymmword ptr [rdi+r8*4+10h],ymm0
...
 96.10%  <total for region 1>

Inlining

So far, everything has been inlined. Java optimistically assumes you only have one implementation and aggressively inlines at first, deoptimising to add a branch when it sees a second implementation, deoptimising again and replacing with a virtual call if it sees a third implementation. This doesn’t matter much usually, but the cost of this not only dwarfs any savings in an iteration strategy; it also prevents various optimisations which can be applied if the code is inlined. Once again, passing a batch of work into the callback completely ameliorates this, because even if the call is virtual, the callback itself might be hot and aggressively optimised. I haven’t benchmarked this because I think the point is self-evident to anyone who would read this far.

Number of runs

It’s clear to see from the benchmark results that the best choice of iteration strategy is sensitive to what you want to do with the data, but also how it is arranged. It is well documented in database literature that real data sets tend to contain runs. If you are building a bitmap index on some attribute of your data, and you sort your data by that attribute, you will have as many bitmaps as you have attribute values, and each attribute value bitmap will contain a single run. This is almost true for any index on attributes correlated with the attribute chosen for the sort order, and is completely untrue for uncorellated attributes. There are a range of iteration strategies to choose from, and the best iteration strategy for one index may not be the best for another.

My benchmarks are available at github.

Building RoaringBitmaps from Streams

RoaringBitmap is a fast compressed bitset format. In the Java implementation of Roaring, it was until recently preferential to build a bitset in one go from sorted data; there were performance penalties of varying magnitude for incremental or unordered insertions. In a recent pull request, I wanted to improve incremental monotonic insertion so I could build bitmaps from streams, but sped up unsorted batch creation significantly by accident.

Incremental Ordered Insertion

If you want to build a bitmap, you can do so efficiently with the RoaringBitmap.bitmapOf factory method.


int[] data = ...
RoaringBitmap bitmap = RoaringBitmap.bitmapOf(data);

However, I often find that I want to stream integers into a bitmap. Given that the integers being inserted into a bitmap often represent indices into an array, such a stream is likely to be monotonic. You might implement this like so:


IntStream stream = ...
RoaringBitmap bitmap = new RoaringBitmap();
stream.forEach(bitmap::add);

While this is OK, it has a few inefficiencies compared to the batch creation method.

  • Indirection: the container being written to must be located on each insertion
  • Eagerness: the cardinality must be kept up to date on each insertion
  • Allocation pressure: the best container type can’t be known in advance. Choice of container may change as data is inserted, this requires allocations of new instances.

You could also collect the stream into an int[] and use the batch method, but it could be a large temporary object with obvious drawbacks.

OrderedWriter

The solution I proposed is to create a writer object (OrderedWriter) which allocates a small buffer of 8KB, to use as a bitmap large enough to cover 16 bits. The stream to bitmap code becomes:


IntStream stream = ...
RoaringBitmap bitmap = new RoaringBitmap();
OrderedWriter writer = new OrderedWriter(bitmap);
stream.forEach(writer::add);
writer.flush(); // clear the buffer out

This is implemented so that changes in key (where the most significant 16 bits of each integer is stored) trigger a flush of the buffer.


  public void add(int value) {
    short key = Util.highbits(value);
    short low = Util.lowbits(value);
    if (key != currentKey) {
      if (Util.compareUnsigned(key, currentKey) < 0) {
        throw new IllegalStateException("Must write in ascending key order");
      }
      flush();
    }
    int ulow = low & 0xFFFF;
    bitmap[(ulow >>> 6)] |= (1L << ulow);
    currentKey = key;
    dirty = true;
  }

When a flush occurs, a container type is chosen and appended to the bitmap’s prefix index.


  public void flush() {
    if (dirty) {
      RoaringArray highLowContainer = underlying.highLowContainer;
      // we check that it's safe to append since RoaringArray.append does no validation
      if (highLowContainer.size > 0) {
        short key = highLowContainer.getKeyAtIndex(highLowContainer.size - 1);
        if (Util.compareUnsigned(currentKey, key) <= 0) {
          throw new IllegalStateException("Cannot write " + currentKey + " after " + key);
        }
      }
      highLowContainer.append(currentKey, chooseBestContainer());
      clearBitmap();
      dirty = false;
    }
  }

There are significant performance advantages in this approach. There is no indirection cost, and no searches in the prefix index for containers: the writes are just buffered. The buffer is small enough to fit in cache, and containers only need to be created when the writer is flushed, which happens whenever a new key is seen, or when flush is called manually. During a flush, the cardinality can be computed in one go, the best container can be chosen, and run optimisation only has to happen once. Computing the cardinality is the only bottleneck – it requires 1024 calls to Long.bitCount which can’t be vectorised in a language like Java. It can’t be incremented on insertion without either sacrificing idempotence or incurring the cost of a membership check. After the flush, the buffer needs to be cleared, using a call to Arrays.fill which is vectorised. So, despite the cost of the buffer, this can be quite efficient.

This approach isn’t universally applicable. For instance, you must write data in ascending order of the most significant 16 bits. You must also remember to flush the writer when you’re finished: until you’ve called flush, the data in the last container may not be in the bitmap. For my particular use case, this is reasonable. However, there are times when this is not fit for purpose, such as if you are occasionally inserting values and expect them to be available to queries immediately. In general, if you don’t know when you’ll stop adding data to the bitmap, this isn’t a good fit because you won’t know when to call flush.

Benchmark

The RoaringBitmap project is a big user of JMH, and most pull requests require benchmarks as evidence of performance improvement. I benchmarked the two approaches, varying bitmap sizes and randomness (likelihood of there not being a compressible run), and was amazed to find that this approach actually beats having a sorted array and using RoaringBitmap.bitmapOf. Less surprising was beating the existing API for incremental adds (this was the goal in the first place). Lower is better:

Benchmark (randomness) (size) Mode Cnt Score Error Units
buildRoaringBitmap 0.1 10000 avgt 5 54.263 3.393 us/op
buildRoaringBitmap 0.1 100000 avgt 5 355.188 15.234 us/op
buildRoaringBitmap 0.1 1000000 avgt 5 3567.839 135.149 us/op
buildRoaringBitmap 0.1 10000000 avgt 5 31982.046 1227.325 us/op
buildRoaringBitmap 0.5 10000 avgt 5 53.855 0.887 us/op
buildRoaringBitmap 0.5 100000 avgt 5 357.671 14.111 us/op
buildRoaringBitmap 0.5 1000000 avgt 5 3556.152 243.671 us/op
buildRoaringBitmap 0.5 10000000 avgt 5 34385.971 3864.143 us/op
buildRoaringBitmap 0.9 10000 avgt 5 59.354 10.385 us/op
buildRoaringBitmap 0.9 100000 avgt 5 374.245 54.485 us/op
buildRoaringBitmap 0.9 1000000 avgt 5 3712.684 657.964 us/op
buildRoaringBitmap 0.9 10000000 avgt 5 37223.976 4691.297 us/op
incrementalNativeAdd 0.1 10000 avgt 5 115.213 31.909 us/op
incrementalNativeAdd 0.1 100000 avgt 5 911.925 127.922 us/op
incrementalNativeAdd 0.1 1000000 avgt 5 8889.49 320.821 us/op
incrementalNativeAdd 0.1 10000000 avgt 5 102819.877 14247.868 us/op
incrementalNativeAdd 0.5 10000 avgt 5 116.878 28.232 us/op
incrementalNativeAdd 0.5 100000 avgt 5 947.076 128.255 us/op
incrementalNativeAdd 0.5 1000000 avgt 5 7190.443 202.012 us/op
incrementalNativeAdd 0.5 10000000 avgt 5 98843.303 4325.924 us/op
incrementalNativeAdd 0.9 10000 avgt 5 101.694 6.579 us/op
incrementalNativeAdd 0.9 100000 avgt 5 816.411 65.678 us/op
incrementalNativeAdd 0.9 1000000 avgt 5 9114.624 412.152 us/op
incrementalNativeAdd 0.9 10000000 avgt 5 108793.694 22562.527 us/op
incrementalUseOrderedWriter 0.1 10000 avgt 5 23.573 5.962 us/op
incrementalUseOrderedWriter 0.1 100000 avgt 5 289.588 36.814 us/op
incrementalUseOrderedWriter 0.1 1000000 avgt 5 2785.659 49.385 us/op
incrementalUseOrderedWriter 0.1 10000000 avgt 5 29489.758 2601.39 us/op
incrementalUseOrderedWriter 0.5 10000 avgt 5 23.57 1.536 us/op
incrementalUseOrderedWriter 0.5 100000 avgt 5 276.488 9.662 us/op
incrementalUseOrderedWriter 0.5 1000000 avgt 5 2799.408 198.77 us/op
incrementalUseOrderedWriter 0.5 10000000 avgt 5 28313.626 1976.042 us/op
incrementalUseOrderedWriter 0.9 10000 avgt 5 22.345 1.574 us/op
incrementalUseOrderedWriter 0.9 100000 avgt 5 280.205 36.987 us/op
incrementalUseOrderedWriter 0.9 1000000 avgt 5 2779.732 93.456 us/op
incrementalUseOrderedWriter 0.9 10000000 avgt 5 30568.591 2140.826 us/op

These benchmarks don’t go far enough to support replacing RoaringBitmap.bitmapOf.

Unsorted Input Data

In the cases benchmarked, this approach seems to be worthwhile. I can’t actually think of a case where someone would want to build a bitmap from unsorted data, but it occurred to me that this approach might be fast enough to cover the cost of a sort. OrderedWriter is also relaxed enough that it only needs the most significant 16 bits to be monotonic, so a full sort isn’t necessary. Implementing a radix sort on the most significant 16 bits (stable in the least significant 16 bits), prior to incremental insertion via an OrderedWriter, leads to huge increases in performance over RoaringBitmap.bitmapOf. The implementation is as follows:


  public static RoaringBitmap bitmapOfUnordered(final int... data) {
    partialRadixSort(data);
    RoaringBitmap bitmap = new RoaringBitmap();
    OrderedWriter writer = new OrderedWriter(bitmap);
    for (int i : data) {
      writer.add(i);
    }
    writer.flush();
    return bitmap;
  }

It did very well, according to benchmarks, even against various implementations of sort prior to RoaringBitmap.bitmapOf. Lower is better:

Benchmark (randomness) (size) Mode Cnt Score Error Units
bitmapOf 0.1 10000 avgt 5 1058.106 76.013 us/op
bitmapOf 0.1 100000 avgt 5 12323.905 976.68 us/op
bitmapOf 0.1 1000000 avgt 5 171812.526 9593.879 us/op
bitmapOf 0.1 10000000 avgt 5 3376296.157 170362.195 us/op
bitmapOf 0.5 10000 avgt 5 1096.663 477.795 us/op
bitmapOf 0.5 100000 avgt 5 12836.177 1674.54 us/op
bitmapOf 0.5 1000000 avgt 5 171998.126 4176 us/op
bitmapOf 0.5 10000000 avgt 5 3707804.439 974532.361 us/op
bitmapOf 0.9 10000 avgt 5 1124.881 65.673 us/op
bitmapOf 0.9 100000 avgt 5 14585.589 1894.788 us/op
bitmapOf 0.9 1000000 avgt 5 198506.813 8552.218 us/op
bitmapOf 0.9 10000000 avgt 5 3723942.934 423704.363 us/op
bitmapOfUnordered 0.1 10000 avgt 5 174.583 17.475 us/op
bitmapOfUnordered 0.1 100000 avgt 5 1768.613 86.543 us/op
bitmapOfUnordered 0.1 1000000 avgt 5 17889.705 135.714 us/op
bitmapOfUnordered 0.1 10000000 avgt 5 192645.352 6482.726 us/op
bitmapOfUnordered 0.5 10000 avgt 5 157.351 3.254 us/op
bitmapOfUnordered 0.5 100000 avgt 5 1674.919 90.138 us/op
bitmapOfUnordered 0.5 1000000 avgt 5 16900.458 778.999 us/op
bitmapOfUnordered 0.5 10000000 avgt 5 185399.32 4383.485 us/op
bitmapOfUnordered 0.9 10000 avgt 5 145.642 1.257 us/op
bitmapOfUnordered 0.9 100000 avgt 5 1515.845 82.914 us/op
bitmapOfUnordered 0.9 1000000 avgt 5 15807.597 811.048 us/op
bitmapOfUnordered 0.9 10000000 avgt 5 167863.49 3501.132 us/op
partialSortThenBitmapOf 0.1 10000 avgt 5 1060.152 168.802 us/op
partialSortThenBitmapOf 0.1 100000 avgt 5 10942.731 347.583 us/op
partialSortThenBitmapOf 0.1 1000000 avgt 5 100606.506 24705.341 us/op
partialSortThenBitmapOf 0.1 10000000 avgt 5 1035448.545 157383.713 us/op
partialSortThenBitmapOf 0.5 10000 avgt 5 1029.883 100.291 us/op
partialSortThenBitmapOf 0.5 100000 avgt 5 10472.509 832.719 us/op
partialSortThenBitmapOf 0.5 1000000 avgt 5 101144.032 16908.087 us/op
partialSortThenBitmapOf 0.5 10000000 avgt 5 958242.087 39650.946 us/op
partialSortThenBitmapOf 0.9 10000 avgt 5 1008.413 70.999 us/op
partialSortThenBitmapOf 0.9 100000 avgt 5 10458.34 600.416 us/op
partialSortThenBitmapOf 0.9 1000000 avgt 5 103945.644 2026.26 us/op
partialSortThenBitmapOf 0.9 10000000 avgt 5 1065638.269 102257.059 us/op
setupCost 0.1 10000 avgt 5 6.577 0.121 us/op
setupCost 0.1 100000 avgt 5 61.378 24.113 us/op
setupCost 0.1 1000000 avgt 5 1021.588 536.68 us/op
setupCost 0.1 10000000 avgt 5 13182.341 196.773 us/op
setupCost 0.5 10000 avgt 5 7.139 2.216 us/op
setupCost 0.5 100000 avgt 5 60.847 23.395 us/op
setupCost 0.5 1000000 avgt 5 800.888 14.711 us/op
setupCost 0.5 10000000 avgt 5 13431.625 553.44 us/op
setupCost 0.9 10000 avgt 5 6.599 0.09 us/op
setupCost 0.9 100000 avgt 5 60.946 22.511 us/op
setupCost 0.9 1000000 avgt 5 813.445 4.896 us/op
setupCost 0.9 10000000 avgt 5 13374.943 349.314 us/op
sortThenBitmapOf 0.1 10000 avgt 5 636.23 13.423 us/op
sortThenBitmapOf 0.1 100000 avgt 5 7411.756 174.264 us/op
sortThenBitmapOf 0.1 1000000 avgt 5 92299.305 3651.161 us/op
sortThenBitmapOf 0.1 10000000 avgt 5 1096374.443 162575.234 us/op
sortThenBitmapOf 0.5 10000 avgt 5 634.957 47.447 us/op
sortThenBitmapOf 0.5 100000 avgt 5 7939.074 409.328 us/op
sortThenBitmapOf 0.5 1000000 avgt 5 93505.427 5409.749 us/op
sortThenBitmapOf 0.5 10000000 avgt 5 1147933.592 57485.51 us/op
sortThenBitmapOf 0.9 10000 avgt 5 661.072 6.717 us/op
sortThenBitmapOf 0.9 100000 avgt 5 7915.506 356.148 us/op
sortThenBitmapOf 0.9 1000000 avgt 5 93403.343 5454.583 us/op
sortThenBitmapOf 0.9 10000000 avgt 5 1095960.734 85753.917 us/op

It looks like there are good performance gains available here, but these things tend to depend on particular data sets. I would be interested in hearing from anyone who has tried to use this class in a real application.