Data Driven Logic

I really don’t like reading or writing blocks of if-else statements. They make my eyes glaze over. Rumour has it that processors don’t like executing them either, though that’s less true now than it once was. There are two problems with these blocks of statements, and neither one of them is performance:

  1. They are hard to read and tend to have subtle dependencies on line order.
  2. They can’t be treated as data, and can’t be executed remotely unless you do something weird like serialise code.

Since I started programming in Java, I have been aware of the existence of rule engines, but I have never heard of a single case of “soft coding” working out. In my own experience, every time I have been involved in the implementation of a system with a DSL to empower business analysts to control the business logic, there has been low stakeholder participation during the design of the DSL and developers have ended up writing the business logic anyway. The most excruciating aspect of this is that it dilutes accountability for testing by blurring the boundaries between the application and user input. Perhaps your experience differs. However, rule engines can eradicate cyclomatic complexity in application code, and systems consisting of straight line code (with high test coverage) tend to do what they are supposed to. Soft coding isn’t the appeal of rule engines, getting rid of the if-else blocks is. If you squint at rule engines the right way, they look data driven and they start to get exciting. I can’t see value in anything more complicated than a decision table.

You can represent a block of if-else statements as a decision table by considering every possible branch as a line in the table. Your decision table doesn’t need to be exhaustive: there can be cases where you fall through and throw an exception or choose a default. This can be quite hard to write in imperative code, and you may need to throw the same exception in multiple places, set flags, or otherwise rely on line order to achieve this. Decision tables have a really nice property: if you want to start treating certain cases as exceptional, you just delete the line from the table.

Decision tables are very similar in character to case classes in Scala, or to the weaker when expressions present in Kotlin, but decision tables can be allowed to grow much larger. I wouldn’t allow a match expression with 50,000 cases through a code review even if someone had the energy to write one deviously enough to come in under the maximum byte code method size.

I looked at several implementations of decision tables on GitHub and saw a lot of clean code, but not a lot of textbook computer science. Every single implementation iterated through a list of rules checking the rule against the input data. I have implemented a password strength checker like this in the past (I know! I probably shouldn’t have done this myself!) which is fine because a strong password checker might have at most a dozen rules. What if you work in adtech and want to classify the people you track (how do you sleep at night?) as members of one or many of 50,000 clusters which can be described in terms of regions of, say, a 50 dimensional feature space? Imagine your task is to guess which cluster your quarry belongs to in a few microseconds at most. You won’t get far if you iterate through thousands of rules.

I implemented a small library in the evenings over the last couple of weeks called bitrules. This was based on some ideas I had about using RoaringBitmap for decision tables last year. The idea is very simple: think of a list of rules with constrained attributes as a matrix, and transpose that matrix and loop through the attributes during classification. This is a similar approach to that taken in blocked signatures, a search technique used in BitFunnel, which translates an expensive signature scan to a random access. In the case of bitrules, for each constraint on each attribute, bits are removed from a mask of potentially matching rules. These bitsets are intersected sequentially, resulting in a bitset rapidly decreasing in cardinality. Because I used RoaringBitmap, rapid reduction in cardinality means a rapid reduction in size, which means cache friendliness. There are a few tricks in the code like using range encoding for range attributes, so that range queries can be evaluated with a single bitset intersection. I plan to implement a hopefully efficient serialisation format so the table can be sent to another server and used for classification remotely.

I don’t actually know how fast this code is: performance is context sensitive, and I shy away from making “performance measurements”. It’s best suited to cases where there are a large number of rules (thousands) and I bet it’s really fast when there are ~50,000 segments in a ~50 dimensional space. I don’t even have a use case right now for bitrules: it was just fun writing the code. I have started releasing it to maven central, while I can’t guarantee its fitness for purpose, perhaps it may be of some use to someone else.

Parallel Bitmap Aggregation

A bitmap index represents predicates over records as sets consisting of the integer identities of each record satisfying each predicate. This representation is actually a few decades out of date, and systems like Pilosa use much more sophisticated data structures, and Sybase had even more on offer back in the 90s. But the chances are, if you’ve rolled your own bitmap index, you’ve used equality encoding and have a bitmap per indexed predicate. RoaringBitmap is a great choice for the bitmaps used in this kind of data structure, because it offers a good tradeoff between bitmap compression and performance. It’s also succinct, that is, you don’t need to decompress the structure in order to operate on it. With the naive index structure described, it’s likely that you have many bitmaps to aggregate (union, intersection, difference, and so on) when you want to query your index.

RoaringBitmap provides a class FastAggregation for aggregations, and the method FastAggregation.and is incredibly fast, particularly given its apparent simplicity. This reflects a nice property of set intersection, that the size of the intersection cannot increase and tends to get smaller as the number of sets increases. Unions and differences are different: the problem size tends to increase in magnitude as the number of contributing sets increases. While FastAggregation.or and FastAggregation.xor are highly optimised, not a lot can be done about the fact each additional set makes the problem bigger. So it may be worth throwing some threads at the problem, and this gets more attractive as you add more dimensions to your index. You can, of course, completely bypass this need by reading some database research and sharing bitmaps between overlapping predicates.

I implemented the class ParallelAggregation in RoaringBitmap, but I’m not convinced the technique used performs as well as it could do. RoaringBitmap stores the 16 bit prefix of each integer in a sorted array, with the rest of each integer in that 16 bit range stored in a container at the same index in another array. This makes the structure very easy to split. The implementation I worked on seeks to exploit this by grouping all the containers by common key as a SortedMap<Short, List<Container>> before executing each reduction (i.e. Function<List<Container>, Container>) in parallel in a ForkJoinPool. This results in a reasonable speedup of 2x-6.5x compared to FastAggregation on an 8 core machine, but it uses quite a lot of temporary memory just to set the problem up. I don’t think it should be possible to beat this approach without grouping the containers by key somehow, but I suspect there are lighter weight approaches which use less memory and give better throughput. Perhaps this would be an interesting problem to work on?

Iterating Over a Bitset in Java

How fast can you iterate over a bitset? Daniel Lemire published a benchmark recently in support of a strategy using the number of trailing zeroes to skip over empty bits. I have used the same technique in Java several times in my hobby project SplitMap and this is something I am keen to optimise. I think that the best strategy depends on what you want to do with the set bits, and how sparse and uniformly distributed they are. I argue that the cost of iteration is less important than the constraints your API imposes on the caller, and whether the caller is free to exploit patterns in the data.

C2 Generates Good Code

If you think C++ is much faster than Java, you either don’t know much about Java or do lots of floating point arithmetic. This isn’t about benchmarking C++ against Java, but comparing the compilation outputs for a C++ implementation and a Java implementation shows that there won’t be much difference if your Java method gets hot. Only the time to performance will differ, and this is amortised over the lifetime of an application. The trailing zeroes implementation is probably the fastest technique in Java as well as in C++, but that is to ignore the optimisations you can’t apply to the callback if you use it too literally.

Compiling this C++ function with GCC yields the snippet of assembly taken from the loop kernel:


template <typename CALLBACK>
static void for_each(const long* bitmap, const int size, const CALLBACK& callback) {
    for (size_t k = 0; k < size; ++k) {
        long bitset = bitmap[k];
        while (bitset != 0) {
            callback((k * 64) + __builtin_ctzl(bitset));
            bitset ^= (bitset & -bitset);
        }
    }
}

The instruction tzcntl calculates the next set bit and blsr switches it off.


.L99:
  movq  %rdi, %rcx
  blsr  %ebx, %ebx
  call  _ZNSo3putEc
  movq  %rax, %rcx
  call  _ZNSo5flushEv
  testl  %ebx, %ebx
  je  .L96
.L100:
  xorl  %edx, %edx
  movq  %r12, %rcx
  tzcntl  %ebx, %edx
  addl  %ebp, %edx
  call  _ZNSolsEi
  movq  %rax, %rdi
  movq  (%rax), %rax
  movq  -24(%rax), %rax
  movq  240(%rdi,%rax), %rsi
  testq  %rsi, %rsi
  je  .L108
  cmpb  $0, 56(%rsi)
  jne  .L109
  movq  %rsi, %rcx
  call  _ZNKSt5ctypeIcE13_M_widen_initEv
  movq  (%rsi), %rax
  movl  $10, %edx
  movq  48(%rax), %rax
  cmpq  %r14, %rax
  je  .L99
  movq  %rsi, %rcx
  call  *%rax
  movsbl  %al, %edx
  jmp  .L99
  .p2align 4,,10

In Java, almost identical code is generated.


public void forEach(long[] bitmap, IntConsumer consumer) {
    for (int i = 0; i < bitmap.length; ++i) {
      long word = bitmap[i];
      while (word != 0) {
        consumer.accept(Long.SIZE * i + Long.numberOfTrailingZeros(word));
        word ^= Long.lowestOneBit(word);
      }
    }
  }

The key difference is that xor and blsi haven’t been fused into blsr, so the C++ code is probably slightly faster. A lambda function accumulating the contents of an array is inlined into this loop (the add comes from an inlined lambda, but notice how little time is spent adding compared to computing the bit to switch off in this sample produced by perfasm).


   .83%    0x000002d79d366a19: tzcnt   r9,rcx
  8.53%    0x000002d79d366a1e: add     r9d,ebx
  0.42%    0x000002d79d366a21: cmp     r9d,r8d
  0.00%    0x000002d79d366a24: jnb     2d79d366a4dh
  0.62%    0x000002d79d366a26: add     r10d,dword ptr [rdi+r9*4+10h]
 16.22%    0x000002d79d366a2b: vmovq   r11,xmm4
  6.68%    0x000002d79d366a30: mov     dword ptr [r11+10h],r10d
 27.92%    0x000002d79d366a34: blsi    r10,rcx
  0.55%    0x000002d79d366a39: xor     rcx,r10         
  0.10%    0x000002d79d366a3c: mov     r11,qword ptr [r15+70h]  

It’s this Java code, and its impact on which optimisations can be applied to the IntConsumer that this post focuses on. There are different principles, particularly related to inlining and vectorisation opportunities in C++, but this blog is about Java. Depending on what your callback does, you get different benchmark results and you should make different choices about how to do the iteration: you just can’t assess this in isolation.

Special Casing -1

Imagine you have an int[] containing data, and you are iterating over a mask or materialised predicate over that data. For each set bit, you want to add the corresponding entry in the array to a sum. In Java, that looks like this (you’ve already seen the generated assembly above):


  @Benchmark
  public int reduce() {
    int[] result = new int[1];
    forEach(bitmap, i -> result[0] += data[i]);
    return result[0];
  }

How fast can this get? It obviously depends on how full the bitset is. The worst case would be that it’s completely full, and it couldn’t get much better than if only one bit per word were set. The difference is noticeable, but scales by a factor less than the number of bits:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
reduce thrpt 1 10 7.435909 0.017491 ops/ms FULL
reduce thrpt 1 10 260.305307 6.081961 ops/ms ONE_BIT_PER_WORD

But the important code here, the callback itself, is stuck at entry level compilation. There is no unrolling, no vectorisation, the adds can’t be pipelined because there is a data dependency on blsi and xor. We can do much better in some cases, and not much worse in others, just by treating -1 as a special case, profiting from optimisations that can now be applied inside the callback. Passing a different callback which consumes whole words costs a branch, but it’s often worth it. Here’s the iterator now:


  interface WordConsumer {
    void acceptWord(int wordIndex, long word);
  }

  public void forEach(long[] bitmap, IntConsumer intConsumer, WordConsumer wordConsumer) {
    for (int i = 0; i < bitmap.length; ++i) {
      long word = bitmap[i];
      if (word == -1L) {
        wordConsumer.acceptWord(i, word);
      } else {
        while (word != 0) {
          intConsumer.accept(Long.SIZE * i + Long.numberOfTrailingZeros(word));
          word ^= Long.lowestOneBit(word);
        }
      }
    }
  }

  @Benchmark
  public int reduceWithWordConsumer() {
    int[] result = new int[1];
    forEach(bitmap, i -> result[0] += data[i], (index, word) -> {
      if (word != -1L) {
        throw new IllegalStateException();
      }
      int sum = 0;
      for (int i = index * Long.SIZE; i < (index + 1) * Long.SIZE; ++i) {
        sum += data[i];
      }
      result[0] += sum;
    });
    return result[0];
  }

This really pays off when the bitset is full, but having that extra branch does seem to cost something even though it is never taken, whereas the full case improves 6x.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
reduce thrpt 1 10 7.401202 0.118648 ops/ms FULL
reduce thrpt 1 10 261.682016 4.155856 ops/ms ONE_BIT_PER_WORD
reduceWithWordConsumer thrpt 1 10 43.972759 0.993264 ops/ms FULL
reduceWithWordConsumer thrpt 1 10 222.824868 4.877147 ops/ms ONE_BIT_PER_WORD

We still don’t actually know the cost of the branch when it’s taken every now and then. To estimate it, we need a new scenario (or new scenarios) which mix full and sparse words. As you might expect, having the WordConsumer is great when one word in every few is full: the fast path is so much faster, it practically skips the word.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
reduce thrpt 1 10 157.358633 4.538679 ops/ms SPARSE_16_FULL_WORDS
reduceWithWordConsumer thrpt 1 10 257.041035 7.446404 ops/ms SPARSE_16_FULL_WORDS

So in this scenario, the branch has paid for itself. How? The data dependency has been removed with a countable loop. Here’s the perfasm output. Notice two things: long runs of add instructions, and the vastly reduced percentage against blsi. The time is now spent adding numbers up, not switching off least significant bits. This feels like progress.


  0.05%    0x000001dd5b35af03: add     ebx,dword ptr [rdi+r9*4+10h]
  0.31%    0x000001dd5b35af08: add     ebx,dword ptr [rdi+r11*4+14h]
  0.32%    0x000001dd5b35af0d: add     ebx,dword ptr [rdi+r11*4+18h]
  0.33%    0x000001dd5b35af12: add     ebx,dword ptr [rdi+r11*4+1ch]
  0.37%    0x000001dd5b35af17: add     ebx,dword ptr [rdi+r11*4+20h]
  0.34%    0x000001dd5b35af1c: add     ebx,dword ptr [rdi+r11*4+24h]
  0.39%    0x000001dd5b35af21: add     ebx,dword ptr [rdi+r11*4+28h]
  0.36%    0x000001dd5b35af26: add     ebx,dword ptr [rdi+r11*4+2ch]
  0.34%    0x000001dd5b35af2b: add     ebx,dword ptr [rdi+r11*4+30h]
  0.35%    0x000001dd5b35af30: add     ebx,dword ptr [rdi+r11*4+34h]
  0.38%    0x000001dd5b35af35: add     ebx,dword ptr [rdi+r11*4+38h]
  0.36%    0x000001dd5b35af3a: add     ebx,dword ptr [rdi+r11*4+3ch]
  0.49%    0x000001dd5b35af3f: add     ebx,dword ptr [rdi+r11*4+40h]
  0.39%    0x000001dd5b35af44: add     ebx,dword ptr [rdi+r11*4+44h]
  0.42%    0x000001dd5b35af49: add     ebx,dword ptr [rdi+r11*4+48h]
  0.39%    0x000001dd5b35af4e: add     ebx,dword ptr [rdi+r11*4+4ch]
...
  2.39%    0x000001dd5b35afe9: tzcnt   r11,rbx
  2.65%    0x000001dd5b35afee: add     r11d,r10d         
  2.15%    0x000001dd5b35aff1: cmp     r11d,r9d
  0.00%    0x000001dd5b35aff4: jnb     1dd5b35b04dh
  2.29%    0x000001dd5b35aff6: add     r8d,dword ptr [rdi+r11*4+10h]
 11.03%    0x000001dd5b35affb: vmovq   r11,xmm0
  2.45%    0x000001dd5b35b000: mov     dword ptr [r11+10h],r8d  
  3.14%    0x000001dd5b35b004: mov     r11,qword ptr [r15+70h]
  2.18%    0x000001dd5b35b008: blsi    r8,rbx
  2.23%    0x000001dd5b35b00d: xor     rbx,r8

Heroically ploughing through the full words tells a different story: blsi is up at 11%. This indicates more time is spent iterating rather than evaluating the callback.


  6.98%    0x0000019f106c6799: tzcnt   r9,rdi
  3.47%    0x0000019f106c679e: add     r9d,ebx           
  1.65%    0x0000019f106c67a1: cmp     r9d,r10d
           0x0000019f106c67a4: jnb     19f106c67cdh
  1.67%    0x0000019f106c67a6: add     r11d,dword ptr [r8+r9*4+10h]
 11.45%    0x0000019f106c67ab: vmovq   r9,xmm2
  3.20%    0x0000019f106c67b0: mov     dword ptr [r9+10h],r11d  
 11.31%    0x0000019f106c67b4: blsi    r11,rdi
  1.71%    0x0000019f106c67b9: xor     rdi,r11           

This shows the cost of a data dependency in a loop. The operation we want to perform is associative, so we could even vectorise this. In C++ that might happen automatically, or could be ensured with intrinsics, but C2 has various heuristics: it won’t try to vectorise a simple reduction, and 64 would probably be on the short side for most cases it would try to vectorise.

Acknowledging Runs

You might be tempted to transfer even more control to the callback, by accumulating runs and then calling the callback once per run. It simplifies the code to exclude incomplete start and end words from the run.


private interface RunConsumer {
    void acceptRun(int start, int end);
  }

  public void forEach(long[] bitmap, IntConsumer intConsumer, RunConsumer runConsumer) {
    int runStart = -1;
    for (int i = 0; i < bitmap.length; ++i) {
      long word = bitmap[i];
      if (word == -1L) {
        if (runStart == -1) {
          runStart = i;
        }
      } else {
        if (runStart != -1) {
          runConsumer.acceptRun(runStart * Long.SIZE, i * Long.SIZE);
          runStart = -1;
        }
        while (word != 0) {
          intConsumer.accept(Long.SIZE * i + Long.numberOfTrailingZeros(word));
          word ^= Long.lowestOneBit(word);
        }
      }
    }
    if (runStart != -1) {
      runConsumer.acceptRun(runStart * Long.SIZE, bitmap.length * Long.SIZE);
    }
  }

For a simple reduction, the extra complexity isn’t justified: you’re better off with the WordIterator.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
reduce thrpt 1 10 160.502749 2.960568 ops/ms SPARSE_16_FULL_WORDS
reduce thrpt 1 10 7.294747 0.186678 ops/ms FULL
reduce thrpt 1 10 258.064511 8.902233 ops/ms ONE_BIT_PER_WORD
reduce thrpt 1 10 159.613877 3.424432 ops/ms SPARSE_1_16_WORD_RUN
reduceWithRunConsumer thrpt 1 10 251.683131 6.799639 ops/ms SPARSE_16_FULL_WORDS
reduceWithRunConsumer thrpt 1 10 37.809154 0.723198 ops/ms FULL
reduceWithRunConsumer thrpt 1 10 218.133560 13.756779 ops/ms ONE_BIT_PER_WORD
reduceWithRunConsumer thrpt 1 10 140.896826 8.495777 ops/ms SPARSE_1_16_WORD_RUN
reduceWithWordConsumer thrpt 1 10 257.961783 5.892072 ops/ms SPARSE_16_FULL_WORDS
reduceWithWordConsumer thrpt 1 10 43.909471 0.601319 ops/ms FULL
reduceWithWordConsumer thrpt 1 10 213.731758 20.398077 ops/ms ONE_BIT_PER_WORD
reduceWithWordConsumer thrpt 1 10 258.280428 11.316647 ops/ms SPARSE_1_16_WORD_RUN

It’s simplistic to measure this and conclude that this is a bad approach though. There are several other dimensions to this problem:

  1. Vectorised callbacks
  2. Inlining failures preventing optimisations
  3. The number of runs and their lengths (i.e. your data and how you structure it)

Vectorisable Callbacks

There are real benefits to batching up callbacks if the workload in the callback can be vectorised. The code doesn’t need to get much more complicated to start benefitting from larger iteration batches. Mapping each bit to a scaled and squared value from the data array and storing it into an output array illustrates this.


  @Benchmark
  public void map(Blackhole bh) {
    forEach(bitmap, i -> output[i] = data[i] * data[i] * factor);
    bh.consume(output);
  }

  @Benchmark
  public void mapWithWordConsumer(Blackhole bh) {
    forEach(bitmap, i -> output[0] = data[i] * factor, (WordConsumer) (index, word) -> {
      if (word != -1L) {
        throw new IllegalStateException();
      }
      for (int i = index * Long.SIZE; i < (index + 1) * Long.SIZE; ++i) {
        output[i] = data[i] * data[i] * factor;
      }
    });
    bh.consume(output);
  }

  @Benchmark
  public void mapWithRunConsumer(Blackhole bh) {
    forEach(bitmap, i -> output[0] = data[i] * factor, (RunConsumer) (start, end) -> {
      for (int i = start; i < end; ++i) {
        output[i] = data[i] * data[i] * factor;
      }
    });
    bh.consume(output);
  }

The RunConsumer does much better in the full case, never much worse than the WordConsumer and always better than the basic strategy – even when there is only one run in the entire bitset, or when there are a few full words in an otherwise sparse bitset.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scenario
map thrpt 1 10 127.876662 3.411741 ops/ms SPARSE_16_FULL_WORDS
map thrpt 1 10 10.598974 0.022404 ops/ms FULL
map thrpt 1 10 126.434666 18.608547 ops/ms ONE_BIT_PER_WORD
map thrpt 1 10 115.977840 20.449258 ops/ms SPARSE_1_16_WORD_RUN
mapWithRunConsumer thrpt 1 10 199.186167 8.138446 ops/ms SPARSE_16_FULL_WORDS
mapWithRunConsumer thrpt 1 10 64.230868 2.871434 ops/ms FULL
mapWithRunConsumer thrpt 1 10 219.963063 4.257561 ops/ms ONE_BIT_PER_WORD
mapWithRunConsumer thrpt 1 10 203.403804 6.907366 ops/ms SPARSE_1_16_WORD_RUN
mapWithWordConsumer thrpt 1 10 229.822235 5.276084 ops/ms SPARSE_16_FULL_WORDS
mapWithWordConsumer thrpt 1 10 48.381990 3.845642 ops/ms FULL
mapWithWordConsumer thrpt 1 10 218.907803 5.331011 ops/ms ONE_BIT_PER_WORD
mapWithWordConsumer thrpt 1 10 240.795280 10.204818 ops/ms SPARSE_1_16_WORD_RUN

This is simply because the callback was vectorised, and the style of the RunConsumer API allows this to be exploited. This can be seen with perfasm. Both the WordConsumer and RunConsumer are actually vectorised, but the thing to notice is that there are two hot regions in the WordConsumer benchmark: the iteration and the callback, this boundary is often crossed. On the other hand, the RunConsumer implementation spends most of its time in the callback.

WordConsumer


....[Hottest Region 1]..............................................................................
c2, com.openkappa.simd.iterate.generated.BitSetIterator_mapWithWordConsumer_jmhTest::mapWithWordConsumer_thrpt_jmhStub, version 172 (227 bytes) 
...
  1.55%    0x000001c2aa13c790: vmovdqu ymm1,ymmword ptr [r9+r10*4+10h]
  0.15%    0x000001c2aa13c797: vpmulld ymm1,ymm1,ymm1
  3.72%    0x000001c2aa13c79c: vpmulld ymm1,ymm1,ymm2
 16.02%    0x000001c2aa13c7a1: vmovdqu ymmword ptr [rdx+r10*4+10h],ymm1
  1.69%    0x000001c2aa13c7a8: movsxd  r8,r10d
  1.55%    0x000001c2aa13c7ab: vmovdqu ymm1,ymmword ptr [r9+r8*4+30h]
  1.46%    0x000001c2aa13c7b2: vpmulld ymm1,ymm1,ymm1
  1.71%    0x000001c2aa13c7b7: vpmulld ymm1,ymm1,ymm2
  3.20%    0x000001c2aa13c7bc: vmovdqu ymmword ptr [rdx+r8*4+30h],ymm1
  0.07%    0x000001c2aa13c7c3: add     r10d,10h          
  1.70%    0x000001c2aa13c7c7: cmp     r10d,r11d
           0x000001c2aa13c7ca: jl      1c2aa13c790h      
  0.02%    0x000001c2aa13c7cc: mov     r8,qword ptr [r15+70h]  
  1.50%    0x000001c2aa13c7d0: test    dword ptr [r8],eax  
  0.04%    0x000001c2aa13c7d3: cmp     r10d,r11d
           0x000001c2aa13c7d6: jl      1c2aa13c78ah
  0.05%    0x000001c2aa13c7d8: mov     r11d,dword ptr [rsp+5ch]
  0.02%    0x000001c2aa13c7dd: add     r11d,39h
  1.57%    0x000001c2aa13c7e1: mov     r8d,ecx
  0.02%    0x000001c2aa13c7e4: cmp     r8d,r11d
  0.06%    0x000001c2aa13c7e7: mov     ecx,80000000h
  0.02%    0x000001c2aa13c7ec: cmovl   r11d,ecx
  1.50%    0x000001c2aa13c7f0: cmp     r10d,r11d
           0x000001c2aa13c7f3: jnl     1c2aa13c819h
  0.02%    0x000001c2aa13c7f5: nop                       
  0.06%    0x000001c2aa13c7f8: vmovdqu ymm1,ymmword ptr [r9+r10*4+10h]
  0.21%    0x000001c2aa13c7ff: vpmulld ymm1,ymm1,ymm1
  2.16%    0x000001c2aa13c804: vpmulld ymm1,ymm1,ymm2
  1.80%    0x000001c2aa13c809: vmovdqu ymmword ptr [rdx+r10*4+10h],ymm1
...
 53.26%  <total for region 1>

RunConsumer


....[Hottest Region 1]..............................................................................
c2, com.openkappa.simd.iterate.BitSetIterator$$Lambda$44.1209658195::acceptRun, version 166 (816 bytes) 
...
  0.92%    0x0000016658954860: vmovdqu ymm0,ymmword ptr [rdx+r8*4+10h]
  1.31%    0x0000016658954867: vpmulld ymm0,ymm0,ymm0
  1.74%    0x000001665895486c: vpmulld ymm0,ymm0,ymm1
  4.55%    0x0000016658954871: vmovdqu ymmword ptr [rdi+r8*4+10h],ymm0
  0.69%    0x0000016658954878: movsxd  rcx,r8d
  0.01%    0x000001665895487b: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+30h]
  0.41%    0x0000016658954881: vpmulld ymm0,ymm0,ymm0
  0.78%    0x0000016658954886: vpmulld ymm0,ymm0,ymm1
  0.83%    0x000001665895488b: vmovdqu ymmword ptr [rdi+rcx*4+30h],ymm0
  0.25%    0x0000016658954891: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+50h]
  1.29%    0x0000016658954897: vpmulld ymm0,ymm0,ymm0
  1.51%    0x000001665895489c: vpmulld ymm0,ymm0,ymm1
  3.65%    0x00000166589548a1: vmovdqu ymmword ptr [rdi+rcx*4+50h],ymm0
  0.54%    0x00000166589548a7: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+70h]
  0.31%    0x00000166589548ad: vpmulld ymm0,ymm0,ymm0
  0.47%    0x00000166589548b2: vpmulld ymm0,ymm0,ymm1
  1.11%    0x00000166589548b7: vmovdqu ymmword ptr [rdi+rcx*4+70h],ymm0
  0.28%    0x00000166589548bd: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+90h]
  1.17%    0x00000166589548c6: vpmulld ymm0,ymm0,ymm0
  1.89%    0x00000166589548cb: vpmulld ymm0,ymm0,ymm1
  3.56%    0x00000166589548d0: vmovdqu ymmword ptr [rdi+rcx*4+90h],ymm0
  0.73%    0x00000166589548d9: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+0b0h]
  0.21%    0x00000166589548e2: vpmulld ymm0,ymm0,ymm0
  0.34%    0x00000166589548e7: vpmulld ymm0,ymm0,ymm1
  1.29%    0x00000166589548ec: vmovdqu ymmword ptr [rdi+rcx*4+0b0h],ymm0
  0.33%    0x00000166589548f5: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+0d0h]
  0.97%    0x00000166589548fe: vpmulld ymm0,ymm0,ymm0
  1.90%    0x0000016658954903: vpmulld ymm0,ymm0,ymm1
  3.59%    0x0000016658954908: vmovdqu ymmword ptr [rdi+rcx*4+0d0h],ymm0
  0.82%    0x0000016658954911: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+0f0h]
  0.18%    0x000001665895491a: vpmulld ymm0,ymm0,ymm0
  0.29%    0x000001665895491f: vpmulld ymm0,ymm0,ymm1
  1.25%    0x0000016658954924: vmovdqu ymmword ptr [rdi+rcx*4+0f0h],ymm0
  0.33%    0x000001665895492d: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+110h]
  1.10%    0x0000016658954936: vpmulld ymm0,ymm0,ymm0
  2.11%    0x000001665895493b: vpmulld ymm0,ymm0,ymm1
  3.67%    0x0000016658954940: vmovdqu ymmword ptr [rdi+rcx*4+110h],ymm0
  0.93%    0x0000016658954949: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+130h]
  0.13%    0x0000016658954952: vpmulld ymm0,ymm0,ymm0
  0.25%    0x0000016658954957: vpmulld ymm0,ymm0,ymm1
  1.35%    0x000001665895495c: vmovdqu ymmword ptr [rdi+rcx*4+130h],ymm0
  0.32%    0x0000016658954965: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+150h]
  0.93%    0x000001665895496e: vpmulld ymm0,ymm0,ymm0
  2.16%    0x0000016658954973: vpmulld ymm0,ymm0,ymm1
  3.73%    0x0000016658954978: vmovdqu ymmword ptr [rdi+rcx*4+150h],ymm0
  0.95%    0x0000016658954981: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+170h]
  0.14%    0x000001665895498a: vpmulld ymm0,ymm0,ymm0
  0.21%    0x000001665895498f: vpmulld ymm0,ymm0,ymm1
  1.39%    0x0000016658954994: vmovdqu ymmword ptr [rdi+rcx*4+170h],ymm0
  0.29%    0x000001665895499d: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+190h]
  1.42%    0x00000166589549a6: vpmulld ymm0,ymm0,ymm0
  2.61%    0x00000166589549ab: vpmulld ymm0,ymm0,ymm1
  4.42%    0x00000166589549b0: vmovdqu ymmword ptr [rdi+rcx*4+190h],ymm0
  1.01%    0x00000166589549b9: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+1b0h]
  0.10%    0x00000166589549c2: vpmulld ymm0,ymm0,ymm0
  0.17%    0x00000166589549c7: vpmulld ymm0,ymm0,ymm1
  1.46%    0x00000166589549cc: vmovdqu ymmword ptr [rdi+rcx*4+1b0h],ymm0
  0.27%    0x00000166589549d5: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+1d0h]
 13.60%    0x00000166589549de: vpmulld ymm0,ymm0,ymm0
  3.51%    0x00000166589549e3: vpmulld ymm0,ymm0,ymm1
  4.69%    0x00000166589549e8: vmovdqu ymmword ptr [rdi+rcx*4+1d0h],ymm0
  1.00%    0x00000166589549f1: vmovdqu ymm0,ymmword ptr [rdx+rcx*4+1f0h]
  0.11%    0x00000166589549fa: vpmulld ymm0,ymm0,ymm0
  0.15%    0x00000166589549ff: vpmulld ymm0,ymm0,ymm1
  1.46%    0x0000016658954a04: vmovdqu ymmword ptr [rdi+rcx*4+1f0h],ymm0
                                                         
  0.26%    0x0000016658954a0d: add     r8d,80h           
  0.01%    0x0000016658954a14: cmp     r8d,r10d
           0x0000016658954a17: jl      16658954860h      
  0.00%    0x0000016658954a1d: mov     r14,qword ptr [r15+70h]  
  0.06%    0x0000016658954a21: test    dword ptr [r14],eax  
  0.17%    0x0000016658954a24: cmp     r8d,r10d
           0x0000016658954a27: jl      16658954860h
           0x0000016658954a2d: mov     r10d,r9d
           0x0000016658954a30: add     r10d,0fffffff9h
           0x0000016658954a34: cmp     r9d,r10d
  0.00%    0x0000016658954a37: cmovl   r10d,ebx
           0x0000016658954a3b: cmp     r8d,r10d
           0x0000016658954a3e: jnl     16658954a61h      
           0x0000016658954a40: vmovdqu ymm0,ymmword ptr [rdx+r8*4+10h]
  0.14%    0x0000016658954a47: vpmulld ymm0,ymm0,ymm0
  0.05%    0x0000016658954a4c: vpmulld ymm0,ymm0,ymm1
  0.03%    0x0000016658954a51: vmovdqu ymmword ptr [rdi+r8*4+10h],ymm0
...
 96.10%  <total for region 1>

Inlining

So far, everything has been inlined. Java optimistically assumes you only have one implementation and aggressively inlines at first, deoptimising to add a branch when it sees a second implementation, deoptimising again and replacing with a virtual call if it sees a third implementation. This doesn’t matter much usually, but the cost of this not only dwarfs any savings in an iteration strategy; it also prevents various optimisations which can be applied if the code is inlined. Once again, passing a batch of work into the callback completely ameliorates this, because even if the call is virtual, the callback itself might be hot and aggressively optimised. I haven’t benchmarked this because I think the point is self-evident to anyone who would read this far.

Number of runs

It’s clear to see from the benchmark results that the best choice of iteration strategy is sensitive to what you want to do with the data, but also how it is arranged. It is well documented in database literature that real data sets tend to contain runs. If you are building a bitmap index on some attribute of your data, and you sort your data by that attribute, you will have as many bitmaps as you have attribute values, and each attribute value bitmap will contain a single run. This is almost true for any index on attributes correlated with the attribute chosen for the sort order, and is completely untrue for uncorellated attributes. There are a range of iteration strategies to choose from, and the best iteration strategy for one index may not be the best for another.

My benchmarks are available at github.

Building RoaringBitmaps from Streams

RoaringBitmap is a fast compressed bitset format. In the Java implementation of Roaring, it was until recently preferential to build a bitset in one go from sorted data; there were performance penalties of varying magnitude for incremental or unordered insertions. In a recent pull request, I wanted to improve incremental monotonic insertion so I could build bitmaps from streams, but sped up unsorted batch creation significantly by accident.

Incremental Ordered Insertion

If you want to build a bitmap, you can do so efficiently with the RoaringBitmap.bitmapOf factory method.


int[] data = ...
RoaringBitmap bitmap = RoaringBitmap.bitmapOf(data);

However, I often find that I want to stream integers into a bitmap. Given that the integers being inserted into a bitmap often represent indices into an array, such a stream is likely to be monotonic. You might implement this like so:


IntStream stream = ...
RoaringBitmap bitmap = new RoaringBitmap();
stream.forEach(bitmap::add);

While this is OK, it has a few inefficiencies compared to the batch creation method.

  • Indirection: the container being written to must be located on each insertion
  • Eagerness: the cardinality must be kept up to date on each insertion
  • Allocation pressure: the best container type can’t be known in advance. Choice of container may change as data is inserted, this requires allocations of new instances.

You could also collect the stream into an int[] and use the batch method, but it could be a large temporary object with obvious drawbacks.

OrderedWriter

The solution I proposed is to create a writer object (OrderedWriter) which allocates a small buffer of 8KB, to use as a bitmap large enough to cover 16 bits. The stream to bitmap code becomes:


IntStream stream = ...
RoaringBitmap bitmap = new RoaringBitmap();
OrderedWriter writer = new OrderedWriter(bitmap);
stream.forEach(writer::add);
writer.flush(); // clear the buffer out

This is implemented so that changes in key (where the most significant 16 bits of each integer is stored) trigger a flush of the buffer.


  public void add(int value) {
    short key = Util.highbits(value);
    short low = Util.lowbits(value);
    if (key != currentKey) {
      if (Util.compareUnsigned(key, currentKey) < 0) {
        throw new IllegalStateException("Must write in ascending key order");
      }
      flush();
    }
    int ulow = low & 0xFFFF;
    bitmap[(ulow >>> 6)] |= (1L << ulow);
    currentKey = key;
    dirty = true;
  }

When a flush occurs, a container type is chosen and appended to the bitmap’s prefix index.


  public void flush() {
    if (dirty) {
      RoaringArray highLowContainer = underlying.highLowContainer;
      // we check that it's safe to append since RoaringArray.append does no validation
      if (highLowContainer.size > 0) {
        short key = highLowContainer.getKeyAtIndex(highLowContainer.size - 1);
        if (Util.compareUnsigned(currentKey, key) <= 0) {
          throw new IllegalStateException("Cannot write " + currentKey + " after " + key);
        }
      }
      highLowContainer.append(currentKey, chooseBestContainer());
      clearBitmap();
      dirty = false;
    }
  }

There are significant performance advantages in this approach. There is no indirection cost, and no searches in the prefix index for containers: the writes are just buffered. The buffer is small enough to fit in cache, and containers only need to be created when the writer is flushed, which happens whenever a new key is seen, or when flush is called manually. During a flush, the cardinality can be computed in one go, the best container can be chosen, and run optimisation only has to happen once. Computing the cardinality is the only bottleneck – it requires 1024 calls to Long.bitCount which can’t be vectorised in a language like Java. It can’t be incremented on insertion without either sacrificing idempotence or incurring the cost of a membership check. After the flush, the buffer needs to be cleared, using a call to Arrays.fill which is vectorised. So, despite the cost of the buffer, this can be quite efficient.

This approach isn’t universally applicable. For instance, you must write data in ascending order of the most significant 16 bits. You must also remember to flush the writer when you’re finished: until you’ve called flush, the data in the last container may not be in the bitmap. For my particular use case, this is reasonable. However, there are times when this is not fit for purpose, such as if you are occasionally inserting values and expect them to be available to queries immediately. In general, if you don’t know when you’ll stop adding data to the bitmap, this isn’t a good fit because you won’t know when to call flush.

Benchmark

The RoaringBitmap project is a big user of JMH, and most pull requests require benchmarks as evidence of performance improvement. I benchmarked the two approaches, varying bitmap sizes and randomness (likelihood of there not being a compressible run), and was amazed to find that this approach actually beats having a sorted array and using RoaringBitmap.bitmapOf. Less surprising was beating the existing API for incremental adds (this was the goal in the first place). Lower is better:

Benchmark (randomness) (size) Mode Cnt Score Error Units
buildRoaringBitmap 0.1 10000 avgt 5 54.263 3.393 us/op
buildRoaringBitmap 0.1 100000 avgt 5 355.188 15.234 us/op
buildRoaringBitmap 0.1 1000000 avgt 5 3567.839 135.149 us/op
buildRoaringBitmap 0.1 10000000 avgt 5 31982.046 1227.325 us/op
buildRoaringBitmap 0.5 10000 avgt 5 53.855 0.887 us/op
buildRoaringBitmap 0.5 100000 avgt 5 357.671 14.111 us/op
buildRoaringBitmap 0.5 1000000 avgt 5 3556.152 243.671 us/op
buildRoaringBitmap 0.5 10000000 avgt 5 34385.971 3864.143 us/op
buildRoaringBitmap 0.9 10000 avgt 5 59.354 10.385 us/op
buildRoaringBitmap 0.9 100000 avgt 5 374.245 54.485 us/op
buildRoaringBitmap 0.9 1000000 avgt 5 3712.684 657.964 us/op
buildRoaringBitmap 0.9 10000000 avgt 5 37223.976 4691.297 us/op
incrementalNativeAdd 0.1 10000 avgt 5 115.213 31.909 us/op
incrementalNativeAdd 0.1 100000 avgt 5 911.925 127.922 us/op
incrementalNativeAdd 0.1 1000000 avgt 5 8889.49 320.821 us/op
incrementalNativeAdd 0.1 10000000 avgt 5 102819.877 14247.868 us/op
incrementalNativeAdd 0.5 10000 avgt 5 116.878 28.232 us/op
incrementalNativeAdd 0.5 100000 avgt 5 947.076 128.255 us/op
incrementalNativeAdd 0.5 1000000 avgt 5 7190.443 202.012 us/op
incrementalNativeAdd 0.5 10000000 avgt 5 98843.303 4325.924 us/op
incrementalNativeAdd 0.9 10000 avgt 5 101.694 6.579 us/op
incrementalNativeAdd 0.9 100000 avgt 5 816.411 65.678 us/op
incrementalNativeAdd 0.9 1000000 avgt 5 9114.624 412.152 us/op
incrementalNativeAdd 0.9 10000000 avgt 5 108793.694 22562.527 us/op
incrementalUseOrderedWriter 0.1 10000 avgt 5 23.573 5.962 us/op
incrementalUseOrderedWriter 0.1 100000 avgt 5 289.588 36.814 us/op
incrementalUseOrderedWriter 0.1 1000000 avgt 5 2785.659 49.385 us/op
incrementalUseOrderedWriter 0.1 10000000 avgt 5 29489.758 2601.39 us/op
incrementalUseOrderedWriter 0.5 10000 avgt 5 23.57 1.536 us/op
incrementalUseOrderedWriter 0.5 100000 avgt 5 276.488 9.662 us/op
incrementalUseOrderedWriter 0.5 1000000 avgt 5 2799.408 198.77 us/op
incrementalUseOrderedWriter 0.5 10000000 avgt 5 28313.626 1976.042 us/op
incrementalUseOrderedWriter 0.9 10000 avgt 5 22.345 1.574 us/op
incrementalUseOrderedWriter 0.9 100000 avgt 5 280.205 36.987 us/op
incrementalUseOrderedWriter 0.9 1000000 avgt 5 2779.732 93.456 us/op
incrementalUseOrderedWriter 0.9 10000000 avgt 5 30568.591 2140.826 us/op

These benchmarks don’t go far enough to support replacing RoaringBitmap.bitmapOf.

Unsorted Input Data

In the cases benchmarked, this approach seems to be worthwhile. I can’t actually think of a case where someone would want to build a bitmap from unsorted data, but it occurred to me that this approach might be fast enough to cover the cost of a sort. OrderedWriter is also relaxed enough that it only needs the most significant 16 bits to be monotonic, so a full sort isn’t necessary. Implementing a radix sort on the most significant 16 bits (stable in the least significant 16 bits), prior to incremental insertion via an OrderedWriter, leads to huge increases in performance over RoaringBitmap.bitmapOf. The implementation is as follows:


  public static RoaringBitmap bitmapOfUnordered(final int... data) {
    partialRadixSort(data);
    RoaringBitmap bitmap = new RoaringBitmap();
    OrderedWriter writer = new OrderedWriter(bitmap);
    for (int i : data) {
      writer.add(i);
    }
    writer.flush();
    return bitmap;
  }

It did very well, according to benchmarks, even against various implementations of sort prior to RoaringBitmap.bitmapOf. Lower is better:

Benchmark (randomness) (size) Mode Cnt Score Error Units
bitmapOf 0.1 10000 avgt 5 1058.106 76.013 us/op
bitmapOf 0.1 100000 avgt 5 12323.905 976.68 us/op
bitmapOf 0.1 1000000 avgt 5 171812.526 9593.879 us/op
bitmapOf 0.1 10000000 avgt 5 3376296.157 170362.195 us/op
bitmapOf 0.5 10000 avgt 5 1096.663 477.795 us/op
bitmapOf 0.5 100000 avgt 5 12836.177 1674.54 us/op
bitmapOf 0.5 1000000 avgt 5 171998.126 4176 us/op
bitmapOf 0.5 10000000 avgt 5 3707804.439 974532.361 us/op
bitmapOf 0.9 10000 avgt 5 1124.881 65.673 us/op
bitmapOf 0.9 100000 avgt 5 14585.589 1894.788 us/op
bitmapOf 0.9 1000000 avgt 5 198506.813 8552.218 us/op
bitmapOf 0.9 10000000 avgt 5 3723942.934 423704.363 us/op
bitmapOfUnordered 0.1 10000 avgt 5 174.583 17.475 us/op
bitmapOfUnordered 0.1 100000 avgt 5 1768.613 86.543 us/op
bitmapOfUnordered 0.1 1000000 avgt 5 17889.705 135.714 us/op
bitmapOfUnordered 0.1 10000000 avgt 5 192645.352 6482.726 us/op
bitmapOfUnordered 0.5 10000 avgt 5 157.351 3.254 us/op
bitmapOfUnordered 0.5 100000 avgt 5 1674.919 90.138 us/op
bitmapOfUnordered 0.5 1000000 avgt 5 16900.458 778.999 us/op
bitmapOfUnordered 0.5 10000000 avgt 5 185399.32 4383.485 us/op
bitmapOfUnordered 0.9 10000 avgt 5 145.642 1.257 us/op
bitmapOfUnordered 0.9 100000 avgt 5 1515.845 82.914 us/op
bitmapOfUnordered 0.9 1000000 avgt 5 15807.597 811.048 us/op
bitmapOfUnordered 0.9 10000000 avgt 5 167863.49 3501.132 us/op
partialSortThenBitmapOf 0.1 10000 avgt 5 1060.152 168.802 us/op
partialSortThenBitmapOf 0.1 100000 avgt 5 10942.731 347.583 us/op
partialSortThenBitmapOf 0.1 1000000 avgt 5 100606.506 24705.341 us/op
partialSortThenBitmapOf 0.1 10000000 avgt 5 1035448.545 157383.713 us/op
partialSortThenBitmapOf 0.5 10000 avgt 5 1029.883 100.291 us/op
partialSortThenBitmapOf 0.5 100000 avgt 5 10472.509 832.719 us/op
partialSortThenBitmapOf 0.5 1000000 avgt 5 101144.032 16908.087 us/op
partialSortThenBitmapOf 0.5 10000000 avgt 5 958242.087 39650.946 us/op
partialSortThenBitmapOf 0.9 10000 avgt 5 1008.413 70.999 us/op
partialSortThenBitmapOf 0.9 100000 avgt 5 10458.34 600.416 us/op
partialSortThenBitmapOf 0.9 1000000 avgt 5 103945.644 2026.26 us/op
partialSortThenBitmapOf 0.9 10000000 avgt 5 1065638.269 102257.059 us/op
setupCost 0.1 10000 avgt 5 6.577 0.121 us/op
setupCost 0.1 100000 avgt 5 61.378 24.113 us/op
setupCost 0.1 1000000 avgt 5 1021.588 536.68 us/op
setupCost 0.1 10000000 avgt 5 13182.341 196.773 us/op
setupCost 0.5 10000 avgt 5 7.139 2.216 us/op
setupCost 0.5 100000 avgt 5 60.847 23.395 us/op
setupCost 0.5 1000000 avgt 5 800.888 14.711 us/op
setupCost 0.5 10000000 avgt 5 13431.625 553.44 us/op
setupCost 0.9 10000 avgt 5 6.599 0.09 us/op
setupCost 0.9 100000 avgt 5 60.946 22.511 us/op
setupCost 0.9 1000000 avgt 5 813.445 4.896 us/op
setupCost 0.9 10000000 avgt 5 13374.943 349.314 us/op
sortThenBitmapOf 0.1 10000 avgt 5 636.23 13.423 us/op
sortThenBitmapOf 0.1 100000 avgt 5 7411.756 174.264 us/op
sortThenBitmapOf 0.1 1000000 avgt 5 92299.305 3651.161 us/op
sortThenBitmapOf 0.1 10000000 avgt 5 1096374.443 162575.234 us/op
sortThenBitmapOf 0.5 10000 avgt 5 634.957 47.447 us/op
sortThenBitmapOf 0.5 100000 avgt 5 7939.074 409.328 us/op
sortThenBitmapOf 0.5 1000000 avgt 5 93505.427 5409.749 us/op
sortThenBitmapOf 0.5 10000000 avgt 5 1147933.592 57485.51 us/op
sortThenBitmapOf 0.9 10000 avgt 5 661.072 6.717 us/op
sortThenBitmapOf 0.9 100000 avgt 5 7915.506 356.148 us/op
sortThenBitmapOf 0.9 1000000 avgt 5 93403.343 5454.583 us/op
sortThenBitmapOf 0.9 10000000 avgt 5 1095960.734 85753.917 us/op

It looks like there are good performance gains available here, but these things tend to depend on particular data sets. I would be interested in hearing from anyone who has tried to use this class in a real application.

Blocked Signatures

The interesting thing about BitFunnel, the search architecture used by Bing, is its unorthodoxy; it revisits many ideas ignored by contemporary search technologies. The paper’s references go back to the 80s, when one million records was considered an enormous database. Something about this appeals to me, that there might be long forgotten ideas waiting to be reinterpreted and recombined in the context of modern micro-architectures.

A bit-sliced signature arrangement reduces the number of words which must be processed to evaluate a query, but it’s not enough for BitFunnel’s purposes. The last piece of background in the BitFunnel paper is blocked signatures, which are discussed in the 1990 paper A signature file scheme based on multiple organizations for indexing very large text databases by Kent, Sacks-Davis, Ramamohanarao (KS-DR). Blocked signatures further reduce the amount of processing per query, at what can be an acceptable false positive rate. In this post I aim to piece their data structure together in modern Java.

Formulation

The goal is to map documents into blocks consisting of a fixed number of documents (referred to as the blocking factor in the BitFunnel paper) so only bit sliced block signatures need be stored, where a block signature is a bloom filter of the terms in a block of documents. There are a variety of ways of doing this but they all start with assigning an integer label to each document prior to block assignment. This topic is covered at length in KS-DR.

The most obvious technique is to assign contiguous ranges of document IDs to blocks of size N, that is the function i -> Math.floorDiv(i, N). This is only useful if blocks of document signatures are also stored, acting as a top level index into those document signatures. Physically, there is a block index, which is the bit sliced signature of the terms in each block, and separate blocks of document signatures, again bit sliced by term. Queries can be evaluated by constructing a bloom filter from the query’s terms, specifying a set of bit slices in the block index to inspect and intersect. The result of the intersection gives the IDs of the document signature blocks to query. This is like a two level tree, and is better than a signature scan over all the document signatures, but why not just bit slice the document signatures? For rare terms, once a block is located, it does cut down the number of words in each slice to be intersected. However, the real problem with this technique, besides storage, is the cost of a false match at the block level: it results in a document level query, touching N bits, but yields nothing. The BitFunnel blocked signatures generalise this two level hierarchical arrangement for multiple levels.

This post goes on a tangent from BitFunnel here, focusing on the ideas put forward in KS-DR. An alternative is to choose a number M > N coprime to C, an estimate of the capacity of the index, and use the function i -> Math.floorDiv(M * i % C, N) to permute records prior to blocking, then make a copy of the block index for each of several values of M. If you choose, say, two values of M, when evaluating queries, you can map the query terms and get the matching blocks from each representation as before. There is no need for a document level query or storage though. If you have a bitmap of the document IDs (not the signatures) for each block, you can intersect the document bitmaps to get the document IDs matching the query (with false positives, the number of which reduces with the number of copies). In the KS-DR paper, this bitmap I assume the existence of is actually computed on the fly via an expensive reverse mapping with the help of a lookup table.

Java Implementation

The code is very similar to the bit sliced signature code, because a significant part of querying is a bit sliced lookup of block IDs, which requires storage of a bit matrix. The major difference is the requirement for block assignment and ultimately block intersection. I encapsulate this in a BlockSet which contains Blocks and is responsible for block assignment and intersection.

Details of block creation (blocking factor, bit matrix dimensions, hashing policy) can be hidden behind a supplier interface.


public class BlockFactory<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> implements Supplier<Block<D, T, Q>> {

    private final List<ToIntFunction<T>> hashes;
    private final int blockingFactor;
    private final int blockCapacity;

    public BlockFactory(List<ToIntFunction<T>> hashes, int blockingFactor, int blockCapacity) {
        this.hashes = hashes;
        this.blockingFactor = blockingFactor;
        this.blockCapacity = blockCapacity;
    }

    @Override
    public Block<D, T, Q> get() {
        return new Block<>(blockingFactor, blockCapacity, hashes);
    }
}

This gives us blocks, which is really just a wrapper around a bit matrix of terms and a bit set of document IDs. It can do three things

  1. Index a document, this requires that it knows the blocking factor (the number of blocks it can index), the hash functions and the bloom filter size.
  2. Check if the block might contain at least one document matching all the terms
  3. Share its document IDs

The code looks quite similar to my previous bit sliced signature implementation.


public class Block<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> {

    private final BitSet documentIds;
    private final long[][] bitMatrix;
    private final int capacity;
    private final List<ToIntFunction<T>> hashFunctions;
    private int docIndex = 0;

    public Block(int blockingFactor, int capacity, List<ToIntFunction<T>> hashFunctions) {
        assert Integer.bitCount(capacity) == 1;
        this.documentIds = new BitSet();
        this.bitMatrix = new long[capacity >> 6][blockingFactor];
        this.capacity = capacity;
        this.hashFunctions = hashFunctions;
    }

    public void add(D doc) {
        int docIndex = this.docIndex++;
        int docWordIndex = docIndex >>> 6;
        long docWord = 1L << docIndex;
        mapTerms(doc.get()).forEach(r -> bitMatrix[r][docWordIndex] |= docWord);
        documentIds.set(doc.getAsInt());
    }

    public void contribute(BitSet result) {
        result.or(documentIds);
    }

    public boolean matches(Q query) {
        int[] rows = mapTerms(query).distinct().toArray();
        return IntStream.range(0, capacity >> 6)
                        .filter(i -> hasMatch(rows, i))
                        .findFirst()
                        .isPresent();
    }

    private boolean hasMatch(int[] rows, int offset) {
        long word = 0L;
        for (int i = 0; i < rows.length && word == 0; ++i) {
            word |= bitMatrix[rows[i]][offset];
        }
        return word != 0;
    }

    private IntStream mapTerms(Set<T> terms) {
        return terms.stream().flatMapToInt(t -> hashFunctions.stream().mapToInt(f -> mapHash(f.applyAsInt(t))));
    }

    private int mapHash(int hash) {
        return hash & -hash & (capacity - 1);
    }
}

Now a level up. A BlockIndex has a BlockSet for each relatively prime factor. When evaluating a query, it passes the query to each of its BlockSets, retrieving all blocks which probably match the query.


public class BlockSet<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> {

    private final Block[] blocks;
    private final Supplier<Block<D, T, Q>> newBlock;
    private final int blockingFactor;
    private final int estimatedCapacity;
    private final int prime;

    public BlockSet(Supplier<Block<D, T, Q>> newBlock, int blockingFactor, int estimatedCapacity, int prime) {
        assert Integer.bitCount(blockingFactor) == 1 && Integer.bitCount(estimatedCapacity) == 1;
        this.newBlock = newBlock;
        this.blocks = new Block[estimatedCapacity/blockingFactor];
        for (int i = 0; i < blocks.length; ++i) {
            blocks[i] = newBlock.get();
        }
        this.blockingFactor = blockingFactor;
        this.estimatedCapacity = estimatedCapacity;
        this.prime = prime;
    }

    public void add(D doc) {
        int blockIndex = blockIndex(doc.getAsInt());
        Block<D, T, Q> block = (Block<D, T, Q>)blocks[blockIndex];
        block.add(doc);
    }

    public BitSet query(Q query) {
        BitSet result = new BitSet();
        Arrays.stream(blocks)
              .filter(b -> b.matches(query))
              .forEach(b -> b.contribute(result));
        return result;
    }

    private int blockIndex(int value) {
        return ((value * prime) & (estimatedCapacity - 1)) / blockingFactor;
    }
}

With a BlockIndex as the tip of an iceberg – it just needs to intersect the bit sets of document IDs.


public class BlockIndex<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> {

    private final List<BlockSet<D, T, Q>> blockSets;

    public BlockIndex(List<BlockSet<D, T, Q>> blockSets) {
        this.blockSets = blockSets;
    }

    public IntStream query(Q query) {
        BitSet result = null;
        for (BlockSet<D, T, Q> blockSet : blockSets) {
            BitSet docIds = blockSet.query(query);
            if (null == result) {
                result = docIds;
            } else {
                result.and(docIds);
            }
        }
        return null == result ? IntStream.of() : result.stream();
    }
}

This code is obviously experimental, but a problem with it as it stands is memory consumption with the temporary bit sets. A better, but less Java 8+ compliant bit set is RoaringBitmap.

Blocked Signatures in BitFunnel

Blocked Signatures are a very old idea, naturally it is reported that there are a few innovations in the BitFunnel data structure. BitFunnel uses multiple levels with blocking factors, each of which must be a proper power of 2, rather than multiple factors coprime to estimated capacity at the same level. Each level has rank = log(blockingFactor). The effect in BitFunnel is having several levels of blocking density. Blocks from different levels can be intersected efficiently by transforming dense blocks to rank zero (the least dense representation) prior to intersection.