Data Driven Logic

I really don’t like reading or writing blocks of if-else statements. They make my eyes glaze over. Rumour has it that processors don’t like executing them either, though that’s less true now than it once was. There are two problems with these blocks of statements, and neither one of them is performance:

  1. They are hard to read and tend to have subtle dependencies on line order.
  2. They can’t be treated as data, and can’t be executed remotely unless you do something weird like serialise code.

Since I started programming in Java, I have been aware of the existence of rule engines, but I have never heard of a single case of “soft coding” working out. In my own experience, every time I have been involved in the implementation of a system with a DSL to empower business analysts to control the business logic, there has been low stakeholder participation during the design of the DSL and developers have ended up writing the business logic anyway. The most excruciating aspect of this is that it dilutes accountability for testing by blurring the boundaries between the application and user input. Perhaps your experience differs. However, rule engines can eradicate cyclomatic complexity in application code, and systems consisting of straight line code (with high test coverage) tend to do what they are supposed to. Soft coding isn’t the appeal of rule engines, getting rid of the if-else blocks is. If you squint at rule engines the right way, they look data driven and they start to get exciting. I can’t see value in anything more complicated than a decision table.

You can represent a block of if-else statements as a decision table by considering every possible branch as a line in the table. Your decision table doesn’t need to be exhaustive: there can be cases where you fall through and throw an exception or choose a default. This can be quite hard to write in imperative code, and you may need to throw the same exception in multiple places, set flags, or otherwise rely on line order to achieve this. Decision tables have a really nice property: if you want to start treating certain cases as exceptional, you just delete the line from the table.

Decision tables are very similar in character to case classes in Scala, or to the weaker when expressions present in Kotlin, but decision tables can be allowed to grow much larger. I wouldn’t allow a match expression with 50,000 cases through a code review even if someone had the energy to write one deviously enough to come in under the maximum byte code method size.

I looked at several implementations of decision tables on GitHub and saw a lot of clean code, but not a lot of textbook computer science. Every single implementation iterated through a list of rules checking the rule against the input data. I have implemented a password strength checker like this in the past (I know! I probably shouldn’t have done this myself!) which is fine because a strong password checker might have at most a dozen rules. What if you work in adtech and want to classify the people you track (how do you sleep at night?) as members of one or many of 50,000 clusters which can be described in terms of regions of, say, a 50 dimensional feature space? Imagine your task is to guess which cluster your quarry belongs to in a few microseconds at most. You won’t get far if you iterate through thousands of rules.

I implemented a small library in the evenings over the last couple of weeks called bitrules. This was based on some ideas I had about using RoaringBitmap for decision tables last year. The idea is very simple: think of a list of rules with constrained attributes as a matrix, and transpose that matrix and loop through the attributes during classification. This is a similar approach to that taken in blocked signatures, a search technique used in BitFunnel, which translates an expensive signature scan to a random access. In the case of bitrules, for each constraint on each attribute, bits are removed from a mask of potentially matching rules. These bitsets are intersected sequentially, resulting in a bitset rapidly decreasing in cardinality. Because I used RoaringBitmap, rapid reduction in cardinality means a rapid reduction in size, which means cache friendliness. There are a few tricks in the code like using range encoding for range attributes, so that range queries can be evaluated with a single bitset intersection. I plan to implement a hopefully efficient serialisation format so the table can be sent to another server and used for classification remotely.

I don’t actually know how fast this code is: performance is context sensitive, and I shy away from making “performance measurements”. It’s best suited to cases where there are a large number of rules (thousands) and I bet it’s really fast when there are ~50,000 segments in a ~50 dimensional space. I don’t even have a use case right now for bitrules: it was just fun writing the code. I have started releasing it to maven central, while I can’t guarantee its fitness for purpose, perhaps it may be of some use to someone else.

Garbage Collector Code Artifacts: Card Marking

In the JVM, lots of evidence of garbage collection mechanics can be seen from JIT compiler output. This may be obvious if you think of garbage collection as a task of book-keeping: the various auxiliary data structures used to track inter-region or inter-generational references, relied on for faster marking, need to be kept up to date somehow. These data structures need maintenance, and this isn’t something you control in application code: the maintenance aspect must must be instrumented somehow. If you profile your application’s disassembly, you can find artifacts of the various garbage collectors, and these snippets of code can help you understand the throughput tradeoffs of each collector. You can also just read the documentation.

A simple benchmark to illustrate this would compare the store of a primitive int and a boxed Integer. It may not be surprising that the classes below can be JIT compiled in very different ways, and that the real difference depends on the selected garbage collector.

public class IntAcceptor {
  private int value;

  public void setValue(int value) {
    this.value = value;

public class IntegerAcceptor {
  private Integer value;

  public void setValue(Integer value) {
    this.value = value;

For instance, the simplest garbage collector, used mostly by specialist applications betting against garbage collection actually happening, is enabled by -XX:+UseSerialGC. If you benchmark throughput for storing these values, you will observe that storing ints is cheaper than storing Integers.

It’s difficult to measure this accurately because there are numerous pitfalls. If you allocate a new Integer for each store, you conflate your measurement with allocation and introduce bias towards the primitive store. If you pre-allocate an Integer[] you can make the measured workload more cache-friendly, from a GC book-keeping point of view, which helps reduce the reference store cost. In a multithreaded context, this same property can create bias in the opposite direction. JMH can’t prevent any of these biases. Be skeptical about the accuracy or generality of the numbers here (because there are a large number of unexplored dimensions to control, as I have alluded to) but you would hardly notice any difference in a single threaded benchmark storing the same boxed integer repeatedly.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
SerialGCStoreBenchmark.storeInt thrpt 1 20 395.370723 10.092432 ops/us
SerialGCStoreBenchmark.storeInteger thrpt 1 20 277.329797 18.036629 ops/us

You may see a large difference in a multithreaded benchmark, with an Integer instance per thread.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
SerialGCStoreBenchmark.storeInt thrpt 4 20 1467.401084 5.917960 ops/us
SerialGCStoreBenchmark.storeInteger thrpt 4 20 793.880064 459.304449 ops/us

The throughput of storeInteger seems to have a large error term, here are the iteration figures:

Iteration   1: 1176.474 ops/us
Iteration   2: 85.966 ops/us
Iteration   3: 1180.612 ops/us
Iteration   4: 90.930 ops/us
Iteration   5: 1180.955 ops/us
Iteration   6: 1181.966 ops/us
Iteration   7: 88.801 ops/us
Iteration   8: 1180.723 ops/us
Iteration   9: 1177.895 ops/us
Iteration  10: 1138.446 ops/us
Iteration  11: 1177.302 ops/us
Iteration  12: 91.551 ops/us
Iteration  13: 1144.591 ops/us
Iteration  14: 102.143 ops/us
Iteration  15: 1179.683 ops/us
Iteration  16: 1184.222 ops/us
Iteration  17: 85.365 ops/us
Iteration  18: 1183.874 ops/us
Iteration  19: 95.979 ops/us
Iteration  20: 1150.123 ops/us

This is bimodal, varying from iteration to iteration between almost as good to an order of magnitude slower, with nothing in between. If you compare the disassembly for loops setting distinct values, such as in my simplistic benchmark, you will see the assembly is virtually identical, but you’ll notice these instructions for the reference stores:

  0.98%    1.12%  │  0x00007f54a96462ee: shr    $0x9,%r10
  2.22%    2.17%  │  0x00007f54a96462f2: movabs $0x7f54c1bc5000,%r11
  2.30%    2.69%  │  0x00007f54a96462fc: mov    %r12b,(%r11,%r10,1) 

This code does card marking, which tracks bucketed references between different sections of the heap. The byte array is the card table, which has logical pages of 512 bytes. The right shift divides the reference of the stored object by 512 to get the card it resides in. The byte at this index offset by the base address of the page tracking references out of the storing object’s card is written to. In other words, a directed link is established between the storing object’s page and stored object’s page. This is what you would see if you squinted at the heap: the card table is a coarse approximation of the object graph which allows false positives (referenced pages may contain dead references) but no false negatives.

The writes to the card table are volatile, and the card table is shared between threads, which can induce false sharing when objects in adjacent pages are stored in objects residing in the same page, and the stores happen on different threads. You can use conditional marking to avoid this because the stored object’s page is often already marked. The bimodal behaviour is caused by unlucky combinations of addresses resulting in false sharing of the card table. It doesn’t even happen all the time. Setting the -XX:+UseCondCardMark the difference gets much smaller, the noise disappears, and conditional marking logic can be seen in the disassembly.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
SerialGCStoreBenchmark.storeInt thrpt 4 20 1467.464828 12.866720 ops/us
SerialGCStoreBenchmark.storeInteger thrpt 4 20 1114.612419 6.960193 ops/us

                  ╭││  0x00007f003164b9e4: je     0x00007f003164ba04 
  0.01%    0.00%  │││  0x00007f003164b9e6: mov    %r10,%r8
  4.92%    3.54%  │││  0x00007f003164b9e9: shr    $0x9,%r8
  0.01%    0.00%  │││  0x00007f003164b9ed: movabs $0x7f0048492000,%r9
  3.48%    2.12%  │││  0x00007f003164b9f7: add    %r8,%r9
  0.02%    0.01%  │││  0x00007f003164b9fa: movsbl (%r9),%ecx
  6.51%    6.53%  │││  0x00007f003164b9fe: test   %ecx,%ecx
  1.71%    1.85%  │╰│  0x00007f003164ba00: jne    0x00007f003164b994
                  │ │                                               
                  │ │                                               
                  │ │                                               
  4.76%    5.29%  │ ╰  0x00007f003164ba02: jmp    0x00007f003164b9a0
                  ↘    0x00007f003164ba04: mov    $0xfffffff6,%esi

I intended to provoke this behaviour, but what if I had been measuring something else and hadn’t ensured conditional marking was enabled?

Card marking is common in older garbage collectors because it has low overhead, particularly with conditional marking, but different collectors intercept stores differently, and you can reverse engineer them all without reading the source code. In fact, Nitsan Wakart has written a great post all about store barriers.

The point of this post is that you can detect garbage collector mechanisms with benchmarks, you just need to write them to provoke the actions you think a garbage collector should make, and look for crop circles in the disassembly. However, this assumes you have some kind of mental model of a garbage collector to start with! The new ones are getting very creative, and you might not be able to guess what they do. In principle, garbage collector implementations could modify any application code so these artifacts could be anywhere.

Parallel Bitmap Aggregation

A bitmap index represents predicates over records as sets consisting of the integer identities of each record satisfying each predicate. This representation is actually a few decades out of date, and systems like Pilosa use much more sophisticated data structures, and Sybase had even more on offer back in the 90s. But the chances are, if you’ve rolled your own bitmap index, you’ve used equality encoding and have a bitmap per indexed predicate. RoaringBitmap is a great choice for the bitmaps used in this kind of data structure, because it offers a good tradeoff between bitmap compression and performance. It’s also succinct, that is, you don’t need to decompress the structure in order to operate on it. With the naive index structure described, it’s likely that you have many bitmaps to aggregate (union, intersection, difference, and so on) when you want to query your index.

RoaringBitmap provides a class FastAggregation for aggregations, and the method FastAggregation.and is incredibly fast, particularly given its apparent simplicity. This reflects a nice property of set intersection, that the size of the intersection cannot increase and tends to get smaller as the number of sets increases. Unions and differences are different: the problem size tends to increase in magnitude as the number of contributing sets increases. While FastAggregation.or and FastAggregation.xor are highly optimised, not a lot can be done about the fact each additional set makes the problem bigger. So it may be worth throwing some threads at the problem, and this gets more attractive as you add more dimensions to your index. You can, of course, completely bypass this need by reading some database research and sharing bitmaps between overlapping predicates.

I implemented the class ParallelAggregation in RoaringBitmap, but I’m not convinced the technique used performs as well as it could do. RoaringBitmap stores the 16 bit prefix of each integer in a sorted array, with the rest of each integer in that 16 bit range stored in a container at the same index in another array. This makes the structure very easy to split. The implementation I worked on seeks to exploit this by grouping all the containers by common key as a SortedMap<Short, List<Container>> before executing each reduction (i.e. Function<List<Container>, Container>) in parallel in a ForkJoinPool. This results in a reasonable speedup of 2x-6.5x compared to FastAggregation on an 8 core machine, but it uses quite a lot of temporary memory just to set the problem up. I don’t think it should be possible to beat this approach without grouping the containers by key somehow, but I suspect there are lighter weight approaches which use less memory and give better throughput. Perhaps this would be an interesting problem to work on?