Spliterator Characteristics and Performance

The streams API has been around for a while now, and I’m a big fan of it. It allows for a clean declarative programming style, which permits various optimisations to occur, and keeps the pastafarians at bay. I also think the Stream is the perfect abstraction for data interchange across API boundaries. This is partly because a Stream is lazy, meaning you don’t need to pay for consumption until you actually need to, and partly because a Stream can only be used once and there can be no ambiguity about ownership. If you supply a Stream to an API, you must expect that it has been used and so must discard it. This almost entirely eradicates defensive copies and can mean that no intermediate data structures need ever exist. Despite my enthusiasm for this abstraction, there’s some weirdness in this API when you scratch beneath surface.

I wanted to find a way to quickly run length encode an IntStream and found it difficult to make this code as fast as I’d like it to be. The code is too slow because it’s necessary to inspect each int, even when there is enough context available to potentially apply optimisations such as skipping over ranges. It’s likely that I am experiencing the friction of treating Stream as a data interchange format, which wasn’t one of its design goals, but this led me to investigate spliterator characteristics (there is no contiguous characteristic, which could speed up RLE greatly) and their relationship with performance.

Spliterator Characteristics

Streams have spliterators, which control iteration and splitting behaviour. If you want to process a stream in parallel, it is the spliterator which dictates how to split the stream, if possible. There’s more to a spliterator than parallel execution though, and each single threaded execution can be optimised based on the characteristics bit mask. The different characteristics are as follows:

  • ORDERED promises that there is an order. For instance, trySplit is guaranteed to give a prefix of elements.
  • DISTINCT a promise that each element in the stream is unique.
  • SORTED a promise that the stream is already sorted.
  • SIZED promises the size of the stream is known. This is not true when a call to iterate generates the stream.
  • NONNULL promises that no elements in the stream are null.
  • IMMUTABLE promises the underlying data will not change.
  • CONCURRENT promises that the underlying data can be modified concurrently. Must not also be IMMUTABLE.
  • SUBSIZED promises that the sizes of splits are known, must also be SIZED.

There’s javadoc for all of these flags, which should be your point of reference, and you need to read it because you wouldn’t guess based on relative performance. For instance, IntStream.range(inclusive, exclusive) creates an RangeIntSpliterator with the characteristics ORDERED | SIZED | SUBSIZED | IMMUTABLE | NONNULL | DISTINCT | SORTED. This means that this stream has no duplicates, no nulls, is already sorted in natural order, the size is known, and it will be chunked deterministically. The data and the iteration order never change, and if we split it, we will always get the same first chunk. So these code snippets should have virtually the same performance:


    @Benchmark
    public long countRange() {
        return IntStream.range(0, size).count();
    }

    @Benchmark
    public long countRangeDistinct() {
        return IntStream.range(0, size).distinct().count();
    }

This is completely at odds with observations. Even though the elements are already distinct, and metadata exists to support this, requesting the distinct elements decimates performance.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countRange thrpt 1 10 49.465729 1.804123 ops/us 262144
countRangeDistinct thrpt 1 10 0.000395 0.000002 ops/us 262144

It turns out this is because IntStream.distinct has a one-size-fits-all implementation which completely ignores the DISTINCT characteristic, and goes ahead and boxes the entire range.

    // from java.util.stream.IntPipeline
    @Override
    public final IntStream distinct() {
        // While functional and quick to implement, this approach is not very efficient.
        // An efficient version requires an int-specific map/set implementation.
        return boxed().distinct().mapToInt(i -> i);
    }

There is even more observable weirdness. If we wanted to calculate the sum of the first 1000 natural numbers, these two snippets should have the same performance. Requesting what should be a redundant sort doubles the throughput.


    @Benchmark 
    public long headSum() {
        return IntStream.range(0, size).limit(1000).sum();
    }

    @Benchmark
    public long sortedHeadSum() {
        return IntStream.range(0, size).sorted().limit(1000).sum();
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
headSum thrpt 1 10 0.209763 0.002478 ops/us 262144
sortedHeadSum thrpt 1 10 0.584227 0.006004 ops/us 262144

In fact, you would have a hard time finding a relationship between Spliterator characteristics and performance, but you can see cases of characteristics driving optimisations if you look hard enough, such as in IntStream.count, where the SIZED characteristic is used.

    // see java.util.stream.ReduceOps.makeIntCounting
    @Override
    public <P_IN> Long evaluateSequential(PipelineHelper<Integer> helper, Spliterator<P_IN> spliterator) {
        if (StreamOpFlag.SIZED.isKnown(helper.getStreamAndOpFlags()))
            return spliterator.getExactSizeIfKnown();
        return super.evaluateSequential(helper, spliterator);
    }

This is a measurably worthwhile optimisation, when benchmarked against the unsized spliterator created by IntStream.iterate:


    @Benchmark
    public long countIterator() {
        return IntStream.iterate(0, i -> i < size, i -> i + 1).count();
    }

    @Benchmark
    public long countRange() {
        return IntStream.range(0, size).count();
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countIterator thrpt 1 10 0.001198 0.001629 ops/us 262144
countRange thrpt 1 10 43.166065 4.628715 ops/us 262144

What about limit, that’s supposed to be useful for speeding up streams by limiting the amount of work done? Unfortunately not. It actually makes things potentially much worse. In SliceOps.flags, we see it will actually disable SIZED operations:

    //see java.util.stream.SliceOps
    private static int flags(long limit) {
        return StreamOpFlag.NOT_SIZED | ((limit != -1) ? StreamOpFlag.IS_SHORT_CIRCUIT : 0);
    }

This has a significant effect on performance, as can be seen in the following benchmark:


    @Benchmark
    public long countRange() {
        return IntStream.range(0, size).count();
    }

    @Benchmark
    public long countHalfRange() {
        return IntStream.range(0, size).limit(size / 2).count();
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countHalfRange thrpt 1 10 0.003632 0.003363 ops/us 262144
countRange thrpt 1 10 44.859998 6.191411 ops/us 262144

It’s almost as if there were grand plans involving characteristic based optimisation, and perhaps time ran out (IntStream.distinct has a very apologetic comment) or others were better on paper than in reality. In any case, it looks like they aren’t as influential as you might expect. Given that the relationship between the characteristics which exist and performance is flaky at best, it’s unlikely that a new one would get implemented, but I think the characteristic CONTIGUOUS is missing.

Don’t Fear the Builder

Writing a nontrivial piece of code requires a programmer to make countless tiny decisions, and to produce good quality code quickly, many of these decisions must be made subconciously. But what happens if your instinct is mostly superstitious bullshit? I instinctively avoid allocating objects when I don’t have to, so would not voluntarily resort to using things like EqualsBuilder or HashCodeBuilder from Apache Commons. I feel dismayed whenever I am asked about the relationship between hash code and equals in interviews for contracts. This indicates a lack of mutual respect, but what’s revealing is that enough people passing themselves off as developers don’t know this stuff that it makes the question worth asking. EqualsBuilder and HashCodeBuilder make it so you don’t actually need to know, making it safe to put any object in a HashSet, whoever wrote the class. But should you use these classes just because they protect the naive (and you from the naive)? Is it a runtime tax? Or is any perceived cost magically eradicated by the JVM? It’s time to benchmark my instinct.

Baseline

There are definitely no allocations in the code IntelliJ (or similar) would generate for equals and hashCode automatically. I will use this generated code as the baseline: if I can match or beat this code with EqualsBuilder or HashCodeBuilder respectively, I can dismiss my prejudice. Here’s the baseline class:


public class Baseline {
    
    private final String a;
    private final int b;
    private final String c;
    private final Instant d;

    public Baseline(String a, int b, String c, Instant d) {
        this.a = a;
        this.b = b;
        this.c = c;
        this.d = d;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;

        Baseline baseline = (Baseline) o;

        if (b != baseline.b) return false;
        if (!a.equals(baseline.a)) return false;
        if (!c.equals(baseline.c)) return false;
        return d.equals(baseline.d);
    }

    @Override
    public int hashCode() {
        int result = a.hashCode();
        result = 31 * result + b;
        result = 31 * result + c.hashCode();
        result = 31 * result + d.hashCode();
        return result;
    }
}

It’s not pretty and it’s a real pain to keep generated code up to date when adding and removing fields (you can delete it and regenerate it but it creates merge noise) but there’s no obvious way to improve on this code. It’s correct and looks fast. To benchmark alternatives, I want to look at the effect on set membership queries against a HashSet of various sizes.

Builder Implementation

I will contrast this with the following code using code from Apache Commons (incidentally, also generated by IntelliJ):


public class Builder {

    private final String a;
    private final int b;
    private final String c;
    private final Instant d;

    public Builder(String a, int b, String c, Instant d) {
        this.a = a;
        this.b = b;
        this.c = c;
        this.d = d;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;

        if (o == null || getClass() != o.getClass()) return false;

        Builder builder = (Builder) o;

        return new EqualsBuilder()
                .append(b, builder.b)
                .append(a, builder.a)
                .append(c, builder.c)
                .append(d, builder.d)
                .isEquals();
    }

    @Override
    public int hashCode() {
        return new HashCodeBuilder(17, 37)
                .append(a)
                .append(b)
                .append(c)
                .append(d)
                .toHashCode();
    }
}

This code is a bit neater and easier to maintain, but does it have an observable cost?

HashSet Benchmark

For a parametrised range of HashSet sizes, I measure throughput for calls to contains. This requires that both hashCode and equals are called, the latter potentially several times. There is bias in this benchmark, because the implementation producing the best hash function will minimise the calls to equals, but I am willing to reward the better implementation rather than maniacally isolate any attributable allocation cost.

The code is simple, if a little repetitive. For each implementation, I make some random instances of my class, and make one that I can be sure isn’t in the set. I measure the cost of repeated invocations to the hashCode and equals methods by finding an object, and searching but failing to find an object separately.


@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class HashSetContainsBenchmark {

    @Param({"8", "96", "1024", "8192"})
    int setSize;

    private HashSet<Baseline> baselineSet;
    private Baseline presentBaseline;
    private Baseline missingBaseline;

    private HashSet<Builder> builderSet;
    private Builder presentBuilder;
    private Builder missingBuilder;

    @Setup(Level.Trial)
    public void setup() {
        setupBaselineState();
        setupBuilderState();
    }

    private void setupBaselineState() {
        baselineSet = new HashSet<>();
        for (int i = 0; i < setSize; ++i) {
            while(!baselineSet.add(newBaselineInstance()));
        }
        presentBaseline = baselineSet.iterator().next();
        do {
            missingBaseline = newBaselineInstance();
        } while (baselineSet.contains(missingBaseline));
    }

    private void setupBuilderState() {
        builderSet = new HashSet<>();
        for (int i = 0; i < setSize; ++i) {
            while(!builderSet.add(newBuilderInstance()));
        }
        presentBuilder = builderSet.iterator().next();
        do {
            missingBuilder = newBuilderInstance();
        } while (builderSet.contains(missingBuilder));
    }

    @Benchmark
    public boolean findPresentBaselineInstance() {
        return baselineSet.contains(presentBaseline);
    }

    @Benchmark
    public boolean findMissingBaselineInstance() {
        return baselineSet.contains(missingBaseline);
    }

    @Benchmark
    public boolean findPresentBuilderInstance() {
        return builderSet.contains(presentBuilder);
    }

    @Benchmark
    public boolean findMissingBuilderInstance() {
        return builderSet.contains(missingBuilder);
    }

    private static Baseline newBaselineInstance() {
        ThreadLocalRandom random = ThreadLocalRandom.current();
        return new Baseline(UUID.randomUUID().toString(),
                random.nextInt(),
                UUID.randomUUID().toString(),
                Instant.ofEpochMilli(random.nextLong(0, Instant.now().toEpochMilli())));
    }

    private static Builder newBuilderInstance() {
        ThreadLocalRandom random = ThreadLocalRandom.current();
        return new Builder(UUID.randomUUID().toString(),
                random.nextInt(),
                UUID.randomUUID().toString(),
                Instant.ofEpochMilli(random.nextLong(0, Instant.now().toEpochMilli())));
    }
}

Running this benchmark with org.apache.commons:commons-lang3:3.7 yields the throughput results below. There is an occasional and slight throughput degradation with the builders, but the builder implementation is sometimes faster in this benchmark, but the difference isn’t large enough to worry about.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: setSize
findMissingBaselineInstance thrpt 1 10 153.878772 7.993036 ops/us 8
findMissingBaselineInstance thrpt 1 10 135.835085 6.031588 ops/us 96
findMissingBaselineInstance thrpt 1 10 137.716073 25.731562 ops/us 1024
findMissingBaselineInstance thrpt 1 10 123.896741 15.484023 ops/us 8192
findMissingBuilderInstance thrpt 1 10 142.051528 4.177352 ops/us 8
findMissingBuilderInstance thrpt 1 10 132.326909 1.017351 ops/us 96
findMissingBuilderInstance thrpt 1 10 127.220440 8.680761 ops/us 1024
findMissingBuilderInstance thrpt 1 10 140.029701 6.272960 ops/us 8192
findPresentBaselineInstance thrpt 1 10 139.714236 1.626873 ops/us 8
findPresentBaselineInstance thrpt 1 10 140.244864 1.777298 ops/us 96
findPresentBaselineInstance thrpt 1 10 135.273760 7.587937 ops/us 1024
findPresentBaselineInstance thrpt 1 10 133.158555 5.101069 ops/us 8192
findPresentBuilderInstance thrpt 1 10 111.196443 18.804103 ops/us 8
findPresentBuilderInstance thrpt 1 10 124.008441 5.182294 ops/us 96
findPresentBuilderInstance thrpt 1 10 126.729750 4.286002 ops/us 1024
findPresentBuilderInstance thrpt 1 10 124.159285 5.287920 ops/us 8192

Where did the builder go?

It’s worth taking a look at the compiler output to find out what happened. Considering the case where the object isn’t found in the set, we can see that the code gets inlined, and the hash code must be quite good. By enabling the args -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining it’s clear to see that the entire hash code generation gets inlined into the calling loop, and that equals is never executed (suggesting that there are no collisions to resolve here) .

                              @ 8   java.util.HashSet::contains (9 bytes)   inline (hot)
                                @ 5   java.util.HashMap::containsKey (18 bytes)   inline (hot)
                                 \-> TypeProfile (17571/17571 counts) = java/util/HashMap
                                  @ 2   java.util.HashMap::hash (20 bytes)   inline (hot)
                                    @ 9   com.openkappa.simd.builder.Builder::hashCode (43 bytes)   inline (hot)
                                      @ 8   org.apache.commons.lang3.builder.HashCodeBuilder::<init> (60 bytes)   inline (hot)
                                        @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
                                        @ 26   org.apache.commons.lang3.Validate::isTrue (18 bytes)   unloaded signature classes
                                        @ 46   org.apache.commons.lang3.Validate::isTrue (18 bytes)   unloaded signature classes
                                      @ 15   org.apache.commons.lang3.builder.HashCodeBuilder::append (58 bytes)   inline (hot)
                                        @ 21   java.lang.Object::getClass (0 bytes)   (intrinsic)
                                        @ 24   java.lang.Class::isArray (0 bytes)   (intrinsic)
                                        @ 49   java.lang.String::hashCode (49 bytes)   inline (hot)
                                          @ 19   java.lang.String::isLatin1 (19 bytes)   inline (hot)
                                          @ 29   java.lang.StringLatin1::hashCode (42 bytes)   inline (hot)
                                      @ 22   org.apache.commons.lang3.builder.HashCodeBuilder::append (17 bytes)   inline (hot)
                                      @ 29   org.apache.commons.lang3.builder.HashCodeBuilder::append (58 bytes)   inline (hot)
                                        @ 21   java.lang.Object::getClass (0 bytes)   (intrinsic)
                                        @ 24   java.lang.Class::isArray (0 bytes)   (intrinsic)
                                        @ 49   java.lang.String::hashCode (49 bytes)   inline (hot)
                                          @ 19   java.lang.String::isLatin1 (19 bytes)   inline (hot)
                                          @ 29   java.lang.StringLatin1::hashCode (42 bytes)   inline (hot)
                                      @ 36   org.apache.commons.lang3.builder.HashCodeBuilder::append (58 bytes)   inline (hot)
                                        @ 21   java.lang.Object::getClass (0 bytes)   (intrinsic)
                                        @ 24   java.lang.Class::isArray (0 bytes)   (intrinsic)
                                        @ 49   java.time.Instant::hashCode (22 bytes)   inline (hot)
                                      @ 39   org.apache.commons.lang3.builder.HashCodeBuilder::toHashCode (5 bytes)   accessor
                                  @ 6   java.util.HashMap::getNode (148 bytes)   inline (hot)
                                    @ 59   com.openkappa.simd.builder.Builder::equals (84 bytes)   never executed
                                    @ 126   com.openkappa.simd.builder.Builder::equals (84 bytes)   never executed

So the code is clearly small enough to get inlined. But what about the builder itself, isn’t it allocated? Not if it doesn’t escape. On a debug JVM, it’s possible to observe the removal of the builder’s allocation the JVM args -XX:+PrintEscapeAnalysis -XX:+PrintEliminateAllocations. However, on a production JVM, it can be observed indirectly by measuring the difference when escape analysis is disabled with -XX:-DoEscapeAnalysis.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: setSize
findMissingBaselineInstance thrpt 1 10 150.972758 5.095163 ops/us 8
findMissingBaselineInstance thrpt 1 10 140.236057 2.222751 ops/us 96
findMissingBaselineInstance thrpt 1 10 132.680494 5.225503 ops/us 1024
findMissingBaselineInstance thrpt 1 10 115.220385 4.232488 ops/us 8192
findMissingBuilderInstance thrpt 1 10 68.869355 4.479944 ops/us 8
findMissingBuilderInstance thrpt 1 10 67.786351 5.980353 ops/us 96
findMissingBuilderInstance thrpt 1 10 71.113062 3.057181 ops/us 1024
findMissingBuilderInstance thrpt 1 10 75.088839 1.592294 ops/us 8192
findPresentBaselineInstance thrpt 1 10 141.425675 2.630898 ops/us 8
findPresentBaselineInstance thrpt 1 10 142.636359 2.854795 ops/us 96
findPresentBaselineInstance thrpt 1 10 143.939199 0.475918 ops/us 1024
findPresentBaselineInstance thrpt 1 10 142.006635 3.098352 ops/us 8192
findPresentBuilderInstance thrpt 1 10 71.546971 6.152584 ops/us 8
findPresentBuilderInstance thrpt 1 10 62.451705 11.896730 ops/us 96
findPresentBuilderInstance thrpt 1 10 68.903442 3.263955 ops/us 1024
findPresentBuilderInstance thrpt 1 10 69.576038 4.047581 ops/us 8192

Without escape analysis, the baseline code hardly slows down, whereas the change in the builder benchmarks is stark. The difference of differences here isolates the cost of allocating the HashCodeBuilder on the heap, and this is the kind of difference that can be expected from value types coming from Project Valhalla. Escape analysis is enabled by default, and the key point here is that even if the code looks like it might be slower, it might be as good or better – it depends what the JIT does with it.

Beware Collection Factory Methods

I saw an interesting tweet referencing a Github issue where the impact of including an (in my view) unnecessary implementation of the List interface impacted inlining decisions, causing 20x degradation in throughput. Guava’s ImmutableList is my favourite class to seek and destroy because of the way it is often used – it tends to be associated with unnecessary copying where encapsulation would be a better solution. I had assumed performance gains won from finding and deleting all the instances of ImmutableList had been thanks to relieving the garbage collector from medieval torture. The performance degradation observed in the benchmark is caused by use of ImmutableList, along with all its subclasses, alongside ArrayList, making calls to List bimorphic at best, causing the JIT compiler to generate slower code. I may have inadvertently profited from better inlining in the past simply by removing as many ImmutableLists as possible!

This post doesn’t go into any details about the various mechanisms of method dispatch, and if you want to understand the impact of polymorphism on inlining, bookmark Aleksey Shipilev’s authoritative post and read it when you have some time to really concentrate.

Without resorting to using LinkedList, is it possible to contrive cases in Java 9 where performance is severely degraded by usages of Collections.unmodifiableList and List.of factory methods? Along with ArrayList, these are random access data structures so this should highlight the potential performance gains inlining can give.

This section and the benchmarks were amended to remove a small bias caused by calls to ThreadLocalRandom.current().nextInt() after a suggestion from Lukas Eder.

The methodology is very simple: I randomly vary the List implementation and plug it into the same algorithm. It is cruder than you would see in Aleksey Shipilev’s post because I’ve targeted only the worst case by creating equal bias between implementations. Aleksey demonstrates that inlining decisions are statistical and opportunistic (the JIT can guess and later deoptimise), and if 90% of your call sites dispatch to the same implementation, it doesn’t matter as much as when the choice is made uniformly. It will vary from application to application, but it could easily be as bad as the case I present if List is used polymorphically.

I created five benchmarks which produce the same number, the same way. Three of these benchmarks only ever call into a single implementation of List and will be inlined monomorphically, to avoid bias, the result is XOR’d with a call to ThreadLocalRandom.current().nextInt() because the other benchmarks need this. One benchmark only ever calls into List.of and ArrayList, then one benchmark randomly chooses a list for each invocation. The difference is stark. You can really screw up performance by making the methods on List megamorphic.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
sumLength_ArrayList thrpt 1 10 55.785270 3.218552 ops/us
sumLength_Factory thrpt 1 10 58.565918 2.852415 ops/us
sumLength_Random2 thrpt 1 10 35.842255 0.684658 ops/us
sumLength_Random3 thrpt 1 10 11.177564 0.080164 ops/us
sumLength_Unmodifiable thrpt 1 10 51.776108 3.751297 ops/us


@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class MegamorphicList {

    private List<String>[] strings;

    @Setup(Level.Trial)
    public void init() {
        strings = new List[]{getArrayList(6), getFactoryList6(), getUnModifiableList(6)};
    }

    @Benchmark
    public int sumLength_ArrayList(Blackhole bh) {
        List<String> list = strings[0];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole ^ ThreadLocalRandom.current().nextInt(3);
    }

    @Benchmark
    public int sumLength_Factory() {
        List<String> list = strings[1];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole ^ ThreadLocalRandom.current().nextInt(3);
    }

    @Benchmark
    public int sumLength_Unmodifiable() {
        List<String> list = strings[2];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole ^ ThreadLocalRandom.current().nextInt(3);
    }

    @Benchmark
    public int sumLength_Random2() {
        List<String> list = strings[ThreadLocalRandom.current().nextInt(2)];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole;
    }

    @Benchmark
    public int sumLength_Random3() {
        List<String> list = strings[ThreadLocalRandom.current().nextInt(3)];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole;
    }

    private List<String> getUnModifiableList(int size) {
        return Collections.unmodifiableList(getArrayList(size));
    }

    private List<String> getFactoryList6() {
        return List.of(randomString(),
                       randomString(),
                       randomString(),
                       randomString(),
                       randomString(),
                       randomString()
                );
    }

    private List<String> getArrayList(int size) {
        List<String> list = new ArrayList<>();
        for (int i = 0; i < size; ++i) {
            list.add(randomString());
        }
        return list;
    }

    private String randomString() {
        return new String(DataUtil.createByteArray(ThreadLocalRandom.current().nextInt(10, 20)));
    }

}

Since writing this post, I have been challenged on whether this result is due to failure to inline or not. This can be easily verified by setting the following JVM arguments to print compilation:

-XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining

You will see the ArrayList and ListN get inlined quickly in isolation:

\-> TypeProfile (19810/19810 counts) = java/util/ArrayList
@ 27   java.util.ArrayList::get (15 bytes)   inline (hot)
...
\-> TypeProfile (363174/363174 counts) = java/util/ImmutableCollections$ListN
@ 24   java.util.ImmutableCollections$ListN::get (17 bytes)   inline (hot)

However, the call remains virtual (and not inlined) when three or more implementations are present:

@ 30   java.util.List::get (0 bytes)   virtual call

I didn’t even bother to use factory methods with different arity, because three is the magic number. Syntactic sugar is nice, but use them with caution.

Bit Twiddling Blog Posts

This is a list of blog posts to read if you like bit twiddling and some basic maths. I would like to improve this list over time – if you have a post similar to these please provide a link in the comments.

  1. Table of basic reversible integer operations – Marc B. Reynolds looks at several bijective functions on the integers modulo 32.
  2. Modular multiplicative inverses – multiplication of integers modulo 32 is invertible for odd integers. Daniel Lemire derives this property and constructs an efficient algorithm to compute the modular inverse of an odd integer.
  3. Carryless multiplicative inverse – A guy called Harold demonstrates an algorithm for computing the carryless inverse of integers modulo 32.
  4. Visualizing Addition, Subtraction and Complement – A spatial interpretation of the addition, subtraction and complement operations on bitwise integers. Another Harold special.
  5. XOR rotates and their inverses – Marc B. Reynolds looks at XOR rotates, their inverses and provides a few visualisations.
  6. Is XOR distributive over addition – a simple proof by contradiction that XOR does not distribute over addition.
  7. Parallel prefix/suffix operations – more insight into compression operations from Harold.
  8. Morton Codes and bit interleaving – Jeroen Baert discusses implementations of Z-curves through bit interleaving.
  9. Tesseral Arithmetic – cracking open Morton encoded coordinates to perform arithmetic adds overhead and reduces throughput. Harold discusses and implements arithmetic without leaving the Morton encoding domain.
  10. Hyperplanes and Wildcards David Eppstein interprets bit strings as the vertices of hypercubes, and develops a wildcard concept to represent unspecified dimensions. This is very similar in concept to a rule engine I wrote recently.

Is XOR Distributive over Addition?

Google search console has become a thing of mild interest to me since I moved my website and Google forgot my content existed. Impressions – search terms that match your site but don’t lead to a click – can provide fascinating insights into the Internet habits of others, particularly when the content is completely irrelevant to the query. Impressions are the best place to look for these noisy mismatches. I looked at my top ten impressions and was pleased to see my stalkers (people who type my name into Google) are outnumbered by CBOR enthusiasts. The world needs to must pay more attention to CBOR. These search terms are:

is xor associative over addition?

The highlighted term “is xor distributive over addition” jumped out at me.

Since multiplication obviously does distribute over addition (ignoring overflow), it’s perhaps a reasonable question to ask. To disprove this proposition, it is enough to find a single counterexample (not hard, and much quicker than a google search) but it’s more interesting to find a constructive class of counterexamples. I thought of a few strategies to disprove this, other than picking random numbers and checking, that I thought were worth writing down.

Tangentially, on the topic of Google relevance, this search term had nothing to do with this blog until this post, but when I search for topics I think my posts are related to, I can’t find them. I expect not to be seeing “is xor distributive over addition” in the search console in future.

Complement Argument

Would XOR distribute over the addition of a number and its logical complement? We would have that y ^ (x + ~x) = y ^ x + y ^ ~x for any y. Then we have y ^ -1 = ~y = y ^ x + y ^ ~x. So based on the assumption of distributivity, y ^ x + y ^ ~x must have none of the bits from y. We have a contradiction because it is clear that all of the bits in y ^ x + y ^ ~x are set.

Left Shift Argument

Addition causes digits to carry left when the bits in the same position are both set, so x + x is equivalent to x << 1 (ignoring overflow). If, for any integer y, we had that y ^ (x + x) = y ^ x + y ^ x we can find constraints on this identity by considering either the rightmost unset bit or the leftmost set bit of x in isolation. Considering the rightmost set bit at position p: this bit is set on the LHS of this identity if and only if it is unset in y. On the RHS, it is set iff its leftmost neighbour at position p-1 is unset in y, because we shift y ^ x to the left. So we can construct counterexamples whenever p and p-1 differ, and the proposition is not generally true.

Incidental Similarity

I recently saw an interesting class, BitVector, in Apache Arrow, which represents a column of bits, providing minimal or zero copy distribution. The implementation is similar to a bitset but backed by a byte[] rather than a long[]. Given the coincidental similarity in implementation, it’s tempting to look at this, extend its interface and try to use it as a general purpose, distributed bitset. Could this work? Why not just implement some extra methods? Fork it on Github!

This post details the caveats of trying to adapt an abstraction beyond its intended purpose; in a scenario where generic bitset capabilities are added to BitVector without due consideration, examined through the lens of performance. This runs into the observable effect of word widening on throughput, given the constraints imposed by JLS 15.22. In the end, the only remedy is to use a long[], sacrificing the original zero copy design goal. I hope this is a fairly self-contained example of how uncontrolled adaptation can be hostile to the original design goals: having the source code isn’t enough reason to modify it.

Checking bits

How fast is it to check if the bit at index i is set or not? BitVector implements this functionality, and was designed for it. This can be measured by JMH by generating a random long[] and creating a byte[] 8x longer with identical bits. The throughput of checking the value of the bit at random indices can be measured. It turns out that if all you want to do is access bits, byte[] isn’t such a bad choice, and if those bytes are coming directly from the network, it could even be a great choice. I ran the benchmark below and saw that the two operations are similar (within measurement error).


@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class BitSet {

    @Param({"1024", "2048", "4096", "8192"})
    int size;

    private long[] leftLongs;
    private long[] rightLongs;
    private long[] differenceLongs;
    private byte[] leftBytes;
    private byte[] rightBytes;
    private byte[] differenceBytes;

    @Setup(Level.Trial)
    public void init() {
        this.leftLongs = createLongArray(size);
        this.rightLongs = createLongArray(size);
        this.differenceLongs = new long[size];
        this.leftBytes = makeBytesFromLongs(leftLongs);
        this.rightBytes = makeBytesFromLongs(rightLongs);
        this.differenceBytes = new byte[size * 8];
    }

    @Benchmark
    public boolean CheckBit_LongArray() {
        int index = index();
        return (leftLongs[index >>> 6] & (1L << index)) != 0;
    }

    @Benchmark
    public boolean CheckBit_ByteArray() {
        int index = index();
        return ((leftBytes[index >>> 3] & 0xFF) & (1 << (index & 7))) != 0;
    }

    private int index() {
        return ThreadLocalRandom.current().nextInt(size * 64);
    }

    private static byte[] makeBytesFromLongs(long[] array) {
        byte[] bytes = new byte[8 * array.length];
        for (int i = 0; i < array.length; ++i) {
            long word = array[i];
            bytes[8 * i + 7] = (byte) word;
            bytes[8 * i + 6] = (byte) (word >>> 8);
            bytes[8 * i + 5] = (byte) (word >>> 16);
            bytes[8 * i + 4] = (byte) (word >>> 24);
            bytes[8 * i + 3] = (byte) (word >>> 32);
            bytes[8 * i + 2] = (byte) (word >>> 40);
            bytes[8 * i + 1] = (byte) (word >>> 48);
            bytes[8 * i]     = (byte) (word >>> 56);
        }
        return bytes;
    }
}

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
CheckBit_ByteArray thrpt 1 10 174.421170 1.583275 ops/us 1024
CheckBit_ByteArray thrpt 1 10 173.938408 1.445796 ops/us 2048
CheckBit_ByteArray thrpt 1 10 172.522190 0.815596 ops/us 4096
CheckBit_ByteArray thrpt 1 10 167.550530 1.677091 ops/us 8192
CheckBit_LongArray thrpt 1 10 171.639695 0.934494 ops/us 1024
CheckBit_LongArray thrpt 1 10 169.703960 2.427244 ops/us 2048
CheckBit_LongArray thrpt 1 10 169.333360 1.649654 ops/us 4096
CheckBit_LongArray thrpt 1 10 166.518375 0.815433 ops/us 8192

To support this functionality, there’s no reason to choose either way, and it must be very appealing to use bytes as they are delivered from the network, avoiding copying costs. Given that for a database column, this is the only operation needed, and Apache Arrow has a stated aim to copy data as little as possible, this seems like quite a good decision.

Logical Conjugations

But what happens if you try to add a logical operation to BitVector, such as an XOR? We need to handle the fact that bytes are signed and their sign bit must be preserved in promotion, according to the JLS. This would break the bitset, so extra operations are required to keep the 8th bit in its right place. With the widening and its associated workarounds, suddenly the byte[] is a much poorer choice than a long[], and it shows in benchmarks.


    @Benchmark
    public void Difference_ByteArray(Blackhole bh) {
        for (int i = 0; i < leftBytes.length && i < rightBytes.length; ++i) {
            differenceBytes[i] = (byte)((leftBytes[i] & 0xFF) ^ (rightBytes[i] & 0xFF));
        }
        bh.consume(differenceBytes);
    }

    @Benchmark
    public void Difference_LongArray(Blackhole bh) {
        for (int i = 0; i < leftLongs.length && i < rightLongs.length; ++i) {
            differenceLongs[i] = leftLongs[i] ^ rightLongs[i];
        }
        bh.consume(differenceLongs);
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
Difference_ByteArray thrpt 1 10 0.805872 0.038644 ops/us 1024
Difference_ByteArray thrpt 1 10 0.391705 0.017453 ops/us 2048
Difference_ByteArray thrpt 1 10 0.190102 0.008580 ops/us 4096
Difference_ByteArray thrpt 1 10 0.169104 0.015086 ops/us 8192
Difference_LongArray thrpt 1 10 2.450659 0.094590 ops/us 1024
Difference_LongArray thrpt 1 10 1.047330 0.016898 ops/us 2048
Difference_LongArray thrpt 1 10 0.546286 0.014211 ops/us 4096
Difference_LongArray thrpt 1 10 0.277378 0.015663 ops/us 8192

This is a fairly crazy slow down. Why? You need to look at the assembly generated in each case. For long[] it’s demonstrable that logical operations do vectorise. The JLS, specifically section 15.22, doesn’t really give the byte[] implementation a chance. It states that for logical operations, sub dword primitive types must be promoted or widened before the operation. This means that if one were to try to implement this operation with, say AVX2, using 256 bit ymmwords each consisting of 16 bytes, then each ymmword would have to be inflated by a factor of four: it gets complicated quickly, given this constraint. Despite that complexity, I was surprised to see that C2 does use 128 bit xmmwords, but it’s not as fast as using the full 256 bit registers available. This can be seen by printing out the emitted assembly like normal.

movsxd  r10,ebx     

vmovq   xmm2,mmword ptr [rsi+r10+10h]

vpxor   xmm2,xmm2,xmmword ptr [r8+r10+10h]

vmovq   mmword ptr [rax+r10+10h],xmm2

Vectorised Logical Operations in Java 9

This is a short post for my own reference, since I feel I have already done the topic of does Java 9 use AVX for this? to death. Cutting to the chase, Java 9 autovectorises loops to compute logical ANDs, XORs, ORs and ANDNOTs between arrays, making use of the instructions VPXOR, VPOR and VPAND. You can replicate this by running the code at github.

XOR


    @Benchmark
    public long[] xor(LongData state) {
        long[] result = new long[state.data1.length];
        long[] data1 = state.data1;
        long[] data2 = state.data2;
        for (int i = 0; i < data1.length && i < data2.length; ++i) {
            result[i] = data1[i] ^ data2[i];
        }
        return result;
    }

vmovdqu ymm0,ymmword ptr [r10+r13*8+10h]

vpxor   ymm0,ymm0,ymmword ptr [rbx+r13*8+10h]

vmovdqu ymmword ptr [rax+r13*8+10h],ymm0

OR


    @Benchmark
    public long[] or(LongData state) {
        long[] result = new long[state.data1.length];
        long[] data1 = state.data1;
        long[] data2 = state.data2;
        for (int i = 0; i < data1.length && i < data2.length; ++i) {
            result[i] = data1[i] | data2[i];
        }
        return result;
    }

vmovdqu ymm0,ymmword ptr [r10+rsi*8+30h]
 
vpor    ymm0,ymm0,ymmword ptr [rbx+rsi*8+30h]

vmovdqu ymmword ptr [rax+rsi*8+30h],ymm0

AND


    @Benchmark
    public long[] and(LongData state) {
        long[] result = new long[state.data1.length];
        long[] data1 = state.data1;
        long[] data2 = state.data2;
        for (int i = 0; i < data1.length && i < data2.length; ++i) {
            result[i] = data1[i] & data2[i];
        }
        return result;
    }

vmovdqu ymm0,ymmword ptr [r10+r13*8+10h]

vpand   ymm0,ymm0,ymmword ptr [rbx+r13*8+10h]

vmovdqu ymmword ptr [rax+r13*8+10h],ymm0

ANDNOT


    @Benchmark
    public long[] andNot(LongData state) {
        long[] result = new long[state.data1.length];
        long[] data1 = state.data1;
        long[] data2 = state.data2;
        for (int i = 0; i < data1.length && i < data2.length; ++i) {
            result[i] = data1[i] & ~data2[i];
        }
        return result;
    }

vpunpcklqdq xmm0,xmm0,xmm0

vinserti128 ymm0,ymm0,xmm0,1h

vmovdqu ymm1,ymmword ptr [rbx+r13*8+10h]

vpxor   ymm1,ymm1,ymm0

vpand   ymm1,ymm1,ymmword ptr [r10+r13*8+10h]

vmovdqu ymmword ptr [rax+r13*8+10h],ymm1

Blocked Signatures

The interesting thing about BitFunnel, the search architecture used by Bing, is its unorthodoxy; it revisits many ideas ignored by contemporary search technologies. The paper’s references go back to the 80s, when one million records was considered an enormous database. Something about this appeals to me, that there might be long forgotten ideas waiting to be reinterpreted and recombined in the context of modern micro-architectures.

A bit-sliced signature arrangement reduces the number of words which must be processed to evaluate a query, but it’s not enough for BitFunnel’s purposes. The last piece of background in the BitFunnel paper is blocked signatures, which are discussed in the 1990 paper A signature file scheme based on multiple organizations for indexing very large text databases by Kent, Sacks-Davis, Ramamohanarao (KS-DR). Blocked signatures further reduce the amount of processing per query, at what can be an acceptable false positive rate. In this post I aim to piece their data structure together in modern Java.

Formulation

The goal is to map documents into blocks consisting of a fixed number of documents (referred to as the blocking factor in the BitFunnel paper) so only bit sliced block signatures need be stored, where a block signature is a bloom filter of the terms in a block of documents. There are a variety of ways of doing this but they all start with assigning an integer label to each document prior to block assignment. This topic is covered at length in KS-DR.

The most obvious technique is to assign contiguous ranges of document IDs to blocks of size N, that is the function i -> Math.floorDiv(i, N). This is only useful if blocks of document signatures are also stored, acting as a top level index into those document signatures. Physically, there is a block index, which is the bit sliced signature of the terms in each block, and separate blocks of document signatures, again bit sliced by term. Queries can be evaluated by constructing a bloom filter from the query’s terms, specifying a set of bit slices in the block index to inspect and intersect. The result of the intersection gives the IDs of the document signature blocks to query. This is like a two level tree, and is better than a signature scan over all the document signatures, but why not just bit slice the document signatures? For rare terms, once a block is located, it does cut down the number of words in each slice to be intersected. However, the real problem with this technique, besides storage, is the cost of a false match at the block level: it results in a document level query, touching N bits, but yields nothing. The BitFunnel blocked signatures generalise this two level hierarchical arrangement for multiple levels.

This post goes on a tangent from BitFunnel here, focusing on the ideas put forward in KS-DR. An alternative is to choose a number M > N coprime to C, an estimate of the capacity of the index, and use the function i -> Math.floorDiv(M * i % C, N) to permute records prior to blocking, then make a copy of the block index for each of several values of M. If you choose, say, two values of M, when evaluating queries, you can map the query terms and get the matching blocks from each representation as before. There is no need for a document level query or storage though. If you have a bitmap of the document IDs (not the signatures) for each block, you can intersect the document bitmaps to get the document IDs matching the query (with false positives, the number of which reduces with the number of copies). In the KS-DR paper, this bitmap I assume the existence of is actually computed on the fly via an expensive reverse mapping with the help of a lookup table.

Java Implementation

The code is very similar to the bit sliced signature code, because a significant part of querying is a bit sliced lookup of block IDs, which requires storage of a bit matrix. The major difference is the requirement for block assignment and ultimately block intersection. I encapsulate this in a BlockSet which contains Blocks and is responsible for block assignment and intersection.

Details of block creation (blocking factor, bit matrix dimensions, hashing policy) can be hidden behind a supplier interface.


public class BlockFactory<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> implements Supplier<Block<D, T, Q>> {

    private final List<ToIntFunction<T>> hashes;
    private final int blockingFactor;
    private final int blockCapacity;

    public BlockFactory(List<ToIntFunction<T>> hashes, int blockingFactor, int blockCapacity) {
        this.hashes = hashes;
        this.blockingFactor = blockingFactor;
        this.blockCapacity = blockCapacity;
    }

    @Override
    public Block<D, T, Q> get() {
        return new Block<>(blockingFactor, blockCapacity, hashes);
    }
}

This gives us blocks, which is really just a wrapper around a bit matrix of terms and a bit set of document IDs. It can do three things

  1. Index a document, this requires that it knows the blocking factor (the number of blocks it can index), the hash functions and the bloom filter size.
  2. Check if the block might contain at least one document matching all the terms
  3. Share its document IDs

The code looks quite similar to my previous bit sliced signature implementation.


public class Block<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> {

    private final BitSet documentIds;
    private final long[][] bitMatrix;
    private final int capacity;
    private final List<ToIntFunction<T>> hashFunctions;
    private int docIndex = 0;

    public Block(int blockingFactor, int capacity, List<ToIntFunction<T>> hashFunctions) {
        assert Integer.bitCount(capacity) == 1;
        this.documentIds = new BitSet();
        this.bitMatrix = new long[capacity >> 6][blockingFactor];
        this.capacity = capacity;
        this.hashFunctions = hashFunctions;
    }

    public void add(D doc) {
        int docIndex = this.docIndex++;
        int docWordIndex = docIndex >>> 6;
        long docWord = 1L << docIndex;
        mapTerms(doc.get()).forEach(r -> bitMatrix[r][docWordIndex] |= docWord);
        documentIds.set(doc.getAsInt());
    }

    public void contribute(BitSet result) {
        result.or(documentIds);
    }

    public boolean matches(Q query) {
        int[] rows = mapTerms(query).distinct().toArray();
        return IntStream.range(0, capacity >> 6)
                        .filter(i -> hasMatch(rows, i))
                        .findFirst()
                        .isPresent();
    }

    private boolean hasMatch(int[] rows, int offset) {
        long word = 0L;
        for (int i = 0; i < rows.length && word == 0; ++i) {
            word |= bitMatrix[rows[i]][offset];
        }
        return word != 0;
    }

    private IntStream mapTerms(Set<T> terms) {
        return terms.stream().flatMapToInt(t -> hashFunctions.stream().mapToInt(f -> mapHash(f.applyAsInt(t))));
    }

    private int mapHash(int hash) {
        return hash & -hash & (capacity - 1);
    }
}

Now a level up. A BlockIndex has a BlockSet for each relatively prime factor. When evaluating a query, it passes the query to each of its BlockSets, retrieving all blocks which probably match the query.


public class BlockSet<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> {

    private final Block[] blocks;
    private final Supplier<Block<D, T, Q>> newBlock;
    private final int blockingFactor;
    private final int estimatedCapacity;
    private final int prime;

    public BlockSet(Supplier<Block<D, T, Q>> newBlock, int blockingFactor, int estimatedCapacity, int prime) {
        assert Integer.bitCount(blockingFactor) == 1 && Integer.bitCount(estimatedCapacity) == 1;
        this.newBlock = newBlock;
        this.blocks = new Block[estimatedCapacity/blockingFactor];
        for (int i = 0; i < blocks.length; ++i) {
            blocks[i] = newBlock.get();
        }
        this.blockingFactor = blockingFactor;
        this.estimatedCapacity = estimatedCapacity;
        this.prime = prime;
    }

    public void add(D doc) {
        int blockIndex = blockIndex(doc.getAsInt());
        Block<D, T, Q> block = (Block<D, T, Q>)blocks[blockIndex];
        block.add(doc);
    }

    public BitSet query(Q query) {
        BitSet result = new BitSet();
        Arrays.stream(blocks)
              .filter(b -> b.matches(query))
              .forEach(b -> b.contribute(result));
        return result;
    }

    private int blockIndex(int value) {
        return ((value * prime) & (estimatedCapacity - 1)) / blockingFactor;
    }
}

With a BlockIndex as the tip of an iceberg – it just needs to intersect the bit sets of document IDs.


public class BlockIndex<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> {

    private final List<BlockSet<D, T, Q>> blockSets;

    public BlockIndex(List<BlockSet<D, T, Q>> blockSets) {
        this.blockSets = blockSets;
    }

    public IntStream query(Q query) {
        BitSet result = null;
        for (BlockSet<D, T, Q> blockSet : blockSets) {
            BitSet docIds = blockSet.query(query);
            if (null == result) {
                result = docIds;
            } else {
                result.and(docIds);
            }
        }
        return null == result ? IntStream.of() : result.stream();
    }
}

This code is obviously experimental, but a problem with it as it stands is memory consumption with the temporary bit sets. A better, but less Java 8+ compliant bit set is RoaringBitmap.

Blocked Signatures in BitFunnel

Blocked Signatures are a very old idea, naturally it is reported that there are a few innovations in the BitFunnel data structure. BitFunnel uses multiple levels with blocking factors, each of which must be a proper power of 2, rather than multiple factors coprime to estimated capacity at the same level. Each level has rank = log(blockingFactor). The effect in BitFunnel is having several levels of blocking density. Blocks from different levels can be intersected efficiently by transforming dense blocks to rank zero (the least dense representation) prior to intersection.

Bit-Sliced Signatures and Bloom Filters

While the inverted index is a familiar indexing structure for many with a casual interest in information retrieval, the concept of a document signature may not be. The idea seems to have a long history, having been popular until it lost favour in the late 90s, owing to file sizes and performance empirically inferior to inverted indices. However, there has been new research into the use of signatures for search problems in the last decade, for instance, TopSig, and they are an important building block of the BitFunnel data structure.

Information Retrieval with Signatures

An m bit signature of a document consists of a bit array of fixed length m containing the (not guaranteed to be disjoint) union of the signatures of the terms found in the document. The signature of a term t is the m bit array with each jth bit set where j = hash(k, t) mod m, for each of k hash functions. This is obviously a Bloom filter by another name. In a recent post I noted that a Bloom filter’s behaviour can be implemented efficiently in less than ten lines of Java code.

In an information retrieval setting, a document signature would be computed for each document in a corpus by tokenising each document and building a Bloom filter from the extracted terms. To retrieve the documents matching a given query, the query signature (which is the union of the signatures of the query’s terms) is computed. In what I will call the signature scan phase of the retrieval, the document signatures are iterated over; each document being included if its signature contains all the bits in the query signature. The result set will contain false positives but should also be quite small, so a cheap filter for correctness is executed after the signature scan is complete.

While the hash function and ordering of the documents could be controlled creatively to cut down the number of signatures to be inspected for a given query, this approach is obviously inefficient, despite the appeal of low level bit-wise instructions. What happens when there are several billion documents in a corpus?

Parallelism and Bit-Slicing

Threads and SIMD instructions could be thrown at the signature scan to speed it up, but one approach to parallelising the retrieval is to increase the number of documents represented by each processed word. If there are n documents with m bit signatures, the signature list is an n * m bit matrix. Its transpose, an m * n bit matrix, is referred to as a bit sliced signature. When querying the bit sliced signature, only the rows specified in the query need be accessed and intersected, and each word represents the presence of a term in up to 64 documents. This is a very old technique – the the earliest formulation of a variant of this data structure I could find, where it was referred to as superimposed coding, was published in 1984, but references implementations from the 60s. An accessible evaluation was published in 1994.

Java Implementation

An implementation of such a structure for an immutable corpus is trivial but informative. Typically terms will be strings but needn’t be, whereas documents can be reduced to a set of terms. Using ToIntFunction to abstract hash functions again, we just need to map all term-document tuples into the bits of a long[][]. When querying the structure, we need to map the query’s terms into a sorted sequence of integers, determining which rows of the bit matrix to access.

On the way in, rather than using the hash functions to compute the bit to set (this is constant for each document), the row index is calculated. For each term, for each hash function, the appropriate row index is calculated and the document’s bit is set in the corresponding array. Clean Java would do this outside of the constructor, of course.


public class BitSlicedSignature<D extends Supplier<Set<T>>, T, Q extends Set<T>> {

    private final long[][] bitMatrix;
    private final int width;
    private final int height;
    private final List<ToIntFunction<T>> hashFunctions;

    public BitSlicedSignature(List<D> documents,
                              List<ToIntFunction<T>> hashFunctions,
                              int height) {
        this.hashFunctions = hashFunctions;
        this.width = (documents.size() + 63) / 64;
        this.height = height;
        this.bitMatrix = new long[height][width];
        int docIndex = 0;
        for (D doc : documents) {
            int docWordIndex = docIndex >> 6;
            long docWord = 1L << docIndex;
            for (T term : doc.get()) {
                for (ToIntFunction<T> hash : hashFunctions) {
                    int row = mapHash(hash.applyAsInt(term));
                    bitMatrix[row][docWordIndex] |= docWord;
                }
            }
            ++docIndex;
        }
    }

    private int mapHash(int hash) {
        return Math.abs(hash % height);
    }
}

To query the structure, the query is mapped into row indices and the corresponding rows are intersected word by word, matching document IDs are emitted lazily as an IntStream. The appeal of doing this lazily is that we should expect there to be a lot of documents, this way the bit-wise intersections can be done in chunks as and when the caller wants more documents. This can be achieved with the help of two utility methods:


    public IntStream query(Q query) {
        int[] rows = query.stream()
                          .flatMapToInt(t -> hashFunctions.stream().mapToInt(h -> mapHash(h.applyAsInt(t))))
                          .distinct()
                          .toArray();
        return IntStream.range(0, width).flatMap(i -> bitsOf(intersection(rows, i), i));
    }

    private long intersection(int[] rows, int offset) {
        long word = -1L;
        for (int i = 0; i < rows.length && word != 0; ++i) {
            word &= bitMatrix[rows[i]][offset];
        }
        return word;
    }

    private static IntStream bitsOf(long word, int offset) {
        return IntStream.range(0, Long.SIZE)
                        .filter(i -> (1L << i & word) != 0)
                        .map(i -> Long.SIZE * offset + i);
    }

As you can probably see, you can leave vast swathes of the long[][] untouched, assuming the query is for a small number of terms. A more sophisticated implementation might partition the documents into several bit matrices.

Shortcomings

There are some obvious problems with this data structure. Firstly, a long[][] uses the same amount of memory whether its bits are set or not. What happens when you have some small documents and lots of terms? You have a column in the bit matrix where most of the bits are zero – it’s likely that a compressed bit set would be preferable. Similarly with very common terms you will have long horizontal runs of set bits.

Even worse, what happens when a term is very rare? If you are providing a search service, it’s likely you only ever need to provide a page of results at a time. If the term is rare enough, you may need to scan the entire row to fill a page, which could take a long time. To get around that, BitFunnel uses bit-sliced block signatures, which I will write about in the next post.

Building a Bloom Filter from Scratch

The Bloom filter is an interesting data structure for modelling approximate sets of objects. It can tell you with certainty that an object is not in a collection, but may give false positives. If you write Java for a living, I would not suggest you implement your own, because there is a perfectly good implementation in Guava. It can be illuminating to write one from scratch, however.

Interface

You can do two things with a Bloom filter: put things in it, and check if the filter probably contains items. This leads to the following interface:


public class BloomFilter<T> {
    void add(T value);

    boolean mightContain(T value);
}

Java Bloom Filter Implementation

A Bloom filter represents a set of objects quite succinctly as a bit array of finite length. Unlike a bitset, the bit array stores hashes of the objects, rather than object identities. In practice, you need multiple hash functions, which, modulo the capacity of the bit array, will collide only with low probability. The more hashes you have, the slower insertions and look ups will be. There is also a space trade off for improved accuracy: the larger the array, the less likely collisions of hashes modulo the capacity will be.

Since you probably want to be able to control the hash functions you use, the interface ToIntFunction fits in nicely as the perfect abstraction. You can set this up simply with a builder.


    public static <T> Builder<T> builder() {
        return new Builder<>();
    }

    public static class Builder<T> {
        private int size;
        private List<ToIntFunction<T>> hashFunctions;

        public Builder<T> withSize(int size) {
            assert Integer.bitCount(size) == 1; 
            this.size = size;
            return this;
        }

        public Builder<T> withHashFunctions(List<ToIntFunction<T>> hashFunctions) {
            this.hashFunctions = hashFunctions;
            return this;
        }

        public BloomFilter<T> build() {
            return new BloomFilter<>(new long[size], size, hashFunctions);
        }
    }

    private final long[] array;
    private final int size;
    private final List<ToIntFunction<T>> hashFunctions;

    public BloomFilter(long[] array, int logicalSize, List<ToIntFunction<T>> hashFunctions) {
        this.array = array;
        this.size = logicalSize;
        this.hashFunctions = hashFunctions;
    }

    private int mapHash(int hash) {
        return hash & (size - 1);
    }

Adding Values

To add a value, you need to take an object, and for each hash function hash it modulo the capacity of the bit array. Using a long[] as the substrate of the bit set you must locate the word each hash belongs to, and set the bit corresponding to the remainder of the hash modulo 64. You can do this quickly as follows:


   public void add(T value) {
        for (ToIntFunction<T> function : hashFunctions) {
            int hash = mapHash(function.applyAsInt(value));
            array[hash >>> 6] |= 1L << hash;
        }
    }

Note that each bit may already be set.

Querying the bit set

To check if an object is contained in the bitset, you need to evaluate the hashes again, and find the corresponding words. You can return false if the intersection of the appropriate word and the bit corresponding to the remainder modulo 64 is zero. That looks like this:


    public boolean mightContain(T value) {
        for (ToIntFunction<T> function : hashFunctions) {
            int hash = mapHash(function.applyAsInt(value));
            if ((array[hash >>> 6] & (1L << hash)) == 0) {
                return false;
            }
        }
        return true;
    }

Note that this absolutely does not mean the object is contained within the set, but you can get more precise results if you are willing to perform more hash evaluations, and if you are willing to use a larger bit array. Modeling the precise false positive rate is not clear cut.

BitFunnel

Just as is the case for bitmap indices, bit slicing is a useful enhancement for Bloom filters, forming the basis of BitFunnel. In a follow up post I will share a simplified implementation of a bit sliced Bloom filter.