Beware Collection Factory Methods

I saw an interesting tweet referencing a Github issue where the impact of including an (in my view) unnecessary implementation of the List interface impacted inlining decisions, causing 20x degradation in throughput. Guava’s ImmutableList is my favourite class to seek and destroy because of the way it is often used – it tends to be associated with unnecessary copying where encapsulation would be a better solution. I had assumed performance gains won from finding and deleting all the instances of ImmutableList had been thanks to relieving the garbage collector from medieval torture. The performance degradation observed in the benchmark is caused by use of ImmutableList, along with all its subclasses, alongside ArrayList, making calls to List bimorphic at best, causing the JIT compiler to generate slower code. I may have inadvertently profited from better inlining in the past simply by removing as many ImmutableLists as possible!

This post doesn’t go into any details about the various mechanisms of method dispatch, and if you want to understand the impact of polymorphism on inlining, bookmark Aleksey Shipilev’s authoritative post and read it when you have some time to really concentrate.

Without resorting to using LinkedList, is it possible to contrive cases in Java 9 where performance is severely degraded by usages of Collections.unmodifiableList and List.of factory methods? Along with ArrayList, these are random access data structures so this should highlight the potential performance gains inlining can give.

The methodology is very simple: I randomly vary the List implementation and plug it into the same algorithm. It is cruder than you would see in Aleksey Shipilev’s post because I’ve targeted only the worst case by creating equal bias between implementations. Aleksey demonstrates that inlining decisions are statistical and opportunistic (the JIT can guess and later deoptimise), and if 90% of your call sites dispatch to the same implementation, it doesn’t matter as much as when the choice is made uniformly. It will vary from application to application, but it could easily be as bad as the case I present if List is used polymorphically.

I created five benchmarks which produce the same number, the same way. Three of these benchmarks only ever call into a single implementation of List and will be inlined monomorphically. One benchmark only ever calls into List.of and ArrayList, then one benchmark randomly chooses a list for each invocation. The difference is stark. You can really screw up performance by making the methods on List megamorphic.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
sumLength_ArrayList thrpt 1 10 70.612067 1.642255 ops/us
sumLength_Factory thrpt 1 10 68.739411 9.440799 ops/us
sumLength_Random2 thrpt 1 10 35.575897 0.749423 ops/us
sumLength_Random3 thrpt 1 10 10.858404 0.435359 ops/us
sumLength_Unmodifiable thrpt 1 10 61.855270 1.701163 ops/us


@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class MegamorphicList {

    private List<String>[] strings;

    @Setup(Level.Trial)
    public void init() {
        strings = new List[]{getArrayList(6), getFactoryList6(), getUnModifiableList(6)};
    }

    @Benchmark
    public int sumLength_ArrayList() {
        List<String> list = strings[0];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole;
    }

    @Benchmark
    public int sumLength_Factory() {
        List<String> list = strings[1];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole;
    }

    @Benchmark
    public int sumLength_Unmodifiable() {
        List<String> list = strings[2];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole;
    }

    @Benchmark
    public int sumLength_Random2() {
        List<String> list = strings[ThreadLocalRandom.current().nextInt(2)];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole;
    }

    @Benchmark
    public int sumLength_Random3() {
        List<String> list = strings[ThreadLocalRandom.current().nextInt(3)];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole;
    }

    private List<String> getUnModifiableList(int size) {
        return Collections.unmodifiableList(getArrayList(size));
    }

    private List<String> getFactoryList6() {
        return List.of(randomString(),
                       randomString(),
                       randomString(),
                       randomString(),
                       randomString(),
                       randomString()
                );
    }

    private List<String> getArrayList(int size) {
        List<String> list = new ArrayList<>();
        for (int i = 0; i < size; ++i) {
            list.add(randomString());
        }
        return list;
    }

    private String randomString() {
        return new String(DataUtil.createByteArray(ThreadLocalRandom.current().nextInt(10, 20)));
    }

}

I didn’t even bother to use factory methods with different arity, because three is the magic number. Syntactic sugar is nice, but use them with caution.

Bit Twiddling Blog Posts

This is a list of blog posts to read if you like bit twiddling and some basic maths. I would like to improve this list over time – if you have a post similar to these please provide a link in the comments.

  1. Table of basic reversible integer operations – Marc B. Reynolds looks at several bijective functions on the integers modulo 32.
  2. Modular multiplicative inverses – multiplication of integers modulo 32 is invertible for odd integers. Daniel Lemire derives this property and constructs an efficient algorithm to compute the modular inverse of an odd integer.
  3. Carryless multiplicative inverse – A guy called Harold demonstrates an algorithm for computing the carryless inverse of integers modulo 32.
  4. Visualizing Addition, Subtraction and Complement – A spatial interpretation of the addition, subtraction and complement operations on bitwise integers. Another Harold special.
  5. XOR rotates and their inverses – Marc B. Reynolds looks at XOR rotates, their inverses and provides a few visualisations.
  6. Is XOR distributive over addition – a simple proof by contradiction that XOR does not distribute over addition.
  7. Parallel prefix/suffix operations – more insight into compression operations from Harold.

Is XOR Distributive over Addition?

Google search console has become a thing of mild interest to me since I moved my website and Google forgot my content existed. Impressions – search terms that match your site but don’t lead to a click – can provide fascinating insights into the Internet habits of others, particularly when the content is completely irrelevant to the query. Impressions are the best place to look for these noisy mismatches. I looked at my top ten impressions and was pleased to see my stalkers (people who type my name into Google) are outnumbered by CBOR enthusiasts. The world needs to must pay more attention to CBOR. These search terms are:

is xor associative over addition?

The highlighted term “is xor distributive over addition” jumped out at me.

Since multiplication obviously does distribute over addition (ignoring overflow), it’s perhaps a reasonable question to ask. To disprove this proposition, it is enough to find a single counterexample (not hard, and much quicker than a google search) but it’s more interesting to find a constructive class of counterexamples. I thought of a few strategies to disprove this, other than picking random numbers and checking, that I thought were worth writing down.

Tangentially, on the topic of Google relevance, this search term had nothing to do with this blog until this post, but when I search for topics I think my posts are related to, I can’t find them. I expect not to be seeing “is xor distributive over addition” in the search console in future.

Complement Argument

Would XOR distribute over the addition of a number and its logical complement? We would have that y ^ (x + ~x) = y ^ x + y ^ ~x for any y. Then we have y ^ -1 = ~y = y ^ x + y ^ ~x. So based on the assumption of distributivity, y ^ x + y ^ ~x must have none of the bits from y. We have a contradiction because it is clear that all of the bits in y ^ x + y ^ ~x are set.

Left Shift Argument

Addition causes digits to carry left when the bits in the same position are both set, so x + x is equivalent to x << 1 (ignoring overflow). If, for any integer y, we had that y ^ (x + x) = y ^ x + y ^ x we can find constraints on this identity by considering either the rightmost unset bit or the leftmost set bit of x in isolation. Considering the rightmost set bit at position p: this bit is set on the LHS of this identity if and only if it is unset in y. On the RHS, it is set iff its leftmost neighbour at position p-1 is unset in y, because we shift y ^ x to the left. So we can construct counterexamples whenever p and p-1 differ, and the proposition is not generally true.

Incidental Similarity

I recently saw an interesting class, BitVector, in Apache Arrow, which represents a column of bits, providing minimal or zero copy distribution. The implementation is similar to a bitset but backed by a byte[] rather than a long[]. Given the coincidental similarity in implementation, it’s tempting to look at this, extend its interface and try to use it as a general purpose, distributed bitset. Could this work? Why not just implement some extra methods? Fork it on Github!

This post details the caveats of trying to adapt an abstraction beyond its intended purpose; in a scenario where generic bitset capabilities are added to BitVector without due consideration, examined through the lens of performance. This runs into the observable effect of word widening on throughput, given the constraints imposed by JLS 15.22. In the end, the only remedy is to use a long[], sacrificing the original zero copy design goal. I hope this is a fairly self-contained example of how uncontrolled adaptation can be hostile to the original design goals: having the source code isn’t enough reason to modify it.

Checking bits

How fast is it to check if the bit at index i is set or not? BitVector implements this functionality, and was designed for it. This can be measured by JMH by generating a random long[] and creating a byte[] 8x longer with identical bits. The throughput of checking the value of the bit at random indices can be measured. It turns out that if all you want to do is access bits, byte[] isn’t such a bad choice, and if those bytes are coming directly from the network, it could even be a great choice. I ran the benchmark below and saw that the two operations are similar (within measurement error).


@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class BitSet {

    @Param({"1024", "2048", "4096", "8192"})
    int size;

    private long[] leftLongs;
    private long[] rightLongs;
    private long[] differenceLongs;
    private byte[] leftBytes;
    private byte[] rightBytes;
    private byte[] differenceBytes;

    @Setup(Level.Trial)
    public void init() {
        this.leftLongs = createLongArray(size);
        this.rightLongs = createLongArray(size);
        this.differenceLongs = new long[size];
        this.leftBytes = makeBytesFromLongs(leftLongs);
        this.rightBytes = makeBytesFromLongs(rightLongs);
        this.differenceBytes = new byte[size * 8];
    }

    @Benchmark
    public boolean CheckBit_LongArray() {
        int index = index();
        return (leftLongs[index >>> 6] & (1L << index)) != 0;
    }

    @Benchmark
    public boolean CheckBit_ByteArray() {
        int index = index();
        return ((leftBytes[index >>> 3] & 0xFF) & (1 << (index & 7))) != 0;
    }

    private int index() {
        return ThreadLocalRandom.current().nextInt(size * 64);
    }

    private static byte[] makeBytesFromLongs(long[] array) {
        byte[] bytes = new byte[8 * array.length];
        for (int i = 0; i < array.length; ++i) {
            long word = array[i];
            bytes[8 * i + 7] = (byte) word;
            bytes[8 * i + 6] = (byte) (word >>> 8);
            bytes[8 * i + 5] = (byte) (word >>> 16);
            bytes[8 * i + 4] = (byte) (word >>> 24);
            bytes[8 * i + 3] = (byte) (word >>> 32);
            bytes[8 * i + 2] = (byte) (word >>> 40);
            bytes[8 * i + 1] = (byte) (word >>> 48);
            bytes[8 * i]     = (byte) (word >>> 56);
        }
        return bytes;
    }
}

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
CheckBit_ByteArray thrpt 1 10 174.421170 1.583275 ops/us 1024
CheckBit_ByteArray thrpt 1 10 173.938408 1.445796 ops/us 2048
CheckBit_ByteArray thrpt 1 10 172.522190 0.815596 ops/us 4096
CheckBit_ByteArray thrpt 1 10 167.550530 1.677091 ops/us 8192
CheckBit_LongArray thrpt 1 10 171.639695 0.934494 ops/us 1024
CheckBit_LongArray thrpt 1 10 169.703960 2.427244 ops/us 2048
CheckBit_LongArray thrpt 1 10 169.333360 1.649654 ops/us 4096
CheckBit_LongArray thrpt 1 10 166.518375 0.815433 ops/us 8192

To support this functionality, there’s no reason to choose either way, and it must be very appealing to use bytes as they are delivered from the network, avoiding copying costs. Given that for a database column, this is the only operation needed, and Apache Arrow has a stated aim to copy data as little as possible, this seems like quite a good decision.

Logical Conjugations

But what happens if you try to add a logical operation to BitVector, such as an XOR? We need to handle the fact that bytes are signed and their sign bit must be preserved in promotion, according to the JLS. This would break the bitset, so extra operations are required to keep the 8th bit in its right place. With the widening and its associated workarounds, suddenly the byte[] is a much poorer choice than a long[], and it shows in benchmarks.


    @Benchmark
    public void Difference_ByteArray(Blackhole bh) {
        for (int i = 0; i < leftBytes.length && i < rightBytes.length; ++i) {
            differenceBytes[i] = (byte)((leftBytes[i] & 0xFF) ^ (rightBytes[i] & 0xFF));
        }
        bh.consume(differenceBytes);
    }

    @Benchmark
    public void Difference_LongArray(Blackhole bh) {
        for (int i = 0; i < leftLongs.length && i < rightLongs.length; ++i) {
            differenceLongs[i] = leftLongs[i] ^ rightLongs[i];
        }
        bh.consume(differenceLongs);
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
Difference_ByteArray thrpt 1 10 0.805872 0.038644 ops/us 1024
Difference_ByteArray thrpt 1 10 0.391705 0.017453 ops/us 2048
Difference_ByteArray thrpt 1 10 0.190102 0.008580 ops/us 4096
Difference_ByteArray thrpt 1 10 0.169104 0.015086 ops/us 8192
Difference_LongArray thrpt 1 10 2.450659 0.094590 ops/us 1024
Difference_LongArray thrpt 1 10 1.047330 0.016898 ops/us 2048
Difference_LongArray thrpt 1 10 0.546286 0.014211 ops/us 4096
Difference_LongArray thrpt 1 10 0.277378 0.015663 ops/us 8192

This is a fairly crazy slow down. Why? You need to look at the assembly generated in each case. For long[] it’s demonstrable that logical operations do vectorise. The JLS, specifically section 15.22, doesn’t really give the byte[] implementation a chance. It states that for logical operations, sub dword primitive types must be promoted or widened before the operation. This means that if one were to try to implement this operation with, say AVX2, using 256 bit ymmwords each consisting of 16 bytes, then each ymmword would have to be inflated by a factor of four: it gets complicated quickly, given this constraint. Despite that complexity, I was surprised to see that C2 does use 128 bit xmmwords, but it’s not as fast as using the full 256 bit registers available. This can be seen by printing out the emitted assembly like normal.

movsxd  r10,ebx     

vmovq   xmm2,mmword ptr [rsi+r10+10h]

vpxor   xmm2,xmm2,xmmword ptr [r8+r10+10h]

vmovq   mmword ptr [rax+r10+10h],xmm2

Vectorised Logical Operations in Java 9

This is a short post for my own reference, since I feel I have already done the topic of does Java 9 use AVX for this? to death. Cutting to the chase, Java 9 autovectorises loops to compute logical ANDs, XORs, ORs and ANDNOTs between arrays, making use of the instructions VPXOR, VPOR and VPAND. You can replicate this by running the code at github.

XOR


    @Benchmark
    public long[] xor(LongData state) {
        long[] result = new long[state.data1.length];
        long[] data1 = state.data1;
        long[] data2 = state.data2;
        for (int i = 0; i < data1.length && i < data2.length; ++i) {
            result[i] = data1[i] ^ data2[i];
        }
        return result;
    }

vmovdqu ymm0,ymmword ptr [r10+r13*8+10h]

vpxor   ymm0,ymm0,ymmword ptr [rbx+r13*8+10h]

vmovdqu ymmword ptr [rax+r13*8+10h],ymm0

OR


    @Benchmark
    public long[] or(LongData state) {
        long[] result = new long[state.data1.length];
        long[] data1 = state.data1;
        long[] data2 = state.data2;
        for (int i = 0; i < data1.length && i < data2.length; ++i) {
            result[i] = data1[i] | data2[i];
        }
        return result;
    }

vmovdqu ymm0,ymmword ptr [r10+rsi*8+30h]
 
vpor    ymm0,ymm0,ymmword ptr [rbx+rsi*8+30h]

vmovdqu ymmword ptr [rax+rsi*8+30h],ymm0

AND


    @Benchmark
    public long[] and(LongData state) {
        long[] result = new long[state.data1.length];
        long[] data1 = state.data1;
        long[] data2 = state.data2;
        for (int i = 0; i < data1.length && i < data2.length; ++i) {
            result[i] = data1[i] & data2[i];
        }
        return result;
    }

vmovdqu ymm0,ymmword ptr [r10+r13*8+10h]

vpand   ymm0,ymm0,ymmword ptr [rbx+r13*8+10h]

vmovdqu ymmword ptr [rax+r13*8+10h],ymm0

ANDNOT


    @Benchmark
    public long[] andNot(LongData state) {
        long[] result = new long[state.data1.length];
        long[] data1 = state.data1;
        long[] data2 = state.data2;
        for (int i = 0; i < data1.length && i < data2.length; ++i) {
            result[i] = data1[i] & ~data2[i];
        }
        return result;
    }

vpunpcklqdq xmm0,xmm0,xmm0

vinserti128 ymm0,ymm0,xmm0,1h

vmovdqu ymm1,ymmword ptr [rbx+r13*8+10h]

vpxor   ymm1,ymm1,ymm0

vpand   ymm1,ymm1,ymmword ptr [r10+r13*8+10h]

vmovdqu ymmword ptr [rax+r13*8+10h],ymm1

Blocked Signatures

The interesting thing about BitFunnel, the search architecture used by Bing, is its unorthodoxy; it revisits many ideas ignored by contemporary search technologies. The paper’s references go back to the 80s, when one million records was considered an enormous database. Something about this appeals to me, that there might be long forgotten ideas waiting to be reinterpreted and recombined in the context of modern micro-architectures.

A bit-sliced signature arrangement reduces the number of words which must be processed to evaluate a query, but it’s not enough for BitFunnel’s purposes. The last piece of background in the BitFunnel paper is blocked signatures, which are discussed in the 1990 paper A signature file scheme based on multiple organizations for indexing very large text databases by Kent, Sacks-Davis, Ramamohanarao (KS-DR). Blocked signatures further reduce the amount of processing per query, at what can be an acceptable false positive rate. In this post I aim to piece their data structure together in modern Java.

Formulation

The goal is to map documents into blocks consisting of a fixed number of documents (referred to as the blocking factor in the BitFunnel paper) so only bit sliced block signatures need be stored, where a block signature is a bloom filter of the terms in a block of documents. There are a variety of ways of doing this but they all start with assigning an integer label to each document prior to block assignment. This topic is covered at length in KS-DR.

The most obvious technique is to assign contiguous ranges of document IDs to blocks of size N, that is the function i -> Math.floorDiv(i, N). This is only useful if blocks of document signatures are also stored, acting as a top level index into those document signatures. Physically, there is a block index, which is the bit sliced signature of the terms in each block, and separate blocks of document signatures, again bit sliced by term. Queries can be evaluated by constructing a bloom filter from the query’s terms, specifying a set of bit slices in the block index to inspect and intersect. The result of the intersection gives the IDs of the document signature blocks to query. This is like a two level tree, and is better than a signature scan over all the document signatures, but why not just bit slice the document signatures? For rare terms, once a block is located, it does cut down the number of words in each slice to be intersected. However, the real problem with this technique, besides storage, is the cost of a false match at the block level: it results in a document level query, touching N bits, but yields nothing. The BitFunnel blocked signatures generalise this two level hierarchical arrangement for multiple levels.

This post goes on a tangent from BitFunnel here, focusing on the ideas put forward in KS-DR. An alternative is to choose a number M > N coprime to C, an estimate of the capacity of the index, and use the function i -> Math.floorDiv(M * i % C, N) to permute records prior to blocking, then make a copy of the block index for each of several values of M. If you choose, say, two values of M, when evaluating queries, you can map the query terms and get the matching blocks from each representation as before. There is no need for a document level query or storage though. If you have a bitmap of the document IDs (not the signatures) for each block, you can intersect the document bitmaps to get the document IDs matching the query (with false positives, the number of which reduces with the number of copies). In the KS-DR paper, this bitmap I assume the existence of is actually computed on the fly via an expensive reverse mapping with the help of a lookup table.

Java Implementation

The code is very similar to the bit sliced signature code, because a significant part of querying is a bit sliced lookup of block IDs, which requires storage of a bit matrix. The major difference is the requirement for block assignment and ultimately block intersection. I encapsulate this in a BlockSet which contains Blocks and is responsible for block assignment and intersection.

Details of block creation (blocking factor, bit matrix dimensions, hashing policy) can be hidden behind a supplier interface.


public class BlockFactory<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> implements Supplier<Block<D, T, Q>> {

    private final List<ToIntFunction<T>> hashes;
    private final int blockingFactor;
    private final int blockCapacity;

    public BlockFactory(List<ToIntFunction<T>> hashes, int blockingFactor, int blockCapacity) {
        this.hashes = hashes;
        this.blockingFactor = blockingFactor;
        this.blockCapacity = blockCapacity;
    }

    @Override
    public Block<D, T, Q> get() {
        return new Block<>(blockingFactor, blockCapacity, hashes);
    }
}

This gives us blocks, which is really just a wrapper around a bit matrix of terms and a bit set of document IDs. It can do three things

  1. Index a document, this requires that it knows the blocking factor (the number of blocks it can index), the hash functions and the bloom filter size.
  2. Check if the block might contain at least one document matching all the terms
  3. Share its document IDs

The code looks quite similar to my previous bit sliced signature implementation.


public class Block<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> {

    private final BitSet documentIds;
    private final long[][] bitMatrix;
    private final int capacity;
    private final List<ToIntFunction<T>> hashFunctions;
    private int docIndex = 0;

    public Block(int blockingFactor, int capacity, List<ToIntFunction<T>> hashFunctions) {
        assert Integer.bitCount(capacity) == 1;
        this.documentIds = new BitSet();
        this.bitMatrix = new long[capacity >> 6][blockingFactor];
        this.capacity = capacity;
        this.hashFunctions = hashFunctions;
    }

    public void add(D doc) {
        int docIndex = this.docIndex++;
        int docWordIndex = docIndex >>> 6;
        long docWord = 1L << docIndex;
        mapTerms(doc.get()).forEach(r -> bitMatrix[r][docWordIndex] |= docWord);
        documentIds.set(doc.getAsInt());
    }

    public void contribute(BitSet result) {
        result.or(documentIds);
    }

    public boolean matches(Q query) {
        int[] rows = mapTerms(query).distinct().toArray();
        return IntStream.range(0, capacity >> 6)
                        .filter(i -> hasMatch(rows, i))
                        .findFirst()
                        .isPresent();
    }

    private boolean hasMatch(int[] rows, int offset) {
        long word = 0L;
        for (int i = 0; i < rows.length && word == 0; ++i) {
            word |= bitMatrix[rows[i]][offset];
        }
        return word != 0;
    }

    private IntStream mapTerms(Set<T> terms) {
        return terms.stream().flatMapToInt(t -> hashFunctions.stream().mapToInt(f -> mapHash(f.applyAsInt(t))));
    }

    private int mapHash(int hash) {
        return hash & -hash & (capacity - 1);
    }
}

Now a level up. A BlockIndex has a BlockSet for each relatively prime factor. When evaluating a query, it passes the query to each of its BlockSets, retrieving all blocks which probably match the query.


public class BlockSet<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> {

    private final Block[] blocks;
    private final Supplier<Block<D, T, Q>> newBlock;
    private final int blockingFactor;
    private final int estimatedCapacity;
    private final int prime;

    public BlockSet(Supplier<Block<D, T, Q>> newBlock, int blockingFactor, int estimatedCapacity, int prime) {
        assert Integer.bitCount(blockingFactor) == 1 && Integer.bitCount(estimatedCapacity) == 1;
        this.newBlock = newBlock;
        this.blocks = new Block[estimatedCapacity/blockingFactor];
        for (int i = 0; i < blocks.length; ++i) {
            blocks[i] = newBlock.get();
        }
        this.blockingFactor = blockingFactor;
        this.estimatedCapacity = estimatedCapacity;
        this.prime = prime;
    }

    public void add(D doc) {
        int blockIndex = blockIndex(doc.getAsInt());
        Block<D, T, Q> block = (Block<D, T, Q>)blocks[blockIndex];
        block.add(doc);
    }

    public BitSet query(Q query) {
        BitSet result = new BitSet();
        Arrays.stream(blocks)
              .filter(b -> b.matches(query))
              .forEach(b -> b.contribute(result));
        return result;
    }

    private int blockIndex(int value) {
        return ((value * prime) & (estimatedCapacity - 1)) / blockingFactor;
    }
}

With a BlockIndex as the tip of an iceberg – it just needs to intersect the bit sets of document IDs.


public class BlockIndex<D extends Supplier<Set<T>> & IntSupplier, T, Q extends Set<T>> {

    private final List<BlockSet<D, T, Q>> blockSets;

    public BlockIndex(List<BlockSet<D, T, Q>> blockSets) {
        this.blockSets = blockSets;
    }

    public IntStream query(Q query) {
        BitSet result = null;
        for (BlockSet<D, T, Q> blockSet : blockSets) {
            BitSet docIds = blockSet.query(query);
            if (null == result) {
                result = docIds;
            } else {
                result.and(docIds);
            }
        }
        return null == result ? IntStream.of() : result.stream();
    }
}

This code is obviously experimental, but a problem with it as it stands is memory consumption with the temporary bit sets. A better, but less Java 8+ compliant bit set is RoaringBitmap.

Blocked Signatures in BitFunnel

Blocked Signatures are a very old idea, naturally it is reported that there are a few innovations in the BitFunnel data structure. BitFunnel uses multiple levels with blocking factors, each of which must be a proper power of 2, rather than multiple factors coprime to estimated capacity at the same level. Each level has rank = log(blockingFactor). The effect in BitFunnel is having several levels of blocking density. Blocks from different levels can be intersected efficiently by transforming dense blocks to rank zero (the least dense representation) prior to intersection.

Bit-Sliced Signatures and Bloom Filters

While the inverted index is a familiar indexing structure for many with a casual interest in information retrieval, the concept of a document signature may not be. The idea seems to have a long history, having been popular until it lost favour in the late 90s, owing to file sizes and performance empirically inferior to inverted indices. However, there has been new research into the use of signatures for search problems in the last decade, for instance, TopSig, and they are an important building block of the BitFunnel data structure.

Information Retrieval with Signatures

An m bit signature of a document consists of a bit array of fixed length m containing the (not guaranteed to be disjoint) union of the signatures of the terms found in the document. The signature of a term t is the m bit array with each jth bit set where j = hash(k, t) mod m, for each of k hash functions. This is obviously a Bloom filter by another name. In a recent post I noted that a Bloom filter’s behaviour can be implemented efficiently in less than ten lines of Java code.

In an information retrieval setting, a document signature would be computed for each document in a corpus by tokenising each document and building a Bloom filter from the extracted terms. To retrieve the documents matching a given query, the query signature (which is the union of the signatures of the query’s terms) is computed. In what I will call the signature scan phase of the retrieval, the document signatures are iterated over; each document being included if its signature contains all the bits in the query signature. The result set will contain false positives but should also be quite small, so a cheap filter for correctness is executed after the signature scan is complete.

While the hash function and ordering of the documents could be controlled creatively to cut down the number of signatures to be inspected for a given query, this approach is obviously inefficient, despite the appeal of low level bit-wise instructions. What happens when there are several billion documents in a corpus?

Parallelism and Bit-Slicing

Threads and SIMD instructions could be thrown at the signature scan to speed it up, but one approach to parallelising the retrieval is to increase the number of documents represented by each processed word. If there are n documents with m bit signatures, the signature list is an n * m bit matrix. Its transpose, an m * n bit matrix, is referred to as a bit sliced signature. When querying the bit sliced signature, only the rows specified in the query need be accessed and intersected, and each word represents the presence of a term in up to 64 documents. This is a very old technique – the the earliest formulation of a variant of this data structure I could find, where it was referred to as superimposed coding, was published in 1984, but references implementations from the 60s. An accessible evaluation was published in 1994.

Java Implementation

An implementation of such a structure for an immutable corpus is trivial but informative. Typically terms will be strings but needn’t be, whereas documents can be reduced to a set of terms. Using ToIntFunction to abstract hash functions again, we just need to map all term-document tuples into the bits of a long[][]. When querying the structure, we need to map the query’s terms into a sorted sequence of integers, determining which rows of the bit matrix to access.

On the way in, rather than using the hash functions to compute the bit to set (this is constant for each document), the row index is calculated. For each term, for each hash function, the appropriate row index is calculated and the document’s bit is set in the corresponding array. Clean Java would do this outside of the constructor, of course.


public class BitSlicedSignature<D extends Supplier<Set<T>>, T, Q extends Set<T>> {

    private final long[][] bitMatrix;
    private final int width;
    private final int height;
    private final List<ToIntFunction<T>> hashFunctions;

    public BitSlicedSignature(List<D> documents,
                              List<ToIntFunction<T>> hashFunctions,
                              int height) {
        this.hashFunctions = hashFunctions;
        this.width = (documents.size() + 63) / 64;
        this.height = height;
        this.bitMatrix = new long[height][width];
        int docIndex = 0;
        for (D doc : documents) {
            int docWordIndex = docIndex >> 6;
            long docWord = 1L << docIndex;
            for (T term : doc.get()) {
                for (ToIntFunction<T> hash : hashFunctions) {
                    int row = mapHash(hash.applyAsInt(term));
                    bitMatrix[row][docWordIndex] |= docWord;
                }
            }
            ++docIndex;
        }
    }

    private int mapHash(int hash) {
        return Math.abs(hash % height);
    }
}

To query the structure, the query is mapped into row indices and the corresponding rows are intersected word by word, matching document IDs are emitted lazily as an IntStream. The appeal of doing this lazily is that we should expect there to be a lot of documents, this way the bit-wise intersections can be done in chunks as and when the caller wants more documents. This can be achieved with the help of two utility methods:


    public IntStream query(Q query) {
        int[] rows = query.stream()
                          .flatMapToInt(t -> hashFunctions.stream().mapToInt(h -> mapHash(h.applyAsInt(t))))
                          .distinct()
                          .toArray();
        return IntStream.range(0, width).flatMap(i -> bitsOf(intersection(rows, i), i));
    }

    private long intersection(int[] rows, int offset) {
        long word = -1L;
        for (int i = 0; i < rows.length && word != 0; ++i) {
            word &= bitMatrix[rows[i]][offset];
        }
        return word;
    }

    private static IntStream bitsOf(long word, int offset) {
        return IntStream.range(0, Long.SIZE)
                        .filter(i -> (1L << i & word) != 0)
                        .map(i -> Long.SIZE * offset + i);
    }

As you can probably see, you can leave vast swathes of the long[][] untouched, assuming the query is for a small number of terms. A more sophisticated implementation might partition the documents into several bit matrices.

Shortcomings

There are some obvious problems with this data structure. Firstly, a long[][] uses the same amount of memory whether its bits are set or not. What happens when you have some small documents and lots of terms? You have a column in the bit matrix where most of the bits are zero – it’s likely that a compressed bit set would be preferable. Similarly with very common terms you will have long horizontal runs of set bits.

Even worse, what happens when a term is very rare? If you are providing a search service, it’s likely you only ever need to provide a page of results at a time. If the term is rare enough, you may need to scan the entire row to fill a page, which could take a long time. To get around that, BitFunnel uses bit-sliced block signatures, which I will write about in the next post.

Building a Bloom Filter from Scratch

The Bloom filter is an interesting data structure for modelling approximate sets of objects. It can tell you with certainty that an object is not in a collection, but may give false positives. If you write Java for a living, I would not suggest you implement your own, because there is a perfectly good implementation in Guava. It can be illuminating to write one from scratch, however.

Interface

You can do two things with a Bloom filter: put things in it, and check if the filter probably contains items. This leads to the following interface:


public class BloomFilter<T> {
    void add(T value);

    boolean mightContain(T value);
}

Java Bloom Filter Implementation

A Bloom filter represents a set of objects quite succinctly as a bit array of finite length. Unlike a bitset, the bit array stores hashes of the objects, rather than object identities. In practice, you need multiple hash functions, which, modulo the capacity of the bit array, will collide only with low probability. The more hashes you have, the slower insertions and look ups will be. There is also a space trade off for improved accuracy: the larger the array, the less likely collisions of hashes modulo the capacity will be.

Since you probably want to be able to control the hash functions you use, the interface ToIntFunction fits in nicely as the perfect abstraction. You can set this up simply with a builder.


    public static <T> Builder<T> builder() {
        return new Builder<>();
    }

    public static class Builder<T> {
        private int size;
        private List<ToIntFunction<T>> hashFunctions;

        public Builder<T> withSize(int size) {
            this.size = size;
            return this;
        }

        public Builder<T> withHashFunctions(List<ToIntFunction<T>> hashFunctions) {
            this.hashFunctions = hashFunctions;
            return this;
        }

        public BloomFilter<T> build() {
            return new BloomFilter<>(new long[(size + 63) / 64], size, hashFunctions);
        }
    }

    private final long[] array;
    private final int size;
    private final List<ToIntFunction<T>> hashFunctions;

    public BloomFilter(long[] array, int logicalSize, List<ToIntFunction<T>> hashFunctions) {
        this.array = array;
        this.size = logicalSize;
        this.hashFunctions = hashFunctions;
    }

    private int mapHash(int hash) {
        return Math.abs(hash % size);
    }

Adding Values

To add a value, you need to take an object, and for each hash function hash it modulo the capacity of the bit array. Using a long[] as the substrate of the bit set you must locate the word each hash belongs to, and set the bit corresponding to the remainder of the hash modulo 64. You can do this quickly as follows:


   public void add(T value) {
        for (ToIntFunction<T> function : hashFunctions) {
            int hash = mapHash(function.applyAsInt(value));
            array[hash >> 6] |= 1L << hash;
        }
    }

Note that each bit may already be set.

Querying the bit set

To check if an object is contained in the bitset, you need to evaluate the hashes again, and find the corresponding words. You can return false if the intersection of the appropriate word and the bit corresponding to the remainder modulo 64 is zero. That looks like this:


    public boolean mightContain(T value) {
        for (ToIntFunction<T> function : hashFunctions) {
            int hash = mapHash(function.applyAsInt(value));
            if ((array[hash >> 6] & (1L << hash)) == 0) {
                return false;
            }
        }
        return true;
    }

Note that this absolutely does not mean the object is contained within the set, but you can get more precise results if you are willing to perform more hash evaluations, and if you are willing to use a larger bit array. Modeling the precise false positive rate is not clear cut.

BitFunnel

Just as is the case for bitmap indices, bit slicing is a useful enhancement for Bloom filters, forming the basis of BitFunnel. In a follow up post I will share a simplified implementation of a bit sliced Bloom filter.

Eliminating Bounds Checks with Intrinsics in Java 9

Open source projects are starting to capitalise on fast new intrinsics available in Java 9. For instance, Lucene is already experimenting with Arrays.mismatch, which I wrote about a few months ago, and Objects.checkIndex, which I didn’t know existed. A quick google search returned nothing on this method except the javadoc and a Baeldung blog post, which reiterates the javadoc. The intent behind these methods is obvious, and their signatures are simple enough to use blind, but why bother? What do you get, and when do you get it? This post analyses the range and index check methods to illustrate why it would be worthwhile for a project like Lucene, known for its performance, potentially supporting two versions of the same code. Edit: Robert Muir from Lucene points out in the comments here that his primary motivation for using these methods is actually consistency and correctness, rather than performance.

Objects.checkIndex, Arrays, and RangeCheckElimination

Suppose you have a double[] and want to perform a summation on the prefix of the array, expressed by a method parameter. You need to check that the parameter actually specifies a prefix of the array, but there also need to be bounds checks accessing the array. You might write code like this:


    @Benchmark
    public double RangePrefixSum_ManualCheck(BoundsCheckState state) {
        double[] data = state.data;
        int limit = state.index;
        if (limit < 0 || limit >= data.length) {
            throw new ArrayIndexOutOfBoundsException(limit + " >= " + data.length);
        }
        double result = 0D;
        for (int i = 0; i <= limit; ++i) {
            result += data[i];
        }
        return result;
    }

Or you could use Objects.checkIndex which takes an index and a length and will throw an ArrayIndexOutOfBoundsException if the index is not less than the length. The idea behind using it is that the JIT compiler can fuse this bounds check with any checks after the call, resulting in their elimination. You can see that this intrinsic functions as a “hint” rather than a handle to an assembly snippet in the source code. Using the new API you could write:


    @Benchmark
    public double PrefixRangeSum_CheckIndex(BoundsCheckState state) {
        double[] data = state.data;
        int limit = Objects.checkIndex(state.index, data.length);
        double result = 0D;
        for (int i = 0; i <= limit; ++i) {
            result += data[i];
        }
        return result;
    }

But does it make any difference? Or will both loops profit from RangeCheckElimination anyway, rendering any difference moot? As a point of comparison, I benchmark against the same code without any kind of precondition check so we can isolate the cost of performing the actual check.


    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public double RangePrefixSum_NoCheck(BoundsCheckState state) {
        double[] data = state.data;
        int limit = state.index;
        double result = 0D;
        for (int i = 0; i <= limit; ++i) {
            result += data[i];
        }
        return result;
    }

Running the benchmark, it turns out there is virtually no difference for shorter arrays, but there may be a small benefit when the arrays get longer. The assembly emitted is identical for each loop in any case. In short, bounds checks are probably being optimised away anyway; it is RangeCheckElimination, not Objects.checkIndex, taking effect here.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
PrefixRangeSum_CheckIndex thrpt 1 10 8040.641837 371.264782 ops/ms 100
PrefixRangeSum_CheckIndex thrpt 1 10 716.093965 35.701464 ops/ms 1000
PrefixRangeSum_CheckIndex thrpt 1 10 73.321275 2.693158 ops/ms 10000
RangePrefixSum_ManualCheck thrpt 1 10 10433.584914 429.149725 ops/ms 100
RangePrefixSum_ManualCheck thrpt 1 10 676.266935 43.792141 ops/ms 1000
RangePrefixSum_ManualCheck thrpt 1 10 69.937342 5.215848 ops/ms 10000
RangePrefixSum_NoCheck thrpt 1 10 8754.536246 548.945401 ops/ms 100
RangePrefixSum_NoCheck thrpt 1 10 687.851103 37.487361 ops/ms 1000
RangePrefixSum_NoCheck thrpt 1 10 69.145729 2.783780 ops/ms 10000

For reference, here is the assembly used by the “naive” code:

com/openkappa/simd/boundscheck/CheckIndex.RangePrefixSum_ManualCheck(Lcom/openkappa/simd/boundscheck/BoundsCheckState;)D  [0x00000293bb8abe20, 0x00000293bb8abf38]  280 bytes
Argument 0 is unknown.RIP: 0x293bb8abe20 Code size: 0x00000118
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x00000293d0b32228} 'RangePrefixSum_ManualCheck' '(Lcom/openkappa/simd/boundscheck/BoundsCheckState;)D' in 'com/openkappa/simd/boundscheck/CheckIndex'
  0x00000293bb8abe20: int3                      ;...cc

  0x00000293bb8abe21: nop     word ptr [rax+rax+0h]  ;...66
                                                ;...66
                                                ;...66
                                                ;...0f
                                                ;...1f
                                                ;...84
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;...00

  0x00000293bb8abe2c: nop                       ;...66
                                                ;...66
                                                ;...66
                                                ;...90

  0x00000293bb8abe30: mov     dword ptr [rsp+0ffffffffffff9000h],eax
                                                ;...89
                                                ;...84
                                                ;...24
                                                ;...00
                                                ;...90
                                                ;...ff
                                                ;...ff

  0x00000293bb8abe37: push    rbp               ;...55

  0x00000293bb8abe38: sub     rsp,50h           ;...48
                                                ;...83
                                                ;...ec
                                                ;...50

  0x00000293bb8abe3c: mov     r14,qword ptr [rdx+20h]  ;...4c
                                                ;...8b
                                                ;...72
                                                ;...20

  0x00000293bb8abe40: mov     ebx,dword ptr [rdx+18h]  ;...8b
                                                ;...5a
                                                ;...18

  0x00000293bb8abe43: mov     r13d,dword ptr [rdx]  ;...44
                                                ;...8b
                                                ;...2a

  0x00000293bb8abe46: vmovsd  xmm0,qword ptr [rdx+8h]  ;...c5
                                                ;...fb
                                                ;...10
                                                ;...42
                                                ;...08

  0x00000293bb8abe4b: vmovq   rbp,xmm0          ;...c4
                                                ;...e1
                                                ;...f9
                                                ;...7e
                                                ;...c5

  0x00000293bb8abe50: mov     rcx,rdx           ;...48
                                                ;...8b
                                                ;...ca

  0x00000293bb8abe53: mov     r10,5b517c90h     ;...49
                                                ;...ba
                                                ;...90
                                                ;...7c
                                                ;...51
                                                ;...5b
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;...00

  0x00000293bb8abe5d: call indirect r10         ;...41
                                                ;...ff
                                                ;...d2

  0x00000293bb8abe60: test    r14,r14           ;...4d
                                                ;...85
                                                ;...f6

  0x00000293bb8abe63: je      293bb8abecdh      ;...74
                                                ;...68

  0x00000293bb8abe65: mov     r11d,dword ptr [r14+8h]  ;...45
                                                ;...8b
                                                ;...5e
                                                ;...08

  0x00000293bb8abe69: cmp     r11d,0f80000b9h   ;...41
                                                ;...81
                                                ;...fb
                                                ;...b9
                                                ;...00
                                                ;...00
                                                ;...f8
                                                ;   {metadata({type array double})}
  0x00000293bb8abe70: jne     293bb8abef1h      ;...0f
                                                ;...85
                                                ;...7b
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;*iload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@42 (line 22)

  0x00000293bb8abe76: cmp     r13d,ebx          ;...44
                                                ;...3b
                                                ;...eb

  0x00000293bb8abe79: jnle    293bb8abec6h      ;...7f
                                                ;...4b
                                                ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@45 (line 22)

  0x00000293bb8abe7b: mov     r11d,dword ptr [r14+0ch]
                                                ;...45
                                                ;...8b
                                                ;...5e
                                                ;...0c
                                                ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@53 (line 23)
                                                ; implicit exception: dispatches to 0x00000293bb8abf0d
  0x00000293bb8abe7f: cmp     r13d,r11d         ;...45
                                                ;...3b
                                                ;...eb

  0x00000293bb8abe82: jnb     293bb8abed7h      ;...73
                                                ;...53

  0x00000293bb8abe84: vmovq   xmm0,rbp          ;...c4
                                                ;...e1
                                                ;...f9
                                                ;...6e
                                                ;...c5

  0x00000293bb8abe89: vaddsd  xmm0,xmm0,mmword ptr [r14+r13*8+10h]
                                                ;...c4
                                                ;...81
                                                ;...7b
                                                ;...58
                                                ;...44
                                                ;...ee
                                                ;...10
                                                ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@54 (line 23)

  0x00000293bb8abe90: inc     r13d              ;...41
                                                ;...ff
                                                ;...c5
                                                ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@57 (line 22)

  0x00000293bb8abe93: cmp     r13d,ebx          ;...44
                                                ;...3b
                                                ;...eb

  0x00000293bb8abe96: jnle    293bb8abebah      ;...7f
                                                ;...22
                                                ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@45 (line 22)

  0x00000293bb8abe98: nop     dword ptr [rax+rax+0h]  ;...0f
                                                ;...1f
                                                ;...84
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@53 (line 23)

  0x00000293bb8abea0: cmp     r13d,r11d         ;...45
                                                ;...3b
                                                ;...eb

  0x00000293bb8abea3: jnb     293bb8abed2h      ;...73
                                                ;...2d

  0x00000293bb8abea5: vaddsd  xmm0,xmm0,mmword ptr [r14+r13*8+10h]
                                                ;...c4
                                                ;...81
                                                ;...7b
                                                ;...58
                                                ;...44
                                                ;...ee
                                                ;...10
                                                ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@54 (line 23)

  0x00000293bb8abeac: inc     r13d              ;...41
                                                ;...ff
                                                ;...c5
                                                ; ImmutableOopMap{r14=Oop }
                                                ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@60 (line 22)

  0x00000293bb8abeaf: test    dword ptr [293a7e50000h],eax
                                                ;...85
                                                ;...05
                                                ;...4b
                                                ;...41
                                                ;...5a
                                                ;...ec
                                                ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@60 (line 22)
                                                ;   {poll}
  0x00000293bb8abeb5: cmp     r13d,ebx          ;...44
                                                ;...3b
                                                ;...eb

  0x00000293bb8abeb8: jle     293bb8abea0h      ;...7e
                                                ;...e6
                                                ;*iload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::RangePrefixSum_ManualCheck@42 (line 22)

  0x00000293bb8abeba: add     rsp,50h           ;...48
                                                ;...83
                                                ;...c4
                                                ;...50

  0x00000293bb8abebe: pop     rbp               ;...5d

  0x00000293bb8abebf: test    dword ptr [293a7e50000h],eax
                                                ;...85
                                                ;...05
                                                ;...3b
                                                ;...41
                                                ;...5a
                                                ;...ec
                                                ;   {poll_return}
  0x00000293bb8abec5: ret                       ;...c3

And using Objects.checkIndex we see essentially the same thing.

com/openkappa/simd/boundscheck/CheckIndex.PrefixRangeSum_CheckIndex(Lcom/openkappa/simd/boundscheck/BoundsCheckState;)D  [0x0000028c1a51d4a0, 0x0000028c1a51d5f8]  344 bytes
Argument 0 is unknown.RIP: 0x28c1a51d4a0 Code size: 0x00000158
[Entry Point]
[Constants]
  # {method} {0x0000028c2f7a1dd0} 'PrefixRangeSum_CheckIndex' '(Lcom/openkappa/simd/boundscheck/BoundsCheckState;)D' in 'com/openkappa/simd/boundscheck/CheckIndex'
  # this:     rdx:rdx   = 'com/openkappa/simd/boundscheck/CheckIndex'
  # parm0:    r8:r8     = 'com/openkappa/simd/boundscheck/BoundsCheckState'
  #           [sp+0x30]  (sp of caller)
  0x0000028c1a51d4a0: mov     r10d,dword ptr [rdx+8h]  ;...44
                                                ;...8b
                                                ;...52
                                                ;...08

  0x0000028c1a51d4a4: shl     r10,3h            ;...49
                                                ;...c1
                                                ;...e2
                                                ;...03

  0x0000028c1a51d4a8: cmp     rax,r10           ;...49
                                                ;...3b
                                                ;...c2

  0x0000028c1a51d4ab: jne     28c12a7c200h      ;...0f
                                                ;...85
                                                ;...4f
                                                ;...ed
                                                ;...55
                                                ;...f8
                                                ;   {runtime_call ic_miss_stub}
  0x0000028c1a51d4b1: nop                       ;...66
                                                ;...66
                                                ;...90

  0x0000028c1a51d4b4: nop     dword ptr [rax+rax+0h]  ;...0f
                                                ;...1f
                                                ;...84
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;...00

  0x0000028c1a51d4bc: nop                       ;...66
                                                ;...66
                                                ;...66
                                                ;...90

[Verified Entry Point]
  0x0000028c1a51d4c0: mov     dword ptr [rsp+0ffffffffffff9000h],eax
                                                ;...89
                                                ;...84
                                                ;...24
                                                ;...00
                                                ;...90
                                                ;...ff
                                                ;...ff

  0x0000028c1a51d4c7: push    rbp               ;...55

  0x0000028c1a51d4c8: sub     rsp,20h           ;...48
                                                ;...83
                                                ;...ec
                                                ;...20
                                                ;*synchronization entry
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@-1 (line 43)

  0x0000028c1a51d4cc: mov     r10d,dword ptr [r8+14h]  ;...45
                                                ;...8b
                                                ;...50
                                                ;...14
                                                ;*getfield data {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@1 (line 43)
                                                ; implicit exception: dispatches to 0x0000028c1a51d5bd
  0x0000028c1a51d4d0: mov     r11d,dword ptr [r12+r10*8+0ch]
                                                ;...47
                                                ;...8b
                                                ;...5c
                                                ;...d4
                                                ;...0c
                                                ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@10 (line 44)
                                                ; implicit exception: dispatches to 0x0000028c1a51d5c9
  0x0000028c1a51d4d5: mov     ebp,dword ptr [r8+10h]  ;...41
                                                ;...8b
                                                ;...68
                                                ;...10
                                                ;*getfield index {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@6 (line 44)

  0x0000028c1a51d4d9: cmp     ebp,r11d          ;...41
                                                ;...3b
                                                ;...eb

  0x0000028c1a51d4dc: jnb     28c1a51d587h      ;...0f
                                                ;...83
                                                ;...a5
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;*invokestatic checkIndex {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.util.Objects::checkIndex@3 (line 372)
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@11 (line 44)

  0x0000028c1a51d4e2: test    r11d,r11d         ;...45
                                                ;...85
                                                ;...db

  0x0000028c1a51d4e5: jbe     28c1a51d59dh      ;...0f
                                                ;...86
                                                ;...b2
                                                ;...00
                                                ;...00
                                                ;...00

  0x0000028c1a51d4eb: movsxd  r11,r11d          ;...4d
                                                ;...63
                                                ;...db

  0x0000028c1a51d4ee: mov     r8d,ebp           ;...44
                                                ;...8b
                                                ;...c5

  0x0000028c1a51d4f1: inc     r8d               ;...41
                                                ;...ff
                                                ;...c0

  0x0000028c1a51d4f4: movsxd  r9,r8d            ;...4d
                                                ;...63
                                                ;...c8

  0x0000028c1a51d4f7: dec     r9                ;...49
                                                ;...ff
                                                ;...c9

  0x0000028c1a51d4fa: cmp     r9,r11            ;...4d
                                                ;...3b
                                                ;...cb

  0x0000028c1a51d4fd: jnb     28c1a51d59dh      ;...0f
                                                ;...83
                                                ;...9a
                                                ;...00
                                                ;...00
                                                ;...00

  0x0000028c1a51d503: cmp     ebp,7ffffffeh     ;...81
                                                ;...fd
                                                ;...fe
                                                ;...ff
                                                ;...ff
                                                ;...7f

  0x0000028c1a51d509: jnle    28c1a51d5adh      ;...0f
                                                ;...8f
                                                ;...9e
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;*dload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@27 (line 47)

  0x0000028c1a51d50f: add     ebp,0fffffffeh    ;...83
                                                ;...c5
                                                ;...fe

  0x0000028c1a51d512: lea     r11,[r12+r10*8]   ;...4f
                                                ;...8d
                                                ;...1c
                                                ;...d4

  0x0000028c1a51d516: vxorpd  xmm0,xmm0,xmm0    ;...c5
                                                ;...f9
                                                ;...57
                                                ;...c0

  0x0000028c1a51d51a: vaddsd  xmm0,xmm0,mmword ptr [r12+r10*8+10h]
                                                ;...c4
                                                ;...81
                                                ;...7b
                                                ;...58
                                                ;...44
                                                ;...d4
                                                ;...10
                                                ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@33 (line 47)

  0x0000028c1a51d521: mov     r9d,80000000h     ;...41
                                                ;...b9
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;...80

  0x0000028c1a51d527: cmp     r8d,ebp           ;...44
                                                ;...3b
                                                ;...c5

  0x0000028c1a51d52a: cmovl   ebp,r9d           ;...41
                                                ;...0f
                                                ;...4c
                                                ;...e9

  0x0000028c1a51d52e: mov     r9d,1h            ;...41
                                                ;...b9
                                                ;...01
                                                ;...00
                                                ;...00
                                                ;...00

  0x0000028c1a51d534: cmp     ebp,1h            ;...83
                                                ;...fd
                                                ;...01

  0x0000028c1a51d537: jle     28c1a51d565h      ;...7e
                                                ;...2c

  0x0000028c1a51d539: nop     dword ptr [rax+0h]  ;...0f
                                                ;...1f
                                                ;...80
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;...00
                                                ;*dload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@27 (line 47)

  0x0000028c1a51d540: vaddsd  xmm0,xmm0,mmword ptr [r11+r9*8+10h]
                                                ;...c4
                                                ;...81
                                                ;...7b
                                                ;...58
                                                ;...44
                                                ;...cb
                                                ;...10

  0x0000028c1a51d547: vaddsd  xmm0,xmm0,mmword ptr [r11+r9*8+18h]
                                                ;...c4
                                                ;...81
                                                ;...7b
                                                ;...58
                                                ;...44
                                                ;...cb
                                                ;...18

  0x0000028c1a51d54e: vaddsd  xmm0,xmm0,mmword ptr [r11+r9*8+20h]
                                                ;...c4
                                                ;...81
                                                ;...7b
                                                ;...58
                                                ;...44
                                                ;...cb
                                                ;...20

  0x0000028c1a51d555: vaddsd  xmm0,xmm0,mmword ptr [r11+r9*8+28h]
                                                ;...c4
                                                ;...81
                                                ;...7b
                                                ;...58
                                                ;...44
                                                ;...cb
                                                ;...28
                                                ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@33 (line 47)

  0x0000028c1a51d55c: add     r9d,4h            ;...41
                                                ;...83
                                                ;...c1
                                                ;...04
                                                ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@36 (line 46)

  0x0000028c1a51d560: cmp     r9d,ebp           ;...44
                                                ;...3b
                                                ;...cd

  0x0000028c1a51d563: jl      28c1a51d540h      ;...7c
                                                ;...db
                                                ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@24 (line 46)

  0x0000028c1a51d565: cmp     r9d,r8d           ;...45
                                                ;...3b
                                                ;...c8

  0x0000028c1a51d568: jnl     28c1a51d57bh      ;...7d
                                                ;...11

  0x0000028c1a51d56a: nop                       ;...66
                                                ;...90
                                                ;*dload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@27 (line 47)

  0x0000028c1a51d56c: vaddsd  xmm0,xmm0,mmword ptr [r11+r9*8+10h]
                                                ;...c4
                                                ;...81
                                                ;...7b
                                                ;...58
                                                ;...44
                                                ;...cb
                                                ;...10
                                                ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@33 (line 47)

  0x0000028c1a51d573: inc     r9d               ;...41
                                                ;...ff
                                                ;...c1
                                                ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@36 (line 46)

  0x0000028c1a51d576: cmp     r9d,r8d           ;...45
                                                ;...3b
                                                ;...c8

  0x0000028c1a51d579: jl      28c1a51d56ch      ;...7c
                                                ;...f1
                                                ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@24 (line 46)

  0x0000028c1a51d57b: add     rsp,20h           ;...48
                                                ;...83
                                                ;...c4
                                                ;...20

  0x0000028c1a51d57f: pop     rbp               ;...5d

  0x0000028c1a51d580: test    dword ptr [28c08310000h],eax
                                                ;...85
                                                ;...05
                                                ;...7a
                                                ;...2a
                                                ;...df
                                                ;...ed
                                                ;   {poll_return}
  0x0000028c1a51d586: ret                       ;...c3

  0x0000028c1a51d587: mov     edx,0ffffffe4h    ;...ba
                                                ;...e4
                                                ;...ff
                                                ;...ff
                                                ;...ff

  0x0000028c1a51d58c: mov     dword ptr [rsp],r10d  ;...44
                                                ;...89
                                                ;...14
                                                ;...24

  0x0000028c1a51d590: mov     dword ptr [rsp+4h],r11d  ;...44
                                                ;...89
                                                ;...5c
                                                ;...24
                                                ;...04

  0x0000028c1a51d595: nop                       ;...66
                                                ;...90

  0x0000028c1a51d597: call    28c12a7de80h      ;...e8
                                                ;...e4
                                                ;...08
                                                ;...56
                                                ;...f8
                                                ; ImmutableOopMap{[0]=NarrowOop }
                                                ;*invokestatic checkIndex {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.util.Objects::checkIndex@3 (line 372)
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@11 (line 44)
                                                ;   {runtime_call UncommonTrapBlob}
  0x0000028c1a51d59c: int3                      ;...cc
                                                ;*invokestatic checkIndex {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.util.Objects::checkIndex@3 (line 372)
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@11 (line 44)

  0x0000028c1a51d59d: mov     edx,0ffffff86h    ;...ba
                                                ;...86
                                                ;...ff
                                                ;...ff
                                                ;...ff

  0x0000028c1a51d5a2: mov     dword ptr [rsp],r10d  ;...44
                                                ;...89
                                                ;...14
                                                ;...24

  0x0000028c1a51d5a6: nop                       ;...90

  0x0000028c1a51d5a7: call    28c12a7de80h      ;...e8
                                                ;...d4
                                                ;...08
                                                ;...56
                                                ;...f8
                                                ; ImmutableOopMap{[0]=NarrowOop }
                                                ;*dload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@27 (line 47)
                                                ;   {runtime_call UncommonTrapBlob}
  0x0000028c1a51d5ac: int3                      ;...cc

  0x0000028c1a51d5ad: mov     edx,0ffffff7eh    ;...ba
                                                ;...7e
                                                ;...ff
                                                ;...ff
                                                ;...ff

  0x0000028c1a51d5b2: mov     dword ptr [rsp],r10d  ;...44
                                                ;...89
                                                ;...14
                                                ;...24

  0x0000028c1a51d5b6: nop                       ;...90

  0x0000028c1a51d5b7: call    28c12a7de80h      ;...e8
                                                ;...c4
                                                ;...08
                                                ;...56
                                                ;...f8
                                                ; ImmutableOopMap{[0]=NarrowOop }
                                                ;*dload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@27 (line 47)
                                                ;   {runtime_call UncommonTrapBlob}
  0x0000028c1a51d5bc: int3                      ;...cc
                                                ;*dload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@27 (line 47)

  0x0000028c1a51d5bd: mov     edx,0fffffff6h    ;...ba
                                                ;...f6
                                                ;...ff
                                                ;...ff
                                                ;...ff

  0x0000028c1a51d5c2: nop                       ;...90

  0x0000028c1a51d5c3: call    28c12a7de80h      ;...e8
                                                ;...b8
                                                ;...08
                                                ;...56
                                                ;...f8
                                                ; ImmutableOopMap{}
                                                ;*getfield data {reexecute=1 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@1 (line 43)
                                                ;   {runtime_call UncommonTrapBlob}
  0x0000028c1a51d5c8: int3                      ;...cc
                                                ;*getfield data {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@1 (line 43)

  0x0000028c1a51d5c9: mov     edx,0fffffff6h    ;...ba
                                                ;...f6
                                                ;...ff
                                                ;...ff
                                                ;...ff

  0x0000028c1a51d5ce: nop                       ;...90

  0x0000028c1a51d5cf: call    28c12a7de80h      ;...e8
                                                ;...ac
                                                ;...08
                                                ;...56
                                                ;...f8
                                                ; ImmutableOopMap{}
                                                ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@10 (line 44)
                                                ;   {runtime_call UncommonTrapBlob}
  0x0000028c1a51d5d4: int3                      ;...cc
                                                ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.simd.boundscheck.CheckIndex::PrefixRangeSum_CheckIndex@10 (line 44)

  0x0000028c1a51d5d5: hlt                       ;...f4

  0x0000028c1a51d5d6: hlt                       ;...f4

  0x0000028c1a51d5d7: hlt                       ;...f4

  0x0000028c1a51d5d8: hlt                       ;...f4

  0x0000028c1a51d5d9: hlt                       ;...f4

  0x0000028c1a51d5da: hlt                       ;...f4

  0x0000028c1a51d5db: hlt                       ;...f4

  0x0000028c1a51d5dc: hlt                       ;...f4

  0x0000028c1a51d5dd: hlt                       ;...f4

  0x0000028c1a51d5de: hlt                       ;...f4

  0x0000028c1a51d5df: hlt                       ;...f4

Point Access Range Check Elimination

If there’s no gain rolling up an array, because the optimisations applied to loops over arrays are so good, what if you need to provide random access to some kind of collection? You need to perform checks. We can put it to the test by comparing these two pieces of code:


    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public double PointAccess_ManualCheck(BoundsCheckState state) {
        double[] data = state.data;
        int index = state.index;
        if (index < 0 || index >= data.length) {
            throw new ArrayIndexOutOfBoundsException(index + " >= " + data.length);
        }
        return data[index];
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public double PointAccess_CheckIndex(BoundsCheckState state) {
        double[] data = state.data;
        int index = Objects.checkIndex(state.index, data.length);
        return data[index];
    }

Again, disappointingly, there’s virtually no difference, and this time it’s in the naive approach’s favour.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
PointAccess_CheckIndex thrpt 1 10 79259.682293 10737.451258 ops/ms 100
PointAccess_CheckIndex thrpt 1 10 59719.861827 15039.268226 ops/ms 1000
PointAccess_CheckIndex thrpt 1 10 25541.582169 18540.720843 ops/ms 10000
PointAccess_ManualCheck thrpt 1 10 81280.935102 7639.676934 ops/ms 100
PointAccess_ManualCheck thrpt 1 10 61888.048023 19194.480590 ops/ms 1000
PointAccess_ManualCheck thrpt 1 10 27379.437571 17218.454466 ops/ms 10000

The nice thing about not using the intrinsic is you can control the error message, and while I’m hesitant to make negative proclamations (perhaps my benchmark code is flawed), I am in no rush to use this method.

Tricking Java into Adding Up Arrays Faster

There is currently what I like to think of as a “crop circle” in Java 9: you can make hash codes faster by manually unrolling some multiplications, which is tracked by an OpenJDK ticket. This one is even weirder. Imagine you have an int[] and want to compute the sum of its elements. You could do exactly that, or, supposing your values are small enough not to overflow, you can make your loop much faster by multiplying each element by 2 inside the loop, and dividing the result by 2 at the end. This is because autovectorisation, with strength reductions to shifts and additions to boot, kicks in for loops that look like a dot product, whereas summations of arrays don’t seem to be optimised at all – see this post for an analysis. Don’t believe me? Run the code at github.


    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int NaiveSum(IntData state) {
        int value = 0;
        int[] data = state.data1;
        for (int i = 0; i < data.length; ++i) {
            value += data[i];
        }
        return value;
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int CropCircle(IntData state) {
        int value = 0;
        int[] data = state.data1;
        for (int i = 0; i < data.length; ++i) {
            value += 2 * data[i];
        }
        return value / 2;
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
CropCircle thrpt 1 10 29.922687 0.383028 ops/ms 100000
CropCircle thrpt 1 10 2.065812 0.120089 ops/ms 1000000
NaiveSum thrpt 1 10 26.241689 0.660850 ops/ms 100000
NaiveSum thrpt 1 10 1.868644 0.244081 ops/ms 1000000