Beware Collection Factory Methods

I saw an interesting tweet referencing a Github issue where the impact of including an (in my view) unnecessary implementation of the List interface impacted inlining decisions, causing 20x degradation in throughput. Guava’s ImmutableList is my favourite class to seek and destroy because of the way it is often used – it tends to be associated with unnecessary copying where encapsulation would be a better solution. I had assumed performance gains won from finding and deleting all the instances of ImmutableList had been thanks to relieving the garbage collector from medieval torture. The performance degradation observed in the benchmark is caused by use of ImmutableList, along with all its subclasses, alongside ArrayList, making calls to List bimorphic at best, causing the JIT compiler to generate slower code. I may have inadvertently profited from better inlining in the past simply by removing as many ImmutableLists as possible!

This post doesn’t go into any details about the various mechanisms of method dispatch, and if you want to understand the impact of polymorphism on inlining, bookmark Aleksey Shipilev’s authoritative post and read it when you have some time to really concentrate.

Without resorting to using LinkedList, is it possible to contrive cases in Java 9 where performance is severely degraded by usages of Collections.unmodifiableList and List.of factory methods? Along with ArrayList, these are random access data structures so this should highlight the potential performance gains inlining can give.

This section and the benchmarks were amended to remove a small bias caused by calls to ThreadLocalRandom.current().nextInt() after a suggestion from Lukas Eder.

The methodology is very simple: I randomly vary the List implementation and plug it into the same algorithm. It is cruder than you would see in Aleksey Shipilev’s post because I’ve targeted only the worst case by creating equal bias between implementations. Aleksey demonstrates that inlining decisions are statistical and opportunistic (the JIT can guess and later deoptimise), and if 90% of your call sites dispatch to the same implementation, it doesn’t matter as much as when the choice is made uniformly. It will vary from application to application, but it could easily be as bad as the case I present if List is used polymorphically.

I created five benchmarks which produce the same number, the same way. Three of these benchmarks only ever call into a single implementation of List and will be inlined monomorphically, to avoid bias, the result is XOR’d with a call to ThreadLocalRandom.current().nextInt() because the other benchmarks need this. One benchmark only ever calls into List.of and ArrayList, then one benchmark randomly chooses a list for each invocation. The difference is stark. You can really screw up performance by making the methods on List megamorphic.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
sumLength_ArrayList thrpt 1 10 55.785270 3.218552 ops/us
sumLength_Factory thrpt 1 10 58.565918 2.852415 ops/us
sumLength_Random2 thrpt 1 10 35.842255 0.684658 ops/us
sumLength_Random3 thrpt 1 10 11.177564 0.080164 ops/us
sumLength_Unmodifiable thrpt 1 10 51.776108 3.751297 ops/us


@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class MegamorphicList {

    private List<String>[] strings;

    @Setup(Level.Trial)
    public void init() {
        strings = new List[]{getArrayList(6), getFactoryList6(), getUnModifiableList(6)};
    }

    @Benchmark
    public int sumLength_ArrayList(Blackhole bh) {
        List<String> list = strings[0];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole ^ ThreadLocalRandom.current().nextInt(3);
    }

    @Benchmark
    public int sumLength_Factory() {
        List<String> list = strings[1];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole ^ ThreadLocalRandom.current().nextInt(3);
    }

    @Benchmark
    public int sumLength_Unmodifiable() {
        List<String> list = strings[2];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole ^ ThreadLocalRandom.current().nextInt(3);
    }

    @Benchmark
    public int sumLength_Random2() {
        List<String> list = strings[ThreadLocalRandom.current().nextInt(2)];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole;
    }

    @Benchmark
    public int sumLength_Random3() {
        List<String> list = strings[ThreadLocalRandom.current().nextInt(3)];
        int blackhole = 0;
        for (int i = 0; i < list.size(); ++i) {
            blackhole += list.get(i).length();
        }
        return blackhole;
    }

    private List<String> getUnModifiableList(int size) {
        return Collections.unmodifiableList(getArrayList(size));
    }

    private List<String> getFactoryList6() {
        return List.of(randomString(),
                       randomString(),
                       randomString(),
                       randomString(),
                       randomString(),
                       randomString()
                );
    }

    private List<String> getArrayList(int size) {
        List<String> list = new ArrayList<>();
        for (int i = 0; i < size; ++i) {
            list.add(randomString());
        }
        return list;
    }

    private String randomString() {
        return new String(DataUtil.createByteArray(ThreadLocalRandom.current().nextInt(10, 20)));
    }

}

Since writing this post, I have been challenged on whether this result is due to failure to inline or not. This can be easily verified by setting the following JVM arguments to print compilation:

-XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining

You will see the ArrayList and ListN get inlined quickly in isolation:

\-> TypeProfile (19810/19810 counts) = java/util/ArrayList
@ 27   java.util.ArrayList::get (15 bytes)   inline (hot)
...
\-> TypeProfile (363174/363174 counts) = java/util/ImmutableCollections$ListN
@ 24   java.util.ImmutableCollections$ListN::get (17 bytes)   inline (hot)

However, the call remains virtual (and not inlined) when three or more implementations are present:

@ 30   java.util.List::get (0 bytes)   virtual call

I didn’t even bother to use factory methods with different arity, because three is the magic number. Syntactic sugar is nice, but use them with caution.

Confusing Sets and Lists

Optimisation is rarely free: often point improvements incur costs elsewhere. Sometimes we try to optimise away the constraints we observe, without questioning our underlying assumptions. I have often seen the roles of lists and sets confused. An application can be brought to its knees – that is, cease to deliver commercial value – if List.contains is called frequently enough on big enough lists. And then there is the workaround…

When I moved over to Java from C++ several years ago, it seemed utterly crazy that there was even a Collection interface – exactly what Scott Meier’s Effective STL said not to do. I still think it’s crazy. Sets and lists cannot be substituted, and when you add up the marginal costs, as well as the costs of any compensatory workarounds, confusing them is responsible for a lot of performance bugs. As an application developer, it is part of your job to choose. Here are a few simple examples of when to use each collection type.

Contains

Is an element in the collection?

Never ever do this with a List. This operation is often referred to as being O(n). Which means in your worst case will touch every element in the list (technically, at least once). You have a choice between HashSet and a TreeSet, and both have costs and benefits.

If you choose a HashSet, your best case is O(1): you evaluate a hash code, take its modulus with respect to the size of an array, and look up a bucket containing only one element. The worst case occurs with a degenerate hash code which maps all elements to the same bucket. This is again O(n): you probe a linked list testing each element for equality. On average you get something between these two cases and it depends on the uniformity of your hash code implementation.

If you choose a TreeSet you get a worst case O(log n): this is effectively just a binary search through the nodes in a red black tree. Performance is limited by the cost of the comparator, and suffers systematically from cache misses for large sets (like any kind of pointer chasing, branch prediction and prefetching is difficult if not impossible).

If you’re working with numbers, and small to medium collections, a sorted primitive array may be the best approach, so long as it fits in cache. If you’re working with integers, you can do this in constant time in the worst case by using a BitSet.

Select

What is the value of the element at a given index with respect to a sort order?

This is an obvious use case for a List: it’s O(1) – this is just a lookup at an array offset.

You couldn’t even write the code with a HashSet without copying the data into an intermediate ordered structure, at which point you would probably think again. You see this sort of thing done in code written by inexpensive programmers at large outsourcing consultancies, who were perhaps just under unreasonable pressure to deliver to arbitrary deadlines.

SortedSet, and anything conceptually similar, is the wrong data structure for this operation. The only way to compute this is O(n): you iterate through the set incrementing a counter until you reach the index, and then return the element you’ve iterated to. If you reach the end of the set, you throw. If you do this a lot, you’ll notice.

Rank

How many predecessors does an element have with respect to a sort order?

Another classic operation for List, so long as you keep it sorted. Use Collections.binarySearch to find the index of the element in the collection with complexity O(log n). This is its rank. If you can get away with it, primitive arrays will be much better here, especially if they are small enough to fit in cache.

Once again, there are creativity points available for the solution involving a HashSet, and they do exist in the wild, but a clean looking solution is at least possible with a SortedSet. However, it involves an iteration with another check against an incrementing counter. It’s O(n) and if you do it a lot, you’ll blow your performance profile, so use a sorted list instead.

What if you had the source code?

Is this fundamental or just a limitation of the Collections framework? Maybe if you had the source code you could just make these data structures optimal for every use case, without having to choose the right one? Not without creating a Frankenstein, and not without a lot of memory. Optimisation isn’t free.