Premature Optimisation: Avoiding Boxed Integers

All too often I see that trove4j has been used in favour of the standard collections library, presumably because of a fear of the cost of integer boxing. Trove4j is 1.7 Megabytes – the full text of Moby Dick fits into a mere 1.3MB. Always keeping an eye on the size of deployed artifacts, it is confounding to find libraries using enormous compile scope dependencies without good reason. No criticism is aimed at trove4j here, and it is not argued that it never has a use case. Only one of trove4j’s many features is benchmarked, and it doesn’t do badly. However, when a dependency is huge, it’s necessary to balance thrift and performance gains. For an in depth benchmark analysis of various collection frameworks, read this excellent article by Mikhail Vorontsov. Eschewing the collections framework is a big decision and requires serious justification.

Data Processing Use Case

The use case is for ingesting records from a JDBC result or CSV file. The library needs to be agnostic to the schema of the records it is processing, and is billed as being high performance. Therefore the library neither imposes a schema on the records nor provides any kind of object mapping (think Jackson, Hibernate) functionality. The ideal interface for representing a record, given these requirements, would be a Map – leaving the types of the values up to the caller. However, since the records in a given stream come from a JDBC result or CSV file, it is known that the keys of the map will always be the same. The vast majority of the Map interface is not required, only reading and writing values by key. Therefore the schema (the mapping from field name to column index) can be elevated to the stream level, and the records will be represented as a wrapper around an Object[] with a reference to a single schema Map. This permits reads and writes by field name, and does indeed save a lot of space.

Boxed Integers are So Slow!!!

Boxed Integers are really slow, the developer asserts, and reaches for TObjectIntHashMap from trove4j for use as the schema. Surely nobody would mind paying 1.7MB for code this fast. The sole reason this decision is made is to avoid boxing integers when using java.util.HashMap. Well, let’s see if there’s any justifying evidence at all.

The benchmark compares three implementations of Record: A HashMap, an Object[] with a shared HashMap, and an Object[] with a shared TObjectIntHashmap. The simple code is shown below.


interface Record {
        <T> T getValue(String field);
        <T> void setValue(String field, T value);
    }

    public static class RecordFactory {

        enum Strategy {
            RAW_HASHMAP,
            FLYWEIGHT_HASHMAP,
            TOBJECTINT_FLYWEIGHT_HASHMAP
        }

        private final String[] schema;
        private final Map<String, Integer> flyweightSchema;
        private final TObjectIntHashMap<String> primitiveFlyWeightSchema;

        public RecordFactory(String[] schema) {
            this.schema = schema;
            this.flyweightSchema = new HashMap<>((schema.length * 3)/2, 0.67f);
            this.primitiveFlyWeightSchema = new TObjectIntHashMap<>((schema.length * 3)/2, 0.67f);
            for (int i = 0; i < schema.length; ++i) {
                flyweightSchema.put(schema[i], i);
                primitiveFlyWeightSchema.put(schema[i], i);
            }
        }

        Record newRecord(Strategy strategy, Object[] nakedRecord) {
            switch (strategy) {
                case RAW_HASHMAP:
                    Map<String, Object> map = Maps.newHashMap();
                    for (int i = 0; i < schema.length; ++i) {
                        map.put(schema[i], nakedRecord[i]);
                    }
                    return new Record() {
                        @Override
                        public <T> T getValue(String field) {
                            return (T)map.get(field);
                        }

                        @Override
                        public <T> void setValue(String field, T value) {
                            map.put(field, value);
                        }
                    };
                case FLYWEIGHT_HASHMAP:
                    return new Record() {
                        @Override
                        public <T> T getValue(String field) {
                            return (T) nakedRecord[flyweightSchema.get(field)];
                        }

                        @Override
                        public <T> void setValue(String field, T value) {
                            nakedRecord[flyweightSchema.get(field)] = value;
                        }
                    };
                case TOBJECTINT_FLYWEIGHT_HASHMAP:
                    return new Record() {
                        @Override
                        public <T> T getValue(String field) {
                            return (T) nakedRecord[primitiveFlyWeightSchema.get(field)];
                        }

                        @Override
                        public <T> void setValue(String field, T value) {
                            nakedRecord[primitiveFlyWeightSchema.get(field)] = value;
                        }
                    };
                default:
                    throw new RuntimeException("Unknown strategy " + strategy);

            }
        }
    }

A token and a set of fields can be passed into the factory and an implementation will be returned. The code above is slightly unfair to the naive raw hashmap implementation because it incurs a transformation cost. In any case, it does not matter because the construction time is not measured.

The time to access values of records of various sizes is measured. Tiny records consist of ten fields (that’s your typical well normalised database dimension table), small records consist of 100 fields (perhaps the fact table of a star schema) and huge records have 300 fields. I have never actually seen a table this wide in my line of work except when using HBase.

Construction costs are not incurred during the benchmark, only access times over a small batch of records. The code below is called by various JMH benchmarks, with the appropriate parameters. The Blackhole stops the JIT from skipping the loop entirely.


 private static void benchmark(String[] fieldsToGet, int limit, List<Record>; records, Blackhole bh) {
        for (Record record : records) {
            for (int i = 0; i < limit; ++i) {
                bh.consume((Object)record.getValue(fieldsToGet[i]));
            }
        }
    }

The results are as follows, bear in mind there is 1.7MB storage cost to using trove4j, so we’re looking for radical speed ups.

Benchmark Mode Count Score Units
FlyweightHugeRecord thrpt 10 8.817 ± 0.817 ops/ms
FlyweightSmallRecord thrpt 10 8.877 ± 0.653 ops/ms
FlyweightTinyRecord thrpt 10 8.120 ± 0.160 ops/ms
RawHashMapHugeRecord thrpt 10 3.336 ± 0.116 ops/ms
RawHashMapSmallRecord thrpt 10 3.774 ± 0.426 ops/ms
RawHashMapTinyRecord thrpt 10 7.973 ± 0.069 ops/ms
TObjectIntHugeRecord thrpt 10 7.355 ± 1.184 ops/ms
TObjectIntSmallRecord thrpt 10 8.094 ± 0.763 ops/ms
TObjectIntTinyRecord thrpt 10 7.715 ± 0.094 ops/ms

The approach of factoring the schema out of the records is completely validated by this benchmark (this is more or less what several schematic serialisation formats like Avro exploit). However, I definitely want my 1.7MB back, the standard library HashMap implementation has the best throughput. Looking at the three nines the HashMap implementation still wins:

Benchmark Mode Count Score Units
FlyweightHugeRecord sample 10 0.564 ms/op
FlyweightSmallRecord sample 10 0.973 ms/op
FlyweightTinyRecord sample 10 0.512 ms/op
RawHashMapHugeRecord sample 10 3.218 ms/op
RawHashMapSmallRecord sample 10 2.163 ms/op
RawHashMapTinyRecord sample 10 0.923 ms/op
TObjectIntHugeRecord sample 10 0.765 ms/op
TObjectIntSmallRecord sample 10 0.664 ms/op
TObjectIntTinyRecord sample 10 0.837 ms/op

It gets better – we only need positive integers, and about as many as we can have columns in a database table. Setting -XX:AutoBoxCacheMax=1000 seems to speed up the HashMap implementation but has no effect at all on TObjectIntHashmap. Think carefully abut doing this in an actual application though, it may have unintended consequences.

Benchmark Mode Count Score Units
FlyweightHugeRecord thrpt 10 8.566 ± 2.463 ops/ms
FlyweightSmallRecord thrpt 10 10.002 ± 0.413 ops/ms
FlyweightTinyRecord thrpt 10 8.554 ± 0.990 ops/ms
FlyweightHugeRecord sample 10 0.398 ms/op
FlyweightSmallRecord sample 10 0.425 ms/op
FlyweightTinyRecord sample 10 0.424 ms/op

Conclusion

Measure your code and don’t pass on enormous dependencies unless there’s some kind of benefit, especially not when there is a cost. Trove4j has to deploy as a huge jar because of its template based code generation, so it should be used only when it brings a benefit. While trove4j may have several sweet spots (though I’ve only actually ever seen it abused), this isn’t one of them. The benchmarks are at github as usual.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>