UUIDs and Compressibility

Universally unique identifiers, or UUIDs, are often used for database primary keys in scenarios where coordination of persistence is either impossible or impractical. UUIDs offer very good probabilistic guarantees of collision avoidance, at the cost of 128 bits per key. 128 bits for a key is quite problematic in key scans and joins: with appropriately structured data, these algorithms can benefit significantly from vector processing, but at 128 bits per element, vectorisation is probably a lost cause. The 128 bit cost is also a fixed cost, even if your database has fewer than one quintilliard rows. By virtue of being random enough to avoid collisions when generated in a distributed system, there is no way to compress UUID keys to the number of bits required to identify each record in the dataset. Data engineering is about tradeoffs, and none of this is to say UUIDs should never be used, but it is prudent to be aware of the cost of convenience. All of this applies in the best case: assuming the keys are stored in their binary format! How bad can it get if UUIDs are stored as text? Can compression undo the damage?

If you work with a relational database like Postgres, you can use an implementation specific uuid type to ensure UUIDs take up as little space as possible. However, having worked on several projects using NoSQL databases, I have seen people store UUIDs as text on at least two occasions (though this is not the fault of the databases!). How harmful this is depends on the character encoding used, but UTF-8 is quite common (for the characters found in a UUID, this is equivalent to ISO-8859-1). A UUID represented by a string like “9289f568-c33f-4667-8a3d-092aa4e21262” can take up the following sizes, depending on the encoding used.

Format Size (bytes) Ratio
binary 16 1
ISO_8859_1 36 2.25
UTF_8 36 2.25
UTF_16 74 4.625

The real issue here is not so much the extra storage burden, because keys are usually much smaller than the values, but that the keys are used to process queries. A representation requiring 2-5x more space requires 2-5x more data to pass through the processor when evaluating queries. Many NoSQL databases offer succinct compression options for keys, which allow the keys to be processed without decompression at a small computational cost, such as the prefix and delta encoding available in HBase. This approach can work wonders with well structured keys, but this will do absolutely nothing if your keys are random.

Even heavyweight compression techniques requiring global or block level decompression before evaluation can’t recover the bloat in a textual representation of a UUID because the text is almost as random as the bytes. I compressed collections of 1M UUIDs using compression algorithms typically reserved for “cold data”: gzip, snappy and LZ4, using the code on github.

Compression Encoding Count Size (MB)
Uncompressed binary 1000000 15.26
Uncompressed ISO-8859-1 1000000 34.33
Uncompressed UTF-16 1000000 70.57
GZIP binary 1000000 15.26
GZIP ISO-8859-1 1000000 19.50
GZIP UTF-16 1000000 23.73
LZ4 binary 1000000 15.32
LZ4 ISO-8859-1 1000000 32.56
LZ4 UTF-16 1000000 50.16
Snappy binary 1000000 15.26
Snappy ISO-8859-1 1000000 33.99
Snappy UTF-16 1000000 37.97

Assuming you are OK with treating your keys as cold data, none of these algorithms will undo the inflation. What’s interesting, assuming you’ve never thought about it before, is that none of these algorithms can compress the binary representation of the UUIDs. This is because the UUIDs are random, and random enough to be taken as unique in any given trillion year epoch. Even though there are only one million values, which could be represented by 20 bits per value, none of the compression algorithms improves on 128 bits per value. This reminds me of a passage from Theories of Everything by John D. Barrow:

The goal of science is to make sense of the diversity of nature. [Science] employs observation to gather information about the world and to test predictions about how the world will react to new circumstances, but in between these two procedures lies the heart of the scientific process. This is nothing more than the transformation of lists of observational data into abbreviated form by the recognition of patterns. The recognition of such a pattern allows the information content of the observed sequence of events to be replaced by a shorthand formula which possesses the same, or almost the same, information content…

We can extend this image of science in a manner that sharpens its focus. Suppose we are presented with any string of symbols. They do not have to be numbers but let us assume for the sake of illustration that they are. We say that the string is ‘random’ if there is no other representation of the string which is shorter than itself. But we say it is ‘non-random’ if there does exist an abbreviated representation.

Random data can’t be compressed, and inflated textual representations of random bits are almost as random as the bits themselves, as far as any existing compression algorithm is concerned. Efficient representation of data requires the identification and exploitation of patterns, and using UUIDs instead of naturally occurring composite keys introduces randomness where there could often be order.

Limiting Factors in a Dot Product Calculation

The dot product is ubiquitous in computing. The calculation is used heavily in nearual networks for perceptron activation and for gradient descent algorithms for error backpropagation. It is one of the building blocks of linear regression. Cosine similarity in document search is yet another dot product.

These use cases are more often implemented in C/C++ than in JVM languages, for reasons of efficiency, but what are the constraints on its computational performance? The combination of the computational simplicity and its streaming nature means the limiting factor in efficient code should be memory bandwidth. This is a good opportunity to look at the raw performance that will be made available with the vector API when it’s released.

Written in Java code since Java 9, the code to calculate a dot product is easy, using the Math.fma intrinsic.

  public float vanilla() {
    float sum = 0f;
    for (int i = 0; i < size; ++i) {
      sum = Math.fma(left[i], right[i], sum);
    }
    return sum;
  }

Despite its simplicity, this code is incredibly inefficient precisely because it’s written in Java. Java is a language which prizes portability, and this sometimes comes at the cost of performance. The only way to make this routine produce the same number given the same input, no matter what operating system or instruction sets are available, is to do the operations in the same order, which means no unrolling or vectorisation. For a web application, this a good trade off, but for data analytics it is not.

An estimate of intensity, assuming a constant processor frequency, and two floating point operations (flops) per FMA, shows that the intensity is constant but very low at 0.67 flops/cycle. There being constant intensity as a function of array size is interesting because it indicates that the performance is insensitive to cache hierarchy, that the the limit is the CPU. Daniel Lemire made this observation with a benchmark written in C, disabling fastmath compiler optimisations, recently.

The JLS’s view on floating point arithmetic is the true limiting factor here. Assuming you really care about dot product performance, the best you can do to opt out is to unroll the loop and get slightly higher throughput.

  public float unrolled() {
    float s0 = 0f;
    float s1 = 0f;
    float s2 = 0f;
    float s3 = 0f;
    float s4 = 0f;
    float s5 = 0f;
    float s6 = 0f;
    float s7 = 0f;
    for (int i = 0; i < size; i += 8) {
      s0 = Math.fma(left[i + 0],  right[i + 0], s0);
      s1 = Math.fma(left[i + 1],  right[i + 1], s1);
      s2 = Math.fma(left[i + 2],  right[i + 2], s2);
      s3 = Math.fma(left[i + 3],  right[i + 3], s3);
      s4 = Math.fma(left[i + 4],  right[i + 4], s4);
      s5 = Math.fma(left[i + 5],  right[i + 5], s5);
      s6 = Math.fma(left[i + 6],  right[i + 6], s6);
      s7 = Math.fma(left[i + 7],  right[i + 7], s7);
    }
    return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7;
  }

The intensity is about 4x better, but still constant. My Intel Skylake processor is capable of 32 flops/cycle, so this code is clearly still not very efficient, but it’s actually the best you can do with any released version of OpenJDK at the time of writing.

The Vector API

I have been keeping an eye on the Vector API incubating in Project Panama for some time, and have only recently got round to kicking the tires. I wrote some benchmarks earlier in the year but ran into, as one should expect of a project in active development, bugs in FMA and vector box elimination. This limited the value I would get from writing about the benchmarks. These bugs have been fixed for a long time now, and you can start to see the how good this API is going to be.

Here’s a simple implementation which wouldn’t be legal for C2 (or Graal for that matter) to generate from the dot product loop. It relies on an accumulator vector, into which a vector dot product of the next eight elements is FMA’d for each step of the loop.

  public float vector() {
    var sum = YMM_FLOAT.zero();
    for (int i = 0; i < size; i += YMM_FLOAT.length()) {
      var l = YMM_FLOAT.fromArray(left, i);
      var r = YMM_FLOAT.fromArray(right, i);
      sum = l.fma(r, sum);
    }
    return sum.addAll();
  }

This loop can be unrolled, but it seems that this must be done manually for the sake of stability. The unroll below uses four accumulators and results in a huge boost in throughput.

  private float vectorUnrolled() {
    var sum1 = YMM_FLOAT.zero();
    var sum2 = YMM_FLOAT.zero();
    var sum3 = YMM_FLOAT.zero();
    var sum4 = YMM_FLOAT.zero();
    int width = YMM_FLOAT.length();
    for (int i = 0; i < size; i += width * 4) {
      sum1 = YMM_FLOAT.fromArray(left, i).fma(YMM_FLOAT.fromArray(right, i), sum1);
      sum2 = YMM_FLOAT.fromArray(left, i + width).fma(YMM_FLOAT.fromArray(right, i + width), sum2);
      sum3 = YMM_FLOAT.fromArray(left, i + width * 2).fma(YMM_FLOAT.fromArray(right, i + width * 2), sum3);
      sum4 = YMM_FLOAT.fromArray(left, i + width * 3).fma(YMM_FLOAT.fromArray(right, i + width * 3), sum4);
    }
    return sum1.addAll() + sum2.addAll() + sum3.addAll() + sum4.addAll();
  }

This plot doesn’t quite do justice to how large the difference is and you could be forgiven for thinking the performance converges. In fact, presenting the data like this is a great way to mislead people! The absolute difference narrows, but the relative performance is more or less constant. Looking at intensity gives a much better picture and is size invariant (until memory bandwidth is saturated).

The first thing to notice is that the intensity gets nowhere near 32 flops/cycle, and that’s because my chip can’t load data fast enough to keep the two FMA ports busy. Skylake chips can do two loads per cycle, which is enough for one FMA between two vectors and the accumulator. Since the arrays are effectively streamed, there is no chance to reuse any loads, so the absolute maximum intensity is 50% capacity, or just 16 flops/cycle.

In the unrolled vector code, the intensity hits 12 flops/cycle just before 4096 elements. 4096 is a special number because 2 * 4096 * 4 = 32kB is the capacity of L1 cache. This peak and rapid decrease suggests that the code is fast enough to be hitting memory bandwidth: if L1 were larger or L2 were faster, the intensity could be sustained. This is great, and the performance counters available with -prof perfnorm corroborate.

In the vanilla loop and unrolled loop, the cycles per instruction (CPI) reaches a maximum long before the arrays breach L1 cache. The latency of an instruction depends on where its operands come from, increasing the further away from L1 cache the data comes from. If CPI for arrays either side of the magical 4092 element threshold is the same, then memory cannot be the limiting factor. The unrolled vector loop show a very sharp increase, suggesting a strong dependency on load speed. Similarly, L1-dcache-load-misses can be seen to increase sharply once the arrays are no longer L1 resident (predictably) correlated with a drop in intensity only in the vector unrolled implementation. It’s short lived, but the unrolled vector code, albeit with bounds checks disabled, is efficient enough for the CPU not to be the bottleneck.

Take a look at my benchmarks and raw data.. The JDK used was built from the vectorIntrinsics branch of the Project Panama OpenJDK fork, run with JMH 1.20 on Ubuntu 16.04 LTS, on an 8 core i7-6700HQ processor.

Vectorised Algorithms in Java

There has been a Cambrian explosion of JVM data technologies in recent years. It’s all very exciting, but is the JVM really competitive with C in this area? I would argue that there is a reason Apache Arrow is polyglot, and it’s not just interoperability with Python. To pick on one project impressive enough to be thriving after seven years, if you’ve actually used Apache Spark you will be aware that it looks fastest next to its predecessor, MapReduce. Big data is a lot like teenage sex: everybody talks about it, nobody really knows how to do it, and everyone keeps their embarrassing stories to themselves. In games of incomplete information, it’s possible to overestimate the competence of others: nobody opens up about how slow their Spark jobs really are because there’s a risk of looking stupid.

If it can be accepted that Spark is inefficient, the question becomes is Spark fundamentally inefficient? Flare provides a drop-in replacement for Spark’s backend, but replaces JIT compiled code with highly efficient native code, yielding order of magnitude improvements in job throughput. Some of Flare’s gains come from generating specialised code, but the rest comes from just generating better native code than C2 does. If Flare validates Spark’s execution model, perhaps it raises questions about the suitability of the JVM for high throughput data processing.

I think this will change radically in the coming years. I think the most important reason is the advent of explicit support for SIMD provided by the vector API, which is currently incubating in Project Panama. Once the vector API is complete, I conjecture that projects like Spark will be able to profit enormously from it. This post takes a look at the API in its current state and ignores performance.

Why Vectorisation?

Assuming a flat processor frequency, throughput is improved by a combination of executing many instructions per cycle (pipelining) and processing multiple data items per instruction (SIMD). SIMD instruction sets are provided by Intel as the various generations of SSE and AVX. If throughput is the only goal, maximising SIMD may even be worth reducing the frequency, which can happen on Intel chips when using AVX. Vectorisation allows throughput to be increased by the use of SIMD instructions.

Analytical workloads are particularly suitable for vectorisation, especially over columnar data, because they typically involve operations consuming the entire range of a few numerical attributes of a data set. Vectorised analytical processing with filters is explicitly supported by vector masks, and vectorisation is also profitable for operations on indices typically performed for filtering prior to calculations. I don’t actually need to make a strong case for the impact of vectorisation on analytical workloads: just read the work of top researchers like Daniel Abadi and Daniel Lemire.

Vectorisation in the JVM

C2 provides quite a lot of autovectorisation, which works very well sometimes, but the support is limited and brittle. I have written about this several times. Because AVX can reduce the processor frequency, it’s not always profitable to vectorise, so compilers employ cost models to decide when they should do so. Such cost models require platform specific calibration, and sometimes C2 can get it wrong. Sometimes, specifically in the case of floating point operations, using SIMD conflicts with the JLS, and the code C2 generates can be quite inefficient. In general, data parallel code can be better optimised by C compilers, such as GCC, than C2 because there are fewer constraints, and there is a larger budget for analysis at compile time. This all makes having intrinsics very appealing, and as a user I would like to be able to:

  1. Bypass JLS floating point constraints.
  2. Bypass cost model based decisions.
  3. Avoid JNI at all costs.
  4. Use a modern “object-functional” style. SIMD intrinsics in C are painful.

There is another attempt to provide SIMD intrinsics to JVM users via LMS, a framework for writing programs which write programs, designed by Tiark Rompf (who is also behind Flare). This work is very promising (I have written about it before), but it uses JNI. It’s only at the prototype stage, but currently the intrinsics are auto-generated from XML definitions, which leads to a one-to-one mapping to the intrinsics in immintrin.h, yielding a similar programming experience. This could likely be improved a lot, but the reliance on JNI is fundamental, albeit with minimal boundary crossing.

I am quite excited by the vector API in Project Panama because it looks like it will meet all of these requirements, at least to some extent. It remains to be seen quite how far the implementors will go in the direction of associative floating point arithmetic, but it has to opt out of JLS floating point semantics to some extent, which I think is progressive.

The Vector API

Disclaimer: Everything below is based on my experience with a recent build of the experimental code in the Project Panama fork of OpenJDK. I am not affiliated with the design or implementation of this API, may not be using it properly, and it may change according to its designers’ will before it is released!

To understand the vector API you need to know that there are different register widths and different SIMD instruction sets. Because of my area of work, and 99% of the server market is Intel, I am only interested in AVX, but ARM have their own implementations with different maximum register sizes, which presumably need to be handled by a JVM vector API. On Intel CPUs, SSE instruction sets use up to 128 bit registers (xmm, four ints), AVX and AVX2 use up to 256 bit registers (ymm, eight ints), and AVX512 use up to 512 bit registers (zmm, sixteen ints).

The instruction sets are typed, and instructions designed to operate on packed doubles can’t operate on packed ints without explicit casting. This is modeled by the interface Vector<Shape>, parametrised by the Shape interface which models the register width.

The types of the vector elements is modeled by abstract element type specific classes such as IntVector. At the leaves of the hierarchy are the concrete classes specialised both to element type and register width, such as IntVector256 which extends IntVector<Shapes.S256Bit>.

Since EJB, the word factory has been a dirty word, which might be why the word species is used in this API. To create a IntVector<Shapes.S256Bit>, you can create the factory/species as follows:

public static final IntVector.IntSpecies<Shapes.S256Bit> YMM_INT = 
          (IntVector.IntSpecies<Shapes.S256Bit>) Vector.species(int.class, Shapes.S_256_BIT);

There are now various ways to create a vector from the species, which all have their use cases. First, you can load vectors from arrays: imagine you want to calculate the bitwise intersection of two int[]s. This can be written quite cleanly, without any shape/register information.


public static int[] intersect(int[] left, int[] right) {
    assert left.length == right.length;
    int[] result = new int[left.length];
    for (int i = 0; i < left.length; i += YMM_INT.length()) {
      YMM_INT.fromArray(left, i)
             .and(YMM_INT.fromArray(right, i))
             .intoArray(result, i);
    }
}

A common pattern in vectorised code is to broadcast a variable into a vector, for instance, to facilitate the multiplication of a vector by a scalar.

IntVector<Shapes.S256Bit> multiplier = YMM_INT.broadcast(x);

Or to create a vector from some scalars, for instance in a lookup table.

IntVector<Shapes.S256Bit> vector = YMM_INT.scalars(0, 1, 2, 3, 4, 5, 6, 7);

A zero vector can be created from a species:

IntVector<Shapes.S256Bit> zero = YMM_INT.zero();

The big split in the class hierarchy is between integral and floating point types. Integral types have meaningful bitwise operations (I am looking forward to trying to write a vectorised population count algorithm), which are absent from FloatVector and DoubleVector, and there is no concept of fused-multiply-add for integral types, so there is obviously no IntVector.fma. The common subset of operations is arithmetic, casting and loading/storing operations.

I generally like the API a lot: it feels familiar to programming with streams, but on the other hand, it isn’t too far removed from traditional intrinsics. Below is an implementation of a fast matrix multiplication written in C, and below it is the same code written with the vector API:


static void mmul_tiled_avx_unrolled(const int n, const float *left, const float *right, float *result) {
    const int block_width = n >= 256 ? 512 : 256;
    const int block_height = n >= 512 ? 8 : n >= 256 ? 16 : 32;
    for (int column_offset = 0; column_offset < n; column_offset += block_width) {
        for (int row_offset = 0; row_offset < n; row_offset += block_height) {
            for (int i = 0; i < n; ++i) {
                for (int j = column_offset; j < column_offset + block_width && j < n; j += 64) {
                    __m256 sum1 = _mm256_load_ps(result + i * n + j);
                    __m256 sum2 = _mm256_load_ps(result + i * n + j + 8);
                    __m256 sum3 = _mm256_load_ps(result + i * n + j + 16);
                    __m256 sum4 = _mm256_load_ps(result + i * n + j + 24);
                    __m256 sum5 = _mm256_load_ps(result + i * n + j + 32);
                    __m256 sum6 = _mm256_load_ps(result + i * n + j + 40);
                    __m256 sum7 = _mm256_load_ps(result + i * n + j + 48);
                    __m256 sum8 = _mm256_load_ps(result + i * n + j + 56);
                    for (int k = row_offset; k < row_offset + block_height && k < n; ++k) {
                        __m256 multiplier = _mm256_set1_ps(left[i * n + k]);
                        sum1 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j), sum1);
                        sum2 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 8), sum2);
                        sum3 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 16), sum3);
                        sum4 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 24), sum4);
                        sum5 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 32), sum5);
                        sum6 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 40), sum6);
                        sum7 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 48), sum7);
                        sum8 = _mm256_fmadd_ps(multiplier, _mm256_load_ps(right + k * n + j + 56), sum8);
                    }
                    _mm256_store_ps(result + i * n + j, sum1);
                    _mm256_store_ps(result + i * n + j + 8, sum2);
                    _mm256_store_ps(result + i * n + j + 16, sum3);
                    _mm256_store_ps(result + i * n + j + 24, sum4);
                    _mm256_store_ps(result + i * n + j + 32, sum5);
                    _mm256_store_ps(result + i * n + j + 40, sum6);
                    _mm256_store_ps(result + i * n + j + 48, sum7);
                    _mm256_store_ps(result + i * n + j + 56, sum8);
                }
            }
        }
    }
}


  private static void mmul(int n, float[] left, float[] right, float[] result) {
    int blockWidth = n >= 256 ? 512 : 256;
    int blockHeight = n >= 512 ? 8 : n >= 256 ? 16 : 32;
    for (int columnOffset = 0; columnOffset < n; columnOffset += blockWidth) {
      for (int rowOffset = 0; rowOffset < n; rowOffset += blockHeight) {
        for (int i = 0; i < n; ++i) {
          for (int j = columnOffset; j < columnOffset + blockWidth && j < n; j += 64) {
            var sum1 = YMM_FLOAT.fromArray(result, i * n + j);
            var sum2 = YMM_FLOAT.fromArray(result, i * n + j + 8);
            var sum3 = YMM_FLOAT.fromArray(result, i * n + j + 16);
            var sum4 = YMM_FLOAT.fromArray(result, i * n + j + 24);
            var sum5 = YMM_FLOAT.fromArray(result, i * n + j + 32);
            var sum6 = YMM_FLOAT.fromArray(result, i * n + j + 40);
            var sum7 = YMM_FLOAT.fromArray(result, i * n + j + 48);
            var sum8 = YMM_FLOAT.fromArray(result, i * n + j + 56);
            for (int k = rowOffset; k < rowOffset + blockHeight && k < n; ++k) {
              var multiplier = YMM_FLOAT.broadcast(left[i * n + k]);
              sum1 = multiplier.fma(YMM_FLOAT.fromArray(right, k * n + j), sum1);
              sum2 = multiplier.fma(YMM_FLOAT.fromArray(right, k * n + j + 8), sum2);
              sum3 = multiplier.fma(YMM_FLOAT.fromArray(right, k * n + j + 16), sum3);
              sum4 = multiplier.fma(YMM_FLOAT.fromArray(right, k * n + j + 24), sum4);
              sum5 = multiplier.fma(YMM_FLOAT.fromArray(right, k * n + j + 32), sum5);
              sum6 = multiplier.fma(YMM_FLOAT.fromArray(right, k * n + j + 40), sum6);
              sum7 = multiplier.fma(YMM_FLOAT.fromArray(right, k * n + j + 48), sum7);
              sum8 = multiplier.fma(YMM_FLOAT.fromArray(right, k * n + j + 56), sum8);
            }
            sum1.intoArray(result, i * n + j);
            sum2.intoArray(result, i * n + j + 8);
            sum3.intoArray(result, i * n + j + 16);
            sum4.intoArray(result, i * n + j + 24);
            sum5.intoArray(result, i * n + j + 32);
            sum6.intoArray(result, i * n + j + 40);
            sum7.intoArray(result, i * n + j + 48);
            sum8.intoArray(result, i * n + j + 56);
          }
        }
      }
    }
  }

They just aren’t that different, and it’s easy to translate between the two. I wouldn’t expect it to be fast yet though. I have no idea what the scope of work involved in implementing all of the C2 intrinsics to make this possible is, but I assume it’s vast. The class jdk.incubator.vector.VectorIntrinsics seems to contain all of the intrinsics implemented so far, and it doesn’t contain every operation used in my array multiplication code. There is also the question of value types and vector box elimination. I will probably look at this again in the future when more of the JIT compiler work has been done, but I’m starting to get very excited about the possibility of much faster JVM based data processing.

I have written various benchmarks for useful analytical subroutines using the Vector API at github.

Stages

The conventional wisdom is that the stages of a task should be pipelined, so you don’t need to wait for the completion of one stage before the next is started. It surprises me that it seems you can sometimes do better when performing each stage of a pipeline in a short batch. Useful optimisation opportunities can arise from this phenomenon, with only minor code changes. I recently applied this principle to implement fast batch iterators in RoaringBitmap.

I came across a discussion about shuffling arrays on Twitter, stimulated by a blog post by Daniel Lemire. Imagine you want to randomly shuffle the contents of an array. One approach to take would be to iterate over the array in reverse, at each index i, generate a random index j smaller than i, and swap the elements at i and j. Here’s some benchmark code to measure how long this takes for an assortment of swapping strategies, including one where the swaps are just precomputed and looked up in an array.


  @Benchmark
  public void shuffle(Blackhole bh) {
    for (int i = data.length; i > 1; i--)
      swap(data, i - 1, op.applyAsInt(i));
    bh.consume(data);
  }

  private static void swap(int[] arr, int i, int j) {
    arr[i] ^= arr[j];
    arr[j] ^= arr[i];
    arr[i] ^= arr[j];
  }

There is a large difference between the version where the random swap is precomputed and when the swap is computed on the fly with ThreadLocalRandom.nextInt.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: mode Param: size
shuffle thrpt 1 10 2198.459182 274.965189 ops/s THREAD_LOCAL_RANDOM 65536
shuffle thrpt 1 10 1015.796005 16.225480 ops/s THREAD_LOCAL_RANDOM 131072
shuffle thrpt 1 10 7300.732038 46.788234 ops/s PRECOMPUTED 65536
shuffle thrpt 1 10 3828.021096 450.874537 ops/s PRECOMPUTED 131072

The difference is large, but a lot more work is being done when the random indices are computed on the fly. A good measure of efficiency per unit of work is cycles per instruction (CPI). Running the benchmark with -prof perfnorm shows that these benchmarks are at parity for cycles per instruction: if throughput is lower when the random numbers are generated on the fly, it’s because there are more instructions to execute.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: mode Param: size
shuffle:CPI thrpt 1 1 0.427028 NaN #/op THREAD_LOCAL_RANDOM 65536
shuffle:CPI thrpt 1 1 0.447793 NaN #/op THREAD_LOCAL_RANDOM 131072
shuffle:CPI thrpt 1 1 0.477202 NaN #/op PRECOMPUTED 65536
shuffle:CPI thrpt 1 1 0.565153 NaN #/op PRECOMPUTED 131072

Nevertheless, instruction profiling with -prof perfasm shows that the code is qualitatively different when computing the next swapped index is simple. When there is random number generation to do, most of the time is attributed either to mov or just after mov instructions (probably because of profiling skid) during the swap. For example, with the smaller array:


  0.04%    0.00%  │   ││  0x00007fa009c0a8f9: xor    0x10(%rsi,%rdx,4),%r10d  
 15.31%   13.18%  │   ││  0x00007fa009c0a8fe: mov    %r10d,0xc(%rsi,%rcx,4)  
  3.43%    3.05%  │   ││  0x00007fa009c0a903: xor    0x10(%rsi,%rdx,4),%r10d  
  5.37%    5.92%  │   ││  0x00007fa009c0a908: mov    %r10d,0x10(%rsi,%rdx,4)  
  4.15%    4.22%  │   ││  0x00007fa009c0a90d: xor    %r10d,0xc(%rsi,%rcx,4)  
 10.80%    8.80%  │   ││  0x00007fa009c0a912: cmp    $0x1,%r9d ; probably skid

The key difference in the precomputed case is that the loop is unrolled with several isomorphic chains of instructions. None of the loads seem to be quite so expensive according to the sampled frequencies.


  0.08%    0.16%  │      0x00007fda2dc0dfb2: cmp    %r10d,%r9d
                  │      0x00007fda2dc0dfb5: jae    0x00007fda2dc0e264
  0.00%    0.00%  │      0x00007fda2dc0dfbb: xor    0x10(%rdx,%r9,4),%edi
  2.90%    2.89%  │      0x00007fda2dc0dfc0: mov    %edi,0xc(%rdx,%r11,4)
  0.48%    0.33%  │      0x00007fda2dc0dfc5: xor    0x10(%rdx,%r9,4),%edi
  0.45%    0.48%  │      0x00007fda2dc0dfca: mov    %edi,0x10(%rdx,%r9,4)
  0.56%    0.46%  │      0x00007fda2dc0dfcf: xor    %edi,0xc(%rdx,%r11,4)
  4.29%    3.88%  │      0x00007fda2dc0dfd4: mov    0x8(%rdx,%r11,4),%edi
  0.03%    0.01%  │      0x00007fda2dc0dfd9: mov    0x8(%rsi,%r11,4),%r9d
  1.38%    1.46%  │      0x00007fda2dc0dfde: mov    %r11d,%ebx
  0.02%    0.01%  │      0x00007fda2dc0dfe1: add    $0xfffffffe,%ebx   

  0.63%    0.61%  │      0x00007fda2dc0dfe4: cmp    %r10d,%r9d
                  │      0x00007fda2dc0dfe7: jae    0x00007fda2dc0e26f
  0.00%    0.01%  │      0x00007fda2dc0dfed: xor    0x10(%rdx,%r9,4),%edi
  2.60%    2.38%  │      0x00007fda2dc0dff2: mov    %edi,0x8(%rdx,%r11,4)
  0.58%    0.51%  │      0x00007fda2dc0dff7: xor    0x10(%rdx,%r9,4),%edi
  0.90%    0.96%  │      0x00007fda2dc0dffc: mov    %edi,0x10(%rdx,%r9,4)
  0.68%    0.66%  │      0x00007fda2dc0e001: xor    %edi,0x8(%rdx,%r11,4)
  4.85%    4.17%  │      0x00007fda2dc0e006: mov    0x4(%rdx,%r11,4),%edi
  0.01%    0.02%  │      0x00007fda2dc0e00b: mov    0x4(%rsi,%r11,4),%r9d
  1.12%    0.95%  │      0x00007fda2dc0e010: mov    %r11d,%ecx
  0.01%    0.00%  │      0x00007fda2dc0e013: add    $0xfffffffd,%ecx  

  1.02%    1.02%  │      0x00007fda2dc0e016: cmp    %r10d,%r9d
                  │      0x00007fda2dc0e019: jae    0x00007fda2dc0e267
  0.01%    0.01%  │      0x00007fda2dc0e01f: xor    0x10(%rdx,%r9,4),%edi
  2.47%    2.10%  │      0x00007fda2dc0e024: mov    %edi,0x4(%rdx,%r11,4)
  0.69%    0.57%  │      0x00007fda2dc0e029: xor    0x10(%rdx,%r9,4),%edi
  1.37%    1.50%  │      0x00007fda2dc0e02e: mov    %edi,0x10(%rdx,%r9,4)
  0.77%    0.83%  │      0x00007fda2dc0e033: xor    %edi,0x4(%rdx,%r11,4)
  4.28%    3.85%  │      0x00007fda2dc0e038: mov    (%rdx,%r11,4),%edi
  0.03%    0.02%  │      0x00007fda2dc0e03c: mov    (%rsi,%r11,4),%r9d
  1.14%    0.97%  │      0x00007fda2dc0e040: mov    %r11d,%ebx
  0.01%    0.00%  │      0x00007fda2dc0e043: add    $0xfffffffc,%ebx  

With unrolling, some of each chain can take place concurrently, and if there is a cache miss in one chain, it won’t stall the progress of the other chains. Without this capacity for parallelism, a cache miss during the swap will stop all work from progressing. As the probability of a cache miss increases, the cost of the load bottleneck in the swap should increase: this can be stressed by increasing the size of the array. With a large (100M) array, there’s a good chance of a cache miss virtually all the time. CPI increases in both cases, markedly so with the precomputed swaps, but throughput converges: access to main memory has become the bottleneck.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: mode Param: size
shuffle:CPI thrpt 1 1 1.354325 NaN #/op THREAD_LOCAL_RANDOM 100000000
shuffle:CPI thrpt 1 1 3.854150 NaN #/op PRECOMPUTED 100000000

The perfasm output points to the first load in the generated swap as the bottleneck: notice the large cost on the mov instruction.


  0.15%    0.24%  │      ││  0x00007f8405c0a264: cmp    %r9d,%edx
                  │      ││  0x00007f8405c0a267: jae    0x00007f8405c0a350
  0.10%    0.11%  │      ││  0x00007f8405c0a26d: xor    0x10(%r11,%rdx,4),%eax  
 73.97%   63.58%  │      ││  0x00007f8405c0a272: mov    %eax,0xc(%r11,%rcx,4)  
  2.46%    1.87%  │      ││  0x00007f8405c0a277: xor    0x10(%r11,%rdx,4),%eax 
  1.42%    0.67%  │      ││  0x00007f8405c0a27c: mov    %eax,0x10(%r11,%rdx,4) 
  2.19%    1.44%  │      ││  0x00007f8405c0a281: xor    %eax,0xc(%r11,%rcx,4) 
  2.16%    1.37%  │      ││  0x00007f8405c0a286: cmp    $0x1,%edi

With precomputed swaps, there is no single bottleneck, and my intuition is that there is some concurrency, despite the higher CPI. This is a long way from being proven.


 10.33%   11.23%   ││  0x00007fdb35c09250: mov    %r9d,0xc(%rsi,%r10,4)  
  0.41%    0.45%   ││  0x00007fdb35c09255: xor    0x10(%rsi,%r11,4),%r9d  
  0.36%    0.25%   ││  0x00007fdb35c0925a: mov    %r9d,0x10(%rsi,%r11,4)  
  0.42%    0.42%   ││  0x00007fdb35c0925f: xor    %r9d,0xc(%rsi,%r10,4)  
  0.51%    0.66%   ││  0x00007fdb35c09264: mov    0x8(%rsi,%r10,4),%r9d  
  0.03%    0.09%   ││  0x00007fdb35c09269: mov    0x8(%r13,%r10,4),%r11d 
  0.25%    0.20%   ││  0x00007fdb35c0926e: mov    %r10d,%r8d
  0.03%    0.15%   ││  0x00007fdb35c09271: add    $0xfffffffe,%r8d  
  0.19%    0.17%   ││  0x00007fdb35c09275: cmp    %ebx,%r11d
                   ││  0x00007fdb35c09278: jae    0x00007fdb35c09440
  0.02%    0.06%   ││  0x00007fdb35c0927e: xor    0x10(%rsi,%r11,4),%r9d  
 10.40%   10.66%   ││  0x00007fdb35c09283: mov    %r9d,0x8(%rsi,%r10,4) 
  0.41%    0.35%   ││  0x00007fdb35c09288: xor    0x10(%rsi,%r11,4),%r9d 
  0.41%    0.30%   ││  0x00007fdb35c0928d: mov    %r9d,0x10(%rsi,%r11,4) 
  0.45%    0.39%   ││  0x00007fdb35c09292: xor    %r9d,0x8(%rsi,%r10,4)  
  0.48%    0.60%   ││  0x00007fdb35c09297: mov    0x4(%rsi,%r10,4),%r9d  
  0.03%    0.06%   ││  0x00007fdb35c0929c: mov    0x4(%r13,%r10,4),%r11d 
  0.06%    0.11%   ││  0x00007fdb35c092a1: mov    %r10d,%edi
  0.02%    0.16%   ││  0x00007fdb35c092a4: add    $0xfffffffd,%edi   
  0.25%    0.20%   ││  0x00007fdb35c092a7: cmp    %ebx,%r11d

This can be exploited so the random numbers can be generated on the fly without a single bottleneck by using a hybrid approach. The random swaps can be generated on the fly and written into a small buffer. Once the buffer is full, the swaps are done. This should “decouple” the random number generation from the swapping code, and should allow some of the swaps to be performed independently. Concretely:


  @Benchmark
  public void shuffleBuffered(Blackhole bh) {
    for (int i = data.length; i - unroll > 1; i -= unroll) {
      for (int j = 0; j < buffer.length; ++j) {
        buffer[j] = op.applyAsInt(i - j);
      }
      for (int j = 0; j < buffer.length; ++j) {
        swap(data, i - j - 1, buffer[j]);
      }
    }
    bh.consume(data);
  }

There’s not much to be gained (or lost) from this until the array gets quite large, but it’s a relatively interesting outcome. CPI is on the whole improved, and throughput improves as a function of buffer size, so long as the buffer is small.

Mode Benchmark 16 32 64 128 256
PRECOMPUTED shuffle 0.30639 0.296566 0.309829 0.312449 0.311183
PRECOMPUTED shuffle:CPI 3.004183 3.126903 2.989748 2.987508 3.000369
THREAD_LOCAL_RANDOM shuffle 0.271536 0.266418 0.271858 0.265593 0.264507
THREAD_LOCAL_RANDOM shuffle:CPI 1.303454 1.328127 1.300731 1.32857 1.377559
THREAD_LOCAL_RANDOM shuffleBuffered 0.296098 0.324416 0.346934 0.353246 0.35277
THREAD_LOCAL_RANDOM shuffleBuffered:CPI 0.96738 0.937101 0.893673 0.87786 0.874607

Frankly, I can’t think of anything less interesting than scrambling an array, but this observation made me think there may be something in the idea of decoupling stages of work in choice scenarios. After going through the otherwise pointless exercise above, I decided it might be worthwhile spending some time porting a batch iteration feature implemented in the Go version of RoaringBitmap to Java. Confident in the potential advantages of batching, I decided to try porting the feature to Java. This idea turned out to be hugely profitable, speeding up iteration between 2x and 10x. If you use RoaringBitmap, it might be worth switching to these new batch iterators.

This topic is explored in more depth by Daniel Lemire in Fast Random Integer Generation in an Interval.

Data Driven Logic

I really don’t like reading or writing blocks of if-else statements. They make my eyes glaze over. Rumour has it that processors don’t like executing them either, though that’s less true now than it once was. There are two problems with these blocks of statements, and neither one of them is performance:

  1. They are hard to read and tend to have subtle dependencies on line order.
  2. They can’t be treated as data, and can’t be executed remotely unless you do something weird like serialise code.

Since I started programming in Java, I have been aware of the existence of rule engines, but I have never heard of a single case of “soft coding” working out. In my own experience, every time I have been involved in the implementation of a system with a DSL to empower business analysts to control the business logic, there has been low stakeholder participation during the design of the DSL and developers have ended up writing the business logic anyway. The most excruciating aspect of this is that it dilutes accountability for testing by blurring the boundaries between the application and user input. Perhaps your experience differs. However, rule engines can eradicate cyclomatic complexity in application code, and systems consisting of straight line code (with high test coverage) tend to do what they are supposed to. Soft coding isn’t the appeal of rule engines, getting rid of the if-else blocks is. If you squint at rule engines the right way, they look data driven and they start to get exciting. I can’t see value in anything more complicated than a decision table.

You can represent a block of if-else statements as a decision table by considering every possible branch as a line in the table. Your decision table doesn’t need to be exhaustive: there can be cases where you fall through and throw an exception or choose a default. This can be quite hard to write in imperative code, and you may need to throw the same exception in multiple places, set flags, or otherwise rely on line order to achieve this. Decision tables have a really nice property: if you want to start treating certain cases as exceptional, you just delete the line from the table.

Decision tables are very similar in character to case classes in Scala, or to the weaker when expressions present in Kotlin, but decision tables can be allowed to grow much larger. I wouldn’t allow a match expression with 50,000 cases through a code review even if someone had the energy to write one deviously enough to come in under the maximum byte code method size.

I looked at several implementations of decision tables on GitHub and saw a lot of clean code, but not a lot of textbook computer science. Every single implementation iterated through a list of rules checking the rule against the input data. I have implemented a password strength checker like this in the past (I know! I probably shouldn’t have done this myself!) which is fine because a strong password checker might have at most a dozen rules. What if you work in adtech and want to classify the people you track (how do you sleep at night?) as members of one or many of 50,000 clusters which can be described in terms of regions of, say, a 50 dimensional feature space? Imagine your task is to guess which cluster your quarry belongs to in a few microseconds at most. You won’t get far if you iterate through thousands of rules.

I implemented a small library in the evenings over the last couple of weeks called bitrules. This was based on some ideas I had about using RoaringBitmap for decision tables last year. The idea is very simple: think of a list of rules with constrained attributes as a matrix, and transpose that matrix and loop through the attributes during classification. This is a similar approach to that taken in blocked signatures, a search technique used in BitFunnel, which translates an expensive signature scan to a random access. In the case of bitrules, for each constraint on each attribute, bits are removed from a mask of potentially matching rules. These bitsets are intersected sequentially, resulting in a bitset rapidly decreasing in cardinality. Because I used RoaringBitmap, rapid reduction in cardinality means a rapid reduction in size, which means cache friendliness. There are a few tricks in the code like using range encoding for range attributes, so that range queries can be evaluated with a single bitset intersection. I plan to implement a hopefully efficient serialisation format so the table can be sent to another server and used for classification remotely.

I don’t actually know how fast this code is: performance is context sensitive, and I shy away from making “performance measurements”. It’s best suited to cases where there are a large number of rules (thousands) and I bet it’s really fast when there are ~50,000 segments in a ~50 dimensional space. I don’t even have a use case right now for bitrules: it was just fun writing the code. I have started releasing it to maven central, while I can’t guarantee its fitness for purpose, perhaps it may be of some use to someone else.