# Limiting Factors in a Dot Product Calculation

The dot product is ubiquitous in computing. The simple calculation is used heavily in neural networks for perceptron activation and again in gradient descent algorithms for error backpropagation. It is one of the building blocks of linear regression. Cosine similarity in document search is yet another dot product.

These use cases are more often implemented in C/C++ than in JVM languages, for reasons of efficiency, but what are the constraints on its computational performance? The combination of the computational simplicity and its streaming nature means the limiting factor in efficient code should be memory bandwidth. This is a good opportunity to look at the raw performance that will be made available with the vector API when it’s released.

Written in Java code since Java 9, the code to calculate a dot product is easy, using the `Math.fma` intrinsic.

``````  public float vanilla() {
float sum = 0f;
for (int i = 0; i < size; ++i) {
sum = Math.fma(left[i], right[i], sum);
}
return sum;
}
``````

Despite its simplicity, this code is incredibly inefficient precisely because it’s written in Java. Java is a language which prizes portability, and this sometimes comes at the cost of performance. The only way to make this routine produce the same number given the same input, no matter what operating system or instruction sets are available, is to do the operations in the same order, which means no unrolling or vectorisation. For a web application, this a good trade off, but for data analytics it is not.

An estimate of intensity, assuming a constant processor frequency, and two floating point operations (flops) per FMA, shows that the intensity is constant but very low at 0.67 flops/cycle. There being constant intensity as a function of array size is interesting because it indicates that the performance is insensitive to cache hierarchy, that the the limit is the CPU. Daniel Lemire made this observation with a benchmark written in C, disabling fastmath compiler optimisations, recently.

The JLS’s view on floating point arithmetic is the true limiting factor here. Assuming you really care about dot product performance, the best you can do to opt out is to unroll the loop and get slightly higher throughput.

``````  public float unrolled() {
float s0 = 0f;
float s1 = 0f;
float s2 = 0f;
float s3 = 0f;
float s4 = 0f;
float s5 = 0f;
float s6 = 0f;
float s7 = 0f;
for (int i = 0; i < size; i += 8) {
s0 = Math.fma(left[i + 0],  right[i + 0], s0);
s1 = Math.fma(left[i + 1],  right[i + 1], s1);
s2 = Math.fma(left[i + 2],  right[i + 2], s2);
s3 = Math.fma(left[i + 3],  right[i + 3], s3);
s4 = Math.fma(left[i + 4],  right[i + 4], s4);
s5 = Math.fma(left[i + 5],  right[i + 5], s5);
s6 = Math.fma(left[i + 6],  right[i + 6], s6);
s7 = Math.fma(left[i + 7],  right[i + 7], s7);
}
return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7;
}
``````

The intensity is about 4x better, but still constant. My Intel Skylake processor is capable of 32 flops/cycle, so this code is clearly still not very efficient, but it’s actually the best you can do with any released version of OpenJDK at the time of writing.

### The Vector API

I have been keeping an eye on the Vector API incubating in Project Panama for some time, and have only recently got round to kicking the tires. I wrote some benchmarks earlier in the year but ran into, as one should expect of a project in active development, bugs in FMA and vector box elimination. This limited the value I would get from writing about the benchmarks. These bugs have been fixed for a long time now, and you can start to see the how good this API is going to be.

Here’s a simple implementation which wouldn’t be legal for C2 (or Graal for that matter) to generate from the dot product loop. It relies on an accumulator vector, into which a vector dot product of the next eight elements is FMA’d for each step of the loop.

``````  public float vector() {
var sum = YMM_FLOAT.zero();
for (int i = 0; i < size; i += YMM_FLOAT.length()) {
var l = YMM_FLOAT.fromArray(left, i);
var r = YMM_FLOAT.fromArray(right, i);
sum = l.fma(r, sum);
}
}
``````

This loop can be unrolled, but it seems that this must be done manually for the sake of stability. The unroll below uses four accumulators and results in a huge boost in throughput.

``````  private float vectorUnrolled() {
var sum1 = YMM_FLOAT.zero();
var sum2 = YMM_FLOAT.zero();
var sum3 = YMM_FLOAT.zero();
var sum4 = YMM_FLOAT.zero();
int width = YMM_FLOAT.length();
for (int i = 0; i < size; i += width * 4) {
sum1 = YMM_FLOAT.fromArray(left, i).fma(YMM_FLOAT.fromArray(right, i), sum1);
sum2 = YMM_FLOAT.fromArray(left, i + width).fma(YMM_FLOAT.fromArray(right, i + width), sum2);
sum3 = YMM_FLOAT.fromArray(left, i + width * 2).fma(YMM_FLOAT.fromArray(right, i + width * 2), sum3);
sum4 = YMM_FLOAT.fromArray(left, i + width * 3).fma(YMM_FLOAT.fromArray(right, i + width * 3), sum4);
}
In the unrolled vector code, the intensity hits 12 flops/cycle just before 4096 elements. 4096 is a special number because `2 * 4096 * 4 = 32kB` is the capacity of L1 cache. This peak and rapid decrease suggests that the code is fast enough to be hitting memory bandwidth: if L1 were larger or L2 were faster, the intensity could be sustained. This is great, and the performance counters available with `-prof perfnorm` corroborate.