# Project Panama and Population Count

Project Panama introduces a new interface Vector, where the specialisation for long looks like a promising substrate for an explicitly vectorised bit set. Bit sets are useful for representing composable predicates over data sets. One obvious omission on this interface, required for an adequate implementation of a bit set, is a bit count, otherwise known as population count. Perhaps this is because the vector API aims to generalise across primitive types, whereas population count is only meaningful for integral types. Even so, if Vector can be interpreted as a wider integer, then it would be consistent to add this to the interface. If the method existed, what possible implementation could it have?

In x86, the population count of a 64 bit register is computed by the POPCNT instruction, which is exposed in Java as an intrinsic in Long.bitCount. There is no SIMD equivalent in any extension set until VPOPCNTD/VPOPCNTQ in AVX-512. Very few processors (at the time of writing) support AVX-512, and only the Knights Mill processor supports this extension; there are not even Intel intrinsics exposing these instructions yet.

The algorithm for vectorised population count adopted by the clang compiler is outlined in this paper, which develops on an algorithm designed for 128 bit registers and SSE instructions, presented by Wojciech Muła on his blog in 2008. This approach is shown in the paper to outperform scalar code using POPCNT and 64 bit registers, almost doubling throughput when 256 bit ymm registers are available. The core algorithm (taken from figure 10 in the paper) returns a vector of four 64 bit counts, which can then be added together in a variety of ways to form a population count, proceeds as follows:


// The Muła Function
__m256i count(__m256i v) {
__m256i lookup = _mm256_setr_epi8(
0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4,
0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4);
__m256i hi = _mm256_and_si256(_mm256_srli_epi32(v, 4), low_mask);
__m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo);
__m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi);
}


If you are struggling to read the code above, you are not alone. I haven’t programmed in C++ for several years – it’s amazing how nice the names in civilised languages like Java and python (and even bash) are compared to the black magic above. There is some logic to the naming though: read page 5 of the manual. You can also read an accessible description of some of the functions used in this blog post.

The basic idea starts from storing the population counts for each possible byte value in a lookup table, which can be looked up using bit level parallelism and ultimately added up. For efficiency’s sake, instead of bytes, 4 bit nibbles are used, which is why you only see numbers 0-4 in the lookup table. Various, occasionally obscure, optimisations are applied resulting in the magic numbers at the the top of the function. A large chunk of the paper is devoted to their derivation: if you are interested, go and read the paper – I could not understand the intent of the code at all until reading the paper twice, especially section 2.

The points I find interesting are:

• This algorithm exists
• It uses instructions all modern commodity processors have
• It is fast
• It is in use

Could this be implemented in the JVM as an intrinsic and exposed on Vector?

# Explicit Intent and Even Faster Hash Codes

I wrote a post recently about how disappointed I was that the optimiser couldn’t outsmart some clever Java code for computing hash codes. Well, here’s a faster hash code along the same lines.

The hash code implemented in Arrays.hashCode is a polynomial hash, it applies to any data type with a positional interpretation. It takes the general form $\sum_{i=0}^{n}x_{i}31^{n - i}$ where $x_0 = 1$. In other words, it’s a dot product of the elements of the array and some powers of 31. Daniel Lemire’s implementation makes it explicit to the optimiser, in a way it won’t otherwise infer, that this operation is data parallel. If it’s really just a dot product it can be made even more obvious at the cost of a loss of flexibility.

Imagine you are processing fixed or limited length strings (VARCHAR(255) or an URL) or coordinates of a space of fixed dimension. Then you could pre-compute the coefficients in an array and write the hash code explicitly as a dot product. Java 9 uses AVX instructions for dot products, so it should be very fast.


public class FixedLengthHashCode {

private final int[] coefficients;

public FixedLengthHashCode(int maxLength) {
this.coefficients = new int[maxLength + 1];
coefficients[maxLength] = 1;
for (int i = maxLength - 1; i >= 0; --i) {
coefficients[i] = 31 * coefficients[i + 1];
}
}

public int hashCode(int[] value) {
int result = coefficients[0];
for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
result += coefficients[i + 1] * value[i];
}
return result;
}
}


This is really explicit, unambiguously parallelisable, and the results are remarkable.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
HashCode.BuiltIn thrpt 1 10 10.323026 0.223614 ops/us 100
HashCode.BuiltIn thrpt 1 10 0.959246 0.038900 ops/us 1000
HashCode.BuiltIn thrpt 1 10 0.096005 0.001836 ops/us 10000
HashCode.FixedLength thrpt 1 10 20.186800 0.297590 ops/us 100
HashCode.FixedLength thrpt 1 10 2.314187 0.082867 ops/us 1000
HashCode.FixedLength thrpt 1 10 0.227090 0.005377 ops/us 10000
HashCode.Unrolled thrpt 1 10 13.250821 0.752609 ops/us 100
HashCode.Unrolled thrpt 1 10 1.503368 0.058200 ops/us 1000
HashCode.Unrolled thrpt 1 10 0.152179 0.003541 ops/us 10000

Modifying the algorithm slightly to support limited variable length arrays degrades performance slightly, but there are seemingly equivalent implementations which do much worse.


public class FixedLengthHashCode {

private final int[] coefficients;

public FixedLengthHashCode(int maxLength) {
this.coefficients = new int[maxLength + 1];
coefficients[0] = 1;
for (int i = 1; i >= maxLength; ++i) {
coefficients[i] = 31 * coefficients[i - 1];
}
}

public int hashCode(int[] value) {
final int max = value.length;
int result = coefficients[max];
for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
result += coefficients[max - i - 1] * value[i];
}
return result;
}
}


Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
FixedLength thrpt 1 10 19.172574 0.742637 ops/us 100
FixedLength thrpt 1 10 2.233006 0.115285 ops/us 1000
FixedLength thrpt 1 10 0.227451 0.012231 ops/us 10000

The benchmark code is at github.