# Zeroing Negative Values in Arrays Efficiently

Replacing negatives with zeroes in large arrays of values is a primitive function of several complex financial risk measures, including potential future exposure (PFE) and the liquidity coverage ratio (LCR). While this is not an interesting operation by any stretch of the imagination, it is useful and there is significant benefit in its performance. This is an operation that can be computed very efficiently using the instruction VMAXPD. For Intel Xeon processors, this instruction requires half a cycle to calculate and has a latency (how long before another instruction can use its result) of four cycles. There is currently no way to trick Java into using this instruction for this simple operation, though there is a placeholder implementation on the current DoubleVector prototype in Project Panama which may do so.

### C++ Intel Intrinsics

It’s possible to target instructions from different processor vendors, in my case Intel, by using intrinsic functions which expose instructions as high level functions. The code looks incredibly ugly but it works. Here is a C++ function for 256 bit ymm registers:

void zero_negatives(const double* source, double* target, const size_t length) {
for (size_t i = 0; i + 3 < length; i += 4) {
__m256d vector = _mm256_load_pd(source + i);
__m256d zeroed = _mm256_max_pd(vector, _mm256_setzero_pd());
_mm256_storeu_pd(target + i, zeroed);
}
}

The function loads doubles into 256 bit vectors, within each vector replaces the negative values with zero, and writes them back into an array. It generates the following assembly code (which, incidentally, is less of a shit show to access than in Java):

void zero_negatives(const double* source, double* target, const size_t length) {
00007FF746EE5110  mov         qword ptr [rsp+18h],r8
00007FF746EE5115  mov         qword ptr [rsp+10h],rdx
00007FF746EE511A  mov         qword ptr [rsp+8],rcx
00007FF746EE511F  push        r13
00007FF746EE5121  push        rbp
00007FF746EE5122  push        rdi
00007FF746EE5123  sub         rsp,250h
00007FF746EE512A  mov         r13,rsp
00007FF746EE512D  lea         rbp,[rsp+20h]
00007FF746EE5132  and         rbp,0FFFFFFFFFFFFFFE0h
00007FF746EE5136  mov         rdi,rsp
00007FF746EE5139  mov         ecx,94h
00007FF746EE513E  mov         eax,0CCCCCCCCh
00007FF746EE5143  rep stos    dword ptr [rdi]
00007FF746EE5145  mov         rcx,qword ptr [rsp+278h]
for (size_t i = 0; i + 3 < length; i += 4) {
00007FF746EE514D  mov         qword ptr [rbp+8],0
00007FF746EE5155  jmp         zero_negatives+53h (07FF746EE5163h)
00007FF746EE5157  mov         rax,qword ptr [rbp+8]
00007FF746EE515F  mov         qword ptr [rbp+8],rax
00007FF746EE5163  mov         rax,qword ptr [rbp+8]
00007FF746EE516B  cmp         rax,qword ptr [length]
00007FF746EE5172  jae         zero_negatives+0DDh (07FF746EE51EDh)
__m256d vector = _mm256_load_pd(source + i);
00007FF746EE5174  mov         rax,qword ptr [source]
00007FF746EE517B  mov         rcx,qword ptr [rbp+8]
00007FF746EE517F  lea         rax,[rax+rcx*8]
00007FF746EE5183  vmovupd     ymm0,ymmword ptr [rax]
00007FF746EE5187  vmovupd     ymmword ptr [rbp+180h],ymm0
00007FF746EE518F  vmovupd     ymm0,ymmword ptr [rbp+180h]
00007FF746EE5197  vmovupd     ymmword ptr [rbp+40h],ymm0
__m256d zeroed = _mm256_max_pd(vector, _mm256_setzero_pd());
00007FF746EE519C  vxorpd      xmm0,xmm0,xmm0
00007FF746EE51A0  vmovupd     ymmword ptr [rbp+200h],ymm0
00007FF746EE51A8  vmovupd     ymm0,ymmword ptr [rbp+40h]
00007FF746EE51B5  vmovupd     ymmword ptr [rbp+1C0h],ymm0
00007FF746EE51BD  vmovupd     ymm0,ymmword ptr [rbp+1C0h]
00007FF746EE51C5  vmovupd     ymmword ptr [rbp+80h],ymm0
_mm256_storeu_pd(target + i, zeroed);
00007FF746EE51CD  mov         rax,qword ptr [target]
00007FF746EE51D4  mov         rcx,qword ptr [rbp+8]
00007FF746EE51D8  lea         rax,[rax+rcx*8]
00007FF746EE51DC  vmovupd     ymm0,ymmword ptr [rbp+80h]
00007FF746EE51E4  vmovupd     ymmword ptr [rax],ymm0
}
00007FF746EE51E8  jmp         zero_negatives+47h (07FF746EE5157h)
}
00007FF746EE51ED  lea         rsp,[r13+250h]
00007FF746EE51F4  pop         rdi
00007FF746EE51F5  pop         rbp
00007FF746EE51F6  pop         r13
00007FF746EE51F8  ret

This code is noticeably fast. I measured the throughput averaged over 1000 iterations, with an array of 10 million doubles (800MB) uniformly distributed between +/- 1E7, to quantify the throughput in GB/s and iterations/s. This code does between 4.5 and 5 iterations per second, which translates to processing approximately 4GB/s. This seems high, and since I am unaware of best practices in C++, if the measurement is flawed, I would gratefully be educated in the comments.

void benchmark() {
const size_t length = 1E8;
double* values = new double[length];
fill_array(values, length);
double* zeroed = new double[length];
auto start = std::chrono::high_resolution_clock::now();
int iterations = 1000;
for (int i = 0; i < iterations; ++i) {
zero_negatives(values, zeroed, length);
}
auto end = std::chrono::high_resolution_clock::now();
auto nanos = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
double thrpt_s = (iterations * 1E9) / nanos;
double thrpt_gbps = (thrpt_s * sizeof(double) * length) / 1E9;
std::cout << thrpt_s << "/s" << std::endl;
std::cout << thrpt_gbps << "GB/s" << std::endl;
delete[] values;
delete[] zeroed;
}

While I am sure there are various ways an expert could tweak this for performance, this code can’t get much faster unless there are 512 bit zmm registers available, in which case it would be wasteful. While the code looks virtually the same for AVX512 (just replace “256” with “512”), portability and efficiency are at odds. Handling the mess of detecting the best instruction set for the deployed architecture is the main reason for using Java in performance sensitive (but not critical) applications. But this is not the code the JVM generates.

### Java Auto-Vectorisation (Play Your Cards Right)

There is currently no abstraction modelling vectorisation in Java. The only access available is if the compiler engineers implement an intrinsic, or auto-vectorisation, which will try, and sometimes succeed admirably, to translate your code to a good vector implementation. There is currently a prototype project for explicit vectorisation in Project Panama. There are a few ways to skin this cat, and it’s worth looking at the code they generate and the throughput available from each approach.

There is a choice between copying the array and zeroing out the negatives, and allocating a new array and only writing the non-negative values. There is another choice between an if statement and branchless code using Math.max. This results in the following four implementations which I measure on comparable data to the C++ benchmark (10 million doubles, normally distributed with mean zero). To be fair to the Java code, as in the C++ benchmarks, the cost of allocation is isolated by writing into an array pre-allocated once per benchmark. This penalises the approaches where the array is copied first and then zeroed wherever the value is negative. The code is online at github.

@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
double[] data = state.data;
double[] result = state.target;
System.arraycopy(data, 0, result, 0, data.length);
for (int i = 0; i < result.length; ++i) {
if (result[i] < 0D) {
result[i] = 0D;
}
}
return result;
}

@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public double[] BranchyNewArray(ArrayWithNegatives state) {
double[] data = state.data;
double[] result = state.target;
for (int i = 0; i < result.length; ++i) {
result[i] = data[i] < 0D ? 0D : data[i];
}
return result;
}

@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public double[] NewArray(ArrayWithNegatives state) {
double[] data = state.data;
double[] result = state.target;
for (int i = 0; i < result.length; ++i) {
result[i] = Math.max(data[i], 0D);
}
return result;
}

@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
double[] data = state.data;
double[] result = state.target;
System.arraycopy(data, 0, result, 0, data.length);
for (int i = 0; i < result.length; ++i) {
result[i] = Math.max(result[i], 0D);
}
return result;
}

None of these implementations comes close to the native code above. The best implementation performs 1.8 iterations per second which equates to processing approximately 1.4GB/s, vastly inferior to the 4GB/s achieved with Intel intrinsics. The results are below:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
BranchyCopyAndMask thrpt 1 10 1.314845 0.061662 ops/s
BranchyNewArray thrpt 1 10 1.802673 0.061835 ops/s
CopyAndMask thrpt 1 10 1.146630 0.018903 ops/s
NewArray thrpt 1 10 1.357020 0.116481 ops/s

As an aside, there is a very interesting observation to make, worthy of its own post: if the array consists only of positive values, the “branchy” implementations run very well, at speeds comparable to the zero_negatives (when it ran with 50% negatives). The ratio of branch hits to misses is an orthogonal explanatory variable, and the input data, while I often don’t think about it enough, is very important.

I only looked at the assembly emitted for the fastest version (BranchyNewArray) and it doesn’t look anything like zero_negatives, though it does use some vectorisation – as pointed out by Daniel Lemire in the comments, this code has probably not been vectorised and is probably using SSE2 (indeed only quad words are loaded into 128 bit registers):

com/openkappa/simd/positive/PositiveValues.BranchyNewArray(Lcom/openkappa/simd/positive/ArrayWithNegatives;)[D  [0x000002ae309c3ce0, 0x000002ae309c3ff8]  792 bytes
Argument 0 is unknown.RIP: 0x2ae309c3ce0 Code size: 0x00000318
[Entry Point]
[Constants]
# {method} {0x000002ae4d13ec70} 'BranchyNewArray' '(Lcom/openkappa/simd/positive/ArrayWithNegatives;)[D' in 'com/openkappa/simd/positive/PositiveValues'
0x000002ae309c3ce0: mov     r10d,dword ptr [rdx+8h]  ;...44
;...8b
;...52
;...08

0x000002ae309c3ce4: shl     r10,3h            ;...49
;...c1
;...e2
;...03

0x000002ae309c3ce8: cmp     r10,rax           ;...4c
;...3b
;...d0

0x000002ae309c3ceb: jne     2ae3042c200h      ;...0f
;...85
;...0f
;...85
;...a6
;...ff
;   {runtime_call ic_miss_stub}
0x000002ae309c3cf1: nop     word ptr [rax+rax+0h]  ;...66
;...66
;...66
;...0f
;...1f
;...84
;...00
;...00
;...00
;...00
;...00

0x000002ae309c3cfc: nop                       ;...66
;...66
;...66
;...90

[Verified Entry Point]
0x000002ae309c3d00: mov     dword ptr [rsp+0ffffffffffff9000h],eax
;...89
;...84
;...24
;...00
;...90
;...ff
;...ff

0x000002ae309c3d07: push    rbp               ;...55

0x000002ae309c3d08: sub     rsp,60h           ;...48
;...83
;...ec
;...60

0x000002ae309c3d0c: mov     rcx,2ae4d163880h  ;...48
;...b9
;...80
;...38
;...16
;...4d
;...ae
;...02
;...00
;...00
;   {metadata(method data for {method} {0x000002ae4d13ec70} 'BranchyNewArray' '(Lcom/openkappa/simd/positive/ArrayWithNegatives;)[D' in 'com/openkappa/simd/positive/PositiveValues')}
0x000002ae309c3d16: mov     esi,dword ptr [rcx+0fch]
;...8b
;...b1
;...fc
;...00
;...00
;...00

;...c6
;...08

0x000002ae309c3d1f: mov     dword ptr [rcx+0fch],esi
;...89
;...b1
;...fc
;...00
;...00
;...00

0x000002ae309c3d25: and     esi,1ff8h         ;...81
;...e6
;...f8
;...1f
;...00
;...00

0x000002ae309c3d2b: cmp     esi,0h            ;...83
;...fe
;...00

0x000002ae309c3d2e: je      2ae309c3ec1h      ;...0f
;...84
;...8d
;...01
;...00
;...00
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@0 (line 29)

0x000002ae309c3d34: mov     edx,dword ptr [r8+0ch]  ;...41
;...8b
;...50
;...0c
; implicit exception: dispatches to 0x000002ae309c3ee2
0x000002ae309c3d38: shl     rdx,3h            ;...48
;...c1
;...e2
;...03
;*getfield data {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@1 (line 29)

0x000002ae309c3d3c: mov     ecx,dword ptr [r8+10h]  ;...41
;...8b
;...48
;...10

0x000002ae309c3d40: shl     rcx,3h            ;...48
;...c1
;...e1
;...03
;*getfield target {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@6 (line 30)

0x000002ae309c3d44: mov     esi,0h            ;...be
;...00
;...00
;...00
;...00

0x000002ae309c3d49: jmp     2ae309c3e27h      ;...e9
;...d9
;...00
;...00
;...00
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@13 (line 31)

0x000002ae309c3d4e: nop                       ;...66
;...90

0x000002ae309c3d50: movsxd  rax,esi           ;...48
;...63
;...c6

0x000002ae309c3d53: cmp     esi,dword ptr [rdx+0ch]  ;...3b
;...72
;...0c
; implicit exception: dispatches to 0x000002ae309c3ee7
0x000002ae309c3d56: jnb     2ae309c3ef1h      ;...0f
;...83
;...95
;...01
;...00
;...00

0x000002ae309c3d5c: vmovsd  xmm0,qword ptr [rdx+rax*8+10h]
;...c5
;...fb
;...10
;...44
;...c2
;...10
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@26 (line 32)

0x000002ae309c3d62: vxorpd  xmm1,xmm1,xmm1    ;...c5
;...f1
;...57
;...c9

0x000002ae309c3d66: vucomisd xmm0,xmm1        ;...c5
;...f9
;...2e
;...c1

0x000002ae309c3d6a: mov     eax,1h            ;...b8
;...01
;...00
;...00
;...00

0x000002ae309c3d6f: jp      2ae309c3d88h      ;...0f
;...8a
;...13
;...00
;...00
;...00

0x000002ae309c3d75: jnbe    2ae309c3d88h      ;...0f
;...87
;...0d
;...00
;...00
;...00

0x000002ae309c3d7b: mov     eax,0h            ;...b8
;...00
;...00
;...00
;...00

0x000002ae309c3d80: je      2ae309c3d88h      ;...0f
;...84
;...02
;...00
;...00
;...00

0x000002ae309c3d86: dec     eax               ;...ff
;...c8
;*dcmpg {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@28 (line 32)

0x000002ae309c3d88: cmp     eax,0h            ;...83
;...f8
;...00

0x000002ae309c3d8b: mov     rax,2ae4d163880h  ;...48
;...b8
;...80
;...38
;...16
;...4d
;...ae
;...02
;...00
;...00
;   {metadata(method data for {method} {0x000002ae4d13ec70} 'BranchyNewArray' '(Lcom/openkappa/simd/positive/ArrayWithNegatives;)[D' in 'com/openkappa/simd/positive/PositiveValues')}
0x000002ae309c3d95: mov     rdi,158h          ;...48
;...bf
;...58
;...01
;...00
;...00
;...00
;...00
;...00
;...00

0x000002ae309c3d9f: jnl     2ae309c3dafh      ;...0f
;...8d
;...0a
;...00
;...00
;...00

0x000002ae309c3da5: mov     rdi,168h          ;...48
;...bf
;...68
;...01
;...00
;...00
;...00
;...00
;...00
;...00

0x000002ae309c3daf: mov     rbx,qword ptr [rax+rdi]  ;...48
;...8b
;...1c
;...38

0x000002ae309c3db3: lea     rbx,[rbx+1h]      ;...48
;...8d
;...5b
;...01

0x000002ae309c3db7: mov     qword ptr [rax+rdi],rbx  ;...48
;...89
;...1c
;...38

0x000002ae309c3dbb: jnl     2ae309c3dd5h      ;...0f
;...8d
;...14
;...00
;...00
;...00
;*ifge {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@29 (line 32)

0x000002ae309c3dc1: mov     rax,2ae4d163880h  ;...48
;...b8
;...80
;...38
;...16
;...4d
;...ae
;...02
;...00
;...00
;   {metadata(method data for {method} {0x000002ae4d13ec70} 'BranchyNewArray' '(Lcom/openkappa/simd/positive/ArrayWithNegatives;)[D' in 'com/openkappa/simd/positive/PositiveValues')}
0x000002ae309c3dcb: inc     dword ptr [rax+178h]  ;...ff
;...80
;...78
;...01
;...00
;...00

0x000002ae309c3dd1: vxorpd  xmm0,xmm0,xmm0    ;...c5
;...f9
;...57
;...c0
;*goto {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@33 (line 32)

0x000002ae309c3dd5: movsxd  rax,esi           ;...48
;...63
;...c6

0x000002ae309c3dd8: cmp     esi,dword ptr [rcx+0ch]  ;...3b
;...71
;...0c

0x000002ae309c3ddb: jnb     2ae309c3efah      ;...0f
;...83
;...19
;...01
;...00
;...00

0x000002ae309c3de1: vmovsd  qword ptr [rcx+rax*8+10h],xmm0
;...c5
;...fb
;...11
;...44
;...c1
;...10
;*dastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@40 (line 32)

0x000002ae309c3de7: inc     esi               ;...ff
;...c6

0x000002ae309c3de9: mov     rax,2ae4d163880h  ;...48
;...b8
;...80
;...38
;...16
;...4d
;...ae
;...02
;...00
;...00
;   {metadata(method data for {method} {0x000002ae4d13ec70} 'BranchyNewArray' '(Lcom/openkappa/simd/positive/ArrayWithNegatives;)[D' in 'com/openkappa/simd/positive/PositiveValues')}
0x000002ae309c3df3: mov     edi,dword ptr [rax+100h]
;...8b
;...b8
;...00
;...01
;...00
;...00

;...c7
;...08

0x000002ae309c3dfc: mov     dword ptr [rax+100h],edi
;...89
;...b8
;...00
;...01
;...00
;...00

0x000002ae309c3e02: and     edi,0fff8h        ;...81
;...e7
;...f8
;...ff
;...00
;...00

0x000002ae309c3e08: cmp     edi,0h            ;...83
;...ff
;...00

0x000002ae309c3e0b: je      2ae309c3f03h      ;...0f
;...84
;...f2
;...00
;...00
;...00
; ImmutableOopMap{rcx=Oop rdx=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@44 (line 31)

0x000002ae309c3e11: test    dword ptr [2ae24520000h],eax
;...85
;...05
;...e9
;...c1
;...b5
;...f3
;   {poll}
0x000002ae309c3e17: mov     rax,2ae4d163880h  ;...48
;...b8
;...80
;...38
;...16
;...4d
;...ae
;...02
;...00
;...00
;   {metadata(method data for {method} {0x000002ae4d13ec70} 'BranchyNewArray' '(Lcom/openkappa/simd/positive/ArrayWithNegatives;)[D' in 'com/openkappa/simd/positive/PositiveValues')}
0x000002ae309c3e21: inc     dword ptr [rax+190h]  ;...ff
;...80
;...90
;...01
;...00
;...00
;*goto {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@44 (line 31)

0x000002ae309c3e27: mov     eax,dword ptr [rcx+0ch]  ;...8b
;...41
;...0c
;*arraylength {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@16 (line 31)
; implicit exception: dispatches to 0x000002ae309c3f24
0x000002ae309c3e2a: cmp     esi,eax           ;...3b
;...f0

0x000002ae309c3e2c: mov     rax,2ae4d163880h  ;...48
;...b8
;...80
;...38
;...16
;...4d
;...ae
;...02
;...00
;...00
;   {metadata(method data for {method} {0x000002ae4d13ec70} 'BranchyNewArray' '(Lcom/openkappa/simd/positive/ArrayWithNegatives;)[D' in 'com/openkappa/simd/positive/PositiveValues')}
0x000002ae309c3e36: mov     rdi,148h          ;...48
;...bf
;...48
;...01
;...00
;...00
;...00
;...00
;...00
;...00

0x000002ae309c3e40: jl      2ae309c3e50h      ;...0f
;...8c
;...0a
;...00
;...00
;...00

0x000002ae309c3e46: mov     rdi,138h          ;...48
;...bf
;...38
;...01
;...00
;...00
;...00
;...00
;...00
;...00

0x000002ae309c3e50: mov     rbx,qword ptr [rax+rdi]  ;...48
;...8b
;...1c
;...38

0x000002ae309c3e54: lea     rbx,[rbx+1h]      ;...48
;...8d
;...5b
;...01

0x000002ae309c3e58: mov     qword ptr [rax+rdi],rbx  ;...48
;...89
;...1c
;...38

0x000002ae309c3e5c: jl      2ae309c3d50h      ;...0f
;...8c
;...ee
;...fe
;...ff
;...ff
;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@17 (line 31)

0x000002ae309c3e62: mov     rax,rcx           ;...48
;...8b
;...c1

;...83
;...c4
;...60

0x000002ae309c3e69: pop     rbp               ;...5d

0x000002ae309c3e6a: test    dword ptr [2ae24520000h],eax
;...85
;...05
;...90
;...c1
;...b5
;...f3
;   {poll_return}
0x000002ae309c3e70: ret                       ;...c3
;*areturn {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.positive.PositiveValues::BranchyNewArray@48 (line 34)

I don’t really understand, and haven’t thought about, the intent of the emitted code, but it makes extensive use of the instruction VUCOMISD for comparisons with zero, which has a lower latency but lower throughput than VMAXPD. It would certainly be interesting to see how Project Panama does this. Perhaps this should just be made available as a fail-safe intrinsic like Arrays.mismatch?

## 4 comments on “Zeroing Negative Values in Arrays Efficiently”

• lemire says:

Please consider benchmarking against a straight memcpy to see what the difference is.

incidentally, is less of a shit show to access than in Java

Yeah. C, C++, Swift, and even Go, make it reasonably easy to see the compiled results. This is very unfortunate. Though I would add that it should be possible to make better tools in Java. It is just that nobody cares enough.

Note that I don’t understand your assembly output. It should be far simpler. What’s your compiler?

4GB/s. This seems high, and since I am unaware of best practices in C++, if the measurement is flawed, I would gratefully be educated in the comments.

I think that if you convert it to CPU cycles per value, it is really not that much for a CPU-bound context. A 64-bit value is 8 bytes. So you are running at 4 giga values per second. You have roughly (depending on your hardware) 2 giga cycles per second (2 GHz ?). So it is roughly two value per CPU cycle. Note that your processor can do more than one instruction per CPU cycle… not to mention SIMD instructions that can process several values per instruction.

Looking at your function, I think you should almost be able to process 4 doubles in two CPU cycles, so it is about right.

You are loading up 2 GB of data or so, but you are essentially doing sequential access, which helps. So you might get no overhead whatsoever from the RAM access. Well, you can check by using less data.

• Thanks for your insights on this. I will look at this again in the future and factor in the cost of JNR. I would also like to see what’s possible with Project Panama if it stabilises in the near future.

I am running this with Visual Studio on Windows. I plan to take a look at this with g++ on Linux soon. My processor is 2.6GHz:

\$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

So at 2.6GHz maybe 4GB/s isn’t so great. It might get better with application level pagination.

• lemire says:

As for the use of vector instruction, I suspect that it might not be auto-vectorization, but maybe just the straight out application of SSE2 to do accurate and safe floating point instructions.

You should do the same trick with 32-bit integers and see what happens.

• That’s a very good point. Looking closely there are only ever qwords loaded into the xmm registers. I’ve updated the post to reflect this.