Sum of Squares

Streams and lambdas, especially the limited support offered for primitive types, are a fantastic addition to the Java language. They’re not supposed to be fast, but how do these features compare to a good old for loop? For a simple calculation amenable to instruction level parallelism, I compare modern and traditional implementations and observe the differences in instructions generated.

Sum of Squares

The sum of squares is the building block of a linear regression analysis so is ubiquitous in statistical computing. It is associative and therefore data-parallel. I compare four implementations: a sequential stream wrapping an array, a parallel stream wrapping an array, a generative sequential stream and a traditional for loop. The benchmark code is on github.

  @Param({"1024", "8192"})
  int size;

  private double[] data;

  public void init() { = createDoubleArray(size);

  public double SS_SequentialStream() {
    return DoubleStream.of(data)
            .map(x -> x * x)
            .reduce((x, y) -> x + y)

  public double SS_ParallelStream() {
    return DoubleStream.of(data)
            .map(x -> x * x)
            .reduce((x, y) -> x + y)

  public double SS_ForLoop() {
    double result = 0D;
    for (int i = 0; i < data.length; ++i) {
      result += data[i] * data[i];
    return result;

  public double SS_GenerativeSequentialStream() {
    return IntStream.iterate(0, i -> i < size, i -> i + 1)
            .mapToDouble(i -> data[i])
            .map(x -> x * x)
            .reduce((x, y) -> x + y)

I must admit I prefer the readability of the stream versions, but let’s see if there is a comedown after the syntactic sugar rush.

Running a Benchmark

I compare the four implementations on an array of one million doubles. I am using JDK 9.0.1, VM 9.0.1+11 on a fairly powerful laptop with 8 processors:

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Before running the benchmark we might expect the for loop and stream to have similar performance, and the parallel version to be about eight times faster (though remember that the arrays aren’t too big). The generative version is very similar to the for loop so a slow down might not be expected.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
SS_ForLoop thrpt 1 10 258351.774491 39797.567968 ops/s 1024
SS_ForLoop thrpt 1 10 29463.408428 4814.826388 ops/s 8192
SS_GenerativeSequentialStream thrpt 1 10 219699.607567 9095.569546 ops/s 1024
SS_GenerativeSequentialStream thrpt 1 10 28351.900454 828.513989 ops/s 8192
SS_ParallelStream thrpt 1 10 22827.821827 2826.577213 ops/s 1024
SS_ParallelStream thrpt 1 10 23230.623610 273.415352 ops/s 8192
SS_SequentialStream thrpt 1 10 225431.985145 9051.538442 ops/s 1024
SS_SequentialStream thrpt 1 10 29123.734157 1333.721437 ops/s 8192

The for loop and stream are similar. The parallel version is a long way behind (yes that’s right: more threads less power), but exhibits constant scaling (incidentally, a measurement like this is a good way to guess the minimum unit of work in a parallelised implementation). If the data is large it could become profitable to use it. The generative stream is surprisingly good, almost as good as the version that wraps the array, though there is a fail-safe way to slow it down: add a limit clause to the method chain (try it…).

Profiling with perfasm, it is clear that the for loop body is being vectorised, but only the loads and multiplications are done in parallel – the complicated string of SSE instructions is the reduction, which must be done in order.

<-- unrolled load -->
  0.01%    0x00000243d8969170: vmovdqu ymm1,ymmword ptr [r11+r8*8+0f0h]
  0.07%    0x00000243d896917a: vmovdqu ymm2,ymmword ptr [r11+r8*8+0d0h]
  0.75%    0x00000243d8969184: vmovdqu ymm3,ymmword ptr [r11+r8*8+0b0h]
  0.01%    0x00000243d896918e: vmovdqu ymm4,ymmword ptr [r11+r8*8+90h]
  0.02%    0x00000243d8969198: vmovdqu ymm5,ymmword ptr [r11+r8*8+70h]
  0.03%    0x00000243d896919f: vmovdqu ymm6,ymmword ptr [r11+r8*8+50h]
  0.77%    0x00000243d89691a6: vmovdqu ymm10,ymmword ptr [r11+r8*8+30h]
  0.02%    0x00000243d89691ad: vmovdqu ymm7,ymmword ptr [r11+r8*8+10h]
<-- multiplication starts -->
  0.01%    0x00000243d89691b4: vmulpd  ymm1,ymm1,ymm1
  0.02%    0x00000243d89691b8: vmovdqu ymmword ptr [rsp+28h],ymm1
  0.76%    0x00000243d89691be: vmulpd  ymm15,ymm7,ymm7
  0.00%    0x00000243d89691c2: vmulpd  ymm12,ymm2,ymm2
  0.01%    0x00000243d89691c6: vmulpd  ymm7,ymm3,ymm3
  0.02%    0x00000243d89691ca: vmulpd  ymm8,ymm4,ymm4
  0.72%    0x00000243d89691ce: vmulpd  ymm9,ymm5,ymm5
  0.00%    0x00000243d89691d2: vmulpd  ymm11,ymm6,ymm6
  0.01%    0x00000243d89691d6: vmulpd  ymm13,ymm10,ymm10
<-- multiplication ends here, scalar reduction starts -->
  0.03%    0x00000243d89691db: vaddsd  xmm0,xmm0,xmm15
  0.72%    0x00000243d89691e0: vpshufd xmm5,xmm15,0eh
  0.01%    0x00000243d89691e6: vaddsd  xmm0,xmm0,xmm5
  2.14%    0x00000243d89691ea: vextractf128 xmm6,ymm15,1h
  0.03%    0x00000243d89691f0: vaddsd  xmm0,xmm0,xmm6
  3.21%    0x00000243d89691f4: vpshufd xmm5,xmm6,0eh
  0.02%    0x00000243d89691f9: vaddsd  xmm0,xmm0,xmm5
  2.81%    0x00000243d89691fd: vaddsd  xmm0,xmm0,xmm13
  2.82%    0x00000243d8969202: vpshufd xmm5,xmm13,0eh
  0.03%    0x00000243d8969208: vaddsd  xmm0,xmm0,xmm5
  2.87%    0x00000243d896920c: vextractf128 xmm6,ymm13,1h
  0.01%    0x00000243d8969212: vaddsd  xmm0,xmm0,xmm6
  3.03%    0x00000243d8969216: vpshufd xmm5,xmm6,0eh
  0.03%    0x00000243d896921b: vaddsd  xmm0,xmm0,xmm5
  2.94%    0x00000243d896921f: vaddsd  xmm0,xmm0,xmm11
  2.70%    0x00000243d8969224: vpshufd xmm5,xmm11,0eh
  0.03%    0x00000243d896922a: vaddsd  xmm0,xmm0,xmm5
  2.98%    0x00000243d896922e: vextractf128 xmm6,ymm11,1h
  0.01%    0x00000243d8969234: vaddsd  xmm0,xmm0,xmm6
  3.11%    0x00000243d8969238: vpshufd xmm5,xmm6,0eh
  0.03%    0x00000243d896923d: vaddsd  xmm0,xmm0,xmm5
  2.95%    0x00000243d8969241: vaddsd  xmm0,xmm0,xmm9
  2.61%    0x00000243d8969246: vpshufd xmm5,xmm9,0eh
  0.02%    0x00000243d896924c: vaddsd  xmm0,xmm0,xmm5
  2.89%    0x00000243d8969250: vextractf128 xmm6,ymm9,1h
  0.04%    0x00000243d8969256: vaddsd  xmm0,xmm0,xmm6
  3.13%    0x00000243d896925a: vpshufd xmm5,xmm6,0eh
  0.01%    0x00000243d896925f: vaddsd  xmm0,xmm0,xmm5
  2.96%    0x00000243d8969263: vaddsd  xmm0,xmm0,xmm8
  2.83%    0x00000243d8969268: vpshufd xmm4,xmm8,0eh
  0.01%    0x00000243d896926e: vaddsd  xmm0,xmm0,xmm4
  3.00%    0x00000243d8969272: vextractf128 xmm10,ymm8,1h
  0.02%    0x00000243d8969278: vaddsd  xmm0,xmm0,xmm10
  3.13%    0x00000243d896927d: vpshufd xmm4,xmm10,0eh
  0.01%    0x00000243d8969283: vaddsd  xmm0,xmm0,xmm4
  3.01%    0x00000243d8969287: vaddsd  xmm0,xmm0,xmm7
  2.95%    0x00000243d896928b: vpshufd xmm1,xmm7,0eh
  0.02%    0x00000243d8969290: vaddsd  xmm0,xmm0,xmm1
  3.06%    0x00000243d8969294: vextractf128 xmm2,ymm7,1h
  0.01%    0x00000243d896929a: vaddsd  xmm0,xmm0,xmm2
  3.07%    0x00000243d896929e: vpshufd xmm1,xmm2,0eh
  0.02%    0x00000243d89692a3: vaddsd  xmm0,xmm0,xmm1
  3.07%    0x00000243d89692a7: vaddsd  xmm0,xmm0,xmm12
  2.92%    0x00000243d89692ac: vpshufd xmm3,xmm12,0eh
  0.02%    0x00000243d89692b2: vaddsd  xmm0,xmm0,xmm3
  3.11%    0x00000243d89692b6: vextractf128 xmm1,ymm12,1h
  0.01%    0x00000243d89692bc: vaddsd  xmm0,xmm0,xmm1
  3.02%    0x00000243d89692c0: vpshufd xmm3,xmm1,0eh
  0.02%    0x00000243d89692c5: vaddsd  xmm0,xmm0,xmm3
  2.97%    0x00000243d89692c9: vmovdqu ymm1,ymmword ptr [rsp+28h]
  0.02%    0x00000243d89692cf: vaddsd  xmm0,xmm0,xmm1
  3.05%    0x00000243d89692d3: vpshufd xmm2,xmm1,0eh
  0.03%    0x00000243d89692d8: vaddsd  xmm0,xmm0,xmm2
  2.97%    0x00000243d89692dc: vextractf128 xmm14,ymm1,1h
  0.01%    0x00000243d89692e2: vaddsd  xmm0,xmm0,xmm14
  2.99%    0x00000243d89692e7: vpshufd xmm2,xmm14,0eh
  0.02%    0x00000243d89692ed: vaddsd  xmm0,xmm0,xmm2 

The sequential stream code is not as good – it is scalar – but the difference in performance is not as stark as it might be because of the inefficient scalar reduction in the for loop: this is JLS floating point semantics twisting C2’s arm behind its back.

  0.00%    0x0000021a1df54c24: vmovsd  xmm0,qword ptr [rbx+r9*8+48h]
  0.00%    0x0000021a1df54c2b: vmovsd  xmm2,qword ptr [rbx+r9*8+18h]
  0.02%    0x0000021a1df54c32: vmovsd  xmm3,qword ptr [rbx+r9*8+40h]
  2.93%    0x0000021a1df54c39: vmovsd  xmm4,qword ptr [rbx+r9*8+38h]
  0.00%    0x0000021a1df54c40: vmovsd  xmm5,qword ptr [rbx+r9*8+30h]
  0.01%    0x0000021a1df54c47: vmovsd  xmm6,qword ptr [rbx+r9*8+28h]
  0.02%    0x0000021a1df54c4e: vmovsd  xmm7,qword ptr [rbx+r9*8+20h]
  2.99%    0x0000021a1df54c55: vmulsd  xmm8,xmm0,xmm0
  0.00%    0x0000021a1df54c59: vmulsd  xmm0,xmm7,xmm7
           0x0000021a1df54c5d: vmulsd  xmm6,xmm6,xmm6
  0.01%    0x0000021a1df54c61: vmulsd  xmm5,xmm5,xmm5
  2.91%    0x0000021a1df54c65: vmulsd  xmm4,xmm4,xmm4
  0.00%    0x0000021a1df54c69: vmulsd  xmm3,xmm3,xmm3
  0.00%    0x0000021a1df54c6d: vmulsd  xmm2,xmm2,xmm2
  0.02%    0x0000021a1df54c71: vaddsd  xmm1,xmm2,xmm1
  6.10%    0x0000021a1df54c75: vaddsd  xmm0,xmm0,xmm1
  5.97%    0x0000021a1df54c79: vaddsd  xmm0,xmm6,xmm0
 16.22%    0x0000021a1df54c7d: vaddsd  xmm0,xmm5,xmm0
  7.86%    0x0000021a1df54c81: vaddsd  xmm0,xmm4,xmm0
 11.16%    0x0000021a1df54c85: vaddsd  xmm1,xmm3,xmm0
 11.90%    0x0000021a1df54c89: vaddsd  xmm0,xmm8,xmm1

The same code can be seen in SS_ParallelStream. SS_GenerativeSequentialStream is much more interesting because it hasn’t been unrolled – see the interleaved control statements. It is also not vectorised.

           0x0000013c1a639c17: vmovsd  xmm0,qword ptr [rbp+r9*8+10h]
  0.01%    0x0000013c1a639c1e: vmulsd  xmm2,xmm0,xmm0    
  0.01%    0x0000013c1a639c22: test    r8d,r8d
           0x0000013c1a639c25: jne     13c1a639e09h   
           0x0000013c1a639c2b: mov     r10d,dword ptr [r12+rax*8+8h]
           0x0000013c1a639c30: cmp     r10d,0f8022d85h 
           0x0000013c1a639c37: jne     13c1a639e3bh     
  0.01%    0x0000013c1a639c3d: vaddsd  xmm2,xmm1,xmm2
  0.01%    0x0000013c1a639c41: vmovsd  qword ptr [rdi+10h],xmm2
  0.00%    0x0000013c1a639c46: movsxd  r10,r9d
           0x0000013c1a639c49: vmovsd  xmm0,qword ptr [rbp+r10*8+18h]
  0.01%    0x0000013c1a639c50: vmulsd  xmm0,xmm0,xmm0
  0.01%    0x0000013c1a639c54: mov     r10d,dword ptr [r12+rax*8+8h]
  0.00%    0x0000013c1a639c59: cmp     r10d,0f8022d85h
           0x0000013c1a639c60: jne     13c1a639e30h
           0x0000013c1a639c66: vaddsd  xmm0,xmm0,xmm2
  0.02%    0x0000013c1a639c6a: vmovsd  qword ptr [rdi+10h],xmm0
  0.02%    0x0000013c1a639c6f: mov     r10d,r9d
           0x0000013c1a639c72: add     r10d,2h 
           0x0000013c1a639c76: cmp     r10d,r11d
           0x0000013c1a639c79: jnl     13c1a639d96h 
  0.01%    0x0000013c1a639c7f: add     r9d,4h 
  0.02%    0x0000013c1a639c83: vmovsd  xmm1,qword ptr [rbp+r10*8+10h]
  0.00%    0x0000013c1a639c8a: movzx   r8d,byte ptr [rdi+0ch]
  0.00%    0x0000013c1a639c8f: vmulsd  xmm1,xmm1,xmm1
  0.01%    0x0000013c1a639c93: test    r8d,r8d
           0x0000013c1a639c96: jne     13c1a639dfbh
  0.01%    0x0000013c1a639c9c: vaddsd  xmm1,xmm0,xmm1
  0.01%    0x0000013c1a639ca0: vmovsd  qword ptr [rdi+10h],xmm1
  0.02%    0x0000013c1a639ca5: movsxd  r8,r10d
  0.00%    0x0000013c1a639ca8: vmovsd  xmm0,qword ptr [rbp+r8*8+18h]
           0x0000013c1a639caf: vmulsd  xmm0,xmm0,xmm0
           0x0000013c1a639cb3: vaddsd  xmm0,xmm0,xmm1
  0.06%    0x0000013c1a639cb7: vmovsd  qword ptr [rdi+10h],xmm0

So it looks like streams don’t vectorise like good old for loops, and you won’t gain from Stream.parallelStream unless you have humungous arrays (which you might be avoiding for other reasons). This was actually a very nice case for the Stream because optimal code can’t be generated for floating point reductions. What happens with sum of squares for ints? Generating data in an unsurprising way:

  public int SS_SequentialStream_Int() {
    return IntStream.of(intData)
            .map(x -> x * x)
            .reduce((x, y) -> x + y)

  public int SS_ParallelStream_Int() {
    return IntStream.of(intData)
            .map(x -> x * x)
            .reduce((x, y) -> x + y)

  public int SS_ForLoop_Int() {
    int result = 0;
    for (int i = 0; i < intData.length; ++i) {
      result += intData[i] * intData[i];
    return result;

  public int SS_GenerativeSequentialStream_Int() {
    return IntStream.iterate(0, i -> i < size, i -> i + 1)
            .map(i -> intData[i])
            .map(x -> x * x)
            .reduce((x, y) -> x + y)

The landscape has completely changed, thanks to the exploitation of associative arithmetic and the VPHADDD instruction which simplifies the reduction in the for loop.

<-- load -->
  0.00%    0x000001f5cdd8cd30: vmovdqu ymm0,ymmword ptr [rdi+r10*4+0f0h]
  1.93%    0x000001f5cdd8cd3a: vmovdqu ymm1,ymmword ptr [rdi+r10*4+0d0h]
  0.10%    0x000001f5cdd8cd44: vmovdqu ymm2,ymmword ptr [rdi+r10*4+0b0h]
  0.07%    0x000001f5cdd8cd4e: vmovdqu ymm3,ymmword ptr [rdi+r10*4+90h]
  0.05%    0x000001f5cdd8cd58: vmovdqu ymm4,ymmword ptr [rdi+r10*4+70h]
  1.75%    0x000001f5cdd8cd5f: vmovdqu ymm5,ymmword ptr [rdi+r10*4+50h]
  0.08%    0x000001f5cdd8cd66: vmovdqu ymm6,ymmword ptr [rdi+r10*4+30h]
  0.07%    0x000001f5cdd8cd6d: vmovdqu ymm7,ymmword ptr [rdi+r10*4+10h]
<-- multiply -->
  0.01%    0x000001f5cdd8cd74: vpmulld ymm0,ymm0,ymm0
  1.81%    0x000001f5cdd8cd79: vmovdqu ymmword ptr [rsp+28h],ymm0
  0.02%    0x000001f5cdd8cd7f: vpmulld ymm15,ymm7,ymm7
  1.79%    0x000001f5cdd8cd84: vpmulld ymm11,ymm1,ymm1
  0.06%    0x000001f5cdd8cd89: vpmulld ymm8,ymm2,ymm2
  1.82%    0x000001f5cdd8cd8e: vpmulld ymm9,ymm3,ymm3
  0.06%    0x000001f5cdd8cd93: vpmulld ymm10,ymm4,ymm4
  1.79%    0x000001f5cdd8cd98: vpmulld ymm12,ymm5,ymm5
  0.08%    0x000001f5cdd8cd9d: vpmulld ymm6,ymm6,ymm6
<-- vectorised reduce -->
  1.83%    0x000001f5cdd8cda2: vphaddd ymm4,ymm15,ymm15
  0.04%    0x000001f5cdd8cda7: vphaddd ymm4,ymm4,ymm7
  1.85%    0x000001f5cdd8cdac: vextracti128 xmm7,ymm4,1h
  0.07%    0x000001f5cdd8cdb2: vpaddd  xmm4,xmm4,xmm7
  1.78%    0x000001f5cdd8cdb6: vmovd   xmm7,r8d
  0.01%    0x000001f5cdd8cdbb: vpaddd  xmm7,xmm7,xmm4
  0.11%    0x000001f5cdd8cdbf: vmovd   r11d,xmm7
  0.05%    0x000001f5cdd8cdc4: vphaddd ymm4,ymm6,ymm6
  1.84%    0x000001f5cdd8cdc9: vphaddd ymm4,ymm4,ymm7
  5.43%    0x000001f5cdd8cdce: vextracti128 xmm7,ymm4,1h
  0.13%    0x000001f5cdd8cdd4: vpaddd  xmm4,xmm4,xmm7
  4.34%    0x000001f5cdd8cdd8: vmovd   xmm7,r11d
  0.36%    0x000001f5cdd8cddd: vpaddd  xmm7,xmm7,xmm4
  1.40%    0x000001f5cdd8cde1: vmovd   r8d,xmm7
  0.01%    0x000001f5cdd8cde6: vphaddd ymm6,ymm12,ymm12
  2.89%    0x000001f5cdd8cdeb: vphaddd ymm6,ymm6,ymm4
  3.25%    0x000001f5cdd8cdf0: vextracti128 xmm4,ymm6,1h
  0.87%    0x000001f5cdd8cdf6: vpaddd  xmm6,xmm6,xmm4
  6.36%    0x000001f5cdd8cdfa: vmovd   xmm4,r8d
  0.01%    0x000001f5cdd8cdff: vpaddd  xmm4,xmm4,xmm6
  1.69%    0x000001f5cdd8ce03: vmovd   r8d,xmm4
  0.03%    0x000001f5cdd8ce08: vphaddd ymm4,ymm10,ymm10
  1.83%    0x000001f5cdd8ce0d: vphaddd ymm4,ymm4,ymm7
  0.10%    0x000001f5cdd8ce12: vextracti128 xmm7,ymm4,1h
  3.29%    0x000001f5cdd8ce18: vpaddd  xmm4,xmm4,xmm7
  0.72%    0x000001f5cdd8ce1c: vmovd   xmm7,r8d
  0.23%    0x000001f5cdd8ce21: vpaddd  xmm7,xmm7,xmm4
  4.42%    0x000001f5cdd8ce25: vmovd   r11d,xmm7
  0.12%    0x000001f5cdd8ce2a: vphaddd ymm5,ymm9,ymm9
  1.69%    0x000001f5cdd8ce2f: vphaddd ymm5,ymm5,ymm1
  0.12%    0x000001f5cdd8ce34: vextracti128 xmm1,ymm5,1h
  3.28%    0x000001f5cdd8ce3a: vpaddd  xmm5,xmm5,xmm1
  0.22%    0x000001f5cdd8ce3e: vmovd   xmm1,r11d
  0.14%    0x000001f5cdd8ce43: vpaddd  xmm1,xmm1,xmm5
  3.81%    0x000001f5cdd8ce47: vmovd   r11d,xmm1
  0.22%    0x000001f5cdd8ce4c: vphaddd ymm0,ymm8,ymm8
  1.58%    0x000001f5cdd8ce51: vphaddd ymm0,ymm0,ymm3
  0.22%    0x000001f5cdd8ce56: vextracti128 xmm3,ymm0,1h
  2.82%    0x000001f5cdd8ce5c: vpaddd  xmm0,xmm0,xmm3
  0.36%    0x000001f5cdd8ce60: vmovd   xmm3,r11d
  0.20%    0x000001f5cdd8ce65: vpaddd  xmm3,xmm3,xmm0
  4.55%    0x000001f5cdd8ce69: vmovd   r8d,xmm3
  0.10%    0x000001f5cdd8ce6e: vphaddd ymm2,ymm11,ymm11
  1.71%    0x000001f5cdd8ce73: vphaddd ymm2,ymm2,ymm1
  0.09%    0x000001f5cdd8ce78: vextracti128 xmm1,ymm2,1h
  2.91%    0x000001f5cdd8ce7e: vpaddd  xmm2,xmm2,xmm1
  1.57%    0x000001f5cdd8ce82: vmovd   xmm1,r8d
  0.05%    0x000001f5cdd8ce87: vpaddd  xmm1,xmm1,xmm2
  4.84%    0x000001f5cdd8ce8b: vmovd   r11d,xmm1
  0.06%    0x000001f5cdd8ce90: vmovdqu ymm0,ymmword ptr [rsp+28h]
  0.03%    0x000001f5cdd8ce96: vphaddd ymm13,ymm0,ymm0
  1.83%    0x000001f5cdd8ce9b: vphaddd ymm13,ymm13,ymm14
  2.16%    0x000001f5cdd8cea0: vextracti128 xmm14,ymm13,1h
  0.14%    0x000001f5cdd8cea6: vpaddd  xmm13,xmm13,xmm14
  0.09%    0x000001f5cdd8ceab: vmovd   xmm14,r11d
  0.51%    0x000001f5cdd8ceb0: vpaddd  xmm14,xmm14,xmm13

If you’re the guy replacing all the for loops with streams because it’s 2018, you may be committing performance vandalism! That nice declarative API (as opposed to language feature) is at arms length and it really isn’t well optimised yet.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
SS_ForLoop_Int thrpt 1 10 1021725.018981 74264.883362 ops/s 1024
SS_ForLoop_Int thrpt 1 10 129250.855026 5764.608094 ops/s 8192
SS_GenerativeSequentialStream_Int thrpt 1 10 55069.227826 1111.903102 ops/s 1024
SS_GenerativeSequentialStream_Int thrpt 1 10 6769.176830 684.970867 ops/s 8192
SS_ParallelStream_Int thrpt 1 10 20970.387258 719.846643 ops/s 1024
SS_ParallelStream_Int thrpt 1 10 19621.397202 1514.374286 ops/s 8192
SS_SequentialStream_Int thrpt 1 10 586847.001223 22390.512706 ops/s 1024
SS_SequentialStream_Int thrpt 1 10 87620.959677 3437.083075 ops/s 8192

Parallel streams might not be the best thing to reach for.

Posted on

Spliterator Characteristics and Performance

The streams API has been around for a while now, and I’m a big fan of it. It allows for a clean declarative programming style, which permits various optimisations to occur, and keeps the pastafarians at bay. I also think the Stream is the perfect abstraction for data interchange across API boundaries. This is partly because a Stream is lazy, meaning you don’t need to pay for consumption until you actually need to, and partly because a Stream can only be used once and there can be no ambiguity about ownership. If you supply a Stream to an API, you must expect that it has been used and so must discard it. This almost entirely eradicates defensive copies and can mean that no intermediate data structures need ever exist. Despite my enthusiasm for this abstraction, there’s some weirdness in this API when you scratch beneath surface.

I wanted to find a way to quickly run length encode an IntStream and found it difficult to make this code as fast as I’d like it to be. The code is too slow because it’s necessary to inspect each int, even when there is enough context available to potentially apply optimisations such as skipping over ranges. It’s likely that I am experiencing the friction of treating Stream as a data interchange format, which wasn’t one of its design goals, but this led me to investigate spliterator characteristics (there is no contiguous characteristic, which could speed up RLE greatly) and their relationship with performance.

Spliterator Characteristics

Streams have spliterators, which control iteration and splitting behaviour. If you want to process a stream in parallel, it is the spliterator which dictates how to split the stream, if possible. There’s more to a spliterator than parallel execution though, and each single threaded execution can be optimised based on the characteristics bit mask. The different characteristics are as follows:

  • ORDERED promises that there is an order. For instance, trySplit is guaranteed to give a prefix of elements.
  • DISTINCT a promise that each element in the stream is unique.
  • SORTED a promise that the stream is already sorted.
  • SIZED promises the size of the stream is known. This is not true when a call to iterate generates the stream.
  • NONNULL promises that no elements in the stream are null.
  • IMMUTABLE promises the underlying data will not change.
  • CONCURRENT promises that the underlying data can be modified concurrently. Must not also be IMMUTABLE.
  • SUBSIZED promises that the sizes of splits are known, must also be SIZED.

There’s javadoc for all of these flags, which should be your point of reference, and you need to read it because you wouldn’t guess based on relative performance. For instance, IntStream.range(inclusive, exclusive) creates an RangeIntSpliterator with the characteristics ORDERED | SIZED | SUBSIZED | IMMUTABLE | NONNULL | DISTINCT | SORTED. This means that this stream has no duplicates, no nulls, is already sorted in natural order, the size is known, and it will be chunked deterministically. The data and the iteration order never change, and if we split it, we will always get the same first chunk. So these code snippets should have virtually the same performance:

    public long countRange() {
        return IntStream.range(0, size).count();

    public long countRangeDistinct() {
        return IntStream.range(0, size).distinct().count();

This is completely at odds with observations. Even though the elements are already distinct, and metadata exists to support this, requesting the distinct elements decimates performance.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countRange thrpt 1 10 49.465729 1.804123 ops/us 262144
countRangeDistinct thrpt 1 10 0.000395 0.000002 ops/us 262144

It turns out this is because IntStream.distinct has a one-size-fits-all implementation which completely ignores the DISTINCT characteristic, and goes ahead and boxes the entire range.

    // from
    public final IntStream distinct() {
        // While functional and quick to implement, this approach is not very efficient.
        // An efficient version requires an int-specific map/set implementation.
        return boxed().distinct().mapToInt(i -> i);

There is even more observable weirdness. If we wanted to calculate the sum of the first 1000 natural numbers, these two snippets should have the same performance. Requesting what should be a redundant sort doubles the throughput.

    public long headSum() {
        return IntStream.range(0, size).limit(1000).sum();

    public long sortedHeadSum() {
        return IntStream.range(0, size).sorted().limit(1000).sum();

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
headSum thrpt 1 10 0.209763 0.002478 ops/us 262144
sortedHeadSum thrpt 1 10 0.584227 0.006004 ops/us 262144

In fact, you would have a hard time finding a relationship between Spliterator characteristics and performance, but you can see cases of characteristics driving optimisations if you look hard enough, such as in IntStream.count, where the SIZED characteristic is used.

    // see
    public <P_IN> Long evaluateSequential(PipelineHelper<Integer> helper, Spliterator<P_IN> spliterator) {
        if (StreamOpFlag.SIZED.isKnown(helper.getStreamAndOpFlags()))
            return spliterator.getExactSizeIfKnown();
        return super.evaluateSequential(helper, spliterator);

This is a measurably worthwhile optimisation, when benchmarked against the unsized spliterator created by IntStream.iterate:

    public long countIterator() {
        return IntStream.iterate(0, i -> i < size, i -> i + 1).count();

    public long countRange() {
        return IntStream.range(0, size).count();

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countIterator thrpt 1 10 0.001198 0.001629 ops/us 262144
countRange thrpt 1 10 43.166065 4.628715 ops/us 262144

What about limit, that’s supposed to be useful for speeding up streams by limiting the amount of work done? Unfortunately not. It actually makes things potentially much worse. In SliceOps.flags, we see it will actually disable SIZED operations:

    private static int flags(long limit) {
        return StreamOpFlag.NOT_SIZED | ((limit != -1) ? StreamOpFlag.IS_SHORT_CIRCUIT : 0);

This has a significant effect on performance, as can be seen in the following benchmark:

    public long countRange() {
        return IntStream.range(0, size).count();

    public long countHalfRange() {
        return IntStream.range(0, size).limit(size / 2).count();

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countHalfRange thrpt 1 10 0.003632 0.003363 ops/us 262144
countRange thrpt 1 10 44.859998 6.191411 ops/us 262144

It’s almost as if there were grand plans involving characteristic based optimisation, and perhaps time ran out (IntStream.distinct has a very apologetic comment) or others were better on paper than in reality. In any case, it looks like they aren’t as influential as you might expect. Given that the relationship between the characteristics which exist and performance is flaky at best, it’s unlikely that a new one would get implemented, but I think the characteristic CONTIGUOUS is missing.

Posted on