# Sum of Squares

Streams and lambdas, especially the limited support offered for primitive types, are a fantastic addition to the Java language. They’re not supposed to be fast, but how do these features compare to a good old `for` loop? For a simple calculation amenable to instruction level parallelism, I compare modern and traditional implementations and observe the differences in instructions generated.

### Sum of Squares

The sum of squares is the building block of a linear regression analysis so is ubiquitous in statistical computing. It is associative and therefore data-parallel. I compare four implementations: a sequential stream wrapping an array, a parallel stream wrapping an array, a generative sequential stream and a traditional `for` loop. The benchmark code is on github.

``````
@Param({"1024", "8192"})
int size;

private double[] data;

@Setup(Level.Iteration)
public void init() {
this.data = createDoubleArray(size);
}

@Benchmark
public double SS_SequentialStream() {
return DoubleStream.of(data)
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0D);
}

@Benchmark
public double SS_ParallelStream() {
return DoubleStream.of(data)
.parallel()
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}

@Benchmark
public double SS_ForLoop() {
double result = 0D;
for (int i = 0; i < data.length; ++i) {
result += data[i] * data[i];
}
return result;
}

@Benchmark
public double SS_GenerativeSequentialStream() {
return IntStream.iterate(0, i -> i < size, i -> i + 1)
.mapToDouble(i -> data[i])
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}
``````

I must admit I prefer the readability of the stream versions, but let’s see if there is a comedown after the syntactic sugar rush.

### Running a Benchmark

I compare the four implementations on an array of one million doubles. I am using `JDK 9.0.1, VM 9.0.1+11` on a fairly powerful laptop with 8 processors:

```\$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
```

Before running the benchmark we might expect the `for` loop and stream to have similar performance, and the parallel version to be about eight times faster (though remember that the arrays aren’t too big). The generative version is very similar to the `for` loop so a slow down might not be expected.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
SS_ForLoop thrpt 1 10 258351.774491 39797.567968 ops/s 1024
SS_ForLoop thrpt 1 10 29463.408428 4814.826388 ops/s 8192
SS_GenerativeSequentialStream thrpt 1 10 219699.607567 9095.569546 ops/s 1024
SS_GenerativeSequentialStream thrpt 1 10 28351.900454 828.513989 ops/s 8192
SS_ParallelStream thrpt 1 10 22827.821827 2826.577213 ops/s 1024
SS_ParallelStream thrpt 1 10 23230.623610 273.415352 ops/s 8192
SS_SequentialStream thrpt 1 10 225431.985145 9051.538442 ops/s 1024
SS_SequentialStream thrpt 1 10 29123.734157 1333.721437 ops/s 8192

The `for` loop and stream are similar. The parallel version is a long way behind (yes that’s right: more threads less power), but exhibits constant scaling (incidentally, a measurement like this is a good way to guess the minimum unit of work in a parallelised implementation). If the data is large it could become profitable to use it. The generative stream is surprisingly good, almost as good as the version that wraps the array, though there is a fail-safe way to slow it down: add a limit clause to the method chain (try it…).

Profiling with perfasm, it is clear that the `for` loop body is being vectorised, but only the loads and multiplications are done in parallel – the complicated string of SSE instructions is the reduction, which must be done in order.

```<-- unrolled load -->
0.01%    0x00000243d8969170: vmovdqu ymm1,ymmword ptr [r11+r8*8+0f0h]
0.07%    0x00000243d896917a: vmovdqu ymm2,ymmword ptr [r11+r8*8+0d0h]
0.75%    0x00000243d8969184: vmovdqu ymm3,ymmword ptr [r11+r8*8+0b0h]
0.01%    0x00000243d896918e: vmovdqu ymm4,ymmword ptr [r11+r8*8+90h]
0.02%    0x00000243d8969198: vmovdqu ymm5,ymmword ptr [r11+r8*8+70h]
0.03%    0x00000243d896919f: vmovdqu ymm6,ymmword ptr [r11+r8*8+50h]
0.77%    0x00000243d89691a6: vmovdqu ymm10,ymmword ptr [r11+r8*8+30h]
0.02%    0x00000243d89691ad: vmovdqu ymm7,ymmword ptr [r11+r8*8+10h]
<-- multiplication starts -->
0.01%    0x00000243d89691b4: vmulpd  ymm1,ymm1,ymm1
0.02%    0x00000243d89691b8: vmovdqu ymmword ptr [rsp+28h],ymm1
0.76%    0x00000243d89691be: vmulpd  ymm15,ymm7,ymm7
0.00%    0x00000243d89691c2: vmulpd  ymm12,ymm2,ymm2
0.01%    0x00000243d89691c6: vmulpd  ymm7,ymm3,ymm3
0.02%    0x00000243d89691ca: vmulpd  ymm8,ymm4,ymm4
0.72%    0x00000243d89691ce: vmulpd  ymm9,ymm5,ymm5
0.00%    0x00000243d89691d2: vmulpd  ymm11,ymm6,ymm6
0.01%    0x00000243d89691d6: vmulpd  ymm13,ymm10,ymm10
<-- multiplication ends here, scalar reduction starts -->
0.72%    0x00000243d89691e0: vpshufd xmm5,xmm15,0eh
2.14%    0x00000243d89691ea: vextractf128 xmm6,ymm15,1h
3.21%    0x00000243d89691f4: vpshufd xmm5,xmm6,0eh
2.82%    0x00000243d8969202: vpshufd xmm5,xmm13,0eh
2.87%    0x00000243d896920c: vextractf128 xmm6,ymm13,1h
3.03%    0x00000243d8969216: vpshufd xmm5,xmm6,0eh
2.70%    0x00000243d8969224: vpshufd xmm5,xmm11,0eh
2.98%    0x00000243d896922e: vextractf128 xmm6,ymm11,1h
3.11%    0x00000243d8969238: vpshufd xmm5,xmm6,0eh
2.61%    0x00000243d8969246: vpshufd xmm5,xmm9,0eh
2.89%    0x00000243d8969250: vextractf128 xmm6,ymm9,1h
3.13%    0x00000243d896925a: vpshufd xmm5,xmm6,0eh
2.83%    0x00000243d8969268: vpshufd xmm4,xmm8,0eh
3.00%    0x00000243d8969272: vextractf128 xmm10,ymm8,1h
3.13%    0x00000243d896927d: vpshufd xmm4,xmm10,0eh
2.95%    0x00000243d896928b: vpshufd xmm1,xmm7,0eh
3.06%    0x00000243d8969294: vextractf128 xmm2,ymm7,1h
3.07%    0x00000243d896929e: vpshufd xmm1,xmm2,0eh
2.92%    0x00000243d89692ac: vpshufd xmm3,xmm12,0eh
3.11%    0x00000243d89692b6: vextractf128 xmm1,ymm12,1h
3.02%    0x00000243d89692c0: vpshufd xmm3,xmm1,0eh
2.97%    0x00000243d89692c9: vmovdqu ymm1,ymmword ptr [rsp+28h]
3.05%    0x00000243d89692d3: vpshufd xmm2,xmm1,0eh
2.97%    0x00000243d89692dc: vextractf128 xmm14,ymm1,1h
2.99%    0x00000243d89692e7: vpshufd xmm2,xmm14,0eh
```

The sequential stream code is not as good – it is scalar – but the difference in performance is not as stark as it might be because of the inefficient scalar reduction in the `for` loop: this is JLS floating point semantics twisting C2’s arm behind its back.

```  0.00%    0x0000021a1df54c24: vmovsd  xmm0,qword ptr [rbx+r9*8+48h]
0.00%    0x0000021a1df54c2b: vmovsd  xmm2,qword ptr [rbx+r9*8+18h]
0.02%    0x0000021a1df54c32: vmovsd  xmm3,qword ptr [rbx+r9*8+40h]
2.93%    0x0000021a1df54c39: vmovsd  xmm4,qword ptr [rbx+r9*8+38h]
0.00%    0x0000021a1df54c40: vmovsd  xmm5,qword ptr [rbx+r9*8+30h]
0.01%    0x0000021a1df54c47: vmovsd  xmm6,qword ptr [rbx+r9*8+28h]
0.02%    0x0000021a1df54c4e: vmovsd  xmm7,qword ptr [rbx+r9*8+20h]
2.99%    0x0000021a1df54c55: vmulsd  xmm8,xmm0,xmm0
0.00%    0x0000021a1df54c59: vmulsd  xmm0,xmm7,xmm7
0x0000021a1df54c5d: vmulsd  xmm6,xmm6,xmm6
0.01%    0x0000021a1df54c61: vmulsd  xmm5,xmm5,xmm5
2.91%    0x0000021a1df54c65: vmulsd  xmm4,xmm4,xmm4
0.00%    0x0000021a1df54c69: vmulsd  xmm3,xmm3,xmm3
0.00%    0x0000021a1df54c6d: vmulsd  xmm2,xmm2,xmm2
```

The same code can be seen in `SS_ParallelStream`. `SS_GenerativeSequentialStream` is much more interesting because it hasn’t been unrolled – see the interleaved control statements. It is also not vectorised.

```           0x0000013c1a639c17: vmovsd  xmm0,qword ptr [rbp+r9*8+10h]
0.01%    0x0000013c1a639c1e: vmulsd  xmm2,xmm0,xmm0
0.01%    0x0000013c1a639c22: test    r8d,r8d
0x0000013c1a639c25: jne     13c1a639e09h
0x0000013c1a639c2b: mov     r10d,dword ptr [r12+rax*8+8h]
0x0000013c1a639c30: cmp     r10d,0f8022d85h
0x0000013c1a639c37: jne     13c1a639e3bh
0.01%    0x0000013c1a639c41: vmovsd  qword ptr [rdi+10h],xmm2
0.00%    0x0000013c1a639c46: movsxd  r10,r9d
0x0000013c1a639c49: vmovsd  xmm0,qword ptr [rbp+r10*8+18h]
0.01%    0x0000013c1a639c50: vmulsd  xmm0,xmm0,xmm0
0.01%    0x0000013c1a639c54: mov     r10d,dword ptr [r12+rax*8+8h]
0.00%    0x0000013c1a639c59: cmp     r10d,0f8022d85h
0x0000013c1a639c60: jne     13c1a639e30h
0.02%    0x0000013c1a639c6a: vmovsd  qword ptr [rdi+10h],xmm0
0.02%    0x0000013c1a639c6f: mov     r10d,r9d
0x0000013c1a639c76: cmp     r10d,r11d
0x0000013c1a639c79: jnl     13c1a639d96h
0.02%    0x0000013c1a639c83: vmovsd  xmm1,qword ptr [rbp+r10*8+10h]
0.00%    0x0000013c1a639c8a: movzx   r8d,byte ptr [rdi+0ch]
0.00%    0x0000013c1a639c8f: vmulsd  xmm1,xmm1,xmm1
0.01%    0x0000013c1a639c93: test    r8d,r8d
0x0000013c1a639c96: jne     13c1a639dfbh
0.01%    0x0000013c1a639ca0: vmovsd  qword ptr [rdi+10h],xmm1
0.02%    0x0000013c1a639ca5: movsxd  r8,r10d
0.00%    0x0000013c1a639ca8: vmovsd  xmm0,qword ptr [rbp+r8*8+18h]
0x0000013c1a639caf: vmulsd  xmm0,xmm0,xmm0
0.06%    0x0000013c1a639cb7: vmovsd  qword ptr [rdi+10h],xmm0
```

So it looks like streams don’t vectorise like good old `for` loops, and you won’t gain from `Stream.parallelStream` unless you have humungous arrays (which you might be avoiding for other reasons). This was actually a very nice case for the `Stream` because optimal code can’t be generated for floating point reductions. What happens with sum of squares for `int`s? Generating data in an unsurprising way:

``````
@Benchmark
public int SS_SequentialStream_Int() {
return IntStream.of(intData)
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}

@Benchmark
public int SS_ParallelStream_Int() {
return IntStream.of(intData)
.parallel()
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}

@Benchmark
public int SS_ForLoop_Int() {
int result = 0;
for (int i = 0; i < intData.length; ++i) {
result += intData[i] * intData[i];
}
return result;
}

@Benchmark
public int SS_GenerativeSequentialStream_Int() {
return IntStream.iterate(0, i -> i < size, i -> i + 1)
.map(i -> intData[i])
.map(x -> x * x)
.reduce((x, y) -> x + y)
.orElse(0);
}
``````

The landscape has completely changed, thanks to the exploitation of associative arithmetic and the `VPHADDD` instruction which simplifies the reduction in the for loop.

```<-- load -->
0.00%    0x000001f5cdd8cd30: vmovdqu ymm0,ymmword ptr [rdi+r10*4+0f0h]
1.93%    0x000001f5cdd8cd3a: vmovdqu ymm1,ymmword ptr [rdi+r10*4+0d0h]
0.10%    0x000001f5cdd8cd44: vmovdqu ymm2,ymmword ptr [rdi+r10*4+0b0h]
0.07%    0x000001f5cdd8cd4e: vmovdqu ymm3,ymmword ptr [rdi+r10*4+90h]
0.05%    0x000001f5cdd8cd58: vmovdqu ymm4,ymmword ptr [rdi+r10*4+70h]
1.75%    0x000001f5cdd8cd5f: vmovdqu ymm5,ymmword ptr [rdi+r10*4+50h]
0.08%    0x000001f5cdd8cd66: vmovdqu ymm6,ymmword ptr [rdi+r10*4+30h]
0.07%    0x000001f5cdd8cd6d: vmovdqu ymm7,ymmword ptr [rdi+r10*4+10h]
<-- multiply -->
0.01%    0x000001f5cdd8cd74: vpmulld ymm0,ymm0,ymm0
1.81%    0x000001f5cdd8cd79: vmovdqu ymmword ptr [rsp+28h],ymm0
0.02%    0x000001f5cdd8cd7f: vpmulld ymm15,ymm7,ymm7
1.79%    0x000001f5cdd8cd84: vpmulld ymm11,ymm1,ymm1
0.06%    0x000001f5cdd8cd89: vpmulld ymm8,ymm2,ymm2
1.82%    0x000001f5cdd8cd8e: vpmulld ymm9,ymm3,ymm3
0.06%    0x000001f5cdd8cd93: vpmulld ymm10,ymm4,ymm4
1.79%    0x000001f5cdd8cd98: vpmulld ymm12,ymm5,ymm5
0.08%    0x000001f5cdd8cd9d: vpmulld ymm6,ymm6,ymm6
<-- vectorised reduce -->
1.85%    0x000001f5cdd8cdac: vextracti128 xmm7,ymm4,1h
1.78%    0x000001f5cdd8cdb6: vmovd   xmm7,r8d
0.11%    0x000001f5cdd8cdbf: vmovd   r11d,xmm7
5.43%    0x000001f5cdd8cdce: vextracti128 xmm7,ymm4,1h
4.34%    0x000001f5cdd8cdd8: vmovd   xmm7,r11d
1.40%    0x000001f5cdd8cde1: vmovd   r8d,xmm7
3.25%    0x000001f5cdd8cdf0: vextracti128 xmm4,ymm6,1h
6.36%    0x000001f5cdd8cdfa: vmovd   xmm4,r8d
1.69%    0x000001f5cdd8ce03: vmovd   r8d,xmm4
0.10%    0x000001f5cdd8ce12: vextracti128 xmm7,ymm4,1h
0.72%    0x000001f5cdd8ce1c: vmovd   xmm7,r8d
4.42%    0x000001f5cdd8ce25: vmovd   r11d,xmm7
0.12%    0x000001f5cdd8ce34: vextracti128 xmm1,ymm5,1h
0.22%    0x000001f5cdd8ce3e: vmovd   xmm1,r11d
3.81%    0x000001f5cdd8ce47: vmovd   r11d,xmm1
0.22%    0x000001f5cdd8ce56: vextracti128 xmm3,ymm0,1h
0.36%    0x000001f5cdd8ce60: vmovd   xmm3,r11d
4.55%    0x000001f5cdd8ce69: vmovd   r8d,xmm3
0.09%    0x000001f5cdd8ce78: vextracti128 xmm1,ymm2,1h
1.57%    0x000001f5cdd8ce82: vmovd   xmm1,r8d
4.84%    0x000001f5cdd8ce8b: vmovd   r11d,xmm1
0.06%    0x000001f5cdd8ce90: vmovdqu ymm0,ymmword ptr [rsp+28h]
2.16%    0x000001f5cdd8cea0: vextracti128 xmm14,ymm13,1h
0.09%    0x000001f5cdd8ceab: vmovd   xmm14,r11d
```

If you’re the guy replacing all the `for` loops with streams because it’s 2018, you may be committing performance vandalism! That nice declarative API (as opposed to language feature) is at arms length and it really isn’t well optimised yet.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
SS_ForLoop_Int thrpt 1 10 1021725.018981 74264.883362 ops/s 1024
SS_ForLoop_Int thrpt 1 10 129250.855026 5764.608094 ops/s 8192
SS_GenerativeSequentialStream_Int thrpt 1 10 55069.227826 1111.903102 ops/s 1024
SS_GenerativeSequentialStream_Int thrpt 1 10 6769.176830 684.970867 ops/s 8192
SS_ParallelStream_Int thrpt 1 10 20970.387258 719.846643 ops/s 1024
SS_ParallelStream_Int thrpt 1 10 19621.397202 1514.374286 ops/s 8192
SS_SequentialStream_Int thrpt 1 10 586847.001223 22390.512706 ops/s 1024
SS_SequentialStream_Int thrpt 1 10 87620.959677 3437.083075 ops/s 8192

Parallel streams might not be the best thing to reach for.

# Spliterator Characteristics and Performance

The streams API has been around for a while now, and I’m a big fan of it. It allows for a clean declarative programming style, which permits various optimisations to occur, and keeps the pastafarians at bay. I also think the `Stream` is the perfect abstraction for data interchange across API boundaries. This is partly because a `Stream` is lazy, meaning you don’t need to pay for consumption until you actually need to, and partly because a `Stream` can only be used once and there can be no ambiguity about ownership. If you supply a `Stream` to an API, you must expect that it has been used and so must discard it. This almost entirely eradicates defensive copies and can mean that no intermediate data structures need ever exist. Despite my enthusiasm for this abstraction, there’s some weirdness in this API when you scratch beneath surface.

I wanted to find a way to quickly run length encode an `IntStream` and found it difficult to make this code as fast as I’d like it to be. The code is too slow because it’s necessary to inspect each `int`, even when there is enough context available to potentially apply optimisations such as skipping over ranges. It’s likely that I am experiencing the friction of treating `Stream` as a data interchange format, which wasn’t one of its design goals, but this led me to investigate spliterator characteristics (there is no contiguous characteristic, which could speed up RLE greatly) and their relationship with performance.

### Spliterator Characteristics

Streams have spliterators, which control iteration and splitting behaviour. If you want to process a stream in parallel, it is the spliterator which dictates how to split the stream, if possible. There’s more to a spliterator than parallel execution though, and each single threaded execution can be optimised based on the characteristics bit mask. The different characteristics are as follows:

• `ORDERED` promises that there is an order. For instance, `trySplit` is guaranteed to give a prefix of elements.
• `DISTINCT` a promise that each element in the stream is unique.
• `SORTED` a promise that the stream is already sorted.
• `SIZED` promises the size of the stream is known. This is not true when a call to `iterate` generates the stream.
• `NONNULL` promises that no elements in the stream are null.
• `IMMUTABLE` promises the underlying data will not change.
• `CONCURRENT` promises that the underlying data can be modified concurrently. Must not also be `IMMUTABLE`.
• `SUBSIZED` promises that the sizes of splits are known, must also be `SIZED`.

There’s javadoc for all of these flags, which should be your point of reference, and you need to read it because you wouldn’t guess based on relative performance. For instance, `IntStream.range(inclusive, exclusive)` creates an `RangeIntSpliterator` with the characteristics `ORDERED | SIZED | SUBSIZED | IMMUTABLE | NONNULL | DISTINCT | SORTED`. This means that this stream has no duplicates, no nulls, is already sorted in natural order, the size is known, and it will be chunked deterministically. The data and the iteration order never change, and if we split it, we will always get the same first chunk. So these code snippets should have virtually the same performance:

``````
@Benchmark
public long countRange() {
return IntStream.range(0, size).count();
}

@Benchmark
public long countRangeDistinct() {
return IntStream.range(0, size).distinct().count();
}
``````

This is completely at odds with observations. Even though the elements are already distinct, and metadata exists to support this, requesting the distinct elements decimates performance.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countRange thrpt 1 10 49.465729 1.804123 ops/us 262144
countRangeDistinct thrpt 1 10 0.000395 0.000002 ops/us 262144

It turns out this is because `IntStream.distinct` has a one-size-fits-all implementation which completely ignores the `DISTINCT` characteristic, and goes ahead and boxes the entire range.

``````    // from java.util.stream.IntPipeline
@Override
public final IntStream distinct() {
// While functional and quick to implement, this approach is not very efficient.
// An efficient version requires an int-specific map/set implementation.
return boxed().distinct().mapToInt(i -> i);
}
``````

There is even more observable weirdness. If we wanted to calculate the sum of the first 1000 natural numbers, these two snippets should have the same performance. Requesting what should be a redundant sort doubles the throughput.

``````
@Benchmark
return IntStream.range(0, size).limit(1000).sum();
}

@Benchmark
return IntStream.range(0, size).sorted().limit(1000).sum();
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
headSum thrpt 1 10 0.209763 0.002478 ops/us 262144
sortedHeadSum thrpt 1 10 0.584227 0.006004 ops/us 262144

In fact, you would have a hard time finding a relationship between Spliterator characteristics and performance, but you can see cases of characteristics driving optimisations if you look hard enough, such as in `IntStream.count`, where the `SIZED` characteristic is used.

``````    // see java.util.stream.ReduceOps.makeIntCounting
@Override
public <P_IN> Long evaluateSequential(PipelineHelper<Integer> helper, Spliterator<P_IN> spliterator) {
if (StreamOpFlag.SIZED.isKnown(helper.getStreamAndOpFlags()))
return spliterator.getExactSizeIfKnown();
return super.evaluateSequential(helper, spliterator);
}
``````

This is a measurably worthwhile optimisation, when benchmarked against the unsized spliterator created by `IntStream.iterate`:

``````
@Benchmark
public long countIterator() {
return IntStream.iterate(0, i -> i < size, i -> i + 1).count();
}

@Benchmark
public long countRange() {
return IntStream.range(0, size).count();
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countIterator thrpt 1 10 0.001198 0.001629 ops/us 262144
countRange thrpt 1 10 43.166065 4.628715 ops/us 262144

What about `limit`, that’s supposed to be useful for speeding up streams by limiting the amount of work done? Unfortunately not. It actually makes things potentially much worse. In `SliceOps.flags`, we see it will actually disable `SIZED` operations:

``````    //see java.util.stream.SliceOps
private static int flags(long limit) {
return StreamOpFlag.NOT_SIZED | ((limit != -1) ? StreamOpFlag.IS_SHORT_CIRCUIT : 0);
}
``````

This has a significant effect on performance, as can be seen in the following benchmark:

``````
@Benchmark
public long countRange() {
return IntStream.range(0, size).count();
}

@Benchmark
public long countHalfRange() {
return IntStream.range(0, size).limit(size / 2).count();
}
``````

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
countHalfRange thrpt 1 10 0.003632 0.003363 ops/us 262144
countRange thrpt 1 10 44.859998 6.191411 ops/us 262144

It’s almost as if there were grand plans involving characteristic based optimisation, and perhaps time ran out (`IntStream.distinct` has a very apologetic comment) or others were better on paper than in reality. In any case, it looks like they aren’t as influential as you might expect. Given that the relationship between the characteristics which exist and performance is flaky at best, it’s unlikely that a new one would get implemented, but I think the characteristic `CONTIGUOUS` is missing.