# How much Algebra does C2 Know? Part 1: Associativity

Making loops execute faster is firmly rooted in algebra, but how much does C2 know or care about? When building a highly optimised query engine, a critical concern is the quality of assembly code generated for loops. There is a lot more to JIT compilation than loop optimisation; inlining, class hierarchy analysis, escape analysis to name but a few. Moreover, everything it does has to be fast since it shares resources with the application itself; it can’t spend time unless it brings a net benefit. Being such a generalist, does C2, the JIT compiler used in server applications, know high school algebra?

Specific knowledge of maths is not always worthwhile to program execution, even when it leads to high performance gains. As a motivating example, there is no way to refer directly to the natural numbers in any programming language I have ever used. For instance, the sum of the first `n` natural numbers is `((n+1) * n)/2`, and most high school students know it. This expression is intuitively much faster to evaluate than the equivalent algorithm:

``````
int sum(int n) {
int total = 0;
for (int i = 0; i <= n; ++i) {
total += i;
}
}
``````

But would this loop rewrite be a worthwhile optimisation? The expression takes about 3.5ns to compute the sum of the first million natural numbers, whereas the loop takes 350µs, so we can conclude that C2 does not know this formula and prefers brute force. I would be aghast if time had been spent on optimisations like this: unless your application spends a lot of time adding up contiguous ranges of natural numbers, the marginal benefit is negligible. If this is what your application does most, you should do it yourself. The possibility of an optimisation doesn’t imply its viability: there needs to be a benefit when considering engineering effort, speed improvement, reliability and ubiquity. While this optimisation fails miserably on the grounds of ubiquity, there’s useful schoolboy maths that C2 does seem to know.

### Associativity and Dependencies

Each x86 instruction has a throughput – the number of cycles it takes to complete – and a latency – the number of cycles it takes before the result is available to the next instruction in a chain. These numbers are produced by processor vendors, but there are independent numbers like these from Agner Fog, which also includes more detailed definitions of terms like latency. At first, the latency number feels a bit like a scam: what use is an advertised throughput if we can’t use the result immediately? This is where pipelining comes in: independent instructions can be interleaved. If a loop operation is associative and there are no dependencies between iterations, then it can be unrolled, which enables pipelining. If a loop operation is also commutative, then out of order execution is permitted. Evidence of an unrolled loop suggests that the compiler has realised that an operation is at least associative.

To see this in action it’s necessary to find an associative loop reduction that the compiler can’t vectorise. I took an example from the RoaringBitmap library – computing the cardinality of a bitmap container – which is a perfect example to capture this behaviour, because bit counts cannot be vectorised in Java.

``````
/**
* Recomputes the cardinality of the bitmap.
*/
protected void computeCardinality() {
this.cardinality = 0;
for (int k = 0; k < this.bitmap.length; k++) {
this.cardinality += Long.bitCount(this.bitmap[k]);
}
}
``````

we can see evidence of loop unrolling and out of order execution when looking at the assembly code emitted. The `popcnt` instructions are executed on the array out of order, and do not wait for the addition to the accumulator.

```popcnt  r9,qword ptr [rbx+r13*8+10h]

movsxd  r8,r13d

popcnt  r10,qword ptr [rbx+r8*8+28h]

popcnt  r11,qword ptr [rbx+r8*8+18h]

popcnt  rdx,qword ptr [rbx+r8*8+20h]

movsxd  r8,r9d

movsxd  r9,edx
```

To generate this assembly code you can run the project at github with the arguments

```--include .*popcnt.*
--print-assembly
```

The compiler does a very good job in this case: you can try unrolling the loop yourself, but you can only match performance if you guess the loop stride correctly. It’s impossible to prove a negative proposition, but it’s likely you’ll only make it worse if you try. C2 graduates with flying colours here: it definitely understands associativity and dependence.

The catch with pipelining is that an instruction must always wait for its operands. While the operation is associative, there is no way to reorder the code below.

``````
private int[] prefixSum(int[] data) {
int[] result = new int[data.length];
for (int i = 1; i < result.length; ++i) {
result[i] = result[i - 1] + data[i];
}
return result;
}
``````

What happens with a prefix sum? There’s no unrolling: you can see the loop control statements have not been removed (look for commands like cmp ebx, inc ebx). The loop is also executed in order because it is sequentially dependent.

```com/openkappa/simd/prefix/PrefixSum.impl([I)[I  [0x000001c21215bb20, 0x000001c21215bd78]  600 bytes
Argument 0 is unknown.RIP: 0x1c21215bb20 Code size: 0x00000258
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x000001c2244379d0} 'impl' '([I)[I' in 'com/openkappa/simd/prefix/PrefixSum'
0x000001c21215bb20: int3                      ;...cc

0x000001c21215bb21: nop     word ptr [rax+rax+0h]  ;...66
;...66
;...66
;...0f
;...1f
;...84
;...00
;...00
;...00
;...00
;...00

0x000001c21215bb2c: nop                       ;...66
;...66
;...66
;...90

0x000001c21215bb30: mov     dword ptr [rsp+0ffffffffffff9000h],eax
;...89
;...84
;...24
;...00
;...90
;...ff
;...ff

0x000001c21215bb37: push    rbp               ;...55

0x000001c21215bb38: sub     rsp,60h           ;...48
;...83
;...ec
;...60

0x000001c21215bb3c: mov     rbp,qword ptr [rdx+10h]  ;...48
;...8b
;...6a
;...10

0x000001c21215bb40: mov     r13,qword ptr [rdx+8h]  ;...4c
;...8b
;...6a
;...08

0x000001c21215bb44: mov     ebx,dword ptr [rdx]  ;...8b
;...1a

0x000001c21215bb46: mov     rcx,rdx           ;...48
;...8b
;...ca

0x000001c21215bb49: mov     r10,51da8d20h     ;...49
;...ba
;...20
;...8d
;...da
;...51
;...00
;...00
;...00
;...00

0x000001c21215bb53: call indirect r10         ;...41
;...ff
;...d2

0x000001c21215bb56: test    rbp,rbp           ;...48
;...85
;...ed

0x000001c21215bb59: je      1c21215bcb9h      ;...0f
;...84
;...5a
;...01
;...00
;...00

0x000001c21215bb5f: mov     r10d,dword ptr [rbp+8h]  ;...44
;...8b
;...55
;...08

0x000001c21215bb63: cmp     r10d,0f800016dh   ;...41
;...81
;...fa
;...6d
;...01
;...00
;...f8
0x000001c21215bb6a: jne     1c21215bd1dh      ;...0f
;...85
;...01
;...00
;...00

0x000001c21215bb70: mov     r8,rbp            ;...4c
;...8b
;...c5

0x000001c21215bb73: mov     r10d,dword ptr [r13+8h]  ;...45
;...8b
;...55
;...08
; implicit exception: dispatches to 0x000001c21215bd41
0x000001c21215bb77: cmp     r10d,0f800016dh   ;...41
;...81
;...fa
;...6d
;...01
;...00
;...f8
0x000001c21215bb7e: jne     1c21215bd1dh      ;...0f
;...85
;...99
;...01
;...00
;...00
; - com.openkappa.simd.prefix.PrefixSum::impl@7 (line 21)

0x000001c21215bb84: mov     edi,dword ptr [r13+0ch]  ;...41
;...8b
;...7d
;...0c
;*arraylength {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@9 (line 21)

0x000001c21215bb88: cmp     ebx,edi           ;...3b
;...df

0x000001c21215bb8a: jnl     1c21215bcaah      ;...0f
;...8d
;...1a
;...01
;...00
;...00
;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@10 (line 21)

0x000001c21215bb90: mov     r11d,ebx          ;...44
;...8b
;...db

0x000001c21215bb93: inc     r11d              ;...41
;...ff
;...c3

0x000001c21215bb96: xor     r9d,r9d           ;...45
;...33
;...c9

0x000001c21215bb99: mov     ecx,1h            ;...b9
;...01
;...00
;...00
;...00

0x000001c21215bb9e: cmp     r11d,ecx          ;...44
;...3b
;...d9

0x000001c21215bba1: cmovl   r11d,ecx          ;...44
;...0f
;...4c
;...d9

0x000001c21215bba5: cmp     r11d,edi          ;...44
;...3b
;...df

0x000001c21215bba8: cmovnle r11d,edi          ;...44
;...0f
;...4f
;...df

0x000001c21215bbac: cmp     r11d,r9d          ;...45
;...3b
;...d9

0x000001c21215bbaf: cmovl   r11d,r9d          ;...45
;...0f
;...4c
;...d9

0x000001c21215bbb3: cmp     r11d,edi          ;...44
;...3b
;...df

0x000001c21215bbb6: cmovnle r11d,edi          ;...44
;...0f
;...4f
;...df

0x000001c21215bbba: nop                       ;...66
;...90
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

0x000001c21215bbbc: mov     ebp,ebx           ;...8b
;...eb

0x000001c21215bbbe: dec     ebp               ;...ff
;...cd
;*isub {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@18 (line 22)

0x000001c21215bbc0: cmp     ebp,edi           ;...3b
;...ef

0x000001c21215bbc2: jnb     1c21215bcc1h      ;...0f
;...83
;...f9
;...00
;...00
;...00

0x000001c21215bbc8: mov     r9d,dword ptr [r8+0ch]  ;...45
;...8b
;...48
;...0c
; - com.openkappa.simd.prefix.PrefixSum::impl@22 (line 22)
; implicit exception: dispatches to 0x000001c21215bd31
0x000001c21215bbcc: mov     ebp,dword ptr [r13+rbx*4+0ch]
;...41
;...8b
;...6c
;...9d
;...0c
; - com.openkappa.simd.prefix.PrefixSum::impl@19 (line 22)

0x000001c21215bbd1: cmp     ebx,r9d           ;...41
;...3b
;...d9

0x000001c21215bbd4: jnb     1c21215bce1h      ;...0f
;...83
;...07
;...01
;...00
;...00

;...41
;...03
;...6c
;...98
;...10
; - com.openkappa.simd.prefix.PrefixSum::impl@23 (line 22)

0x000001c21215bbdf: cmp     ebx,edi           ;...3b
;...df

0x000001c21215bbe1: jnb     1c21215bd01h      ;...0f
;...83
;...1a
;...01
;...00
;...00

0x000001c21215bbe7: mov     dword ptr [r13+rbx*4+10h],ebp
;...41
;...89
;...6c
;...9d
;...10
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

0x000001c21215bbec: inc     ebx               ;...ff
;...c3
;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@25 (line 21)

0x000001c21215bbee: cmp     ebx,r11d          ;...41
;...3b
;...db

0x000001c21215bbf1: jl      1c21215bbbch      ;...7c
;...c9
;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@10 (line 21)

0x000001c21215bbf3: cmp     edi,r9d           ;...41
;...3b
;...f9

0x000001c21215bbf6: mov     r11d,edi          ;...44
;...8b
;...df

0x000001c21215bbf9: cmovnle r11d,r9d          ;...45
;...0f
;...4f
;...d9

0x000001c21215bbfd: mov     r10d,r11d         ;...45
;...8b
;...d3

;...83
;...c2
;...f9

0x000001c21215bc04: mov     ecx,80000000h     ;...b9
;...00
;...00
;...00
;...80

0x000001c21215bc09: cmp     r11d,r10d         ;...45
;...3b
;...da

0x000001c21215bc0c: cmovl   r10d,ecx          ;...44
;...0f
;...4c
;...d1

0x000001c21215bc10: cmp     ebx,r10d          ;...41
;...3b
;...da

0x000001c21215bc13: jnl     1c21215bc80h      ;...7d
;...6b

0x000001c21215bc15: nop     word ptr [rax+rax+0h]  ;...66
;...66
;...66
;...0f
;...1f
;...84
;...00
;...00
;...00
;...00
;...00
;*isub {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@18 (line 22)

0x000001c21215bc20: mov     ecx,dword ptr [r8+rbx*4+10h]
;...41
;...8b
;...4c
;...98
;...10

;...41
;...03
;...4c
;...9d
;...0c
; - com.openkappa.simd.prefix.PrefixSum::impl@23 (line 22)

0x000001c21215bc2a: mov     dword ptr [r13+rbx*4+10h],ecx
;...41
;...89
;...4c
;...9d
;...10
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

0x000001c21215bc2f: movsxd  r11,ebx           ;...4c
;...63
;...db

;...43
;...03
;...4c
;...98
;...14
; - com.openkappa.simd.prefix.PrefixSum::impl@23 (line 22)

0x000001c21215bc37: mov     dword ptr [r13+r11*4+14h],ecx
;...43
;...89
;...4c
;...9d
;...14
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

;...43
;...03
;...4c
;...98
;...18
; - com.openkappa.simd.prefix.PrefixSum::impl@23 (line 22)

0x000001c21215bc41: mov     dword ptr [r13+r11*4+18h],ecx
;...43
;...89
;...4c
;...9d
;...18
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

;...43
;...03
;...4c
;...98
;...1c
; - com.openkappa.simd.prefix.PrefixSum::impl@23 (line 22)

0x000001c21215bc4b: mov     dword ptr [r13+r11*4+1ch],ecx
;...43
;...89
;...4c
;...9d
;...1c
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

;...43
;...03
;...4c
;...98
;...20
; - com.openkappa.simd.prefix.PrefixSum::impl@23 (line 22)

0x000001c21215bc55: mov     dword ptr [r13+r11*4+20h],ecx
;...43
;...89
;...4c
;...9d
;...20
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

;...43
;...03
;...4c
;...98
;...24
; - com.openkappa.simd.prefix.PrefixSum::impl@23 (line 22)

0x000001c21215bc5f: mov     dword ptr [r13+r11*4+24h],ecx
;...43
;...89
;...4c
;...9d
;...24
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

;...43
;...03
;...4c
;...98
;...28
; - com.openkappa.simd.prefix.PrefixSum::impl@23 (line 22)

0x000001c21215bc69: mov     dword ptr [r13+r11*4+28h],ecx
;...43
;...89
;...4c
;...9d
;...28

;...43
;...03
;...4c
;...98
;...2c

0x000001c21215bc73: mov     dword ptr [r13+r11*4+2ch],ecx
;...43
;...89
;...4c
;...9d
;...2c
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

;...c3
;...08
;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@25 (line 21)

0x000001c21215bc7b: cmp     ebx,r10d          ;...41
;...3b
;...da

0x000001c21215bc7e: jl      1c21215bc20h      ;...7c
;...a0
;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@10 (line 21)

0x000001c21215bc80: cmp     ebx,edi           ;...3b
;...df

0x000001c21215bc82: jnl     1c21215bcaah      ;...7d
;...26
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

0x000001c21215bc84: mov     ebp,ebx           ;...8b
;...eb

0x000001c21215bc86: dec     ebp               ;...ff
;...cd
;*isub {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@18 (line 22)

0x000001c21215bc88: cmp     ebp,edi           ;...3b
;...ef

0x000001c21215bc8a: jnb     1c21215bcc1h      ;...73
;...35

0x000001c21215bc8c: mov     ebp,dword ptr [r13+rbx*4+0ch]
;...41
;...8b
;...6c
;...9d
;...0c
; - com.openkappa.simd.prefix.PrefixSum::impl@19 (line 22)

0x000001c21215bc91: cmp     ebx,r9d           ;...41
;...3b
;...d9

0x000001c21215bc94: jnb     1c21215bce1h      ;...73
;...4b

;...41
;...03
;...6c
;...98
;...10
; - com.openkappa.simd.prefix.PrefixSum::impl@23 (line 22)

0x000001c21215bc9b: cmp     ebx,edi           ;...3b
;...df

0x000001c21215bc9d: jnb     1c21215bd01h      ;...73
;...62

0x000001c21215bc9f: mov     dword ptr [r13+rbx*4+10h],ebp
;...41
;...89
;...6c
;...9d
;...10
;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@24 (line 22)

0x000001c21215bca4: inc     ebx               ;...ff
;...c3
;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@25 (line 21)

0x000001c21215bca6: cmp     ebx,edi           ;...3b
;...df

0x000001c21215bca8: jl      1c21215bc84h      ;...7c
;...da
;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.simd.prefix.PrefixSum::impl@10 (line 21)

0x000001c21215bcaa: mov     rax,r13           ;...49
;...8b
;...c5

;...83
;...c4
;...60

0x000001c21215bcb1: pop     rbp               ;...5d

0x000001c21215bcb2: test    dword ptr [1c27b720000h],eax
;...85
;...05
;...48
;...43
;...5c
;...69
;   {poll_return}
0x000001c21215bcb8: ret                       ;...c3
```

Does this harm performance? `add` takes 0.33 cycles, whereas `popcnt` takes 1 cycle per instruction. Shouldn’t a prefix sum be faster to calculate than a population count, on the same length of array and same width of integer? They can be compared head to head (implementing prefix sum for `long[]` to keep word width constant)

```--include .*prefix.PrefixSum.PrefixSumLong|.*popcnt.PopCount.PopCount\$
```

Far from having 3x throughput, the prefix sum is much worse. This is entirely because there is no loop unrolling and no pipelining. When possible, C2 applies aggressive unrolling optimisations unavailable to the programmer. For vectorisable operations (requiring linear independence and countability), loop unrolling further marks the loop as a candidate for auto-vectorisation.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
PopCount thrpt 1 10 9.174499 0.394487 ops/ms 100000
PopCount thrpt 1 10 1.217521 0.513734 ops/ms 1000000
PrefixSumLong thrpt 1 10 6.807279 0.925282 ops/ms 100000
PrefixSumLong thrpt 1 10 0.443974 0.053544 ops/ms 1000000

If the dependencies need to fetch data from RAM the latency can be much higher than loading from registers or from prefetched cache. Even when fetching from RAM, the worst case scenario, during this delay independent instructions can complete, unless they have a false dependency.