Tricking Java into Adding Up Arrays Faster

There is currently what I like to think of as a “crop circle” in Java 9: you can make hash codes faster by manually unrolling some multiplications, which is tracked by an OpenJDK ticket. This one is even weirder. Imagine you have an int[] and want to compute the sum of its elements. You could do exactly that, or, supposing your values are small enough not to overflow, you can make your loop much faster by multiplying each element by 2 inside the loop, and dividing the result by 2 at the end. This is because autovectorisation, with strength reductions to shifts and additions to boot, kicks in for loops that look like a dot product, whereas summations of arrays don’t seem to be optimised at all – see this post for an analysis. Don’t believe me? Run the code at github.


    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int NaiveSum(IntData state) {
        int value = 0;
        int[] data = state.data1;
        for (int i = 0; i < data.length; ++i) {
            value += data[i];
        }
        return value;
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int CropCircle(IntData state) {
        int value = 0;
        int[] data = state.data1;
        for (int i = 0; i < data.length; ++i) {
            value += 2 * data[i];
        }
        return value / 2;
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
CropCircle thrpt 1 10 29.922687 0.383028 ops/ms 100000
CropCircle thrpt 1 10 2.065812 0.120089 ops/ms 1000000
NaiveSum thrpt 1 10 26.241689 0.660850 ops/ms 100000
NaiveSum thrpt 1 10 1.868644 0.244081 ops/ms 1000000

4 comments on “Tricking Java into Adding Up Arrays Faster

    • Hi Vladimir,

      Here’s the cpuinfo:

      $ cat /proc/cpuinfo
      processor : 0
      vendor_id : GenuineIntel
      cpu family : 6
      model : 94
      model name : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
      stepping : 3
      cpu MHz : 2592.000
      cache size : 256 KB
      physical id : 0
      siblings : 8
      core id : 0
      cpu cores : 4
      apicid : 0
      initial apicid : 0
      fpu : yes
      fpu_exception : yes
      cpuid level : 22
      wp : yes
      flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
      clflush size : 64
      cache_alignment : 64
      address sizes : 39 bits physical, 48 bits virtual
      power management:

      This was run on the latest (windows) ea jdk 9 as of the published date. I can get the version number when I next have access to a computer.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>