Garbage Collector Code Artifacts: Card Marking

In the JVM, lots of evidence of garbage collection mechanics can be seen from JIT compiler output. This may be obvious if you think of garbage collection as a task of book-keeping: the various auxiliary data structures used to track inter-region or inter-generational references, relied on for faster marking, need to be kept up to date somehow. These data structures need maintenance, and this isn’t something you control in application code: the maintenance aspect must must be instrumented somehow. If you profile your application’s disassembly, you can find artifacts of the various garbage collectors, and these snippets of code can help you understand the throughput tradeoffs of each collector. You can also just read the documentation.

A simple benchmark to illustrate this would compare the store of a primitive int and a boxed Integer. It may not be surprising that the classes below can be JIT compiled in very different ways, and that the real difference depends on the selected garbage collector.

public class IntAcceptor {
  private int value;

  public void setValue(int value) {
    this.value = value;
  }
}

public class IntegerAcceptor {
  private Integer value;

  public void setValue(Integer value) {
    this.value = value;
  }
}

For instance, the simplest garbage collector, used mostly by specialist applications betting against garbage collection actually happening, is enabled by -XX:+UseSerialGC. If you benchmark throughput for storing these values, you will observe that storing ints is cheaper than storing Integers.

It’s difficult to measure this accurately because there are numerous pitfalls. If you allocate a new Integer for each store, you conflate your measurement with allocation and introduce bias towards the primitive store. If you pre-allocate an Integer[] you can make the measured workload more cache-friendly, from a GC book-keeping point of view, which helps reduce the reference store cost. In a multithreaded context, this same property can create bias in the opposite direction. JMH can’t prevent any of these biases. Be skeptical about the accuracy or generality of the numbers here (because there are a large number of unexplored dimensions to control, as I have alluded to) but you would hardly notice any difference in a single threaded benchmark storing the same boxed integer repeatedly.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
SerialGCStoreBenchmark.storeInt thrpt 1 20 395.370723 10.092432 ops/us
SerialGCStoreBenchmark.storeInteger thrpt 1 20 277.329797 18.036629 ops/us

You may see a large difference in a multithreaded benchmark, with an Integer instance per thread.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
SerialGCStoreBenchmark.storeInt thrpt 4 20 1467.401084 5.917960 ops/us
SerialGCStoreBenchmark.storeInteger thrpt 4 20 793.880064 459.304449 ops/us

The throughput of storeInteger seems to have a large error term, here are the iteration figures:

Iteration   1: 1176.474 ops/us
Iteration   2: 85.966 ops/us
Iteration   3: 1180.612 ops/us
Iteration   4: 90.930 ops/us
Iteration   5: 1180.955 ops/us
Iteration   6: 1181.966 ops/us
Iteration   7: 88.801 ops/us
Iteration   8: 1180.723 ops/us
Iteration   9: 1177.895 ops/us
Iteration  10: 1138.446 ops/us
Iteration  11: 1177.302 ops/us
Iteration  12: 91.551 ops/us
Iteration  13: 1144.591 ops/us
Iteration  14: 102.143 ops/us
Iteration  15: 1179.683 ops/us
Iteration  16: 1184.222 ops/us
Iteration  17: 85.365 ops/us
Iteration  18: 1183.874 ops/us
Iteration  19: 95.979 ops/us
Iteration  20: 1150.123 ops/us

This is bimodal, varying from iteration to iteration between almost as good to an order of magnitude slower, with nothing in between. If you compare the disassembly for loops setting distinct values, such as in my simplistic benchmark, you will see the assembly is virtually identical, but you’ll notice these instructions for the reference stores:


  0.98%    1.12%  │  0x00007f54a96462ee: shr    $0x9,%r10
  2.22%    2.17%  │  0x00007f54a96462f2: movabs $0x7f54c1bc5000,%r11
  2.30%    2.69%  │  0x00007f54a96462fc: mov    %r12b,(%r11,%r10,1) 

This code does card marking, which tracks bucketed references between different sections of the heap. The byte array is the card table, which has logical pages of 512 bytes. The right shift divides the reference of the stored object by 512 to get the card it resides in. The byte at this index offset by the base address of the page tracking references out of the storing object’s card is written to. In other words, a directed link is established between the storing object’s page and stored object’s page. This is what you would see if you squinted at the heap: the card table is a coarse approximation of the object graph which allows false positives (referenced pages may contain dead references) but no false negatives.

The writes to the card table are volatile, and the card table is shared between threads, which can induce false sharing when objects in adjacent pages are stored in objects residing in the same page, and the stores happen on different threads. You can use conditional marking to avoid this because the stored object’s page is often already marked. The bimodal behaviour is caused by unlucky combinations of addresses resulting in false sharing of the card table. It doesn’t even happen all the time. Setting the -XX:+UseCondCardMark the difference gets much smaller, the noise disappears, and conditional marking logic can be seen in the disassembly.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
SerialGCStoreBenchmark.storeInt thrpt 4 20 1467.464828 12.866720 ops/us
SerialGCStoreBenchmark.storeInteger thrpt 4 20 1114.612419 6.960193 ops/us


                  ╭││  0x00007f003164b9e4: je     0x00007f003164ba04 
                  │││                                                
                  │││                                               
  0.01%    0.00%  │││  0x00007f003164b9e6: mov    %r10,%r8
  4.92%    3.54%  │││  0x00007f003164b9e9: shr    $0x9,%r8
  0.01%    0.00%  │││  0x00007f003164b9ed: movabs $0x7f0048492000,%r9
  3.48%    2.12%  │││  0x00007f003164b9f7: add    %r8,%r9
  0.02%    0.01%  │││  0x00007f003164b9fa: movsbl (%r9),%ecx
  6.51%    6.53%  │││  0x00007f003164b9fe: test   %ecx,%ecx
  1.71%    1.85%  │╰│  0x00007f003164ba00: jne    0x00007f003164b994
                  │ │                                               
                  │ │                                               
                  │ │                                               
  4.76%    5.29%  │ ╰  0x00007f003164ba02: jmp    0x00007f003164b9a0
                  ↘    0x00007f003164ba04: mov    $0xfffffff6,%esi

I intended to provoke this behaviour, but what if I had been measuring something else and hadn’t ensured conditional marking was enabled?

Card marking is common in older garbage collectors because it has low overhead, particularly with conditional marking, but different collectors intercept stores differently, and you can reverse engineer them all without reading the source code. In fact, Nitsan Wakart has written a great post all about store barriers.

The point of this post is that you can detect garbage collector mechanisms with benchmarks, you just need to write them to provoke the actions you think a garbage collector should make, and look for crop circles in the disassembly. However, this assumes you have some kind of mental model of a garbage collector to start with! The new ones are getting very creative, and you might not be able to guess what they do. In principle, garbage collector implementations could modify any application code so these artifacts could be anywhere.

The Much Aligned Garbage Collector

A power of two is often a good choice for the size of an array. Sometimes you might see this being exploited to replace an integer division with a bitwise intersection. You can see why with a toy benchmark of a bloom filter, which deliberately folds in a representative cost of a hash function and array access to highlight the significance of the differential cost of the division mechanism to a method that does real work:

@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class BloomFilter {

  private long[] bitset;

  @Param({"1000", "1024"})
  int size;

  @Setup(Level.Trial)
  public void init() {
    bitset = DataUtil.createLongArray(size);
  }

  @Benchmark
  public boolean containsAnd() {
    int hash = hash();
    int pos = hash & (size - 1);
    return (bitset[pos >>> 6] & (1L << pos)) != 0;
  }

  @Benchmark
  public boolean containsAbsMod() {
    int hash = hash();
    int pos = Math.abs(hash % size);
    return (bitset[pos >>> 6] & (1L << pos)) != 0;
  }

  private int hash() {
    return ThreadLocalRandom.current().nextInt(); // a stand in for a hash function;
  }
}

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
containsAbsMod thrpt 1 10 104.063744 4.068283 ops/us 1000
containsAbsMod thrpt 1 10 103.849577 4.991040 ops/us 1024
containsAnd thrpt 1 10 161.917397 3.807912 ops/us 1024

Disregarding the case which produces an incorrect result, you can do two thirds as many lookups again in the same period of time if you just use a 1024 element bloom filter. Note that the compiler clearly won’t magically transform cases like AbsMod 1024; you need to do this yourself. You can readily see this property exploited in any open source bit set, hash set, or bloom filter you care to look at. This is boring, at least, we often get this right by accident. What is quite interesting is a multiplicative decrease in throughput of DAXPY as a result of this same choice of lengths:

@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class DAXPYAlignment {

  @Param({"250", "256", "1000", "1024"})
  int size;

  double s;
  double[] a;
  double[] b;

  @Setup(Level.Trial)
  public void init() {
    s = ThreadLocalRandom.current().nextDouble();
    a = createDoubleArray(size);
    b = createDoubleArray(size);
  }

  @Benchmark
  public void daxpy(Blackhole bh) {
    for (int i = 0; i < a.length; ++i) {
      a[i] += s * b[i];
    }
    bh.consume(a);
  }
}

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
daxpy thrpt 1 10 23.499857 0.891309 ops/us 250
daxpy thrpt 1 10 22.425412 0.989512 ops/us 256
daxpy thrpt 1 10 2.420674 0.098991 ops/us 1000
daxpy thrpt 1 10 6.263005 0.175048 ops/us 1024

1000 and 1024 are somehow very different, yet 250 and 256 are almost equivalent. This will either shock you, or you will guess that my page size is 4KB. My page size is indeed 4KB (which is a power of 2 precisely because of the prohibitive cost of integer division). The 250 element array only needs 2000B of contiguous space, so the next array allocated will reside in the same page, assuming the start of the array is at the start of the page. On the other hand, the 1000 element array takes up 8000B, that’s 1.95 pages. If the second array starts immediately after this array, the array will start in the 5% remaining in the current page, take up another page, and then 90% of yet another. Is this a big deal? Perhaps, when a page is accessed for the first time, it needs to be looked up in the page table, and is then cached in the TLB. Access to the cache is fast, whereas accessing the page table is slow. That extra page access costs something, but can it really cost this much, and is this all that’s going on? Since these measurements are made on an Intel Skylake processor, they will also be affected by 4K aliasing. Let’s allocate an array in between the two we want to loop over, to vary the offsets:

  @Param({"0", "6", "12", "18", "24"})
  int offset;

  double s;
  double[] a;
  double[] b;
  double[] padding;

  @Setup(Level.Trial)
  public void init() {
    s = ThreadLocalRandom.current().nextDouble();
    a = createDoubleArray(size);
    padding = new double[offset];
    b = createDoubleArray(size);
  }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: offset Param: size
daxpy thrpt 1 10 2.224875 0.247778 ops/us 0 1000
daxpy thrpt 1 10 6.159791 0.441525 ops/us 0 1024
daxpy thrpt 1 10 2.350425 0.136992 ops/us 6 1000
daxpy thrpt 1 10 6.047009 0.360723 ops/us 6 1024
daxpy thrpt 1 10 3.332370 0.253739 ops/us 12 1000
daxpy thrpt 1 10 6.506141 0.155733 ops/us 12 1024
daxpy thrpt 1 10 6.621031 0.345151 ops/us 18 1000
daxpy thrpt 1 10 6.827635 0.970527 ops/us 18 1024
daxpy thrpt 1 10 7.456584 0.214229 ops/us 24 1000
daxpy thrpt 1 10 7.451441 0.104871 ops/us 24 1024

The pattern is curious (pay attention to the offset parameter) – the ratio of the throughputs for each size ranging from 3x throughput degradation through to parity:

The loop in question is vectorised, which can be disabled by setting -XX:-UseSuperWord. Doing so is revealing, because the trend is still present but it is dampened to the extent it could be waved away as noise:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: offset Param: size
daxpy thrpt 1 10 1.416452 0.079905 ops/us 0 1000
daxpy thrpt 1 10 1.806841 0.200231 ops/us 0 1024
daxpy thrpt 1 10 1.408526 0.085147 ops/us 6 1000
daxpy thrpt 1 10 1.921026 0.049655 ops/us 6 1024
daxpy thrpt 1 10 1.459186 0.076427 ops/us 12 1000
daxpy thrpt 1 10 1.809220 0.199885 ops/us 12 1024
daxpy thrpt 1 10 1.824435 0.169680 ops/us 18 1000
daxpy thrpt 1 10 1.842230 0.204414 ops/us 18 1024
daxpy thrpt 1 10 1.934717 0.229822 ops/us 24 1000
daxpy thrpt 1 10 1.964316 0.039893 ops/us 24 1024

The point is, you may not have cared about alignment much before because it’s unlikely you would have noticed unless you were really looking for it. Decent autovectorisation seems to raise the stakes enormously.

Analysis with Perfasm

It’s impossible to know for sure what the cause of this behaviour is without profiling. Since I observed this effect on my Windows development laptop, I use xperf via WinPerfAsmProfiler, which is part of JMH.

I did some instruction profiling. The same code is going to get generated in each case, with a preloop, main loop and post loop, but by looking at the sampled instruction frequency we can see what’s taking the most time in the vectorised main loop. From now on, superword parallelism is never disabled. The full output of this run can be seen at github. Here is the main loop for size=1024, offset=0, which is unrolled, spending most time loading and storing data (vmovdqu) but spending a decent amount of time in the multiplication:

  0.18%    0x0000020dddc5af90: vmovdqu ymm0,ymmword ptr [r10+r8*8+10h]
  9.27%    0x0000020dddc5af97: vmulpd  ymm0,ymm0,ymm2
  0.22%    0x0000020dddc5af9b: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+10h]
  7.48%    0x0000020dddc5afa2: vmovdqu ymmword ptr [r11+r8*8+10h],ymm0
 10.16%    0x0000020dddc5afa9: vmovdqu ymm0,ymmword ptr [r10+r8*8+30h]
  0.09%    0x0000020dddc5afb0: vmulpd  ymm0,ymm0,ymm2
  3.62%    0x0000020dddc5afb4: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+30h]
 10.60%    0x0000020dddc5afbb: vmovdqu ymmword ptr [r11+r8*8+30h],ymm0
  0.26%    0x0000020dddc5afc2: vmovdqu ymm0,ymmword ptr [r10+r8*8+50h]
  3.76%    0x0000020dddc5afc9: vmulpd  ymm0,ymm0,ymm2
  0.20%    0x0000020dddc5afcd: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+50h]
 13.23%    0x0000020dddc5afd4: vmovdqu ymmword ptr [r11+r8*8+50h],ymm0
  9.46%    0x0000020dddc5afdb: vmovdqu ymm0,ymmword ptr [r10+r8*8+70h]
  0.11%    0x0000020dddc5afe2: vmulpd  ymm0,ymm0,ymm2
  4.63%    0x0000020dddc5afe6: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+70h]
  9.78%    0x0000020dddc5afed: vmovdqu ymmword ptr [r11+r8*8+70h],ymm0

In the worst performer (size=1000, offset=0) a lot more time is spent on the stores, a much smaller fraction of observed instructions are involved with multiplication or addition. This indicates either a measurement bias (perhaps there’s some mechanism that makes a store/load easier to observe) or an increase in load/store cost.

  0.24%    0x000002d1a946f510: vmovdqu ymm0,ymmword ptr [r10+r8*8+10h]
  3.61%    0x000002d1a946f517: vmulpd  ymm0,ymm0,ymm2
  4.63%    0x000002d1a946f51b: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+10h]
  9.73%    0x000002d1a946f522: vmovdqu ymmword ptr [r11+r8*8+10h],ymm0
  4.34%    0x000002d1a946f529: vmovdqu ymm0,ymmword ptr [r10+r8*8+30h]
  2.13%    0x000002d1a946f530: vmulpd  ymm0,ymm0,ymm2
  7.77%    0x000002d1a946f534: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+30h]
 13.46%    0x000002d1a946f53b: vmovdqu ymmword ptr [r11+r8*8+30h],ymm0
  3.37%    0x000002d1a946f542: vmovdqu ymm0,ymmword ptr [r10+r8*8+50h]
  0.47%    0x000002d1a946f549: vmulpd  ymm0,ymm0,ymm2
  1.47%    0x000002d1a946f54d: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+50h]
 13.00%    0x000002d1a946f554: vmovdqu ymmword ptr [r11+r8*8+50h],ymm0
  4.24%    0x000002d1a946f55b: vmovdqu ymm0,ymmword ptr [r10+r8*8+70h]
  2.40%    0x000002d1a946f562: vmulpd  ymm0,ymm0,ymm2
  8.92%    0x000002d1a946f566: vaddpd  ymm0,ymm0,ymmword ptr [r11+r8*8+70h]
 14.10%    0x000002d1a946f56d: vmovdqu ymmword ptr [r11+r8*8+70h],ymm0

This trend can be seen to generally improve as 1024 is approached from below, and do bear in mind that this is a noisy measure. Interpret the numbers below as probabilities: were you to stop the execution of daxpy at random, at offset zero, you would have a 94% chance of finding yourself within the main vectorised loop. You would have a 50% chance of observing a store, and only 31% chance of observing a multiply or add. As we get further from 1024, the stores dominate the main loop, and the main loop comes to dominate the method. Again, this is approximate. When the arrays aren’t well aligned, we spend less time loading, less time multiplying and adding, and much more time storing.

classification offset = 0 offset = 6 offset = 12 offset = 18 offset = 24
add 22.79 21.46 15.41 7.77 8.03
load 12.19 11.95 15.55 21.9 21.19
multiply 8.61 7.7 9.54 13.15 8.33
store 50.29 51.3 49.16 42.34 44.56
main loop 93.88 92.41 89.66 85.16 82.11

I stop short of counting TLB misses because this is on Windows and it would be a complete pain in the arse to capture. The truth is, I still don’t have enough information to say what the cause is, but I know the stores are getting more expensive, and that I would need to be quite unlucky to have put these two arrays next to each other. I may update this post with a Linux measurement where it’s much easier to profile hardware events with perf.

After discussing the post on Twitter (see here and here), Vsevolod and Aleksey Shipilёv correctly attributed this performance bug to 4K aliasing. The effect observed here is also a contributing factor to fluctuations in throughput observed in JDK-8150730.

Garbage Collection

Is it necessary to make sure all arrays are of a size equal to a power of two and aligned with pages? In this microbenchmark, it’s easy to arrange that, for typical developers this probably isn’t feasible (which isn’t to say there aren’t people out there who do this). Fortunately, this isn’t necessary for most use cases. True to the title, this post has something to do with garbage collection. The arrays were allocated in order, and no garbage would be produced during the benchmarks, so the second array will be split across pages. Let’s put some code into the initialisation of the benchmark bound to trigger garbage collection:

  String acc = "";

  @Setup(Level.Trial)
  public void init() {
    s = ThreadLocalRandom.current().nextDouble();
    a = createDoubleArray(size);
    b = createDoubleArray(size);
    // don't do this in production
    for (int i = 0; i < 10000; ++i) {
      acc += UUID.randomUUID().toString();
    }
  }

A miracle occurs: the code speeds up!

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
daxpy thrpt 1 10 6.854161 0.261247 ops/us 1000
daxpy thrpt 1 10 6.328602 0.163391 ops/us 1024

Why? G1 has rearranged the heap and that second array probably isn’t straddling three pages any more, and is unlikely to remain in such an unlucky position relative to the first array. Future garbage collectors are expected to be even better at doing this than G1. This makes the cost of garbage collection difficult to quantify, because if it takes resources with one hand it gives them back with another.

The benchmark code is available at github.