I have recently become interested in the way the effects of compression algorithms and text encoding compose. I started looking at this in my last post. Base 64 encoding extracts and maps each 6-bit subword of a byte stream to one of 64 possible bytes, which is guaranteed to waste 2 bits per byte, but can encode any binary data as UTF-8 text. On a block of incompressible binary data encoded as base 64, neither LZ4 nor Snappy can compress the text to the size of the original binary data, whereas GZIP can (undo a 33% inflation). With monotonically increasing integers, LZ4 and Snappy achieve size parity with uncompressed binary data, whereas GZIP compression can be less effective on base 64 encoded text than on equivalent binary data.

I was interested to see if using LZ4 or Snappy as an intermediate step between base 64 and GZIP would make a difference. Compressing monotonic integers again, my expectation was that LZ4/Snappy could “undo” the base 64 bloat, to get to parity in composition with GZIP on raw binary data, but that’s not what happened:

Compression | Transformation | Count | Compressed Size (MB) |
---|---|---|---|

Uncompressed | binary | 1000000 | 976.56 |

Uncompressed | base64 | 1000000 | 1304.63 |

Uncompressed | base64/snappy | 1000000 | 969.20 |

Uncompressed | base64/lz4 | 1000000 | 993.98 |

GZIP | binary | 1000000 | 337.67 |

GZIP | base64 | 1000000 | 523.80 |

GZIP | base64/snappy | 1000000 | 61.60 |

GZIP | base64/lz4 | 1000000 | 31.53 |

I had already noted that all three algorithms had poor compression ratios for this case, which is better suited to delta encoding. Applying base 64 encoding prior to a fast compression algorithm seems to prime the input for GZIP. I’m so confused by this result that I spent some time on sanity checks: verifying that the results are truly decompressible.

This isn’t very exciting: the result is not competitive with delta encoding, and I ran the same test with a sinusoidal byte stream and saw the opposite effect.

Compression | Transformation | Count | Compressed Size (MB) |
---|---|---|---|

Uncompressed | binary | 1000000 | 976.56 |

Uncompressed | base64 | 1000000 | 1304.63 |

Uncompressed | base64/snappy | 1000000 | 1087.73 |

Uncompressed | base64/lz4 | 1000000 | 1089.67 |

GZIP | binary | 1000000 | 27.74 |

GZIP | base64 | 1000000 | 69.12 |

GZIP | base64/snappy | 1000000 | 282.52 |

GZIP | base64/lz4 | 1000000 | 283.81 |

Still, it’s hard to predict how multiple layers of data representation will interact.

Two things to keep in mind: any transformation that “breaks the byte boundaries” is tough on byte-oriented compressors since they do everything based on byte boundaries. That is, they work by finding matches earlier in the text: if I write “banana” here and then later on I write “banana” again, all of these algorithms record a “match” for the second banana that says “just go back N bytes and copy 6 characters”, so copies the earlier banana to create the second one.

However, the formats only supports byte offsets: N is measured in whole bytes. So if I add a whole byte in somewhere between the two bananas, nothing changes, the compressor will still find the match, now at N+1 (except for the edge case that the first banana is outside the “window” at the point of the second banana). If, however, I add a single *bit* in between the two bananas (or, in general, any number of bits not a multiple of 8), the compressor won’t find a match because the bytes of the second banana are shifted by 1 bit and don’t look like ‘b’, ‘a’, ‘n’, but rather something like ‘b’ << 1, 'a' << 1, etc.

So base-64 encoding which takes 6-bits and maps them to 8-bite bytes ends up breaking the byte boundaries in the same way. There is a good chance that after the transformation the bytes making up the first banana don't match at all the bytes in the second banana, because they have a different offset relative to a byte boundary. Compare that to a totally dumb transformation which simply repeats each byte, doubling the size. This preserves the byte boundaries in many cases a match-based compressor will be able to compress it almost as well as the original text (not exactly, since the size increase pushes some matches out of the window and there might be other effects like longer matches taking more bits to encode).

That explains why gzip isn't able to "undo" the damage of base64 very well.

The most interesting result is how applying snappy or lz4 first produce a much smaller result when you gzip. That's totally contrary to the popular wisdom that compressing a file a second time never helps.

I can think of two reasons:

All the compressors here are match based (aka LZ77): they find earlier matches for the current text and encode back-references to it, as described above. However, this "lookback" has as limit, often called the "window" – matches further back than this window won't be found. For gzip the max window size is 32K (actual depends on what compression level you choose) and for LZ4 it is 64K, snappy I'm not sure.

This means that these compressors can exhibit very non-linear effect on certain types of data. For example, consider a file composed of a random string of length L, repeated N times. A match-based compressor will generally compress this perfectly as long as L is less than the window size: the total file size will be close to L regardless of N. As soon as L gets longer than the window, however, no matches can be found and compression drops to zero (or a bit worse due to overhead) and the file size is at least L * N.

A sequence of integers could exhibit this kind of behavior, for example there aren't many repeats in from 01, 02, 03, …, 99 if you want at least two bytes to match – but as soon as you get to to 100, you can match the last two digits of 100, 101, 102, with 00, 01, 02 and so on. Normally you'd want longer matches but with bigger numbers that's what you get. So maybe lz4's (and snappy?) longer window gets a match that gzip doesn't?

The other possible effect is that the first compressor somewhat compresses the data, so that the second compressor effectively has a larger window because the input data is smaller, so the second pass is able to find more matches. That doesn't really align with the fact that the UNCOMPRESSED tests show minimal compression for lz4/snappy alone, however.

Another possibility is that unlike snappy and lz4, gzip also does entropy encoding after the matching part: it takes all the various "symbols" (including matches and non match stuff) and choose a bit-code for them in proportion to how frequent they are. Maybe the way lz4 and snappy end up processing the file makes it much more ammenable to entropy coding. That is, they are finding lots of small matches, which doesn't end up being too awesome overall because maybe a match of 4 bytes ends up taking 3 bytes to encode, but the matches are all very similar (consider the 101, 102, example: the match is always found at exactly the same distance back) in terms of the byte values they contain, so entropy coding really rocks!