Collaborative Compression

I have recently become interested in the way the effects of compression algorithms and text encoding compose. I started looking at this in my last post. Base 64 encoding extracts and maps each 6-bit subword of a byte stream to one of 64 possible bytes, which is guaranteed to waste 2 bits per byte, but can encode any binary data as UTF-8 text. On a block of incompressible binary data encoded as base 64, neither LZ4 nor Snappy can compress the text to the size of the original binary data, whereas GZIP can (undo a 33% inflation). With monotonically increasing integers, LZ4 and Snappy achieve size parity with uncompressed binary data, whereas GZIP compression can be less effective on base 64 encoded text than on equivalent binary data.

I was interested to see if using LZ4 or Snappy as an intermediate step between base 64 and GZIP would make a difference. Compressing monotonic integers again, my expectation was that LZ4/Snappy could “undo” the base 64 bloat, to get to parity in composition with GZIP on raw binary data, but that’s not what happened:

Compression Transformation Count Compressed Size (MB)
Uncompressed binary 1000000 976.56
Uncompressed base64 1000000 1304.63
Uncompressed base64/snappy 1000000 969.20
Uncompressed base64/lz4 1000000 993.98
GZIP binary 1000000 337.67
GZIP base64 1000000 523.80
GZIP base64/snappy 1000000 61.60
GZIP base64/lz4 1000000 31.53

I had already noted that all three algorithms had poor compression ratios for this case, which is better suited to delta encoding. Applying base 64 encoding prior to a fast compression algorithm seems to prime the input for GZIP. I’m so confused by this result that I spent some time on sanity checks: verifying that the results are truly decompressible.

This isn’t very exciting: the result is not competitive with delta encoding, and I ran the same test with a sinusoidal byte stream and saw the opposite effect.

Compression Transformation Count Compressed Size (MB)
Uncompressed binary 1000000 976.56
Uncompressed base64 1000000 1304.63
Uncompressed base64/snappy 1000000 1087.73
Uncompressed base64/lz4 1000000 1089.67
GZIP binary 1000000 27.74
GZIP base64 1000000 69.12
GZIP base64/snappy 1000000 282.52
GZIP base64/lz4 1000000 283.81

Still, it’s hard to predict how multiple layers of data representation will interact.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>