UUIDs and Compressibility

Universally unique identifiers, or UUIDs, are often used for database primary keys in scenarios where coordination of persistence is either impossible or impractical. UUIDs offer very good probabilistic guarantees of collision avoidance, at the cost of 128 bits per key. 128 bits for a key is quite problematic in key scans and joins: with appropriately structured data, these algorithms can benefit significantly from vector processing, but at 128 bits per element, vectorisation is probably a lost cause. The 128 bit cost is also a fixed cost, even if your database has fewer than one quintilliard rows. By virtue of being random enough to avoid collisions when generated in a distributed system, there is no way to compress UUID keys to the number of bits required to identify each record in the dataset. Data engineering is about tradeoffs, and none of this is to say UUIDs should never be used, but it is prudent to be aware of the costs. All of this applies in the best case: assuming the keys are stored in their binary format! How bad can it get if UUIDs are stored as text? Can compression undo the damage?

If you work with a relational database like Postgres, you can use an implementation specific uuid type to ensure UUIDs take up as little space as possible. However, having worked on several projects using NoSQL databases, I have seen people store UUIDs as text on at least two occasions (though this is not the fault of the databases!). How harmful this is depends on the character encoding used, but UTF-8 is quite common (for the characters found in a UUID, this is equivalent to ISO-8859-1). A UUID represented by a string like “9289f568-c33f-4667-8a3d-092aa4e21262” can take up the following sizes, depending on the encoding used.

Format Size (bytes) Ratio
binary 16 1
ISO-8859-1 36 2.25
UTF-8 36 2.25
UTF-16 74 4.625

The real issue here is not so much the extra storage burden, because keys are usually much smaller than the values, but that the keys are used to process queries. A representation requiring 2-5x more space requires 2-5x more data to pass through the processor when evaluating queries. Many NoSQL databases offer succinct compression options for keys, which allow the keys to be processed without decompression at a small computational cost, such as the prefix and delta encoding available in HBase. This approach can work wonders with well structured keys, but this will do absolutely nothing if your keys are random.

Even heavyweight compression techniques requiring global or block level decompression before evaluation can’t recover the bloat in a textual representation of a UUID because the text is almost as random as the bytes. I compressed collections of 1M UUIDs using compression algorithms typically reserved for “cold data”: gzip, snappy and LZ4, using the code on github.

Compression Encoding Count Size (MB)
Uncompressed binary 1000000 15.26
Uncompressed ISO-8859-1 1000000 34.33
Uncompressed UTF-16 1000000 70.57
GZIP binary 1000000 15.26
GZIP ISO-8859-1 1000000 19.50
GZIP UTF-16 1000000 23.73
LZ4 binary 1000000 15.32
LZ4 ISO-8859-1 1000000 32.56
LZ4 UTF-16 1000000 50.16
Snappy binary 1000000 15.26
Snappy ISO-8859-1 1000000 33.99
Snappy UTF-16 1000000 37.97

Assuming you are OK with treating your keys as cold data, none of these algorithms will undo the inflation. What’s interesting, assuming you’ve never thought about it before, is that none of these algorithms can compress the binary representation of the UUIDs. This is because the UUIDs are random, and random enough to be taken as unique in any given trillion year epoch. Even though there are only one million values, which could be represented by 20 bits per value, none of the compression algorithms improves on 128 bits per value. This reminds me of a passage from Theories of Everything by John D. Barrow:

The goal of science is to make sense of the diversity of nature. [Science] employs observation to gather information about the world and to test predictions about how the world will react to new circumstances, but in between these two procedures lies the heart of the scientific process. This is nothing more than the transformation of lists of observational data into abbreviated form by the recognition of patterns. The recognition of such a pattern allows the information content of the observed sequence of events to be replaced by a shorthand formula which possesses the same, or almost the same, information content…

We can extend this image of science in a manner that sharpens its focus. Suppose we are presented with any string of symbols. They do not have to be numbers but let us assume for the sake of illustration that they are. We say that the string is ‘random’ if there is no other representation of the string which is shorter than itself. But we say it is ‘non-random’ if there does exist an abbreviated representation.

Random data can’t be compressed, and inflated textual representations of random bits are almost as random as the bits themselves, as far as any existing compression algorithm is concerned. Efficient representation of data requires the identification and exploitation of patterns, and using UUIDs instead of naturally occurring composite keys introduces randomness where there could often be order.


5 thoughts on “UUIDs and Compressibility

  1. What is the mapping between the binary and text forms for UUID? Is it just a base16 encoding with dashes added in fixed place (ie which don’t need to be stored?).

      1. No, but gzip gets you pretty close. It would have been nice to have a column which shows you how much of the expansion is “clawed back” by each compression method. gzip seems to claw back about 79% of the expansion, at least for the UTF8/ISO-8859 case, which isn’t terrible.

        The UTF-16 is much harder for the compressors because it basically makes every other byte zero, making entropy coder (in the case of gzip) waste a bunch of space for a short efficient way to encode zeros. None of the encoders are able to recognize the “every other byte is a zero” pattern, which would bring the overhead of UTF-16 down to about the same as the other encodings.

  2. In principle, compressors that use byte-wise entropy coding should be able to essentially completely erase the waste incurred by the base 16 encoding (modulo some issues such as imperfect entropy coding, the introduction of the dashes and introduction of zeros by UTF-16).

    If you start out with some random bytes, and break each random byte up into two bytes whose values are selected (in the same way) from each nibble of the byte (this is base 16 encoding), you from one byte with 256 uniformly distributed values, which has an entropy of 8 bits (256 * 1/256 * lg(256)), to two bytes each with only 16 uniformly distributed values, which have an entropy of 4 bits: so the total entropy is the same (8 bits) and should be more or less picked up by a byte-wise encoder.

    Indeed, turning to truly random values, we can check this for gzip:

    # create a million random bytes
    $ head /dev/urandom –bytes=1MB > rand1M
    # check that base16 encoding it doubles the size
    $ cat rand1M | od -An -vtx1 | tr -d ‘ \n’ | wc -c
    2000000
    # check the size when gzipped
    $ cat rand1M | od -An -vtx1 | tr -d ‘ \n’ | gzip -c | wc -c
    1139154
    # check the size when xz’d (using LZMA)
    $ cat rand1M | od -An -vtx1 | tr -d ‘ \n’ | xz -c | wc -c
    1041200

    So from the 100% expansion you got originally, gzip is able to claw back 86% of that (leaving 14%), and xz about 96%. So they both “mostly” undo the expansion. Based on your numbers above, gzip left 22% of the expansion, probably because the UUID has dashes and perhaps some other separators between the UUID values (that wasn’t clear).

    It’s actually curious that gzip doesn’t do better since I’d expect 16 roughly equally occurring values to be about optimal for the entropy coder. gzip actaully encodes literals and various stuff relating to matches together in one alphabet. Now matches are totally useless here (the values are random after all, so any matches are going to be “spurious” and code more to encode them than its worth) – but gzip doesn’t know that and it does matches before entropy coding. So perhaps it ends up with more than 16 values based on some spurious values and that reduces the efficiency. Also gzip uses huffman coding which means that each code needs to be an integral number of bits, even if the idea is say 4.1 bits, it has to choose 4 or 5. So except it cases where everything works out exactly, it loses some efficiency there.

    1. “probably because the UUID has dashes and perhaps some other separators between the UUID values (that wasn’t clear).”

      I’m referring to the canonical format as described on Wikipedia which is used in Java. If this is not a universal standard, that’s even more reason not to use the textual encoding! There are no separators in the stream prior to GZIP compression.

      For the avoidance of doubt, I’m not claiming that the encoding somehow (invertibly!) maintains the entropy while doubling the size, and clearly, an entropy based compression algorithm should be able to get back to parity. But GZIP doesn’t, and GZIP probably isn’t even the first thing you would reach for. LZ4 and Snappy are often used in preference to GZIP in NoSQL datastores because they are much faster, but as far as these algorithms are concerned UUID strings are every bit as random as the binary input.

Leave a Reply

Your email address will not be published. Required fields are marked *