Interpreting Compression Ratios as Signals

I often think about the book Theories of Everything by John D. Barrow. There is a beautiful passage on algorithmic compressibility:

The goal of science is to make sense of the diversity of nature. [Science] employs observation to gather information about the world and to test predictions about how the world will react to new circumstances, but in between these two procedures lies the heart of the scientific process. This is nothing more than the transformation of lists of observational data into abbreviated form by the recognition of patterns. The recognition of such a pattern allows the information content of the observed sequence of events to be replaced by a shorthand formula which possesses the same, or almost the same, information content…

We can extend this image of science in a manner that sharpens its focus. Suppose we are presented with any string of symbols. They do not have to be numbers but let us assume for the sake of illustration that they are. We say that the string is ‘random’ if there is no other representation of the string which is shorter than itself. But we say it is ‘non-random’ if there does exist an abbreviated representation.

When processing a data set, patterns can be found and exploited to reduce its size to close to that of its information content, in other words the data set can be abbreviated. This idea can be turned on its head to use a data set’s compression ratio as a measure of its information content. Using compression algorithms which seek to exploit correlations between attribute values, hitherto unseen relations, or mistakes, decrease compression ratio. Repeated measurement of this proxy information content in windows over a stream of, say, credit card transactions could be interpreted as a measure of the likelihood of fraud. In a more generic setting, changes in the compression ratio can indicate changes in data quality.

Compression Techniques

There are various approaches to compression used in database systems. These techniques can be classified in terms of what is actually compressed, agnosticism to the domain of the data, and invariance of properties of relations in compressed form.

At one extreme there are techniques, typically used for cold data, which compress entire chunks of data byte-wise, with no relational invariance. A good example of such a cold data compression approach is SNAPPY block level compression in HBase.  This approach has a key benefit; it can compress anything since it is agnostic to the format of the data. In order to exploit a property it must be recognised and acknowledged first, and ubiquity comes at the cost of being unable to perform relational operations in compressed form. There are options in various database system to perform this kind of compression at a row level. This can lead to lower compression ratios since at a row level the frequency histograms are unlikely to converge to the frequency histogram of the entire data set (convergence could not happen unless each row were equivalent up to a byte-wise permutation).

At the other extreme there are techniques like columnar dictionary encoding, which maps the domain of a column to integral indices into some array-like storage of the distinct attribute values. The distinct values are typically represented as a sorted array or as a hash set. Several properties are preserved in dictionarised data sets. Equality is always preserved, meaning that several operations like join and various filters can be performed without decompression (instead, the expressions themselves can be mapped into the dictionary domain). Depending on the storage technique used, ordering and associated relations may also be preserved after compression. The frequency histogram of the attribute values of the column is sometimes avoided because this would not result in a fixed-width encoding, but there are dictionary encodings to map the most frequent values to the smallest integers to decrease the number of bits required in encoded form. While the level of compression feasible is good, it is inferior to an inverse-frequency symbol based approach. Columnar dictionary encoding also cannot exploit correlations between attribute values. By way of example, in a financial data set, there is a high correlation between trade dates, trade types and settlement dates: with columnar dictionary encoding, the compressed form would take up space proportional to the distinct values of the three attributes combined, even though trades in FX spot settle T+2, except USDCAD which settles T+1 (the settlement date needn’t be stored, it could feasibly take up no space and just be computed instead).

Somewhere in between the two extremes there are algorithms which encode tuples of attribute values. One such technique starts by computing the Huffman encoding of the entire domain of each column. Lists of tuples are encoded by first replacing each attribute value of each tuple with the corresponding Huffman code, concatenating the Huffman codes into fixed width integers before a sort and delta encoding. This produces a compression ratio close to the entropy of the data set, and specifically targets correlations between columns. While several compression techniques have their use, it is at this point in the spectrum where I see there being useful information in the compression ratio over a window of data, because unusual things will take up more space. If your compression ratio ever changes, it is because something weird has happened and you should probably investigate it. If you are processing trades and the compression ratio decreases, is it because you recently started trading more USDCAD Spot?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>