We present here a new construction which has no real immediate usefulness, but is a good illustration of a fundamental concept of cryptography, namely that there is a great difference between knowing that some mathematical object exists, and being able to build it in practice. Thus, this construction can be thought of as having some educational virtue, in the spirit of the mathematical recreations that have been a favourite pastime of mathematicians since at least the early 17th century.
We call paradoxical compression a lossless compression algorithm such that:
- On any input x (a sequence of bits), it produces an output C(x) (another sequence of bits) such that the original x can be rebuilt from C(x) (the compression is lossless).
- There is at least one input x0 such that C(x0) is strictly shorter than x0 (the algorithm really deserves to be called a compression algorithm).
- The input size is never increased: for all x, C(x) is never strictly longer than x.
This is mathematically impossible. There is no compression algorithm that can achieve these three properties simultaneously. Despite this impossibility, we can achieve it in practice. A paper describing the algorithm, and a demo implementation, are available at:
https://github.com/pornin/paradox-compress
The paper is also available on eprint.
A Few Important Warnings
- This is not “infinite compression”. We are not talking about a compression algorithm that could reduce the size of every possible input. That would be very useful indeed, but it is impossible, and, so to speak, a lot more impossible than paradoxical compression. We merely claim that the output is not larger than the input, not that it is always shorter.
- Both paradoxical and infinite compression have been “achieved” by various people when working on files by moving some information into the file metadata (file name, date of creation or last modification…). We are not dealing here with such tricks; our inputs and outputs are generic sequences of bits, that do not have any metadata.
On the Impossibility of Paradoxical Compression
Let x0 be an input which is reduced by the compression algorithm. For instance, suppose that x0 has length 1000 bits, bit C(x0) has length 900 bits. For the compression algorithm to deserve that name, such an input must exist.
There are exactly 2901-1 possible bit sequences of up to 900 bits. Every one of them should be compressible into another bit sequence of up to 900 bits. Thus, all 2901-1 bit sequences are mapped to bit sequences in the same set. But the mapping must be injective: if two bit sequences are mapped to the same output, then that output cannot be reliably decompressed, since there would be two corresponding inputs. The hypothesis of losslessness forbids that situation. Therefore, all 2901-1 bit sequences of length up to 900 bits are mapped by the compression algorithm to 2901-1 distinct bit sequences of length up to 900 bits. This necessarily exhausts all of them, in particular C(x0). This implies that the compression of x0 produces an output which is also the result of compressing another input x1 distinct from x0 (since x1 has length at most 900 bits, while x0 has length 1000 bits). This is a contradiction: the compression algorithm cannot be lossless.
This simple demonstration is an application of what is now known as the pigeonhole principle, a well-known remark that can be expressed in set theory as follows: there cannot be an injective mapping from a finite set S1 to a finite set S2 if the cardinality of S2 is strictly lower than the cardinality of S1. In the early 17th century, the French Jesuit Jean Leucheron used it to explain that there necessarily are at least two men in the world with the same number of hair, since there are more men in the world than hairs on the head of any single man. The principle was restated several times along the centuries, with various metaphors, one of the most recent ones involving pigeons in a dovecote.
In the context of paradoxical compression, the pigeonhole principle basically means that there must exist a “problematic input” x that breaks one of the alleged properties: either C(x) is larger than x, or decompression of C(x) does not yield x back, or the compression or decompression algorithm fails to terminate.
Achieving Paradoxical Compression
Since we know that paradoxical compression is impossible, achieving it nonetheless is going to require some creativity. In other words, we will “cheat”. The conceptual trick here is the following: there exist some “problematic inputs”, but there are not necessarily many of them. Maybe we can arrange for them to be hard to find? In essence, we are moving the target: we won’t make an algorithm that can be proven to always work, but we might make one such that nobody can find a problematic input (even though everybody knows that these problematic inputs necessarily exist).
This is where the gist of the trick lies: there is an existence proof for problematic inputs, but this is not a constructive proof since it only demonstrates that such inputs must exist; it does not reveal them. In the world of mathematics, existence proofs are enough to conclude; but in the world of computers, which operate under finite resources, constructive proofs do not automatically follow. Most of cryptography strives in the gap between existence and constructive proofs. For instance, given a strong, cryptographically secure hash function such as SHA-256, we perfectly know that collisions must exist (the SHA-256 input space is vastly larger than the SHA-256 output space, so the SHA-256 function cannot be injective), but nobody ever managed to find one. We can even describe algorithms that will produce collisions if executed, but at a computational cost that far exceeds our available resources, so we cannot do that in practice.
In the case of paradoxical compression, the construction can be described as successive improvements. Let’s start with a “normal” compression algorithm such as DEFLATE (the core algorithm in the GZip and Zlib formats, also used in traditional Zip archives). This algorithm can reduce the size of some sequences of bits, in particular structured data commonly encountered in practical applications. However, for most possible inputs (e.g. output of a random generator), it will slightly increase the size. We want to fix that. A simple method is the following:
- Compression: if DEFLATE(x) is shorter than x, then return DEFLATE(x). Otherwise, return x.
- Decompression: if y can be inflated into x, then return x. Otherwise, return y.
This method assumes that a “compressed output” can be unambiguously distinguished from any other sequence of bits. This is, of course, not true, and it is easy to make this algorithm fail by applying the compressor on its own output: for a given x, C(x) is usually not itself compressible (after all, DEFLATE does a good job at removing the redundancies that DEFLATE itself leverages), so that C(C(x)) will return C(x) itself. Then, decompression of C(C(x)) will return x instead of C(x). C(x) is then a “problematic input” since it does not survive a compression-decompression cycle.
Now, we can handle that case by adjoining a counter in some sort of header. Compressed outputs will start with a counter of 0. If the input to the compression algorithm is already a compressed output, we’ll just increase the counter; the counter is really the number of nested invocations of the compression algorithm. This leads to the following:
- Compression of x:
- If DEFLATE(x) is shorter than x with some extra room for our counter (say, a 32-bit counter), then return DEFLATE(x) || 0.
- Otherwise, if the input is itself x = DEFLATE(x’) || c for some input x’ and counter c, then return DEFLATE(x’) || c+1.
- Otherwise, return x.
- Decompression of y:
- If y = DEFLATE(x) || 0 for some x, then return x.
- Otherwise, if y = DEFLATE(x) || c from some input x and counter c > 0, then return DEFLATE(x) || c-1.
- Otherwise, return y.
Does this work? Mostly. If you implement it and try it on some examples, including by trying to compress a compression output, then this seems to work. However, the problematic inputs are not unreachable. It can be shown that the above necessarily works except in a single place, which is the computation of c+1 in the second compression case: since the counter slot has a given fixed length, it may overflow. For instance, if the counter is a 32-bit integer, and c happens to be equal to 232-1, then c+1 does not fit. The compression algorithm would typically truncate the value to its low 32 bits, yielding 0, and decompression would not return the proper bit sequence. In other words, the “problematic inputs” are exactly the values DEFLATE(x) || 0xFFFFFFFF. You might take solace in the idea that it is unlikely that such an input occurs in practical, non-adversarial situations, but they are still easy enough to build. Thus, this is not good enough for us. We want a construction that cannot be broken even when actively trying to find a problematic input.
Note, though, that building a problematic input by compressing the output repeatedly is very inefficient; with a 32-bit counter, it takes more than 4 billions of recursive calls. Of course, if you want to make a problematic input on purpose, you don’t do that; you directly set the counter to the all-ones value. But what if you can’t?
Let’s now suppose that the compression and decompression algorithms are really compression and decompression systems that can be invoked, but which are opaque to attackers (this is called the “blackbox model”). In that case, we may imagine that the compressor somehow signs its outputs, so that attackers who want to make a problematic input cannot directly set the counter to the all-ones value; if the attacker must use the compression system to increase the counter, then reaching a counter overflow would require an inordinately large number of invocations. If the counter is large enough (e.g. 64 bits or more), then the compression system cannot do that in any practical time. This leads us to the following construction:
- The compression and decompression system are assumed to share a secret key K. That key is used in a message authentication code (MAC). The MAC is supposed to be stateless, deterministic and unforgeable; in practice, consider it to be HMAC/SHA-256, possibly with an output truncated to 128 bits.
- Compression of x:
- If DEFLATE(x) is shorter than x by at least 64+128 = 192 bits, then return DEFLATE(x) || 0 || MAC(DEFLATE(x) || 0) (i.e. the concatenation of DEFLATE(x), the counter value 0 over 64 bits, and a 128-bit MAC computed over these two elements).
- Otherwise, if the input is x = DEFLATE(x’) || c || MAC(DEFLATE(x’) || c) for some input x’ and counter c, then return DEFLATE(x’) || c+1 || MAC(DEFLATE(x’) || c+1).
- Otherwise, return x.
- Decompression of y:
- If y = DEFLATE(x) || 0 || MAC(DEFLATE(x) || 0) for some input x, then return x.
- Otherwise, if y = DEFLATE(x) || c || MAC(DEFLATE(x) || c) for some input x and counter c > 0, then return DEFLATE(x) || c-1 || MAC(DEFLATE(x) || c-1).
- Otherwise, return y.
This construction achieves paradoxical construction because the problematic inputs (with counter c close to overflow) can be obtained only from the compression system itself (since we assume the MAC to be unforgeable), and the compression system produces new counter values only in minimal increments; getting a problematic input then requires too many (264 in this example) invocations of the compression system to be done in practice. If problematic inputs cannot be built, we win: nobody can make our compression/decompression system fail.
Removing the Shared Key Requirement
The construction above uses a MAC that relies on a shared secret value (the MAC key) between the compression and decompression systems. This is a restrictive model and we would like to remove this requirement. Can we build the same system, but in a keyless fashion?
We can, with a verifiable delay function (VDF). Such functions have been recently defined. In a nutshell, a VDF is a sort of hash function that takes a configurable time to compute (with no known shortcut) but also comes with a proof of correct computation, and the proof can be verified at a low computational cost. This is an extension of earlier time-lock puzzles. One can think of a VDF as a time-lock puzzle that can be verified to have been properly set.
In our paradoxical compression system described above, we used a MAC with a secret key in order to prevent outsiders from building problematic inputs. We can replace the MAC with a VDF, using the counter itself as work factor: an input with a very high counter value cannot be built in practice because that would mean a very high work factor: the secrecy of the key is replaced with a “computational wall” of the VDF becoming too expensive the compute for high counter values.
I implemented this method, using a VDF designed by Wesolowski. It is based on computing squarings modulo a big composite integer (i.e. an RSA modulus): raising an input x to the power 2c modulo n is postulated to require c successive squarings modulo n, with no known “shortcut” if the factorization of n is not known. Wesolowski’s VDF comes with a compact proof of correct construction, that can be efficiently verified. The implementation is in C# (mostly because C# has a DEFLATE implementation in its standard library, and I already had some C# implementation of the SHA3/SHAKE hash functions, and of big integers with primality tests). The 2048-bit modulus was generated as a random RSA key (I discarded the prime factors; now nobody knows them). The code can be obtained on GitHub.
The code is open-source and you may reuse it as you will (under the MIT license), but of course achieving paradoxical compression is not, in fact, a very useful property: it does not solve any real, practical problem. However, it is an enjoyable party trick.