Lossless compression algorithms such as DEFLATE strive to reliably process arbitrary inputs, while achieving compressed sizes as low as possible for commonly encountered data inputs. It is well-known that it is mathematically impossible for a compression algorithm to simultaneously achieve non-trivial compression on some inputs (i.e. compress these inputs into strictly shorter outputs) and to never expand any other input (i.e. guaranteeing that all inputs will be compressed into an output which is no longer than the input); this is a direct application of the “pigeonhole principle”. Despite their mathematical impossibility, we show in this paper how to build such paradoxical compression and decompression algorithms, with the aid of some tools from cryptography, notably veri able delay functions, and, of course, by slightly cheating. 1 Paradoxical Compression The pigeonhole principle is the colloquial name for the general remark that, given two nite sets S1 and S2, there cannot exist an injective map from S1 to S2 if the cardinality of S1 is strictly greater than that of S2. The principle appears to have been used by mathematicians since at least the early 17th century[8] and was rst formalized by Dirichlet two centuries later[5]. Dirichlet’s metaphor involved drawers, which, through some later translation mishaps, led to the name “pigeonhole” and the description of the principle involving birds in a dovecote[10]. In rough terms, if you have more pigeons than holes, you cannot put each pigeon alone in a hole; there must be at least one hole where you will cram two pigeons together, or a pigeon with no hole to sit into. This principle applies to lossless compression algorithms. We consider the setB of ordered nite sequences of bits; each sequence x ∈ B has a length denoted len(x), which is the number of bits in the sequence. For any integer n ≥ 0, there are precisely 2n possible bit sequences of length n, and 2n+1 − 1 possible bit sequences of length at most n. A lossless compression algorithm is de ned as a pair of computable functions C and D, each taking bit sequences as input and as output, with the following characteristics: – C takes as input any bit sequence x ∈ B, and outputs a corresponding bit sequenceC (x). As a computable function, C may be randomized, i.e. there is no requirement that for a given x, the exact same output C (x) is always obtained. – D takes as input a bit sequence y ∈ B, and outputs a corresponding bit sequence D(y); for some inputs y, D may instead return ⊥, a symbolic value distinct from all bit sequences which represents decompression failure. For a given input y, D must always return the same output (i.e. the implementation ofDmay be randomized, but, as a function, it is deterministic). – For any x ∈ B, D(C (x)) = x. This is what “lossless” means: no information about x is lost in the compression process, and x can be recovered through decompression. – For inputs x that are expected to occur in a given usage context, the average compressed length (len(C (x))) is lower than the average uncompressed length (len(x)). Compression is useful under the assumption that “normal data” is not uniformly distributed; i.e., for a given bit length n, a small fraction of bit sequences of that length are much more likely to appear as inputs to C than all others, and these inputs may thus be encoded into a shorter format. Commonly used generic-purpose compression algorithms such as DEFLATE[4] (used in the well-known GZip and Zlib formats, and the PNG image format) exploit repeated sequences of input bits (or bytes), as well as non-uniform distribution of input code units, as is typical of text-based data formats, to achieve non-negligible compression ratios at moderate computational cost. If compression can reduce the length of some inputs, it must necessarily increase the length of other inputs. We will call paradoxical compression a lossless compression algorithm (C,D) such that: – For any x ∈ B, len(C (x)) ≤ len(x). – There exists at least one input x0 ∈ B such that len(C (x0)) < len(x0). The second condition implies that C is not simply a length-preserving permutation (e.g. the identity): some data can really be “compressed”. Notwithstanding, there needs not be many inputs that can be such compressed. The pigeonhole principle expresses the fact that paradoxical compression is impossible (hence the name). Indeed, if n = len(C (x0)), then consider the set Bn of all bit sequences of length at most n: for any x ∈ Bn, paradoxical compression ensures that len(C (x)) ≤ len(x) ≤ n, hence C (x) ∈ Bn. The input x0 is not part of Bn (by construction), and thus the cardinality of S1 = Bn ∪ {x0} is strictly greater than the cardinality of S2 = Bn (namely, #S1 = 2n+1 = #S2 + 1), and therefore there cannot be an injective map from S1 to S2. This implies that there must be two distinct inputs x and x′ in S1 that are compressed into the same output, i.e. C (x) = C (x′). The decompressor, applied on that shared output value, cannot return both x and x′, which means that compression is not lossless. Now that we have established that paradoxical compression is mathematically impossible, we will show in this paper how to achieve it. Of course there is a cheat. We will subtly change the rules of the game. Consider the unfortunately not too rare situation of a scammer, peddler of some “in nite compression” scheme, selling an algorithm (usually as a software or hardware black box) that claims to be able to reduce the size of any input, and still decompress the result back into the original data with no loss. This is a claim much stronger than paradoxical compression (in which it is merely claimed that no input is made strictly longer, but some inputs might not be made strictly shorter). In an abstract model, the scammer is defeated by simply trying uniformly random inputs: – Choose an input length n. – Choose a uniformly random sequence x of n bits. – Obtain y = C (x) and then z = C (y). – Verify that len(z) < len(y) < n and that D(D(z)) = x.
[1]
Krzysztof Pietrzak,et al.
Simple Verifiable Delay Functions
,
2018,
IACR Cryptol. ePrint Arch..
[2]
Benjamin Wesolowski,et al.
Efficient Verifiable Delay Functions
,
2019,
Journal of Cryptology.
[3]
Dan Boneh,et al.
Verifiable Delay Functions
,
2018,
IACR Cryptol. ePrint Arch..
[4]
Peter Deutsch,et al.
DEFLATE Compressed Data Format Specification version 1.3
,
1996,
RFC.
[5]
Dan Boneh,et al.
A Survey of Two Verifiable Delay Functions
,
2018,
IACR Cryptol. ePrint Arch..
[6]
Albrecht Heeffer,et al.
The Pigeonhole Principle, Two Centuries Before Dirichlet
,
2014
.