Sublinear Algorithms for Approximating String Compressibility and the Distribution Support Size

We raise the question of approximating compressibility of a string with respect to a xed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present algorithms and lower bounds for approximating compressibility with respect to both schemes. While we obtain much stronger results for RLE in terms of the eciency of the algorithms, our investigation of LZ yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate compressibility of a string with respect to Lempel-Ziv to the number of distinct short substrings contained in it. We also show that compressibility with respect to LZ is related to approximating the support size of a distribution. This problem has been considered under dieren t guises in the literature. We prove a strong lower bound for it, at the heart of which is a construction of two positive integer random variables, X1 and X2, with very dieren t expectations and the following condition on the moments up to k: E[X1]= E[X2] = E[X 2