Lossless Data Compression

This thesis makes several contributions to the field of data compression. Lossless data compression algorithms shorten the description of input objects, such as sequences of text, in a way that allows perfect recovery of the original object. Such algorithms exploit the fact that input objects are not uniformly distributed: by allocating shorter descriptions to more probable objects and longer descriptions to less probable objects, the expected length of the compressed output can be made shorter than the object’s original description. Compression algorithms can be designed to match almost any given probability distribution over input objects. This thesis employs probabilistic modelling, Bayesian inference, and arithmetic coding to derive compression algorithms for a variety of applications, making the underlying probability distributions explicit throughout. A general compression toolbox is described, consisting of practical algorithms for compressing data distributed by various fundamental probability distributions, and mechanisms for combining these algorithms in a principled way. Building on the compression toolbox, new mathematical theory is introduced for compressing objects with an underlying combinatorial structure, such as permutations, combinations, and multisets. An example application is given that compresses unordered collections of strings, even if the strings in the collection are individually incompressible. For text compression, a novel unifying construction is developed for a family of contextsensitive compression algorithms. Special cases of this family include the PPM algorithm and the Sequence Memoizer, an unbounded depth hierarchical Pitman–Yor process model. It is shown how these algorithms are related, what their probabilistic models are, and how they produce fundamentally similar results. The work concludes with experimental results, example applications, and a brief discussion on cost-sensitive compression and adversarial sequences.

[1]  M. S. Mayzner,et al.  Tables of single-letter and digram frequency counts for various word-length and letter-position combinations. , 1965 .

[2]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[3]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[4]  Frans M. J. Willems,et al.  Universal data compression and repetition times , 1989, IEEE Trans. Inf. Theory.

[5]  Suzanne Bunton,et al.  Semantically Motivated Improvements for PPM Variants , 1997, Comput. J..

[6]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[7]  Jr. J.B. O'Neal,et al.  A bound on signal-to-quantizing noise ratios for digital encoding systems , 1967 .

[8]  L. Varshney,et al.  On Universal Coding of Unordered Data , 2007, 2007 Information Theory and Applications Workshop.

[9]  Richard Clark Pasco,et al.  Source coding algorithms for fast data compression , 1976 .

[10]  Saul Gorn,et al.  American standard code for information interchange , 1963, CACM.

[11]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[12]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[13]  Shmuel Tomi Klein,et al.  Robust Universal Complete Codes for Transmission and Compression , 1996, Discret. Appl. Math..

[14]  W. Teahan,et al.  Experiments on the zero frequency problem , 1995, Proceedings DCC '95 Data Compression Conference.

[15]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[16]  J. Pitman Coalescents with multiple collisions , 1999 .

[17]  Vincent Gripon,et al.  Compressing multisets using tries , 2012, 2012 IEEE Information Theory Workshop.

[18]  Geoff Holmes,et al.  Correcting English text using PPM models , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[19]  Matthew V. Mahoney The PAQ1 Data Compression Program , 2002 .

[20]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[21]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[22]  David Bruce Wilson,et al.  Exact sampling with coupled Markov chains and applications to statistical mechanics , 1996, Random Struct. Algorithms.

[23]  David J. Harper,et al.  Using compression based language models for text categorization. , 2003 .

[24]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[25]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[26]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[27]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[28]  Ran El-Yaniv,et al.  Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition , 2006, J. Mach. Learn. Res..

[29]  Vinton G. Cerf,et al.  ASCII format for network interchange , 1969, RFC.

[30]  Bernhard Balkenhol,et al.  Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice , 2000, IEEE Trans. Computers.

[31]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[32]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[33]  Hiroshi Imai,et al.  Implementing the context tree weighting method for text compression , 2000, Proceedings DCC 2000. Data Compression Conference.

[34]  Paolo Ferragina,et al.  Text Compression , 2009, Encyclopedia of Database Systems.

[35]  Heiko Schwarz,et al.  Source Coding: Part I of Fundamentals of Source and Video Coding , 2011, Found. Trends Signal Process..

[36]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[37]  Yoshua Bengio,et al.  The Z-coder adaptive binary coder , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[38]  L. Carroll,et al.  Alice's Adventures in Wonderland: Princeton University Press , 2015 .

[39]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[40]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[41]  Jeffrey Scott Vitter,et al.  Design and analysis of dynamic Huffman codes , 1987, JACM.

[42]  Jukka Teuhola,et al.  A Compression Method for Clustered Bit-Vectors , 1978, Inf. Process. Lett..

[43]  S. Brodetsky Essai philosophique sur les probabilités , 1922, Nature.

[44]  Ben J. M. Smeets,et al.  Non-uniform PPM and context tree models , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[45]  Jorma Rissanen,et al.  Generalized Kraft Inequality and Arithmetic Coding , 1976, IBM J. Res. Dev..

[46]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[47]  Thomas J. Goblick,et al.  Analog source digitization: A comparison of theory and practice (Corresp.) , 1967, IEEE Trans. Inf. Theory.

[48]  Marcus Hutter,et al.  Towards a Universal Theory of Artificial Intelligence Based on Algorithmic Probability and Sequential Decisions , 2000, ECML.

[49]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[50]  E. L. Lohse,et al.  Data communications control procedures for the USA standard code for information interchange , 1969, Commun. ACM.

[51]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[52]  J. Aberg,et al.  Estimation of escape probabilities for PPM based on universal source coding theory , 1997, Proceedings of IEEE International Symposium on Information Theory.

[53]  Glen G. Langdon,et al.  Arithmetic Coding , 1979 .

[54]  W. Sudderth,et al.  Polya Trees and Random Distributions , 1992 .

[55]  Timothy C. Bell,et al.  A corpus for the evaluation of lossless compression algorithms , 1997, Proceedings DCC '97. Data Compression Conference.

[56]  Aaron D. Wyner,et al.  Prediction and Entropy of Printed English , 1993 .

[57]  Peter Elias,et al.  Interval and recency rank source coding: Two on-line adaptive variable-length schemes , 1987, IEEE Trans. Inf. Theory.

[58]  P. Fenwick,et al.  Block Sorting Text Compression -- Final Report , 1996 .

[59]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[60]  F. Hoppe Pólya-like urns and the Ewens' sampling formula , 1984 .

[61]  Yee Whye Teh,et al.  A Bayesian Interpretation of Interpolated Kneser-Ney , 2006 .

[62]  Christian Steinruecken Compressing Sets and Multisets of Sequences , 2015, IEEE Transactions on Information Theory.

[63]  Yuriy A. Reznik,et al.  Coding of Sets of Words , 2011, 2011 Data Compression Conference.

[64]  D. Aldous Exchangeability and related topics , 1985 .

[65]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[66]  Alistair Moffat,et al.  An improved data structure for cumulative probability tables , 1999, Softw. Pract. Exp..

[67]  William H. Kautz Ieee Transactions on Information Theory Co~tcluding Remarks , 2022 .

[68]  Vincent Beaudoin,et al.  Lossless Data Compression via Substring Enumeration , 2010, 2010 Data Compression Conference.

[69]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[70]  Andrei N. Kolmogorov,et al.  Logical basis for information theory and probability theory , 1968, IEEE Trans. Inf. Theory.

[71]  Brendan J. Frey,et al.  Efficient Stochastic Source Coding and an Application to a Bayesian Network Source Model , 1997, Comput. J..

[72]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[73]  Yee Whye Teh,et al.  Improvements to the Sequence Memoizer , 2010, NIPS.

[74]  Richard E. Ladner,et al.  On-line stochastic processes in data compression , 1996 .

[75]  James H. Burrows,et al.  Secure Hash Standard , 1995 .

[76]  Matthew V. Mahoney,et al.  Text Compression as a Test for Artificial Intelligence , 1999, AAAI/IAAI.

[77]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[78]  Toby Berger,et al.  Permutation codes for sources , 1972, IEEE Trans. Inf. Theory.

[79]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[80]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[81]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[82]  Paul G. Howard,et al.  The design and analysis of efficient lossless data compression systems , 1993 .

[83]  Glen G. Langdon,et al.  An Overview of the Basic Principles of the Q-Coder Adaptive Binary Arithmetic Coder , 1988, IBM J. Res. Dev..

[84]  Dmitry A. Shkarin,et al.  PPM: one step to practicality , 2002, Proceedings DCC 2002. Data Compression Conference.

[85]  Yee Whye Teh,et al.  Lossless Compression Based on the Sequence Memoizer , 2010, 2010 Data Compression Conference.

[86]  Frans M. J. Willems,et al.  The Context-Tree Weighting Method : Extensions , 1998, IEEE Trans. Inf. Theory.

[87]  M. R. Leadbetter Poisson Processes , 2011, International Encyclopedia of Statistical Science.

[88]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[89]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[90]  David Wilson,et al.  Coupling from the past: A user's guide , 1997, Microsurveys in Discrete Probability.

[91]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[92]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[93]  G. Korodi,et al.  On improving the PPM algorithm , 2008, 2008 3rd International Symposium on Communications, Control and Signal Processing.

[94]  Alberto Apostolico,et al.  Robust transmission of unbounded strings using Fibonacci representations , 1987, IEEE Trans. Inf. Theory.

[95]  Jossy Sayir On coding by probability transformation , 1999 .

[96]  Ian H. Witten,et al.  A comparison of enumerative and adaptive codes , 1984, IEEE Trans. Inf. Theory.

[97]  John G. Cleary,et al.  Unbounded length contexts for PPM , 1995, Proceedings DCC '95 Data Compression Conference.

[98]  John B. O'Neal,et al.  Entropy coding in speech and television differential PCM systems (Corresp.) , 1971, IEEE Trans. Inf. Theory.

[99]  Y. Shtarkov,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[100]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[101]  D. Blei Bayesian Nonparametrics I , 2016 .

[102]  Karlheinz Brandenburg,et al.  MP3 and AAC Explained , 1999 .

[103]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[104]  Frank D. Wood,et al.  Deplump for Streaming Data , 2011, 2011 Data Compression Conference.

[105]  H. Blasbalg,et al.  Message Compression , 1962, IRE Transactions on Space Electronics and Telemetry.

[106]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[107]  Shmuel Zaks,et al.  Lexicographic Generation of Ordered Trees , 1980, Theor. Comput. Sci..

[108]  J. Pitman,et al.  Size-biased sampling of Poisson point processes and excursions , 1992 .

[109]  Matthew V. Mahoney,et al.  Adaptive weighing of context models for lossless data compression , 2005 .

[110]  Yuri M. Shtarkov,et al.  Text compression by context tree weighting , 1997, Proceedings DCC '97. Data Compression Conference.

[111]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[112]  R. Stanley Enumerative Combinatorics: Volume 1 , 2011 .

[113]  Dmitry A. Shkarin Improving the Efficiency of the PPM Algorithm , 2001, Probl. Inf. Transm..

[114]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[115]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[116]  A. Shiryayev On Tables of Random Numbers , 1993 .

[117]  Frans M. J. Willems,et al.  Context Tree Weighting : A Sequential Universal Source Coding Procedure for Fsmx Sources , 1993, Proceedings. IEEE International Symposium on Information Theory.

[118]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[119]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[120]  Peter M. Fenwick,et al.  A New Data Structure for Cumulative Probability Tables: An Improved Frequency-to-Symbol Algorithm , 1996, Softw. Pract. Exp..

[121]  Jechang Jeong,et al.  The JPEG standard , 1997 .

[122]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[123]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[124]  B. De Finetti,et al.  Funzione caratteristica di un fenomeno aleatorio , 1929 .

[125]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[126]  S. Golomb Run-length encodings. , 1966 .

[127]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[128]  Alan F. Blackwell,et al.  Dasher—a data entry interface using continuous gestures and language models , 2000, UIST '00.

[129]  Lancelot F. James,et al.  Coagulation fragmentation laws induced by general coagulations of two-parameter Poisson-Dirichlet processes , 2006, math/0601608.

[130]  Robert G. Gallager,et al.  Variations on a theme by Huffman , 1978, IEEE Trans. Inf. Theory.

[131]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[132]  Frank D. Wood,et al.  The sequence memoizer , 2011, Commun. ACM.

[133]  The unicode standard: The Unicode Consortium - version 2.0 , 1996 .

[134]  Vivek K. Goyal,et al.  Toward a source coding theory for sets , 2006, Data Compression Conference (DCC'06).

[135]  M. Lavine More Aspects of Polya Tree Distributions for Statistical Modelling , 1992 .

[136]  Peter M. Fenwick Burrows-Wheeler compression: Principles and reflections , 2007, Theor. Comput. Sci..

[137]  D. H. Lehmer Teaching combinatorial tricks to a computer , 1960 .

[138]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[139]  Yee Whye Teh,et al.  A stochastic memoizer for sequence data , 2009, ICML '09.

[140]  École d'été de probabilités de Saint-Flour,et al.  École d'été de probabilités de Saint-Flour XIII - 1983 , 1985 .

[141]  Donald E. Knuth,et al.  Dynamic Huffman Coding , 1985, J. Algorithms.

[142]  Claude E. Shannon,et al.  Communication theory of secrecy systems , 1949, Bell Syst. Tech. J..

[143]  T. Ferguson Prior Distributions on Spaces of Probability Measures , 1974 .

[144]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[145]  Vivek K Goyal,et al.  Ordered and Disordered Source Coding , 2006 .

[146]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[147]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[148]  Matthew V. Mahoney,et al.  Fast Text Compression with Neural Networks , 2000, FLAIRS Conference.

[149]  Edward A. Bender,et al.  Partitions of multisets , 1974, Discret. Math..

[150]  Alfred Vail The American Electro Magnetic Telegraph: With The Reports Of Congress, And A Description Of All Telegraphs Known, Employing Electricity Or Galvanism. Illustrated By Eighty-one Wood Engravings , 2011 .

[151]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[152]  Nando de Freitas,et al.  A Machine Learning Perspective on Predictive Coding with PAQ8 , 2011, 2012 Data Compression Conference.

[153]  J. Vitter,et al.  Practical Implementations of Arithmetic Coding , 1991 .

[154]  Marcus Hutter,et al.  Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.

[155]  W. Ewens The sampling theory of selectively neutral alleles. , 1972, Theoretical population biology.

[156]  Purushottam W. Laud,et al.  Bayesian Nonparametric Inference for Random Distributions and Related Functions , 1999 .

[157]  Alistair Moffat An Improved Data Structure for Cumulative Probability Tables , 1999, Softw. Pract. Exp..

[158]  Thomas M. Cover,et al.  Enumerative source encoding , 1973, IEEE Trans. Inf. Theory.

[159]  David Pfau,et al.  Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process , 2010, ICML.