On-line stochastic processes in data compression

Program Authorized to OOer Degree Date In presenting this dissertation in partial fullllment of the requirements for the Doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with \fair use" as prescribed in the U.S. to whom the author has granted \the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform." The ability to predict the future based upon the past in nite-alphabet sequences has many applications, including communications, data security, pattern recognition, and natural language processing. By Shannon's theory and the breakthrough development of arithmetic coding, any sequence, a 1 a 2 a n , can be encoded in a number of bits that is essentially equal to the minimal information-lossless codelength, P i ? log 2 p(a i ja 1 a i?1). The goal of universal on-line modeling, and therefore of universal data compression, is to deduce the model of the input sequence a 1 a 2 a n that can estimate each p(a i ja 1 a i?1) knowing only a 1 a 2 a i?1 so that the expected value of ? log p(a i ja 1 a i?1) is minimized. Thus, data compression has become both a routine application of on-line modeling techniques and a means for accurately measuring their empirical performance. The on-line modeling algorithm, Prediction By Partial Matching (PPM), has set the performance standard in data compression research since its introduction in 1984. PPM's success stems from its ad hoc probability estimator, which dynamically blends distinct frequency distributions contained in a single model into a probability estimate for each input symbol. Meanwhile, the most conclusive asymptotic results use an information-theoretic metric to dynamically select a model from a set of competing models, and then use that selected model to estimate the currently scanned symbol's probability. Our hypothesis is that these apparently unrelated approaches can be combined to produce a semantically coherent technique that is arguably universal and which consistently outperforms existing techniques on actual data. To prove our hypothesis, we rst give a semantics that uniies both forms of on-line modeling. Then we show how related but linguistically distinct model families t the semantics, and give a new frequency update mechanism that is …

[1]  Suzanne Bunton,et al.  Semantically Motivated Improvements for PPM Variants , 1997, Comput. J..

[2]  Abraham Lempel,et al.  A sequential algorithm for the universal coding of finite memory sources , 1992, IEEE Trans. Inf. Theory.

[3]  Ross N. Williams,et al.  Adaptive Data Compression , 1990 .

[4]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[5]  John G. Cleary,et al.  Unbounded Length Contexts for PPM , 1997 .

[6]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[7]  Esko Ukkonen,et al.  On{line Construction of Suux Trees 1 , 1995 .

[8]  Jorma Rissanen An Image Compression System , 1986, MILCOM 1986 - IEEE Military Communications Conference: Communications-Computers: Teamed for the 90's.

[9]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10]  Ian H. Witten,et al.  An empirical evaluation of coding methods for multi-symbol alphabets , 1993, [Proceedings] DCC `93: Data Compression Conference.

[11]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[12]  Jukka Teuhola,et al.  Application of a Finite-State Model to Text Compression , 1993, Comput. J..

[13]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[14]  Mark N. Wegman,et al.  Variations on a theme by Ziv and Lempel , 1985 .

[15]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[16]  R. Gray,et al.  Variable rate vector quantization of images , 1990 .

[17]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[18]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[19]  Timothy Bell,et al.  A unifying theory and improvements for existing approaches to text compression , 1986 .

[20]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[21]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[22]  Frank Rubin,et al.  Arithmetic stream coding using fixed precision registers , 1979, IEEE Trans. Inf. Theory.

[23]  Robert Fletcher Whitehead An exploration of dynamic Markov compression , 1994 .

[24]  Suzanne BuntonTechnical A Percolating State Selector for Suux-tree Context Models 1 , 1997 .

[25]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[26]  James A. Storer Textual Substitution Techniques for Data Compression , 1985 .

[27]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[28]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[29]  Gaetano Borriello,et al.  Practical dictionary management for hardware data compression , 1992, CACM.

[30]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[31]  R. Gray,et al.  Applications of information theory to pattern recognition and the design of decision trees and trellises , 1988 .

[32]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[33]  P. Glenn Gulak,et al.  Minimizing error and VLSI complexity in the multiplication free approximation of arithmetic coding , 1993, [Proceedings] DCC `93: Data Compression Conference.

[34]  Wolfgang Thomas,et al.  Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics , 1990 .

[35]  Glen G. Langdon,et al.  A note on the Ziv-Lempel model for compressing individual sequences , 1983, IEEE Trans. Inf. Theory.

[36]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[37]  Mauro Guazzo,et al.  A general minimum-redundancy source-coding algorithm , 1980, IEEE Trans. Inf. Theory.

[38]  Jorma Rissanen,et al.  Generalized Kraft Inequality and Arithmetic Coding , 1976, IBM J. Res. Dev..

[39]  Timothy C. Bell,et al.  A Note on the DMC Data Compression Scheme , 1989, Computer/law journal.

[40]  Ian H. Witten,et al.  A comparison of enumerative and adaptive codes , 1984, IEEE Trans. Inf. Theory.

[41]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[42]  G. Furlan An enhancement to universal modeling algorithm context for real-time applications to image compression , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[43]  Neri Merhav,et al.  On the estimation of the order of a Markov chain and universal data compression , 1989, IEEE Trans. Inf. Theory.

[44]  Jorma Rissanen,et al.  A multiplication-free multialphabet arithmetic code , 1989, IEEE Trans. Commun..

[45]  Ehud D. Karnin,et al.  High efficiency, multiplication free approximation of arithmetic coding , 1991, [1991] Proceedings. Data Compression Conference.

[46]  Glen G. Langdon,et al.  Arithmetic Coding , 1979 .

[47]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[48]  Ian H. Witten,et al.  Modeling for text compression , 1989, CSUR.