Unbounded length contexts for PPM

The prediction by partial matching (PPM) data compression scheme has set the performance standard in lossless compression of text throughout the past decade. The original algorithm was first published in 1984 by Cleary and Witten, and a series of improvements was described by Moffat (1990), culminating in a careful implementation, called PPMC, which has become the benchmark version. This still achieves results superior to virtually all other compression methods, despite many attempts to better it. PPM, is a finite-context statistical modeling technique that can be viewed as blending together several fixed-order context models to predict the next character in the input sequence. Prediction probabilities for each context in the model are calculated from frequency counts which are updated adaptively; and the symbol that actually occurs is encoded relative to its predicted distribution using arithmetic coding. The paper describes a new algorithm, PPM*, which exploits contexts of unbounded length. It reliably achieves compression superior to PPMC, although our current implementation uses considerably greater computational resources (both time and space). The basic PPM compression scheme is described, showing the use of contexts of unbounded length, and how it can be implemented using a tree data structure. Some results are given that demonstrate an improvement of about 6% over the old method.

[1]  Renato De Mori,et al.  A cache based natural lan-guage model for speech recognition , 1992 .

[2]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[3]  Paul G. Howard,et al.  The design and analysis of efficient lossless data compression systems , 1993 .

[4]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[5]  W. Teahan,et al.  Experiments on the zero frequency problem , 1995, Proceedings DCC '95 Data Compression Conference.

[6]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[7]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[8]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[9]  William J. Wilson Chinks in the armor of public key cryptosystems , 1994 .

[10]  Ben J. M. Smeets,et al.  Towards understanding and improving escape probabilities in PPM , 1997, Proceedings DCC '97. Data Compression Conference.

[11]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[12]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[13]  John G. Cleary,et al.  Unbounded Length Contexts for PPM , 1997 .

[14]  Timothy C. Bell,et al.  A Note on the DMC Data Compression Scheme , 1989, Computer/law journal.

[15]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[16]  John G. Cleary,et al.  The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[17]  Suzanne Bunton,et al.  Semantically Motivated Improvements for PPM Variants , 1997, Comput. J..

[18]  Edward R. Fiala,et al.  Data compression with finite windows , 1989, CACM.

[19]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  N. Jesper Larsson Extended application of suffix trees to data compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[21]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.