Predictive encoding in text compression

Abstract In predictive text compression the characters are encoded one by one on the basis of a few preceding characters. The usage of contextual knowledge makes the compression more effective than the plain coding of characters independently of their neighbors. In the simplest case we merely try to guess the next character, and the success/ failure is encoded. Generally, the preceding substring determines the probability distribution of the successor, providing a basis for encoding. In this article, three compression methods of increasing power are presented. Special attention is paid to the trade-off between compression gain and processing time. As for speed, hashing turns out to be an ideal technique for maintaining the prediction information. The best gain is achieved by applying the optimal arithmetic coding to the successor information, extracted from the dependencies between characters.

[1]  Jukka Teuhola,et al.  Text compression using prediction , 1986, SIGIR '86.

[2]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[3]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[4]  James Andrew Storer Data compression: methods and complexity issues. , 1979 .

[5]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[6]  Daniel S. Hirschberg,et al.  Self-organizing linear search , 1985, CSUR.

[7]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[8]  Malcolm C. Harrison,et al.  Implementation of the substring test by hashing , 1971, CACM.

[9]  Ian H. Witten,et al.  A comparison of enumerative and adaptive codes , 1984, IEEE Trans. Inf. Theory.

[10]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[11]  Ted G. Lewis,et al.  Hash Table Methods , 1975, CSUR.

[12]  Gene Ott,et al.  Compact encoding of stationary Markov sources , 1967, IEEE Trans. Inf. Theory.

[13]  Jukka Teuhola,et al.  A Compression Method for Clustered Bit-Vectors , 1978, Inf. Process. Lett..

[14]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[15]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[16]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[17]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[18]  Ian H. Witten,et al.  Fortelling the future by adaptive modeling , 1986 .

[19]  James A. Storer Textual Substitution Techniques for Data Compression , 1985 .

[20]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[21]  Silviu Guiaşu,et al.  Information theory with applications , 1977 .

[22]  Jukka Teuhola,et al.  Predictive test compression by hashing , 1987, SIGIR '87.

[23]  Donald E. Knuth,et al.  The Art of Computer Programming, Vol. 3: Sorting and Searching , 1974 .

[24]  Glen G. Langdon,et al.  Arithmetic Coding , 1979 .

[25]  S. Golomb Run-length encodings. , 1966 .