Text compression via alphabet re-representation

We consider re-representing the alphabet so that a representation of a character reflects its properties as a predictor of future text. This enables us to use an estimator from a restricted class to map contexts to predictions of upcoming characters. We describe an algorithm that uses this idea in conjunction with neural networks. The performance of this implementation is compared to other compression methods, such as UNIX compress, gzip, PPMC, and an alternative neural network approach.

[1]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[2]  Hans Henrik Thodberg,et al.  Improving Generalization of Neural Networks Through Pruning , 1991, Int. J. Neural Syst..

[3]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[4]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[5]  Philip M. Long,et al.  Text compression via alphabet re-representation , 1999, Neural Networks.

[6]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[7]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[8]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[9]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[10]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[11]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[12]  John G. Cleary,et al.  The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[13]  Jürgen Schmidhuber,et al.  Sequential neural text compression , 1996, IEEE Trans. Neural Networks.

[14]  Garrison W. Cottrell,et al.  Image compression by back-propagation: An example of extensional programming , 1988 .

[15]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[16]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[17]  David E. Rumelhart,et al.  Predicting the Future: a Connectionist Approach , 1990, Int. J. Neural Syst..

[18]  Yves Chauvin,et al.  Backpropagation: theory, architectures, and applications , 1995 .

[19]  D. A. Bell,et al.  Information Theory and Reliable Communication , 1969 .

[20]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[21]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[22]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[23]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.