Text compression via alphabet re-representation

This article introduces the concept of alphabet re-representation in the context of text compression. We consider re-representing the alphabet so that a representation of a character reflects its properties as a predictor of future text. This enables us to use an estimator from a restricted class to map contexts to predictions of upcoming characters. We describe an algorithm that uses this idea in conjunction with neural networks. The performance of our implementation is compared to other compression methods, such as UNIX compress, gzip, PPMC, and an alternative neural network approach.

[1]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[2]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[3]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[4]  Jürgen Schmidhuber,et al.  Sequential neural text compression , 1996, IEEE Trans. Neural Networks.

[5]  Eugene L. Lawler,et al.  Traveling Salesman Problem , 2016 .

[6]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[7]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[8]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[9]  R. Gallager Information Theory and Reliable Communication , 1968 .

[10]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[11]  Ian H. Witten,et al.  Arithmetic coding revisited , 1995, Proceedings DCC '95 Data Compression Conference.

[12]  Yves Chauvin,et al.  Backpropagation: theory, architectures, and applications , 1995 .

[13]  Philip M. Long,et al.  Text compression via alphabet re-representation , 1997, Proceedings DCC '97. Data Compression Conference.

[14]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[15]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[16]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[17]  Hans Henrik Thodberg,et al.  Improving Generalization of Neural Networks Through Pruning , 1991, Int. J. Neural Syst..

[18]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[19]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[20]  Garrison W. Cottrell,et al.  Image compression by back-propagation: An example of extensional programming , 1988 .