On the Performance of Data Compression Algorithms Based Upon String Matching

Lossless and lossy data compression algorithms based on string matching are considered. In the lossless case, a result of Wyner and Ziv (1989) is extended. In the lossy case, a data compression algorithm based on approximate string matching is analyzed in the following two frameworks: (1) the database and the source together form a Markov chain of finite order; (2) the database and the source are independent with the database coming from a Markov model and the source from a general stationary, ergodic model. In either framework, it is shown that the resulting compression rate converges with probability one to a quantity computable as the infimum of an information theoretic functional over a set of auxiliary random variables; the quantity is strictly greater than the rate distortion function of the source except in some symmetric cases. In particular, this result implies that the lossy algorithm proposed by Steinberg and Gutman (1993) is not optimal, even for memoryless or Markov sources.

[1]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[4]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[5]  Wojciech Szpankowski,et al.  Asymptotic properties of data compression and suffix trees , 1993, IEEE Trans. Inf. Theory.

[6]  Robert M. Gray,et al.  Ergodicity of Markov channels , 1987, IEEE Trans. Inf. Theory.

[7]  S. Arimoto,et al.  Asymptotic properties of algorithms of data compression with fidelity criterion based on string matching , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.

[8]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[9]  Benjamin Weiss,et al.  Entropy and data compression schemes , 1993, IEEE Trans. Inf. Theory.

[10]  J. Kieffer,et al.  Markov Channels are Asymptotically Mean Stationary , 1981 .

[11]  A. D. Wyner,et al.  The sliding-window Lempel-Ziv algorithm is asymptotically optimal , 1994, Proc. IEEE.

[12]  Yossef Steinberg,et al.  An algorithm for source coding subject to a fidelity criterion, based on string matching , 1993, IEEE Trans. Inf. Theory.

[13]  P. Billingsley,et al.  Convergence of Probability Measures , 1969 .

[14]  D. Halverson,et al.  Discrete-time detection in Epsilon -mixing noise , 1980, IEEE Trans. Inf. Theory.

[15]  Paul C. Shields,et al.  Waiting times: Positive and negative results on the Wyner-Ziv problem , 1993 .

[16]  Wojciech Szpankowski,et al.  A suboptimal lossy data compression based on approximate pattern matching , 1997, IEEE Trans. Inf. Theory.

[17]  En-Hui Yang,et al.  On the redundancy of the fixed-database Lempel-Ziv algorithm for phi -mixing sources , 1997, IEEE Trans. Inf. Theory.

[18]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[19]  Aaron D. Wyner,et al.  Improved redundancy of a version of the Lempel-Ziv algorithm , 1995, IEEE Trans. Inf. Theory.

[20]  A. Nobel,et al.  A recurrence theorem for dependent processes with applications to data compression , 1992, IEEE Trans. Inf. Theory.

[21]  Zhen Zhang,et al.  An on-line universal lossy data compression algorithm via continuous codebook refinement - Part II. Optimality for phi-mixing source models , 1996, IEEE Trans. Inf. Theory.

[22]  John C. Kieffer,et al.  Sample converses in source coding theory , 1991, IEEE Trans. Inf. Theory.