Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text

We discuss a family of estimators for the entropy rate of a stationary ergodic process and prove their pointwise and mean consistency under a Doeblin-type mixing condition. The estimators are Cesaro averages of longest match-lengths, and their consistency follows from a generalized ergodic theorem due to Maker (1940). We provide examples of their performance on English text, and we generalize our results to countable alphabet processes and to random fields.

[1]  Amiel Feinstein,et al.  Information and information stability of random variables and processes , 1964 .

[2]  Ioannis Kontoyiannis,et al.  Prefixes and the entropy rate for long-range sources , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.

[3]  P. Shields Entropy and Prefixes , 1992 .

[4]  Benoist,et al.  On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence , 1994 .

[5]  Anthony Quas,et al.  AN ENTROPY ESTIMATOR FOR A CLASS OF INFINITE ALPHABET PROCESSES , 1999 .

[6]  Paul C. Shields,et al.  Universal redundancy rates do not exist , 1993, IEEE Trans. Inf. Theory.

[7]  Philip T. Maker The ergodic theorem for a sequence of functions , 1940 .

[8]  Benjamin Weiss,et al.  Entropy and data compression schemes , 1993, IEEE Trans. Inf. Theory.

[9]  B. Pittel Asymptotical Growth of a Class of Random Trees , 1985 .

[10]  Peter Grassberger,et al.  Estimating the information content of symbol sequences and efficient codes , 1989, IEEE Trans. Inf. Theory.

[11]  Abraham J. Wyner The redundancy and distribution of the phrase lengths of the fixed-database Lempel-Ziv algorithm , 1997, IEEE Trans. Inf. Theory.

[12]  Wojciech Szpankowski,et al.  Asymptotic properties of data compression and suffix trees , 1993, IEEE Trans. Inf. Theory.

[13]  Paul C. Shields,et al.  Universal redundancy rates for the class of B-processes do not exist , 1995, IEEE Trans. Inf. Theory.

[14]  A. Barron THE STRONG ERGODIC THEOREM FOR DENSITIES: GENERALIZED SHANNON-MCMILLAN-BREIMAN THEOREM' , 1985 .

[15]  John H. Reif,et al.  Using difficulty of prediction to decrease computation: fast sort, priority queue and convex hull on entropy bounded inputs , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[17]  Ioannis Kontoyiannis,et al.  Stationary Entropy Estimation via String Matching P N I=1 N I (x) , 1996 .

[18]  John H. Reif,et al.  Fast pattern matching for entropy bounded text , 1995, Proceedings DCC '95 Data Compression Conference.

[19]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[20]  John G. Cleary,et al.  The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[21]  John G. Cleary,et al.  Models of English text , 1997, Proceedings DCC '97. Data Compression Conference.

[22]  Paul H. Algoet,et al.  The strong law of large numbers for sequential decisions under uncertainty , 1994, IEEE Trans. Inf. Theory.

[23]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[24]  L. Breiman The Individual Ergodic Theorem of Information Theory , 1957 .

[25]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[26]  W. Doeblin Sur les propriétés asymptotiques de mouvements régis par certains types de chaînes simples , 1938 .