Maximal Repetition and Zero Entropy Rate

Maximal repetition of a string is the maximal length of a repeated substring. This paper investigates maximal repetition of strings drawn from stochastic processes. Strengthening previous results, two new bounds for the almost sure growth rate of maximal repetition are identified: an upper bound in terms of conditional Rényi entropy of order <inline-formula> <tex-math notation="LaTeX">$\gamma >1$ </tex-math></inline-formula> given a sufficiently long past and a lower bound in terms of unconditional Shannon entropy (<inline-formula> <tex-math notation="LaTeX">$\gamma =1$ </tex-math></inline-formula>). Both the upper and the lower bound can be proved using an inequality for the distribution of recurrence time. We also supply an alternative proof of the lower bound which makes use of an inequality for the expectation of subword complexity. In particular, it is shown that a power-law logarithmic growth of maximal repetition with respect to the string length, recently observed for texts in natural language, may hold only if the conditional Rényi entropy rate given a sufficiently long past equals zero. According to this observation, natural language cannot be faithfully modeled by a typical hidden Markov process, which is a class of basic language models used in computational linguistics.

[1]  Max Tegmark,et al.  Critical Behavior in Physics and Probabilistic Formal Languages , 2016, Entropy.

[2]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[3]  Lukasz Debowski Consistency of the plug-in estimator of the entropy rate for ergodic processes , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[4]  Adilson E. Motter,et al.  Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words , 2009, PloS one.

[5]  Gregory Kucherov,et al.  On Maximal Repetitions in Words , 1999, FCT.

[6]  Kumiko Tanaka-Ishii,et al.  Entropy Rate Estimates for Natural Language - A New Extrapolation of Compressed Large-Scale Corpora , 2016, Entropy.

[7]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[8]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[9]  Mark Daniel Ward,et al.  On Correlation Polynomials and Subword Complexity , 2007 .

[10]  Ioannis Kontoyiannis,et al.  Prefixes and the entropy rate for long-range sources , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.

[11]  Svante Janson,et al.  On the average sequence complexity , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[12]  Michael Mitzenmacher,et al.  Estimating and comparing entropies across written natural languages using PPM compression , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[13]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[14]  A. Rényi,et al.  On a new law of large numbers , 1970 .

[15]  Wojciech Szpankowski,et al.  Asymptotic properties of data compression and suffix trees , 1993, IEEE Trans. Inf. Theory.

[16]  A. Rényi On Measures of Entropy and Information , 1961 .

[17]  Łukasz Dębowski,et al.  Regular Hilberg Processes: An Example of Processes With a Vanishing Entropy Rate , 2017, IEEE Transactions on Information Theory.

[18]  E. Arıkan An inequality on guessing and its application to sequential decoding , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[19]  Lukasz Debowski,et al.  Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture , 2015, Entropy.

[20]  Yuri M. Suhov,et al.  Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text , 1998, IEEE Trans. Inf. Theory.

[21]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[22]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[23]  Ioannis Kontoyiannis,et al.  Asymptotic Recurrence and Waiting Times for Stationary Processes , 1998 .

[24]  M. Ko Renyi entropy and recurrence , 2012 .

[25]  Lukasz Debowski,et al.  On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts , 2008, IEEE Transactions on Information Theory.

[26]  Lucian Ilie,et al.  Maximal repetitions in strings , 2008, J. Comput. Syst. Sci..

[27]  Marco Tomamichel Conditional Rényi Entropy , 2016 .

[28]  Lukasz Debowski,et al.  Estimation of Entropy from Subword Complexity , 2016, Challenges in Computational Statistics and Data Mining.

[29]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[30]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[31]  Sébastien Ferenczi,et al.  Complexity of sequences and dynamical systems , 1999, Discret. Math..

[32]  M. Kac On the notion of recurrence in discrete stochastic processes , 1947 .

[33]  M. Waterman,et al.  THE ERDOS-RENYI STRONG LAW FOR PATTERN MATCHING WITH A GIVEN PROPORTION OF MISMATCHES , 1989 .

[34]  Paul C. Shields,et al.  Cutting and stacking: A method for constructing stationary processes , 1991, IEEE Trans. Inf. Theory.

[35]  P. Shields String matching bounds via coding , 1997 .

[36]  String Matching: The Ergodic Case , 1992 .

[37]  Yun Gao,et al.  Estimating the Entropy of Binary Time Series: Methodology, Some Theory and a Simulation Study , 2008, Entropy.

[38]  Peter Grassberger Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution , 2002 .

[39]  Aldo de Luca,et al.  On the Combinatorics of Finite Words , 1999, Theor. Comput. Sci..

[40]  Benjamin Weiss,et al.  Entropy and data compression schemes , 1993, IEEE Trans. Inf. Theory.

[41]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[42]  Serge Fehr,et al.  On the Conditional Rényi Entropy , 2014, IEEE Transactions on Information Theory.