The Relaxed Hilberg Conjecture: A Review and New Experimental Support

Abstract The relaxed Hilberg conjecture states that the mutual information between two adjacent blocks of text in natural language grows as a power of the block length. The present paper reviews recent results concerning this conjecture. First, the relaxed Hilberg conjecture occurs when the texts repeatedly describe a random reality and Herdan’s law for facts repeatedly described in the texts is obeyed. Second, the relaxed Hilberg conjecture implies Herdan’s law for set phrases, which can be associated with the better known Herdan law for words. Third, the relaxed Hilberg conjecture is positively tested, using the Lempel-Ziv universal code, on a selection of texts in English, German, and French. Hence the relaxed Hilberg conjecture seems to be a likely and important hypothesis concerning natural language.

[1]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[2]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[3]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[4]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[5]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[6]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[7]  P. Guiraud Les caractères statistiques du vocabulaire : essai de méthodologie , 1954 .

[8]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[9]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[10]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[11]  J. Crutchfield,et al.  Regularities unseen, randomness observed: levels of entropy convergence. , 2001, Chaos.

[12]  Lukasz Debowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2005, J. Quant. Linguistics.

[13]  R. K. Wiersba Review of "Information Retrieval: Computational and Theoretical Aspects, by H. S. Heaps", Academic Press Inc. , 1980, SIGF.

[14]  Lukasz Dkebowski,et al.  Excess entropy in natural language: present state and perspectives , 2011, Chaos.

[15]  Naftali Tishby,et al.  Complexity through nonextensivity , 2001, physics/0103076.

[16]  J. Gerard Wolfp,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980 .

[17]  J. Norris Appendix: probability and measure , 1997 .

[18]  J. Wolff,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980, Language and speech.

[19]  Damián H. Zanette,et al.  New perspectives on Zipf's law: from single texts to large corpora , 2002 .

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  Lukasz Dkebowski Variable-Length Coding of Two-Sided Asymptotically Mean Stationary Measures , 2009 .

[22]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[23]  Werner Ebeling,et al.  Entropy of symbolic sequences: the role of correlations , 1991 .

[24]  Rajesh P. N. Rao,et al.  Probabilistic Analysis of an Ancient Undeciphered Script , 2010, Computer.

[25]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[26]  Lukasz Dkebowski,et al.  On Hidden Markov Processes with Infinite Excess Entropy , 2012, 1211.0834.

[27]  Ramon Ferrer-i-Cancho,et al.  Constant conditional entropy and related hypotheses , 2013, ArXiv.

[28]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[29]  Lukasz Debowski,et al.  A general definition of conditional information and its application to ergodic decomposition , 2009 .

[30]  Harry Eugene Stanley,et al.  Languages cool as they expand: Allometric scaling and the decreasing need for new words , 2012, Scientific Reports.

[31]  Łukasz De¸bowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2006, Journal of Quantitative Linguistics.

[32]  Benoit B. Mandelbrot,et al.  Structure Formelle des Textes et Communication , 1954 .

[33]  Werner Ebeling,et al.  Word frequency and entropy of symbolic sequences: a dynamical perspective , 1992 .

[34]  Lukasz Debowski,et al.  On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts , 2008, IEEE Transactions on Information Theory.

[35]  D. Saad Europhysics Letters , 1997 .

[36]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[37]  Marcelo A. Montemurro,et al.  Frequency-rank distribution of words in large text samples: phenomenology and models , 2002, Glottometrics.

[38]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[39]  L. Debowski Maximal Lengths of Repeat in English Prose , 2011 .

[40]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[41]  L. Debowski,et al.  Empirical Evidence for Hilberg ’ s Conjecture in Single-Author Texts , 2012 .

[42]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[43]  Ë. .. De On Hidden Markov Processes with Infinite Excess Entropy , 2012 .

[44]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.

[45]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[46]  Marie Tesitelová I. Quantitative Linguistics , 1992 .

[47]  A. U.S.,et al.  Predictability , Complexity , and Learning , 2002 .

[48]  Lukasz Debowski,et al.  Mixing, Ergodic, and Nonergodic Processes With Rapidly Growing Information Between Blocks , 2011, IEEE Transactions on Information Theory.