On Hilberg's law and its links with Guiraud's law*

Hilberg (1990) supposed that finite-order excess entropy of a random human text is proportional to the square root of the text length. Assuming that Hilberg's hypothesis is true, we derive Guiraud's law, which states that the number of word types in a text is greater than proportional to the square root of the text length. Our derivation is based on some mathematical conjecture in coding theory and on several experiments suggesting that words can be defined approximately as the nonterminals of the shortest context-free grammar for the text. Such operational definition of words can be applied even to texts deprived of spaces, which do not allow for Mandelbrot's ``intermittent silence'' explanation of Zipf's and Guiraud's laws. In contrast to Mandelbrot's, our model assumes some probabilistic long-memory effects in human narration and might be capable of explaining Menzerath's law.

[1]  Bruxelles Palais des Académies Bulletin de la Classe des sciences. , 1973 .

[2]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[3]  Valérie Berthé,et al.  Conditional entropy of some automatic sequences , 1994 .

[4]  Anastasios A. Tsonis,et al.  Zipf's law and the structure and evolution of languages , 1997, Complex..

[5]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[6]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[7]  Eric Lehman,et al.  Approximation algorithms for grammar-based data compression , 2002 .

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  A. U.S.,et al.  Predictability , Complexity , and Learning , 2002 .

[10]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.

[13]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[14]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[15]  Gramss Entropy of the symbolic sequence for critical circle maps. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[16]  C. Shalizi,et al.  Causal architecture, complexity and self-organization in time series and cellular automata , 2001 .

[17]  Peter Grassberger Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution , 2002 .

[18]  Lothar Hoffmann,et al.  Beiträge zur Sprachstatistik , 1979 .

[19]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[20]  Lukasz Debowski,et al.  A Revision of Coding Theory for Learning from Language , 2004, FGMOL.

[21]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[22]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[23]  G. Miller,et al.  Some effects of intermittent silence. , 1957, The American journal of psychology.

[24]  J. Marchal Cours d'economie politique , 1950 .

[25]  J. Wolff,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980, Language and speech.

[26]  Lukasz Debowski Trigram morphosyntactic tagger for Polish , 2004, Intelligent Information Systems.

[27]  Benoit B. Mandelbrot,et al.  Structure Formelle des Textes et Communication , 1954 .

[28]  Werner Ebeling,et al.  Word frequency and entropy of symbolic sequences: a dynamical perspective , 1992 .

[29]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[30]  Patrick Billingsley,et al.  Probability and Measure. , 1986 .

[31]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[32]  Beáta Megyesi Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish , 2001, EMNLP.

[33]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[34]  Daniel Ray Upper,et al.  Theory and algorithms for hidden Markov models and generalized hidden Markov models , 1998 .

[35]  Marc Hug Disturbing Factors in a Linguistic Usage Test , 1997, J. Quant. Linguistics.

[36]  Abhi Shelat,et al.  Approximation algorithms for grammar-based compression , 2002, SODA '02.

[37]  W. Ebeling,et al.  Finite sample effects in sequence analysis , 1994 .

[38]  Martin A. Nowak,et al.  The evolution of syntactic communication , 2000, Nature.

[39]  András Kornai,et al.  How many words are there? , 2002, Glottometrics.

[40]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[41]  Werner Ebeling,et al.  Entropy of symbolic sequences: the role of correlations , 1991 .