On Hilberg's law and its links with Guiraud's law*

Abstract Hilberg (1990) supposed that finite-order excess entropy of a random human text is proportional to the square root of the text length. Assuming that Hilberg's hypothesis is true, we derive Guiraud's law, which states that the number of word types in a text is greater than proportional to the square root of the text length. Our derivation is based on some mathematical conjecture in coding theory and on several experiments suggesting that words can be defined approximately as the nonterminals of the shortest context-free grammar for the text. Such operational definition of words can be applied even to texts deprived of spaces, which do not allow for Mandelbrot's “intermittent silence” explanation of Zipf's and Guiraud's laws. In contrast to Mandelbrot's, our model assumes some probabilistic long-memory effects in human narration and might be capable of explaining Menzerath's law.

[1]  C. Shalizi,et al.  Causal architecture, complexity and self-organization in time series and cellular automata , 2001 .

[2]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[3]  A. U.S.,et al.  Predictability , Complexity , and Learning , 2002 .

[4]  Martin A. Nowak,et al.  The evolution of syntactic communication , 2000, Nature.

[5]  Benoit B. Mandelbrot,et al.  Structure Formelle des Textes et Communication , 1954 .

[6]  Werner Ebeling,et al.  Entropy of symbolic sequences: the role of correlations , 1991 .

[7]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[8]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[9]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[10]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[11]  Daniel Ray Upper,et al.  Theory and algorithms for hidden Markov models and generalized hidden Markov models , 1998 .

[12]  Lothar Hoffmann,et al.  Beiträge zur Sprachstatistik , 1979 .

[13]  Marc Hug Disturbing Factors in a Linguistic Usage Test , 1997, J. Quant. Linguistics.

[14]  Lukasz Debowski,et al.  A Revision of Coding Theory for Learning from Language , 2004, FGMOL.

[15]  J. Wolff,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980, Language and speech.

[16]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[17]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[18]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[19]  Damián H. Zanette,et al.  New perspectives on Zipf's law: from single texts to large corpora , 2002 .

[20]  J. Marchal Cours d'economie politique , 1950 .

[21]  J. Gerard Wolfp,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980 .

[22]  András Kornai,et al.  How many words are there? , 2002, Glottometrics.

[23]  Anastasios A. Tsonis,et al.  Zipf's law and the structure and evolution of languages , 1997, Complex..

[24]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[25]  G. Miller,et al.  Some effects of intermittent silence. , 1957, The American journal of psychology.

[26]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[27]  Lukasz Debowski Trigram morphosyntactic tagger for Polish , 2004, Intelligent Information Systems.

[28]  Beáta Megyesi Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish , 2001, EMNLP.

[29]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[30]  Abhi Shelat,et al.  Approximation algorithms for grammar-based compression , 2002, SODA '02.

[31]  Charles Gide,et al.  Cours d'économie politique , 1911 .

[32]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[33]  Valérie Berthé,et al.  Conditional entropy of some automatic sequences , 1994 .

[34]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[35]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[36]  Marcelo A. Montemurro,et al.  Frequency-rank distribution of words in large text samples: phenomenology and models , 2002, Glottometrics.

[37]  Eric Lehman,et al.  Approximation algorithms for grammar-based data compression , 2002 .

[38]  W. Ebeling,et al.  Finite sample effects in sequence analysis , 1994 .

[39]  J. Crutchfield,et al.  Regularities unseen, randomness observed: levels of entropy convergence. , 2001, Chaos.

[40]  Gramss Entropy of the symbolic sequence for critical circle maps. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[41]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[42]  Werner Ebeling,et al.  Word frequency and entropy of symbolic sequences: a dynamical perspective , 1992 .

[43]  Murray Gell-Mann,et al.  Fundamental sources of unpredictability , 1997 .

[44]  Krzysztof Trojanowski,et al.  Intelligent Information Processing and Web Mining , 2008 .

[45]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[46]  Peter Grassberger Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution , 2002 .

[47]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.