On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

This paper presents a new interpretation for Zipf-Mandelbrot's law in natural language which rests on two areas of information theory. Firstly, we construct a new class of grammar-based codes and, secondly, we investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If a text of length n describes nβ independent facts in a repetitive way then the text contains at least nβ / log n different words, under suitable conditions on n. In the formal statement, two modeling postulates are adopted. Firstly, the words are understood as nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the text is assumed to be emitted by a finite-energy strongly nonergodic source whereas the facts are binary IID variables predictable in a shift-invariant way.

[1]  Benoit B. Mandelbrot,et al.  Structure Formelle des Textes et Communication , 1954 .

[2]  David L. Neuhoff,et al.  Simplistic Universal Coding. , 1998, IEEE Trans. Inf. Theory.

[3]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[4]  Robert M. Gray,et al.  Asymptotically mean stationary channels , 1981, IEEE Trans. Inf. Theory.

[5]  Werner Ebeling,et al.  Word frequency and entropy of symbolic sequences: a dynamical perspective , 1992 .

[6]  P. Shields The Ergodic Theory of Discrete Sample Paths , 1996 .

[7]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[8]  Paul C. Shields,et al.  Universal redundancy rates do not exist , 1993, IEEE Trans. Inf. Theory.

[9]  Adam Kilgarriff,et al.  Language is never, ever, ever, random , 2005 .

[10]  Lukasz Debowski,et al.  Mixing, Ergodic, and Nonergodic Processes With Rapidly Growing Information Between Blocks , 2011, IEEE Transactions on Information Theory.

[11]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.

[12]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[13]  Lukasz Debowski On vocabulary size of grammar-based codes , 2007, ISIT.

[14]  Lukasz Debowski,et al.  Computable Bayesian Compression for Uniformly Discretizable Statistical Models , 2009, ALT.

[15]  D. Freedman Bernard Friedman's Urn , 1965 .

[16]  Lukasz Debowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2005, J. Quant. Linguistics.

[17]  Haitao Liu,et al.  Exact methods in the study of language and text , 2008 .

[18]  L. J. Savage,et al.  Symmetric measures on Cartesian products , 1955 .

[19]  J. Norris Appendix: probability and measure , 1997 .

[20]  Tsachy Weissman Not All Universal Source Codes Are Pointwise Universal , 2004 .

[21]  L. Debowski,et al.  Ergodic decomposition of excess entropy and conditional mutual information , 2006 .

[22]  G. Chaitin Randomness and Mathematical Proof , 1975 .

[23]  Robert M. Gray,et al.  The ergodic decomposition of stationary discrete random processes , 1974, IEEE Trans. Inf. Theory.

[24]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[25]  A. U.S.,et al.  Predictability , Complexity , and Learning , 2002 .

[26]  Naftali Tishby,et al.  Predictability, Complexity, and Learning , 2000, Neural Computation.

[27]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[28]  Robert M. Gray,et al.  Source coding theorems without the ergodic assumption , 1974, IEEE Trans. Inf. Theory.

[29]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[30]  E.C. van der Meulen,et al.  There is no Universal Source Code for Infinite Alphabet , 1993, Proceedings. IEEE International Symposium on Information Theory.

[31]  L. Dgbowski On vocabulary size of grammar-based codes , 2007, 2007 IEEE International Symposium on Information Theory.

[32]  Wojciech Rytter Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2003, Theor. Comput. Sci..

[33]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[34]  J. Kingman Uses of Exchangeability , 1978 .

[35]  E. Hewitt,et al.  On the fundamental ideas of measure theory , 1962 .

[36]  W. A. Thompson,et al.  Events which are Almost Independent , 1973 .

[37]  Peter Harremoës,et al.  Maximum Entropy Fundamentals , 2001, Entropy.

[38]  L. Davisson,et al.  The Distortion-Rate Function for Nonergodic Sources , 1978 .

[39]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[40]  Ioannis Kontoyiannis,et al.  Prefixes and the entropy rate for long-range sources , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.

[41]  P. Shields String matching bounds via coding , 1997 .

[42]  Sergio VerdÂ,et al.  The Minimum Description Length Principle in Coding and Modeling , 2000 .

[43]  Naftali Tishby,et al.  Complexity through nonextensivity , 2001, physics/0103076.

[44]  Ioannis Kontoyiannis The complexity and entropy of literary styles , 1997 .

[45]  J. Crutchfield,et al.  Regularities unseen, randomness observed: levels of entropy convergence. , 2001, Chaos.

[46]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[47]  Amiel Feinstein,et al.  Information and information stability of random variables and processes , 1964 .

[48]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[49]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[50]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[51]  John C. Kieffer On the optimum average distortion attainable by fixed-rate coding of a nonergodic source , 1975, IEEE Trans. Inf. Theory.

[52]  Benjamin Weiss,et al.  How Sampling Reveals a Process , 1990 .

[53]  G. Miller,et al.  Some effects of intermittent silence. , 1957, The American journal of psychology.

[54]  U. Grenander,et al.  Toeplitz Forms And Their Applications , 1958 .

[55]  L. Debowski Variable-Length Coding of Two-sided Asymptotically Mean Stationary Measures , 2009 .

[56]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[57]  On the strong mixing and weak Bernoulli conditions , 1980 .

[58]  R. Gray,et al.  Asymptotically Mean Stationary Measures , 1980 .

[59]  Gregory J. Chaitin,et al.  A recent technical report , 1974, SIGA.

[60]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[61]  David Blackwell,et al.  The Martin boundary for Polya's urn scheme, and an application to stochastic population growth , 1964, Journal of Applied Probability.

[62]  J. Gerard Wolfp,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980 .

[63]  P. Shields Universal Redundancy Rates Don't Exist , 1993, Proceedings. IEEE International Symposium on Information Theory.

[64]  String Matching: The Ergodic Case , 1992 .

[65]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[66]  Daniel Jurafsky,et al.  An introduction to natural language processing , 2000 .

[67]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[68]  R. Perline Strong, Weak and False Inverse Power Laws , 2005 .

[69]  Michael Mitzenmacher,et al.  Power laws for monkeys typing randomly: the case of unequal probabilities , 2004, IEEE Transactions on Information Theory.

[70]  Lukasz Debowski,et al.  Menzerath's law for the smallest grammars , 2007, Exact Methods in the Study of Language and Text.

[71]  Werner Ebeling,et al.  Entropy of symbolic sequences: the role of correlations , 1991 .

[72]  B. De Finetti,et al.  Funzione caratteristica di un fenomeno aleatorio , 1929 .

[73]  Andreas Schinner The Voynich Manuscript: Evidence of the Hoax Hypothesis , 2007, Cryptologia.

[74]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[75]  RytterWojciech Application of Lempel--Ziv factorization to the approximation of grammar-based compression , 2003 .

[76]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[77]  Lukasz Debowski,et al.  A general definition of conditional information and its application to ergodic decomposition , 2009 .

[78]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[79]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[80]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[81]  B. M. Hill,et al.  Exchangeable Urn Processes , 1987 .

[82]  J. Wolff,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980, Language and speech.

[83]  Lukasz Dkebowski Variable-Length Coding of Two-Sided Asymptotically Mean Stationary Measures , 2009 .

[84]  András Kornai,et al.  How many words are there? , 2002, Glottometrics.

[85]  John C. Kieffer,et al.  A unified approach to weak universal source coding , 1978, IEEE Trans. Inf. Theory.

[86]  Gabriel Landini,et al.  EVIDENCE OF LINGUISTIC STRUCTURE IN THE VOYNICH MANUSCRIPT USING SPECTRAL ANALYSIS , 2001, Cryptologia.

[87]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[88]  E. Khmaladze The statistical analysis of a large number of rare events , 1988 .

[89]  A. Barron THE STRONG ERGODIC THEOREM FOR DENSITIES: GENERALIZED SHANNON-MCMILLAN-BREIMAN THEOREM' , 1985 .

[90]  P. Billingsley,et al.  Ergodic theory and information , 1966 .

[91]  Łukasz De¸bowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2006, Journal of Quantitative Linguistics.

[92]  Raymond W. Yeung,et al.  A First Course in Information Theory , 2002 .

[93]  Rudolf Carnap,et al.  An outline of a theory of semantic information , 1952 .

[94]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[95]  Neri Merhav,et al.  Hidden Markov processes , 2002, IEEE Trans. Inf. Theory.

[96]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.