Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited

As we discuss, a stationary stochastic process is nonergodic when a random persistent topic can be detected in the infinite random text sampled from the process, whereas we call the process strongly nonergodic when an infinite sequence of independent random bits, called probabilistic facts, is needed to describe this topic completely. Replacing probabilistic facts with an algorithmically random sequence of bits, called algorithmic facts, we adapt this property back to ergodic processes. Subsequently, we call a process perigraphic if the number of algorithmic facts which can be inferred from a finite text sampled from the process grows like a power of the text length. We present a simple example of such a process. Moreover, we demonstrate an assertion which we call the theorem about facts and words. This proposition states that the number of probabilistic or algorithmic facts which can be inferred from a text drawn from a process must be roughly smaller than the number of distinct word-like strings detected in this text by means of the Prediction by Partial Matching (PPM) compression algorithm. We also observe that the number of the word-like strings for a sample of plays by Shakespeare follows an empirical stepwise power law, in a stark contrast to Markov processes. Hence, we suppose that natural language considered as a process is not only non-Markov but also perigraphic.

[1]  J. Norris Appendix: probability and measure , 1997 .

[2]  Lukasz Dcebowski Hilberg Exponents: New Measures of Long Memory in the Process , 2014 .

[3]  Łukasz De¸bowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2006, Journal of Quantitative Linguistics.

[4]  Marie Tesitelová I. Quantitative Linguistics , 1992 .

[5]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[6]  Lukasz Debowski,et al.  The Relaxed Hilberg Conjecture: A Review and New Experimental Support , 2015, J. Quant. Linguistics.

[7]  J. Gerard Wolfp,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980 .

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[10]  Naftali Tishby,et al.  Complexity through nonextensivity , 2001, physics/0103076.

[11]  Aldo de Luca,et al.  On the Combinatorics of Finite Words , 1999, Theor. Comput. Sci..

[12]  Lukasz Debowski,et al.  Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture , 2015, Entropy.

[13]  Thomas M. Cover,et al.  Probability and Information. , 1986 .

[14]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.

[15]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[16]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[17]  Lukasz Deowski Excess entropy in natural language: present state and perspectives , 2011, ArXiv.

[18]  Werner Ebeling,et al.  Entropy of symbolic sequences: the role of correlations , 1991 .

[19]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[20]  J. Crutchfield,et al.  Regularities unseen, randomness observed: levels of entropy convergence. , 2001, Chaos.

[21]  Boris Ryabko,et al.  Applications of Universal Source Coding to Statistical Analysis of Time Series , 2008, ArXiv.

[22]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[23]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[24]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[25]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[26]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[27]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[28]  Lukasz Debowski,et al.  A general definition of conditional information and its application to ergodic decomposition , 2009 .

[29]  J. Wolff,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980, Language and speech.

[30]  Lukasz Debowski,et al.  Mixing, Ergodic, and Nonergodic Processes With Rapidly Growing Information Between Blocks , 2011, IEEE Transactions on Information Theory.

[31]  Robert M. Gray,et al.  Probability, Random Processes, And Ergodic Properties , 1987 .

[32]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[33]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[34]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[35]  Benoit B. Mandelbrot,et al.  Structure Formelle des Textes et Communication , 1954 .

[36]  Aarnout Brombacher,et al.  Probability... , 2009, Qual. Reliab. Eng. Int..

[37]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[38]  Lukasz Debowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2005, J. Quant. Linguistics.

[39]  Lukasz Dkebowski,et al.  Excess entropy in natural language: present state and perspectives , 2011, Chaos.

[40]  Boris Ryabko Twice-universal coding , 2015 .

[41]  Lukasz Debowski,et al.  On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts , 2008, IEEE Transactions on Information Theory.

[42]  Robert M. Gray,et al.  The ergodic decomposition of stationary discrete random processes , 1974, IEEE Trans. Inf. Theory.