A New Universal Code Helps to Distinguish Natural Language from Random Texts

Using a new universal distribution called switch distribution, we reveal a prominent statistical difference between a text in natural language and its unigram version. For the text in natural language, the cross mutual information grows as a power law, whereas for the unigram text, it grows logarithmically. In this way, we corroborate Hilberg’s conjecture and disprove an alternative hypothesis that texts in natural language are generated by the unigram model.

[1]  Ramon Ferrer-i-Cancho,et al.  Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution , 2010, PloS one.

[2]  Werner Ebeling,et al.  Entropy of symbolic sequences: the role of correlations , 1991 .

[3]  Guy Louchard,et al.  Average redundancy rate of the Lempel-Ziv code , 1996, Proceedings of Data Compression Conference - DCC '96.

[4]  A. U.S.,et al.  Predictability , Complexity , and Learning , 2002 .

[5]  Steven de Rooij,et al.  Catching Up Faster in Bayesian Model Selection and Model Averaging , 2007, NIPS.

[6]  Naftali Tishby,et al.  Complexity through nonextensivity , 2001, physics/0103076.

[7]  Marie Tesitelová I. Quantitative Linguistics , 1992 .

[8]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[9]  Lukasz Debowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2005, J. Quant. Linguistics.

[10]  L. Debowski,et al.  Empirical Evidence for Hilberg ’ s Conjecture in Single-Author Texts , 2012 .

[11]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[12]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[13]  Lukasz Debowski,et al.  On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts , 2008, IEEE Transactions on Information Theory.

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  Łukasz De¸bowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2006, Journal of Quantitative Linguistics.

[16]  Benoit B. Mandelbrot,et al.  Structure Formelle des Textes et Communication , 1954 .

[17]  Werner Ebeling,et al.  Word frequency and entropy of symbolic sequences: a dynamical perspective , 1992 .

[18]  D. Saad Europhysics Letters , 1997 .

[19]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[20]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[21]  G. Miller,et al.  Some effects of intermittent silence. , 1957, The American journal of psychology.

[22]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.

[23]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .