Algorithmic information, complexity and Zipf's law

Zipf’s law of word frequencies for language discourses is established with statistical rigor. Data show a departure from Zipf’s power law term at low frequencies. This is accounted by a modifying exponential term. Both arise naturally in a model for word frequencies based on Information Theory, algorithmic coding of a text preserving the symbol sequence, concepts from quantum statistical physics and computer science and extremum principles. The Optimum Meaning Preserving Code (OMPC) of the discourse is realized when word frequencies follow the Modified Power Law (MPL). The model predicts a variant of the MPL for the relative frequencies of a small fixed set of symbols such as letters, phonemes and grammatical words. The OMPC can be viewed as containing orderly and random parts. This leads us to a quantitative definition of complexity of a string (C) that tends to 0 for the extremes of ‘all order’ and ‘all random’ but is a maximum (C = 1) for a mixture of both (Gell-Mann). It is found that natural languages have maximum complexity. The uniqueness of Zipf’s power law index (γ = 2) is shown to arise in four different ways, one of which depends on scale invariance characteristic of fractal structures. It is argued that random text models are unsuitable for natural languages. It is speculated that a drastic change in symbol frequency distribution starting from phrases is related to emergence of meaning and coherence of a discourse.

[1]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[2]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[3]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[4]  D. Champernowne A Model of Income Distribution , 1953 .

[5]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[6]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[7]  M. Gell-Mann Books-Received - the Quark and the Jaguar - Adventures in the Simple and the Complex , 1994 .

[8]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[9]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[10]  Teun A. van Dijk,et al.  Text and Context: Explorations in the Semantics and Pragmatics of Discourse , 1977 .

[11]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[12]  Benoit B. Mandelbrot,et al.  A Note On a Class of Skew Distribution Functions: Analysis and Critique of a Paper by H. A. Simon , 1959, Inf. Control..

[13]  Herbert A. Simon,et al.  Reply to Dr. Mandelbrot's Post Scriptum , 1961, Inf. Control..

[14]  B. Mandelbrot On the quadratic mapping z→z2-μ for complex μ and z: The fractal structure of its M set, and scaling , 1983 .

[15]  J. Schilperoord,et al.  Linguistics , 1999 .

[16]  Herbert A. Simon Reply to "Final Note" by Benoit Mandelbrot , 1961, Inf. Control..

[17]  Benoit B. Mandelbrot,et al.  Post Scriptum to "Final Note" , 1961, Inf. Control..

[18]  S. Denisov,et al.  Fractal binary sequences: Tsallis thermodynamics and the Zipf law , 1997 .

[19]  Herbert A. Simon,et al.  Some Further Notes on a Class of Skew Distribution Functions , 1960, Inf. Control..

[20]  D. G. Champernowne,et al.  The Graduation of Income Distributions , 1952 .

[21]  Marcelo A. Montemurro,et al.  Beyond the Zipf-Mandelbrot law in quantitative linguistics , 2001, ArXiv.

[22]  B. Schapiro,et al.  Zipf 's law and the effect of ranking on probability distributions , 1996 .

[23]  S. Naranan,et al.  Information theoretic models in statistical linguistics. I: A model for word frequencies , 1992 .

[24]  Rosario N. Mantegna,et al.  Can Zipf Analyses and Entropy Distinguish Between Artificial and Natural Language Texts , 1996 .

[25]  D. Zanette,et al.  At the boundary between biological and cultural evolution: the origin of surname distributions. , 2002, Journal of theoretical biology.

[26]  R. Harald Baayen,et al.  The Effects of Lexical Specialization on the Growth Curve of the Vocabulary , 1996, Comput. Linguistics.