Statistical laws in linguistics

Zipf’s law is just one out of many universal laws proposed to describe statistical regularities in language. Here we review and critically discuss how these laws can be statistically interpreted, fitted, and tested (falsified). The modern availability of large databases of written text allows for tests with an unprecedent statistical accuracy and also for a characterization of the fluctuations around the typical behavior. We find that fluctuations are usually much larger than expected based on simplifying statistical assumptions (e.g., independence and lack of correlations between observations). These simplifications appear also in usual statistical tests so that the large fluctuations can be erroneously interpreted as a falsification of the law. Instead, here we argue that linguistic laws are only meaningful (falsifiable) if accompanied by a model for which the fluctuations can be computed (e.g., a generative model of the text). The large fluctuations we report show that the constraints imposed by linguistic laws on the creativity process of text generation are not as tight as one could expect.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  Eduardo G. Altmann,et al.  Scaling laws and fluctuations in the statistics of word frequencies , 2014, ArXiv.

[3]  Lukasz Debowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2005, J. Quant. Linguistics.

[4]  B. Mandelbrot,et al.  TESTS OF THE DEGREE OF WORD CLUSTERING IN SAMPLES OF WRITTEN ENGLISH , 1973 .

[5]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[6]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[7]  J. Kertész,et al.  Fluctuation scaling in complex systems: Taylor's law and beyond , 2007, 0708.2053.

[8]  Ramon Ferrer-i-Cancho,et al.  Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution , 2010, PloS one.

[9]  Irene M. Cramer The Parameters of the Altmann-Menzerath Law , 2005, J. Quant. Linguistics.

[10]  Marcelo A. Montemurro,et al.  Dynamics of Text Generation with Realistic Zipf's Distribution , 2002, J. Quant. Linguistics.

[11]  Marc S. Weiss Modification of the Kolmogorov-Smirnov Statistic for Use with Correlated Data , 1978 .

[12]  H. Bauke Parameter estimation for power-law distributions by maximum likelihood methods , 2007, 0704.1867.

[13]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[14]  Marie Tesitelová I. Quantitative Linguistics , 1992 .

[15]  R'emi Louf,et al.  Scaling: Lost in the Smog , 2014, 1410.4964.

[16]  B. Schapiro,et al.  Zipf 's law and the effect of ranking on probability distributions , 1996 .

[17]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[18]  Nick Chater,et al.  Networks in Cognitive Science , 2013, Trends in Cognitive Sciences.

[19]  K. Dill,et al.  A maximum entropy framework for nonexponential distributions , 2013, Proceedings of the National Academy of Sciences.

[20]  Leo Egghe Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments , 2007, J. Assoc. Inf. Sci. Technol..

[21]  James P. Bagrow,et al.  Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  Steven T Piantadosi,et al.  Word lengths are optimized for efficient communication , 2011, Proceedings of the National Academy of Sciences.

[23]  John McCloskey,et al.  Effect of the Sumatran mega-earthquake on the global magnitude cut-off and event rate , 2008 .

[24]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[25]  François Pachet,et al.  Markov Constraints for Generating Lyrics with Style , 2012, ECAI.

[26]  Harry Eugene Stanley,et al.  Languages cool as they expand: Allometric scaling and the decreasing need for new words , 2012, Scientific Reports.

[27]  Matteo Marsili,et al.  On sampling and modeling complex systems , 2013, 1301.3622.

[28]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[29]  Eduardo G. Altmann,et al.  Stochastic model for the vocabulary growth in natural languages , 2012, ArXiv.

[30]  Jean-Philippe Bouchaud,et al.  Goodness-of-fit tests with dependent observations , 2011, 1106.3016.

[31]  Dami'an H. Zanette,et al.  Statistical Patterns in Written Language , 2014, ArXiv.

[32]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[33]  Wentian Li,et al.  Zipf's Law everywhere , 2002, Glottometrics.

[34]  Haitao Liu,et al.  Approaching human language with complex networks. , 2014, Physics of life reviews.

[35]  M. Porter,et al.  Critical Truths About Power Laws , 2012, Science.

[36]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[37]  Panagiotis Papapetrou,et al.  Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping , 2011, ECML/PKDD.

[38]  A. Deluca,et al.  Fitting and goodness-of-fit test of non-truncated and truncated power-law distributions , 2012, Acta Geophysica.

[39]  Filippo Menczer,et al.  Modeling Statistical Properties of Written Text , 2009, PloS one.

[40]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[41]  Linyuan Lü,et al.  Zipf's Law Leads to Heaps' Law: Analyzing Their Relation in Finite-Size Systems , 2010, PloS one.

[42]  Łukasz De¸bowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2006, Journal of Quantitative Linguistics.

[43]  Roman Jakobson,et al.  Structure of Language and Its Mathematical Aspects , 1961 .

[44]  Ricard V. Solé,et al.  Emergence of Zipf's Law in the Evolution of Communication , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  Michel L. Goldstein,et al.  Problems with fitting to the power-law distribution , 2004, cond-mat/0402322.

[46]  Michael Batty,et al.  There is More than a Power Law in Zipf , 2012, Scientific Reports.

[47]  Luis Enrique Correa da Rocha,et al.  Size dependent word frequencies and translational invariance of books , 2009, ArXiv.

[48]  Jun Zhang,et al.  LONG RANGE CORRELATION IN HUMAN WRITINGS , 1993 .

[49]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.

[50]  Adilson E. Motter,et al.  Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words , 2009, PloS one.

[51]  Animesh Mukherjee,et al.  The Structure and Dynamics of Linguistic Networks , 2009 .

[52]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[53]  Luciano da Fontoura Costa,et al.  Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript , 2013, PloS one.

[54]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[55]  M. Bierwisch,et al.  Structure of Language and its Mathematical Aspects. (Proceedings of Symposia in Applied Mathematics, Volume XII.) VI + 279 S. Providence, Rh. I., 1961. American Mathematical Society. Preis geb. $ 7.80 . , 1964 .

[56]  Carlos Gershenson,et al.  Complexity measurement of natural and artificial languages , 2013, Complex..

[57]  H. Akaike A new look at the statistical model identification , 1974 .

[58]  Gerhard Jäger,et al.  Power Laws and Other heavy-Tailed Distributions in Linguistic Typology , 2012, Adv. Complex Syst..

[59]  R. Ferrer-i-Cancho,et al.  The Evolution of the Exponent of Zipf's Law in Language Ontogeny , 2013, PloS one.

[60]  Ricard V. Solé,et al.  Language networks: Their structure, function, and evolution , 2007, Complex..

[61]  Eduardo G. Altmann,et al.  On the origin of long-range correlations in texts , 2012, Proceedings of the National Academy of Sciences.

[62]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.