Stochastic model for the vocabulary growth in natural languages

Max Planck Institute for the Physics of Complex Systems, 01187 Dresden, Germany(Received 4 December 2012; revised manuscript received 20 March 2013; published 14 May 2013)Weproposeastochasticmodelforthenumberofdifferentwordsinagivendatabasewhichincorporatesthe dependenceonthedatabasesizeandhistoricalchanges.Themain featureofourmodelistheexistenceof two different classes of words: (i) a finite number of core words, which have higher frequency and donot affect the probability of a new word to be used, and (ii) the remaining virtually infinite number ofnoncore words, which have lower frequency and, once used, reduce the probability of a new word to beused in the future. Our model relies on a careful analysis of the Google Ngram database of bookspublished in the last centuries, and its main consequence is the generalization of Zipf’s and Heaps’ law totwo-scaling regimes. We confirm that these generalizations yield the best simple description of the dataamong generic descriptive models and that the two free parameters depend only on the language but noton the database. From the point of view of our model, the main change on historical time scales is thecomposition of the specific words included in the finite list of core words, which we observe to decayexponentially in time with a rate of approximately 30 words per year for English.

[1]  M. Pagel,et al.  Frequency of word-use predicts rates of lexical evolution throughout Indo-European history , 2007, Nature.

[2]  B. Harshbarger An Introduction to Probability Theory and its Applications, Volume I , 1958 .

[3]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[4]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[5]  Marcelo A. Montemurro,et al.  Dynamics of Text Generation with Realistic Zipf's Distribution , 2002, J. Quant. Linguistics.

[6]  Ricardo A. Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 1997, CIKM '97.

[7]  Theo P. van der Weide,et al.  A formal derivation of Heaps' Law , 2005, Inf. Sci..

[8]  S. Fortunato,et al.  Statistical physics of social dynamics , 2007, 0710.3256.

[9]  Ginestra Bianconi,et al.  Dynamics of Ranking Processes in Complex Systems , 2012 .

[10]  Gabriel Altmann,et al.  Review Article: On Vocabulary Richness , 1999, J. Quant. Linguistics.

[11]  Rosario N. Mantegna,et al.  Numerical Analysis of Word Frequencies in Artificial and Natural Language Texts , 1997 .

[12]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[13]  William H. Press,et al.  Numerical recipes , 1990 .

[14]  Hugh E. Williams,et al.  Searchable words on the Web , 2005, International Journal on Digital Libraries.

[15]  Pierre Baldi,et al.  Discovery of Power-Laws in Chemical Space , 2008, J. Chem. Inf. Model..

[16]  S N Dorogovtsev,et al.  Language as an evolving word web , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[17]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[18]  R. Ferrer i Cancho,et al.  The variation of Zipf's law in human language , 2005 .

[19]  P. A. P. Moran,et al.  An introduction to probability theory , 1968 .

[20]  M. Porter,et al.  Critical Truths About Power Laws , 2012, Science.

[21]  Luis Enrique Correa da Rocha,et al.  The meta book and size-dependent properties of written language , 2009, ArXiv.

[22]  Ricard V. Solé,et al.  Emergence of Zipf's Law in the Evolution of Communication , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Erez Lieberman,et al.  Quantifying the evolutionary dynamics of language , 2007, Nature.

[24]  Marcelo A. Montemurro,et al.  Beyond the Zipf-Mandelbrot law in quantitative linguistics , 2001, ArXiv.

[25]  Ralph B. D'Agostino,et al.  Goodness-of-Fit-Techniques , 2020 .

[26]  G. Āllport The Psycho-Biology of Language. , 1936 .

[27]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[28]  Emilio Hernández-García,et al.  Wikipedia Information Flow Analysis Reveals the Scale-Free Architecture of the Semantic Space , 2011, PloS one.

[29]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[30]  S. Naranan,et al.  Models for Power Law Relations in Linguistics and Information Science , 1998, J. Quant. Linguistics.

[31]  Harry Eugene Stanley,et al.  Languages cool as they expand: Allometric scaling and the decreasing need for new words , 2012, Scientific Reports.

[32]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[33]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[34]  Filippo Menczer,et al.  Modeling Statistical Properties of Written Text , 2009, PloS one.

[35]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[36]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[37]  Sebastian Bernhardsson,et al.  Zipf's law unzipped , 2011, ArXiv.

[38]  Vittorio Loreto,et al.  Collective dynamics of social annotation , 2009, Proceedings of the National Academy of Sciences.

[39]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[40]  Roman Jakobson,et al.  Structure of Language and Its Mathematical Aspects , 1961 .

[41]  Juhan Tuldava,et al.  The Frequency Spectrum of Text and Vocabulary , 1996, J. Quant. Linguistics.

[42]  Vittorio Loreto,et al.  Language Dynamics , 2012, Adv. Complex Syst..

[43]  Daniel E. Geer,et al.  Power. Law , 2012, IEEE Secur. Priv..

[44]  J. Taylor An Introduction to Error Analysis , 1982 .

[45]  V. Roychowdhury,et al.  Re-inventing Willis , 2006, physics/0601192.

[46]  Jean-Pierre Eckmann,et al.  Loops and Self-Reference in the Construction of Dictionaries , 2012 .

[47]  Lahomtoires d'Electronique AN INFORMATIONAL THEORY OF THE STATISTICAL STRUCTURE OF LANGUAGE 36 , 2010 .

[48]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[49]  H. Akaike A new look at the statistical model identification , 1974 .

[50]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[51]  Harry Eugene Stanley,et al.  Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death , 2011, Scientific Reports.

[52]  Germinal Cocho,et al.  Fitting Ranked Linguistic Data with Two-Parameter Functions , 2010, Entropy.

[53]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[54]  Gerhard Jäger,et al.  Power Laws and Other heavy-Tailed Distributions in Linguistic Typology , 2012, Adv. Complex Syst..

[55]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[56]  Misako Takayasu,et al.  Zipf's Law and Heaps' Law Can Predict the Size of Potential Words , 2012 .

[57]  Jing Hu,et al.  Culturomics meets random fractal theory: insights into long-range correlations of social and natural phenomena over the past two centuries , 2012, Journal of The Royal Society Interface.

[58]  Iddo Eliazar,et al.  The growth statistics of Zipfian ensembles: Beyond Heaps’ law , 2011 .

[59]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.