The predictive capabilities of mathematical models for the type-token relationship in English language corpora

Abstract We investigate the predictive capability of mathematical models of the type-token relationship applied to the vocabulary growth profiles of selected of English language documents. We compare the existing Good-Toulmin and Heaps formulae with an alternative approach based on Bernoulli trial word selection from a fixed finite vocabulary using the Zipf and Zipf-Mandelbrot probability distributions. We make two major observations: firstly, while the Zipf-Mandelbrot model makes better predictions of vocabulary growth than the Zipf model, the optimized parameters of the latter correlate better than those of the former with statistics gleaned independently from the data. Secondly, the mean of the Zipf-Mandelbrot, Good-Toulmin and Heaps models provides a more consistent and unbiased prediction of vocabulary than any individual model alone.

[1]  B. Richards Type/Token Ratios: what do they really tell us? , 1987, Journal of Child Language.

[2]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[3]  Roger K. Moore A comparison of the data requirements of automatic speech recognition systems and human listeners , 2003, INTERSPEECH.

[4]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[5]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[6]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[7]  Gordon James Allan Hunter,et al.  Statistical language modelling of dialogue material in the British national corpus , 2004 .

[8]  Adilson E. Motter,et al.  Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words , 2009, PloS one.

[9]  Leo Egghe Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments , 2007 .

[10]  Á. Corral,et al.  Log-Log Convexity of Type-Token Growth in Zipf's Systems. , 2014, Physical review letters.

[11]  Linyuan Lü,et al.  Zipf's Law Leads to Heaps' Law: Analyzing Their Relation in Finite-Size Systems , 2010, PloS one.

[12]  K. Hussein Measuring Lexical Richness through Type-Token Curve: a Corpus-Based Analysis of Arabic and English Texts , 2015 .

[13]  James Baker,et al.  A historical perspective of speech recognition , 2014, CACM.

[14]  Raj Jain,et al.  Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks , 1989, Comput. Networks.

[15]  Kevin J Gaston,et al.  Predicting unknown species numbers using discovery curves , 2007, Proceedings of the Royal Society B: Biological Sciences.

[16]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[17]  C. Mora,et al.  How Many Species Are There on Earth and in the Ocean? , 2011, PLoS biology.

[18]  Theo P. van der Weide,et al.  A formal derivation of Heaps' Law , 2005, Inf. Sci..

[19]  Simon P. Wilson,et al.  Predicting total global species richness using rates of species description and estimates of taxonomic effort. , 2012, Systematic biology.

[20]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .