The Effects of Lexical Specialization on the Growth Curve of the Vocabulary

The number of different words expected on the basis of the urn model to appear in, for example, the first half of a text, is known to overestimate the observed number of different words. This paper examines the source of this overestimation bias. It is shown that this bias does not arise due to sentence-bound syntactic constraints, but that it is a direct consequence of topic cohesion in discourse. The nonrandom, clustered appearance of lexically specialized words, often the key words of the text, explains the main trends in the overestimation bias both quantitatively and qualitatively. The effects of nonrandomness are so strong that they introduce an overestimation bias in distributions of units derived from words, such as syllables and digrams. Nonrandom words usage also affects the accuracy of the Good-Turing frequency estimates which, for the lowest frequencies, reveal a strong underestimation bias. A heuristic adjusted frequency estimate is proposed that, at least for novel-sized texts, is considerably more accurate.

[1]  R. H. Baayen,et al.  The randomness assumption in word frequency statistics , 1994 .

[2]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[3]  Ferdinand de Haan,et al.  RESEARCH IN HUMANITIES COMPUTING 2 , 1994 .

[4]  R. Baayen,et al.  Chronicling the Times: Productive Lexical Innovations in an English Newspaper , 1996 .

[5]  D. A. Sprott Urn Models and Their Application—An Approach to Modern Discrete Probability Theory , 1978 .

[6]  Peter Indefrey,et al.  Estimating word frequencies from lexical dispersion data , 1994 .

[7]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[8]  C. Muller Principes et méthodes de statistique lexicale , 1992 .

[9]  Pierre Hubert,et al.  A Model of Vocabulary Partition , 1988 .

[10]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[11]  Richard Sproat,et al.  Estimating Lexical Priors for Low-Frequency Morphologically Ambiguous Forms , 1996, Comput. Linguistics.

[12]  R. Harald Baayen,et al.  Quantitative aspects of morphological productivity , 1992 .

[13]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[14]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[15]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.