The Probability Distribution of Textual Vocabulary in the English Language

Abstract The probability of textual vocabulary is defined as the combined probabilities of the individual lemmas occurring in a text, which sum to 1 in the text but normally less than 1 in another different text. If the text is expanded the probability of the original textual vocabulary would be smaller than 1 in the expanded text. However, the present study reveals that as the text expands continually, instead of monotonically decreasing, the probability of the original textual vocabulary quickly reaches a point from which it stabilizes despite further expansion of the text. In addition, the probability of the textual vocabulary of a text occurring in other texts is not affected by the length of the texts in which they occur. Mathematical models are formulated capturing the distribution of the probability of textual vocabulary in the English language.

[1]  Cyril Labbé,et al.  Inter-Textual Distance and Authorship Attribution Corneille and Molière , 2001, J. Quant. Linguistics.

[2]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[3]  Emmerich Kelih,et al.  On the dependency of word length on text length. Empirical results from Russian and Bulgarian parallel texts , 2012 .

[4]  András Kornai,et al.  How many words are there? , 2002, Glottometrics.

[5]  Fan Fengxiang An Asymptotic Model for the English Hapax/Vocabulary Ratio , 2010, Computational Linguistics.

[6]  Walter L. Smith Probability and Statistics , 1959, Nature.

[7]  Gabriel Altmann,et al.  Systems: New Paradigms for the Human Sciences , 1998 .

[8]  Fengxiang Fan Models for dynamic inter-textual type-token relationship , 2006, Glottometrics.

[9]  R. Harald Baayen,et al.  The Effects of Lexical Specialization on the Growth Curve of the Vocabulary , 1996, Comput. Linguistics.

[10]  Fan Fengxiang Text Length, Vocabulary Size and Text Coverage Constancy , 2013, J. Quant. Linguistics.

[11]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[12]  Graeme D. Kennedy,et al.  Book Reviews: An Introduction to Corpus Linguistics , 1999, CL.

[13]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[14]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15]  I.S.P. Nation,et al.  Factors Affecting Guessing Vocabulary in Context , 1985 .

[16]  Reinhard Kohler Quantitative Syntax Analysis , 2012 .

[17]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[18]  W. Grabe,et al.  The Percentage of Words Known in a Text and Reading Comprehension. , 2011 .

[19]  Juhan Tuldava,et al.  Methods in quantitative linguistics , 1995 .

[20]  Reinhard Köhler,et al.  A systems theoretical approach to language and music , 1998 .

[21]  B. Laufer,et al.  Lexical threshold revisited: Lexical text coverage, learners' vocabulary size and reading comprehension , 2010 .

[22]  Étienne Brunet Une mesure de la distance intertextuelle : la connexion lexicale , 1988 .

[23]  Fan Fengxiang A Corpus-based empirical study on inter-textual vocabulary growth* , 2006, J. Quant. Linguistics.