Text Length, Vocabulary Size and Text Coverage Constancy

Abstract This paper examines the dynamic relationship between vocabulary size, text length and text coverage in the English language, i.e. the ratio between the number of words of a text or a collection of texts covered by a set of vocabulary, and the total number of words of the text or a collection of texts. The results reveal that, on average, for texts between 50 and 1,000,000 words in length, text coverage by the same set of vocabulary is not significantly affected by text length; in addition, the relationship between text coverage and vocabulary size can be captured by the re-parametrized mathematical models of Altmann, Tuldava and Köhler and Martináková-Rendeková.

[1]  Ron S. Kenett,et al.  Statistics for Business and Economics. , 1988 .

[2]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[3]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[4]  Walter L. Smith Probability and Statistics , 1959, Nature.

[5]  W. Grabe,et al.  The Percentage of Words Known in a Text and Reading Comprehension. , 2011 .

[6]  Juhan Tuldava,et al.  Methods in quantitative linguistics , 1995 .

[7]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[8]  P. Nation,et al.  What vocabulary size is needed toread unsimplified texts for pleasure? , 2020 .

[9]  Wenhua Hsu,et al.  The vocabulary thresholds of business textbooks and business research articles for EFL learners , 2011 .

[10]  Masao Utiyama,et al.  Understanding the Role of Text Length, Sample Size and Vocabulary Size in Determining Text Coverage. , 2005 .

[11]  Reinhard Köhler,et al.  A systems theoretical approach to language and music , 1998 .

[12]  B. Laufer,et al.  Lexical threshold revisited: Lexical text coverage, learners' vocabulary size and reading comprehension , 2010 .

[13]  I.S.P. Nation,et al.  Factors Affecting Guessing Vocabulary in Context , 1985 .

[14]  L. K. Engels THE FALLACY OF WORD-COUNTS , 1968 .

[15]  Reinhard Kohler Quantitative Syntax Analysis , 2012 .

[16]  Robert Sandy,et al.  Statistics for Business and Economics , 1989 .