论文信息 - An Asymptotic Model for the English Hapax/Vocabulary Ratio

An Asymptotic Model for the English Hapax/Vocabulary Ratio

In the known literature, hapax legomena in an English text or a collection of texts roughly account for about 50% of the vocabulary. This sort of constancy is baffling. The 100-million-word British National Corpus was used to study this phenomenon. The result reveals that the hapax/vocabulary ratio follows a U-shaped pattern. Initially, as the size of text increases, the hapax/vocabulary ratio decreases; however, after the text size reaches about 3,000,000 words, the hapax/vocabulary ratio starts to increase steadily. A computer simulation shows that as the text size continues to increase, the hapax/vocabulary ratio would approach 1.

Fan Fengxiang

[1] James H. Martin,et al. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[2] R. Harald Baayen,et al. The Effects of Lexical Specialization on the Growth Curve of the Vocabulary , 1996, Comput. Linguistics.

[3] H. Kucera,et al. Computational analysis of present-day American English , 1967 .

[4] Gabriel Altmann,et al. Hapax Legomena and Language Typology , 2008, J. Quant. Linguistics.

[5] R. Harald Baayen,et al. How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[6] David I. Holmes,et al. Vocabulary Richness and the Prophetic Voice , 1991 .

[7] Graeme D. Kennedy,et al. Book Reviews: An Introduction to Corpus Linguistics , 1999, CL.

[8] D. Biber,et al. Longman Grammar of Spoken and Written English , 1999 .

[9] András Kornai,et al. How many words are there? , 2002, Glottometrics.

[10] R. Harald Baayen,et al. Word Frequency Distributions , 2001 .