Empirical studies on corpora involve making measurements of several quantities for the purpose of comparing corpora, creating language models or to make generalizations about specific linguistic phenomena in a language. Quantities such as average word length are stable across sample sizes and hence can be reliably estimated from large enough samples. However, quantities such as vocabulary size change with sample size. Thus measurements based on a given sample will need to be extrapolated to obtain their estimates over larger unseen samples. In this work, we propose a novel nonparametric estimator of vocabulary size. Our main result is to show the statistical consistency of the estimator -- the first of its kind in the literature. Finally, we compare our proposal with the state of the art estimators (both parametric and nonparametric) on large standard corpora; apart from showing the favorable performance of our estimator, we also see that the classical Good-Turing estimator consistently underestimates the vocabulary size.
[1]
E. Khmaladze.
The statistical analysis of a large number of rare events
,
1988
.
[2]
J. Bunge,et al.
Estimating the Number of Species: A Review
,
1993
.
[3]
Geoffrey Sampson,et al.
Word frequency distributions
,
2002,
Computational Linguistics.
[4]
A. Gandolfi,et al.
Nonparametric Estimations about Species Not Observed in a Random Sample
,
2004
.
[5]
Sanjeev R. Kulkarni,et al.
Strong Consistency of the Good-Turing Estimator
,
2006,
2006 IEEE International Symposium on Information Theory.
[6]
Marco Baroni,et al.
Testing the extrapolation quality of word frequency models
,
2006
.
[7]
N.P. Santhanam,et al.
New tricks for old dogs: Large alphabet probability estimation
,
2007,
2007 IEEE Information Theory Workshop.