In this paper I address the practical concern of predicting how much training data is sufficient for a statistical language learning system. First, I briefly review earlier results and show how these can be combined to bound the expected accuracy of a mode-based learner as a function of the volume of training data. I then develop a more accurate estimate of the expected accuracy function under the assumption that inputs are uniformly distributed. Since this estimate is expensive to compute, I also give a close but cheaply computable approximation to it. Finally, I report on a series of simulations exploring the effects of inputs that are not uniformly distributed. Although these results are based on simplistic assumptions, they are a tentative step toward a useful theory of data requirements for SLL systems.
[1]
Mark Lauer.
How much is enough?: Data requirements for statistical NLP
,
1995,
ArXiv.
[2]
Beatrice Santorini,et al.
Building a Large Annotated Corpus of English: The Penn Treebank
,
1993,
CL.
[3]
Mark Dras,et al.
A Probabilistic Model of Compound Nouns
,
1994,
ArXiv.
[4]
Michael R. Brent,et al.
From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax
,
1993,
Comput. Linguistics.
[5]
David Yarowsky,et al.
Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora
,
2010,
COLING.
[6]
Mats Rooth,et al.
Structural Ambiguity and Lexical Relations
,
1991,
ACL.
[7]
Pieter de Haan.
The optimum corpus sample size
,
1992
.
[8]
Eugene Charniak,et al.
Statistical language learning
,
1997
.