论文信息 - Conserving Fuel in Statistical Language Learning: Predicting Data Requirements

Conserving Fuel in Statistical Language Learning: Predicting Data Requirements

In this paper I address the practical concern of predicting how much training data is sufficient for a statistical language learning system. First, I briefly review earlier results and show how these can be combined to bound the expected accuracy of a mode-based learner as a function of the volume of training data. I then develop a more accurate estimate of the expected accuracy function under the assumption that inputs are uniformly distributed. Since this estimate is expensive to compute, I also give a close but cheaply computable approximation to it. Finally, I report on a series of simulations exploring the effects of inputs that are not uniformly distributed. Although these results are based on simplistic assumptions, they are a tentative step toward a useful theory of data requirements for SLL systems.

Mark Lauer

[1] Mark Lauer. How much is enough?: Data requirements for statistical NLP , 1995, ArXiv.

[2] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3] Mark Dras,et al. A Probabilistic Model of Compound Nouns , 1994, ArXiv.

[4] Michael R. Brent,et al. From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[5] David Yarowsky,et al. Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[6] Mats Rooth,et al. Structural Ambiguity and Lexical Relations , 1991, ACL.

[7] Pieter de Haan. The optimum corpus sample size , 1992 .

[8] Eugene Charniak,et al. Statistical language learning , 1997 .