论文信息 - A Very Very Large Corpus Doesn’t Always Yield Reliable Estimates

A Very Very Large Corpus Doesn’t Always Yield Reliable Estimates

Banko and Brill (2001) suggested that the development of very large training corpora may be more effective for progress in empirical Natural Language Processing than improving methods that use existing smaller training corpora.This work tests their claim by exploring whether a very large corpus can eliminate the sparseness problems associated with estimating unigram probabilities. We do this by empirically investigating the convergence behaviour of unigram probability estimates on a one billion word corpus. When using one billion words, as expected, we do find that many of our estimates do converge to their eventual value. However, we also find that for some words, no such convergence occurs. This leads us to conclude that simply relying upon large corpora is not in itself sufficient: we must pay attention to the statistical modelling as well.

James R. Curran | Miles Osborne | J. Curran | M. Osborne

[1] Mark Stevenson,et al. The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[2] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[3] Treebank Penn,et al. Linguistic Data Consortium , 1999 .

[4] Martin Volk,et al. Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[5] John D. Lafferty,et al. A Model of Lexical Attraction and Repulsion , 1997, ACL.

[6] V. V. Petrov. Limit Theorems of Probability Theory: Sequences of Independent Random Variables , 1995 .

[7] Frank Keller,et al. Using the Web to Overcome Data Sparseness , 2002, EMNLP.

[8] James R. Curran,et al. Scaling Context Space , 2002, ACL.

[9] Michele Banko,et al. Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.