Sample Selection for Statistical Parsing

Corpus-based statistical parsing relies on using large quantities of annotated text as training examples. Building this kind of resource is expensive and labor-intensive. This work proposes to use sample selection to find helpful training examples and reduce human effort spent on annotating less informative ones. We consider several criteria for predicting whether unlabeled data might be a helpful training example. Experiments are performed across two syntactic learning tasks and within the single task of parsing across two learning models to compare the effect of different predictive criteria. We find that sample selection can significantly reduce the size of annotated training corpora and that uncertainty is a robust predictive criterion that can be easily applied to different learning models.

[1]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[2]  Aravind K. Joshi,et al.  Tree Adjunct Grammars , 1975, J. Comput. Syst. Sci..

[3]  R. Larsen An introduction to mathematical statistics and its applications / Richard J. Larsen, Morris L. Marx , 1986 .

[4]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  Fernando Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[7]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8]  Richard C. Waters,et al.  Stochastic Lexicalized Context-Free Grammar , 1993, IWPT.

[9]  David M. Magerman Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.

[10]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[11]  Eric Brill,et al.  A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation , 1994, COLING.

[12]  Michael Collins,et al.  Prepositional Phrase Attachment through a Backed-off Model , 1995, VLC@ACL.

[13]  Shlomo Argamon,et al.  Minimizing Manual Annotation Cost in Supervised Training from Corpora , 1996, ACL.

[14]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[15]  Kentaro Inui,et al.  Selective Sampling for Example-based Word Sense Disambiguation , 1998, CL.

[16]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[17]  Rebecca Hwa An Empirical Evaluation of Probabilistic Lexicalized Tree Insertion Grammars , 1998, COLING-ACL.

[18]  Adwait Ratnaparkhi,et al.  Statistical Models for Unsupervised Prepositional Phrase Attachment , 1998, ACL.

[19]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[20]  Eric Brill,et al.  Bagging and Boosting a Treebank Parser , 2000, ANLP.

[21]  David Yarowsky,et al.  Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking , 2000, ACL.

[22]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[23]  Rebecca Hwa,et al.  Sample Selection for Statistical Grammar Induction , 2000, EMNLP.

[24]  Rebecca Hwa,et al.  Learning probabilistic lexicalized grammars for natural language processing , 2001 .

[25]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[26]  Rebecca Hwa,et al.  On minimizing training corpus for parser acquisition , 2001, CoNLL.

[27]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[28]  Anoop Sarkar,et al.  Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[29]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[30]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[31]  Mark Steedman,et al.  Example Selection for Bootstrapping Statistical Parsers , 2003, NAACL.

[32]  Anoop Sarkar,et al.  Corrected Co-training for Statistical Parsers , 2003 .

[33]  Mark Steedman,et al.  Bootstrapping statistical parsers from small datasets , 2003, EACL.

[34]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[35]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[36]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.