Automatic Acquisition of Training Data for Statistical Parsers

The limitations of existing data sets for training parsers has led to a need for additional data. However, the cost of manually annotating the amount and range of data required is prohibitive. For a number of simple facts like those sought in Question Answering, we compile a corpus of sentences extracted from the Web that contain the fact keywords. We use a state-of-the-art parser to parse these sentences, constraining the analysis of the more complex sentences using information from the simpler sentences. This allows us to automatically create additional annotated sentences which we then use to augment our existing training data.

[1]  Mark Johnson,et al.  Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques , 2002, ACL.

[2]  Johan Bos,et al.  The Pronto QA System at TREC 2007: Harvesting Hyponyms, Using Nominalisation Patterns, and Computing Answer Cardinality , 2007, TREC.

[3]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Novelty Track. , 2005 .

[4]  I. Dan Melamed,et al.  Statistical Machine Translation by Parsing , 2004, ACL.

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  Douglas Biber,et al.  Using Register-Diversified Corpora for General Language Studies , 1993, Comput. Linguistics.

[7]  Mary Dalrymple,et al.  The PARC 700 Dependency Bank , 2003, LINC@EACL.

[8]  Andy Way,et al.  Wide-Coverage Deep Statistical Parsing Using Automatic Dependency Structure Annotation , 2008, CL.

[9]  Eduard Hovy,et al.  A question/answer typology with surface text patterns , 2002 .

[10]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[11]  Sanda M. Harabagiu,et al.  FALCON: Boosting Knowledge for Answer Engines , 2000, TREC.

[12]  Yoav Seginer,et al.  Fast Unsupervised Incremental Parsing , 2007, ACL.

[13]  Jimmy J. Lin,et al.  Data-Intensive Question Answering , 2001, TREC.

[14]  Satoshi Sekine,et al.  The Domain Dependence of Parsing , 1997, ANLP.

[15]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[16]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[17]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[18]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[19]  Rens Bod,et al.  An All-Subtrees Approach to Unsupervised Parsing , 2006, ACL.

[20]  James R. Curran,et al.  Partial Training for a Lexicalized-Grammar Parser , 2006, HLT-NAACL.

[21]  Julia Hockenmaier,et al.  Data and models for statistical parsing with combinatory categorial grammar , 2003 .

[22]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[23]  Ted Briscoe,et al.  Evaluating the Accuracy of an Unlexicalized Statistical Parser on the PARC DepBank , 2006, ACL.

[24]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[25]  James R. Curran,et al.  Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models , 2007, Computational Linguistics.

[26]  Martin Kay,et al.  Syntactic Process , 1979, ACL.