Adapting a Lexicalized-Grammar Parser to Contrasting Domains

Most state-of-the-art wide-coverage parsers are trained on newspaper text and suffer a loss of accuracy in other domains, making parser adaptation a pressing issue. In this paper we demonstrate that a CCG parser can be adapted to two new domains, biomedical text and questions for a QA system, by using manually-annotated training data at the pos and lexical category levels only. This approach achieves parser accuracy comparable to that on newspaper data without the need for annotated parse trees in the new domain. We find that retraining at the lexical category level yields a larger performance increase for questions than for biomedical text and analyze the two datasets to investigate why different domains might behave differently for parser adaptation.

[1]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[2]  Josef van Genabith,et al.  QuestionBank: Creating a Corpus of Parse-Annotated Questions , 2006, ACL.

[3]  Mark Steedman,et al.  Object-Extraction and Question-Parsing using CCG , 2004, EMNLP.

[4]  Tapio Salakoski,et al.  On the unification of syntactic annotations under the Stanford dependency scheme: A case study on BioInfer and GENIA , 2007, BioNLP@ACL.

[5]  James R. Curran,et al.  Formalism-Independent Parser Evaluation with CCG and DepBank , 2007, ACL.

[6]  Jun'ichi Tsujii,et al.  Probabilistic Disambiguation Models for Wide-Coverage HPSG Parsing , 2005, ACL.

[7]  James R. Curran,et al.  Multi-Tagging for Lexicalized-Grammar Parsing , 2006, ACL.

[8]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[9]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[10]  Matthew Lease,et al.  Parsing Biomedical Literature , 2005, IJCNLP.

[11]  James R. Curran,et al.  Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models , 2007, Computational Linguistics.

[12]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[13]  Martin Kay,et al.  Syntactic Process , 1979, ACL.

[14]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[15]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[16]  Jun'ichi Tsujii,et al.  Evaluating Impact of Re-training a Lexical Disambiguation Model on Domain Adaptation of an HPSG Parser , 2007, Trends in Parsing Technology.

[17]  Srinivas Bangalore,et al.  Supertagging: An Approach to Almost Parsing , 1999, CL.

[18]  Jun'ichi Tsujii,et al.  Extremely Lexicalized Models for Accurate and Fast HPSG Parsing , 2006, EMNLP.

[19]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[20]  Mary Dalrymple,et al.  The PARC 700 Dependency Bank , 2003, LINC@EACL.

[21]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[22]  Galit Avneri,et al.  Style-based Text Categorization: What Newspaper Am I Reading? , 1998 .

[23]  Mark Steedman,et al.  CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[24]  Sanda M. Harabagiu,et al.  The Role of Lexico-Semantic Feedback in Open-Domain Textual Question-Answering , 2001, ACL.