Exploiting Heterogeneous Treebanks for Parsing

We address the issue of using heterogeneous treebanks for parsing by breaking it down into two sub-problems, converting grammar formalisms of the treebanks to the same one, and parsing on these homogeneous treebanks. First we propose to employ an iteratively trained target grammar parser to perform grammar formalism conversion, eliminating predefined heuristic rules as required in previous methods. Then we provide two strategies to refine conversion results, and adopt a corpus weighting technique for parsing on homogeneous treebanks. Results on the Penn Treebank show that our conversion method achieves 42% error reduction over the previous best result. Evaluation on the Penn Chinese Treebank indicates that a converted dependency treebank helps constituency parsing and the use of unlabeled data by self-training further increases parsing f-score to 85.2%, resulting in 6% error reduction over the previous best result.

[1]  Daniel Jurafsky,et al.  Shallow Semantc Parsing of Chinese , 2004, HLT-NAACL.

[2]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[3]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[4]  Ari Rappoport,et al.  Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets , 2007, ACL.

[5]  Wojciech Skut,et al.  SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS , 2003 .

[6]  Makoto Nagao,et al.  Building a Japanese parsed corpus while improving the parsing system , 1997 .

[7]  Ralph Grishman,et al.  Developing a Syntactic Annotation Scheme and Tools for a Spanish Treebank , 2003 .

[8]  Dan Klein,et al.  Two Languages are Better than One (for Syntactic Parsing) , 2008, EMNLP.

[9]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[10]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[11]  Fei Xia,et al.  Converting Dependency Structures to Phrase Structures , 2001, HLT.

[12]  Qun Liu,et al.  Parsing the Penn Chinese Treebank with Semantic Knowledge , 2005, IJCNLP.

[13]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[14]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[15]  Mitchell P. Marcus,et al.  On the parameter space of generative lexicalized statistical parsing models , 2004 .

[16]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[17]  David Chiang,et al.  Recovering Latent Information in Treebanks , 2002, COLING.

[18]  David A. Smith,et al.  Parser Adaptation and Projection with Quasi-Synchronous Grammar Features , 2009, EMNLP.

[19]  Suresh Manandhar,et al.  Translating Treebank Annotation for Evaluation , 2001, ACL 2001.

[20]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[21]  Martin Forst Treebank Conversion - Establishing a testsuite for a broad-coverage LFG from the TIGER treebank , 2003, LINC@EACL.

[22]  Ting Liu,et al.  Building a Dependency Treebank for Improving Chinese Parser , 2006, J. Chin. Lang. Comput..

[23]  Keh-Yih Su,et al.  An Automatic Treebank Conversion Algorithm for Corpus Sharing , 1994, ACL.

[24]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[25]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[26]  Michael A. Covington GB theory as dependency grammar , 1992 .

[27]  David Chiang,et al.  Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[28]  Martha Palmer,et al.  Development and Evaluation of a Korean Treebank and its Application to NLP , 2002, LREC.

[29]  Teruko Mitamura,et al.  A Fast, Accurate Deterministic Parser for Chinese , 2006, ACL.

[30]  Owen Rambow,et al.  Towards a Multi-Representational Treebank , 2008 .

[31]  Dan Jurafsky,et al.  A corrigendum to Sun and Jurafsky ( 2004 ) “ Shallow Semantic Parsing of Chinese ” TR-CSLR-2005-01 , 2005 .

[32]  Brian Roark,et al.  Supervised and unsupervised PCFG adaptation to novel domains , 2003, NAACL.