Training Parsers on Incompatible Treebanks

We consider the problem of training a statistical parser in the situation when there are multiple treebanks available, and these treebanks are annotated according to different linguistic conventions. To address this problem, we present two simple adaptation methods: the first method is based on the idea of using a shared feature representation when parsing multiple treebanks, and the second method on guided parsing where the output of one parser provides features for a second one. To evaluate and analyze the adaptation methods, we train parsers on treebank pairs in four languages: German, Swedish, Italian, and English. We see significant improvements for all eight treebanks when training on the full training sets. However, the clearest benefits are seen when we consider smaller training sets. Our experiments were carried out with unlabeled dependency parsers, but the methods can easily be generalized to other featurebased parsers.

[1]  Richard Johansson,et al.  Extended Constituent-to-Dependency Conversion for English , 2007, NODALIDA.

[2]  Zheng-Yu Niu,et al.  Exploiting Heterogeneous Treebanks for Parsing , 2009, ACL/IJCNLP.

[3]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[4]  Joakim Nivre,et al.  Generalizing Tree Transformations for Inductive Dependency Parsing , 2007, ACL.

[5]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[6]  Richard Johansson,et al.  The Effect of Syntactic Representation on Semantic Role Labeling , 2008, COLING.

[7]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[8]  Joakim Nivre,et al.  Comparing the Influence of Different Treebank Annotations on Dependency Parsing , 2010, LREC.

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  Erhard W. Hinrichs,et al.  The Tüba-D/Z Treebank: Annotating German with a Context-Free Backbone , 2004, LREC.

[11]  S. Montemagni,et al.  The Italian dependency annotated corpus developed for the CoNLL-X Shared Task , 2007 .

[12]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[13]  David A. Smith,et al.  Parser Adaptation and Projection with Quasi-Synchronous Grammar Features , 2009, EMNLP.

[14]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[15]  Joakim Nivre,et al.  Integrating Graph-Based and Transition-Based Dependency Parsers , 2008, ACL.

[16]  Joakim Nivre,et al.  MAMBA Meets TIGER: Reconstructing a Swedish Treebank from Antiquity , 2005 .

[17]  Wanxiang Che,et al.  Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars , 2012, ACL.

[18]  Cristina Bosco,et al.  Annotation Schema Oriented Validation for Dependency Parsing Evaluation , 2010 .

[19]  Joakim Nivre,et al.  What kinds of trees grow in Swedish soil , 2002 .

[20]  Xavier Carreras,et al.  Experiments with a Higher-Order Projective Dependency Parser , 2007, EMNLP.

[21]  Joakim Nivre,et al.  Characterizing the Errors of Data-Driven Dependency Parsing Models , 2007, EMNLP.

[22]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[23]  Eric P. Xing,et al.  Stacking Dependency Parsers , 2008, EMNLP.

[24]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[25]  菅山 謙正,et al.  Word Grammar 理論の研究 , 2005 .

[26]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[27]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[28]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.