Self-Training PCFG Grammars with Latent Annotations Across Languages

We investigate the effectiveness of self-training PCFG grammars with latent annotations (PCFG-LA) for parsing languages with different amounts of labeled training data. Compared to Charniak's lexicalized parser, the PCFG-LA parser was more effectively adapted to a language for which parsing has been less well developed (i.e., Chinese) and benefited more from self-training. We show for the first time that self-training is able to significantly improve the performance of the PCFG-LA parser, a single generative parser, on both small and large amounts of labeled training data. Our approach achieves state-of-the-art parsing accuracies for a single parser on both English (91.5%) and Chinese (85.2%).

[1]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[2]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[3]  David Chiang,et al.  Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[4]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[5]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[6]  Mark Steedman,et al.  Bootstrapping statistical parsers from small datasets , 2003, EACL.

[7]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[8]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[9]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[10]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[11]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[12]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[13]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[14]  Ari Rappoport,et al.  Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets , 2007, ACL.

[15]  Wen Wang,et al.  Mandarin Part-of-Speech Tagging and Discriminative Reranking , 2007, EMNLP.

[16]  Eugene Charniak,et al.  When is Self-Training Effective for Parsing? , 2008, COLING.

[17]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[18]  Dan Klein,et al.  Sparse Multi-Scale Grammars for Discriminative Latent Variable Parsing , 2008, EMNLP.

[19]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[20]  Dale Schuurmans,et al.  Semi-Supervised Convex Training for Dependency Parsing , 2008, ACL.

[21]  Liang Huang,et al.  Forest Reranking: Discriminative Parsing with Non-Local Features , 2008, ACL.

[22]  M. Harper,et al.  Chinese Statistical Parsing , 2009 .