Augmented Parsing of Unknown Word by Graph-Based Semi-Supervised Learning

This paper presents a novel method using graph-based semi-supervised learning (SSL) to improve the syntax parsing of unknown words. Different from conventional approaches that uses hand-crafted rules, rich morphological features, or a character-based model to handle unknown words, this method is based on a graph-based label propagation technique. It gives greater improvement on grammars trained on a smaller amount of labeled data and a large amount of unlabeled one. A transductiv 1 graph-based SSL method is employed to propagate POS and derive the emission distributions from labeled data to unlabeled one. The derived distributions are incorporated into the parsing process. The proposed method effectively augments the original supervised parsing model by contributing 2.28% and 1.72% absolute improvement on the accuracy of POS tagging and syntax parsing for Penn Chinese Treebank respectively.

[1]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[2]  Lidia S. Chao,et al.  Adapting Multilingual Parsing Models to Sinica Treebank , 2012, CIPS-SIGHAN.

[3]  Koby Crammer,et al.  New Regularized Algorithms for Transductive Learning , 2009, ECML/PKDD.

[4]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[5]  Josef van Genabith,et al.  Handling Unknown Words in Statistical Latent-Variable Parsing Models for Arabic, English and French , 2010, SPMRL@NAACL-HLT.

[6]  Mary P. Harper,et al.  Feature-Rich Log-Linear Lexical Model for Latent Variable PCFG Grammars , 2011, IJCNLP.

[7]  J. Lafferty,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[8]  Shankar Kumar,et al.  Video suggestion and discovery for youtube: taking random walks through the view graph , 2008, WWW.

[9]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[10]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[11]  Jeff A. Bilmes,et al.  Soft-Supervised Learning for Text Classification , 2008, EMNLP.

[12]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[13]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[14]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[15]  Slav Petrov,et al.  Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models , 2010, EMNLP.

[16]  Mary P. Harper,et al.  Self-Training PCFG Grammars with Latent Annotations Across Languages , 2009, EMNLP.

[17]  Lidia S. Chao,et al.  A Simplified Chinese Parser with Factored Model , 2012, CIPS-SIGHAN.

[18]  Nianwen Xue,et al.  Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.

[19]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[20]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[21]  Wen Wang,et al.  Mandarin Part-of-Speech Tagging and Discriminative Reranking , 2007, EMNLP.

[22]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[23]  Noah A. Smith,et al.  Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties , 2012, NAACL.

[24]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[25]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[26]  Joseph Olive,et al.  Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation , 2011 .

[27]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[28]  Isabel Trancoso,et al.  Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2013, ACL.