Chinese Statistical Parsing

This chapter describes several issues that are fundamental to achieving accurate Chinese parsing given available Chinese resources and the challenges of the Gale processing pipeline. For Gale, our parsing algorithm is expected to accurately parse various different materials, ranging from newswire text, which tends to be grammatically well formed, to n-best ASR outputs, many of which are poorly formed sentences. To address this challenge, we have re-implemented and enhanced the Berkeley parser to handle unknown Chinese words efficiently, parse difficult sentences robustly, and operate more efficiently. We also address issues related to training the parser for several different genres given a limited number of available training trees, the importance of matching word segmentation to the treebank segmentation standard to support accurate parsing, and the need for standardized tokenization for managing the types of things that will appear as input to the parser. Understanding and handling these issues is a prerequisite for achieving adequate parsing performance levels. We also investigate self-training with automatically labeled in-domain data to enhance parsing performance given the limited number of trees in the Chinese treebanks.

[1]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[2]  Fei Xia The Segmentation Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[3]  David Chiang,et al.  Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[4]  Anoop Sarkar,et al.  Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[5]  Nianwen Xue,et al.  Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.

[6]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[7]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[8]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[9]  Mark Steedman,et al.  Bootstrapping statistical parsers from small datasets , 2003, EACL.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[12]  Mary P. Harper,et al.  2005 Johns Hopkins Summer Workshop Final Report on Parsing and Spoken Structural Event Detection , 2005 .

[13]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[14]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[15]  Detlef Prescher,et al.  Inducing Head-Driven PCFGs with Latent Heads: Refining a Tree-Bank Grammar for Parsing , 2005, ECML.

[16]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[17]  Packard The Morphology of Chinese , 2006 .

[18]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[19]  Wen Wang,et al.  Investigation on Mandarin broadcast news speech recognition , 2006, INTERSPEECH.

[20]  Mary P. Harper,et al.  SParseval: Evaluation Metrics for Parsing Speech , 2006, LREC.

[21]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[22]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[23]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[24]  Wen Wang,et al.  Mandarin Part-of-Speech Tagging and Discriminative Reranking , 2007, EMNLP.

[25]  Dan Klein,et al.  Discriminative Log-Linear Grammars with Latent Variables , 2007, NIPS.

[26]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[27]  Dan Klein,et al.  Sparse Multi-Scale Grammars for Discriminative Latent Variable Parsing , 2008, EMNLP.