论文信息 - Chinese Statistical Parsing

Chinese Statistical Parsing

This chapter describes several issues that are fundamental to achieving accurate Chinese parsing given available Chinese resources and the challenges of the Gale processing pipeline. For Gale, our parsing algorithm is expected to accurately parse various different materials, ranging from newswire text, which tends to be grammatically well formed, to n-best ASR outputs, many of which are poorly formed sentences. To address this challenge, we have re-implemented and enhanced the Berkeley parser to handle unknown Chinese words efficiently, parse difficult sentences robustly, and operate more efficiently. We also address issues related to training the parser for several different genres given a limited number of available training trees, the importance of matching word segmentation to the treebank segmentation standard to support accurate parsing, and the need for standardized tokenization for managing the types of things that will appear as input to the parser. Understanding and handling these issues is a prerequisite for achieving adequate parsing performance levels. We also investigate self-training with automatically labeled in-domain data to enhance parsing performance given the limited number of trees in the Chinese treebanks.

M. Harper | Zhongqiang Huang

[1] Eugene Charniak,et al. Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[2] Fei Xia. The Segmentation Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[3] David Chiang,et al. Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[4] Anoop Sarkar,et al. Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[5] Nianwen Xue,et al. Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.

[6] Roger Levy,et al. Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[7] Michael Collins,et al. Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[8] Richard Sproat,et al. The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[9] Mark Steedman,et al. Bootstrapping statistical parsers from small datasets , 2003, EACL.

[10] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11] Eugene Charniak,et al. Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.