论文信息 - Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging - A Case Study

Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging - A Case Study

Manually annotated corpora are valuable but scarce resources, yet for many annotation tasks such as treebanking and sequence labeling there exist multiple corpora with different and incompatible annotation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice to automatically adapt one annotation standard to another. We present a simple yet effective strategy that transfers knowledge from a differently annotated corpus to the corpus with desired annotation. We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese. Experiments show that adaptation from the much larger People's Daily corpus to the smaller but more popular Penn Chinese Treebank results in significant improvements in both segmentation and tagging accuracies (with error reductions of 30.2% and 14%, respectively), which in turn helps improve Chinese parsing accuracy.

Qun Liu | Wenbin Jiang | Liang Huang

[1] Sabine Buchholz,et al. CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[2] Mark Steedman,et al. CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[3] Anoop Sarkar,et al. Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[4] Thorsten Brants,et al. The LinGO Redwoods Treebank: Motivation and Preliminary Applications , 2002, COLING.

[5] Hal Daumé,et al. Frustratingly Easy Domain Adaptation , 2007, ACL.

[6] M. A. R T A P A L,et al. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[7] Nianwen Xue,et al. Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[8] Eric Brill,et al. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[9] John Blitzer,et al. Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[10] David Chiang,et al. Hierarchical Phrase-Based Translation , 2007, CL.

[11] Hwee Tou Ng,et al. Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? , 2004, EMNLP.