Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging - A Case Study

Manually annotated corpora are valuable but scarce resources, yet for many annotation tasks such as treebanking and sequence labeling there exist multiple corpora with different and incompatible annotation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice to automatically adapt one annotation standard to another. We present a simple yet effective strategy that transfers knowledge from a differently annotated corpus to the corpus with desired annotation. We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese. Experiments show that adaptation from the much larger People's Daily corpus to the smaller but more popular Penn Chinese Treebank results in significant improvements in both segmentation and tagging accuracies (with error reductions of 30.2% and 14%, respectively), which in turn helps improve Chinese parsing accuracy.

[1]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[2]  Mark Steedman,et al.  CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[3]  Anoop Sarkar,et al.  Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[4]  Thorsten Brants,et al.  The LinGO Redwoods Treebank: Motivation and Preliminary Applications , 2002, COLING.

[5]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[6]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[7]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[8]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[9]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[10]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[11]  Hwee Tou Ng,et al.  Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? , 2004, EMNLP.

[12]  David Chiang,et al.  Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[13]  Fernando Pereira,et al.  Online Learning of Approximate Dependency Parsing Algorithms , 2006, EACL.

[14]  Eric P. Xing,et al.  Stacking Dependency Parsers , 2008, EMNLP.

[15]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[16]  Andy Way,et al.  Automatic annotation of the Penn-treebank with LFG f-structureinformation , 2002 .

[17]  Qun Liu,et al.  A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2008, ACL.

[18]  Qun Liu,et al.  Parsing the Penn Chinese Treebank with Semantic Knowledge , 2005, IJCNLP.

[19]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[20]  Jianfeng Gao,et al.  Adaptive Chinese Word Segmentation , 2004, ACL.

[21]  Fei Xia The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[22]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[23]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24]  Stephen Clark,et al.  Chinese Segmentation with a Word-Based Perceptron Algorithm , 2007, ACL.

[25]  Joakim Nivre,et al.  Integrating Graph-Based and Transition-Based Dependency Parsers , 2008, ACL.

[26]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.