Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning

Chinese word segmentation and part-ofspeech tagging (S&T) are fundamental steps for more advanced Chinese language processing tasks. Recently, it has attracted more and more research interests to exploit heterogeneous annotation corpora for Chinese S&T. In this paper, we propose a unified model for Chinese S&T with heterogeneous annotation corpora. We first automatically construct a loose and uncertain mapping between two representative heterogeneous corpora, Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD). Then we regard the Chinese S&T with heterogeneous corpora as two “related” tasks and train our model on two heterogeneous corpora simultaneously. Experiments show that our method can boost the performances of both of the heterogeneous corpora by using the shared information, and achieves significant improvements over the state-of-the-art methods.

[1]  Qun Liu,et al.  Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging - A Case Study , 2009, ACL/IJCNLP.

[2]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[3]  Qun Liu,et al.  Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation , 2012, EMNLP-CoNLL.

[4]  Xiao Chen,et al.  The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Named Entity Recognition and Chinese POS Tagging , 2008, IJCNLP.

[5]  Hwee Tou Ng,et al.  Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? , 2004, EMNLP.

[6]  Xuanjing Huang,et al.  FudanNLP: A Toolkit for Chinese Natural Language Processing , 2013, ACL.

[7]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[8]  Xuanjing Huang,et al.  Joint Segmentation and Tagging with Coupled Sequences Labeling , 2012, COLING.

[9]  Ben Taskar,et al.  An End-to-End Discriminative Approach to Machine Translation , 2006, ACL.

[10]  Qun Liu,et al.  A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2008, ACL.

[11]  Xuanjing Huang,et al.  A Unified Model for Joint Chinese Word Segmentation and POS Tagging with Heterogeneous Annotation Corpora , 2013, 2013 International Conference on Asian Language Processing.

[12]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[13]  F. Xia,et al.  The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[14]  Jianfeng Gao,et al.  Adaptive Chinese Word Segmentation , 2004, ACL.

[15]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[16]  Weiwei Sun,et al.  A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2011, ACL.

[17]  Shai Ben-David,et al.  Exploiting Task Relatedness for Mulitple Task Learning , 2003, COLT.

[18]  Weiwei Sun,et al.  Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations , 2012, ACL.

[19]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[20]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.