Improving Chinese Word Segmentation and POS Tagging with Semi-supervised Methods Using Large Auto-Analyzed Data

This paper presents a simple yet effective semi-supervised method to improve Chinese word segmentation and POS tagging. We introduce novel features derived from large auto-analyzed data to enhance a simple pipelined system. The auto-analyzed data are generated from unlabeled data by using a baseline system. We evaluate the usefulness of our approach in a series of experiments on Penn Chinese Treebanks and show that the new features provide substantial performance gains in all experiments. Furthermore, the results of our proposed method are superior to the best reported results in the literature.

[1]  Nianwen Xue,et al.  Chinese Word Segmentation as Character Tagging , 2003, ROCLING/IJCLCLP.

[2]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[3]  Hai Zhao,et al.  A Unified Character-Based Tagging Framework for Chinese Word Segmentation , 2010, TALIP.

[4]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[5]  Tetsuji Nakagawa,et al.  A Hybrid Approach to Word Segmentation and POS Tagging , 2007, ACL.

[6]  Kentaro Torisawa,et al.  Inducing Gazetteers for Named Entity Recognition by Large-Scale Clustering of Dependency Relations , 2008, ACL.

[7]  Gertjan van Noord Self-Trained Bilexical Preferences to Improve Disambiguation Accuracy , 2007, Trends in Parsing Technology.

[8]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[9]  Hitoshi Isahara,et al.  Joint Chinese Word Segmentation and POS Tagging Using an Error-Driven Word-Character Hybrid Model , 2009, IEICE Trans. Inf. Syst..

[10]  Yue-Shi Lee,et al.  Description of the NCU Chinese Word Segmentation and Part-of-Speech Tagging for SIGHAN Bakeoff 2007 , 2008, IJCNLP.

[11]  Hwee Tou Ng,et al.  A Maximum Entropy Approach to Chinese Word Segmentation , 2005, SIGHAN@IJCNLP 2005.

[12]  Joakim Nivre,et al.  Dependency Parsing , 2009, Lang. Linguistics Compass.

[13]  Sadao Kurohashi,et al.  Character-based Chinese Word Segmentation and Pos-tagging with Unsupervised Unknown Word Learning , 2007 .

[14]  Kentaro Torisawa,et al.  Improving Dependency Parsing with Subtrees from Auto-Parsed Data , 2009, EMNLP.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Hitoshi Isahara,et al.  An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging , 2009, ACL/IJCNLP.

[17]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[18]  Qun Liu,et al.  Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging , 2008, COLING.

[19]  Qun Liu,et al.  A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2008, ACL.

[20]  Hai Zhao,et al.  Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling , 2006, PACLIC.

[21]  Eric P. Xing,et al.  Stacking Dependency Parsers , 2008, EMNLP.

[22]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[23]  Hai Zhao,et al.  Exploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation , 2008 .

[24]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[25]  Hitoshi Isahara,et al.  Dependency Parsing with Short Dependency Relations in Unlabeled Data , 2008, IJCNLP.

[26]  Stephen Clark,et al.  A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model , 2010, EMNLP.

[27]  Hai Zhao Incorporating Global Information into Supervised Learning for Chinese Word Segmentation , 2007 .

[28]  Xavier Carreras,et al.  An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing , 2009, EMNLP.

[29]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.