A Study of the Effectiveness of Suffixes for Chinese Word Segmentation

We investigate whether suffix related features can significantly improve the performance of character-based approaches for Chinese word segmentation (CWS). Since suffixes are quite productive in forming new words, and OOV is the main error source for CWS, many researchers expect that suffix information can further improve the performance. With this belief, we tried several suffix related features in both generative and discriminative approaches. However, our experiment results have shown that significant improvement can hardly be achieved by incorporating suffix related features into those widely adopted surface features, which is against the commonly believed supposition. Error analysis reveals that the main problem behind this surprising finding is the conflict between the degree of reliability and the coverage rate of suffix related features.

[1]  Shashi Narayan,et al.  Proceedings of the 24th International Conference on Computational Linguistics (COLING) , 2012, International Conference on Computational Linguistics.

[2]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[3]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[4]  Chengqing Zong,et al.  Which is More Suitable for Chinese Word Segmentation, the Generative Model or the Discriminative One? , 2009, PACLIC.

[5]  Weiwei Sun Word-based and Character-based Word Segmentation Models: Comparison and Combination , 2010, COLING.

[6]  Hai Zhao,et al.  How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method , 2010, LREC.

[7]  Anke Lüdeling,et al.  Corpus Linguistics: An International Handbook , 2009 .

[8]  Stephen Clark,et al.  Chinese Segmentation with a Word-Based Perceptron Algorithm , 2007, ACL.

[9]  Nianwen Xue,et al.  Chinese Word Segmentation as Character Tagging , 2003, ROCLING/IJCLCLP.

[10]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11]  Chengqing Zong,et al.  A Character-Based Joint Model for Chinese Word Segmentation , 2010, COLING.

[12]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[13]  Qiang Dong,et al.  Word Segmentation needs change- From a linguist's view , 2010, CIPS-SIGHAN.

[14]  Haizhou Li,et al.  Chinese Word Segmentation , 1998, PACLIC.

[15]  Klaas Willems,et al.  George Kingsley Zipf , 2006 .

[16]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[17]  Chengqing Zong,et al.  Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation , 2012, TALIP.

[18]  Marco Baroni,et al.  37. Distributions in text , 2009 .

[19]  Hwee Tou Ng,et al.  A Maximum Entropy Approach to Chinese Word Segmentation , 2005, SIGHAN@IJCNLP 2005.

[20]  Ying Zhang,et al.  Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System? , 2004, LREC.

[21]  Weiwei Sun,et al.  Enhancing Chinese Word Segmentation Using Unlabeled Data , 2011, EMNLP.

[22]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[23]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[24]  Hai Zhao,et al.  Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition , 2008, IJCNLP.

[25]  Zhongguo Li Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation , 2011, ACL.

[26]  Chengqing Zong,et al.  Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation , 2012, COLING.

[27]  Hai Zhao,et al.  A Unified Character-Based Tagging Framework for Chinese Word Segmentation , 2010, TALIP.

[28]  Xu Sun,et al.  Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection , 2012, ACL.

[29]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[30]  Eiichiro Sumita,et al.  Subword-Based Tagging for Confidence-Dependent Chinese Word Segmentation , 2006, ACL.