Which is More Suitable for Chinese Word Segmentation, the Generative Model or the Discriminative One?

Since the traditional word-based n-gram model, a generative approach, cannot handle those out-of-vocabulary (OOV) words in the testing-set, the character-based discriminative approach has been widely adopted recently. However, this discriminative model, though is more robust to OOV words, fails to deliver satisfactory performance for those in-vocabulary (IV) words that have been observed before. Having analyzed the wordbased approach, its capability to handle the dependency between adjacent characters within a word, which is believed that the human adopts for doing segmentation, is found to account for its excellent performance for those IV words. To incorporate the intra-word characters dependency, a character-based approach with a generative model is thus proposed in this paper. The experiments conducted on the second SIGHAN Bakeoffs have shown that the proposed model not only achieves a good balance between those IV words and OOV words, but also outperforms the above-mentioned well-known approaches under the similar conditions.

[1]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[2]  Hwee Tou Ng,et al.  Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? , 2004, EMNLP.

[3]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[4]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[5]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  Changning Huang,et al.  Improved Source-Channel Models for Chinese Word Segmentation , 2003, ACL.

[9]  Stephen Clark,et al.  Chinese Segmentation with a Word-Based Perceptron Algorithm , 2007, ACL.

[10]  Yuji Matsumoto,et al.  Combination of Machine Learning Methods for Optimum Chinese Word Segmentation , 2005, IJCNLP.

[11]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[12]  Nianwen Xue,et al.  Chinese Word Segmentation as Character Tagging , 2003, ROCLING/IJCLCLP.

[13]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[14]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[15]  Eiichiro Sumita,et al.  Subword-Based Tagging for Confidence-Dependent Chinese Word Segmentation , 2006, ACL.