A MMSM-based Hybrid Method for Chinese MicroBlog Word Segmentation

After years of researches, Chinese word segmentation has achieved quite high precisions for formal style text. However, the performance of segmentation is not so satisfying for MicroBlog corpora. In this paper we describe a scheme for Chinese word segmentation for, MicroBlog which integrates the characterbased and word-based information in the directed graph generated by MMSM model. Word-level information is effective for analysis of known words, while character-level information is useful for analysis of unknown words. A multi-chain unequal states CRF model is proposed. The proposed multi-chain unequal states CRF has two state chains with unequal states which can recognize the POS tag simultaneously. The hybrid model was effective and adopted in real-world system.

[1]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[2]  Hai Zhao,et al.  An Improved Chinese Word Segmentation System with Conditional Random Field , 2006, SIGHAN@COLING/ACL.

[3]  Kazuhide Yamamoto,et al.  Performance Evaluation of Chinese Analyzers with Support Vector Machines , 2003 .

[4]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[5]  Mengqiu Wang,et al.  A Dual-layer CRFs Based Joint Decoding Method for Cascaded Segmentation and Labeling Tasks , 2007, IJCAI.

[6]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[7]  Jianfeng Gao,et al.  Adaptive Chinese Word Segmentation , 2004, ACL.

[8]  Yue-Shi Lee,et al.  Extracting Named Entities Using Support Vector Machines , 2006, KDLL.

[9]  Yuji Matsumoto,et al.  Combining Segmenter and Chunker for Chinese Word Segmentation , 2003, SIGHAN.

[10]  R.J. McEliece,et al.  Iterative decoding on graphs with a single cycle , 1998, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[11]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[12]  Xiao Sun,et al.  An Integrative Approach to Chinese Named Entity Recognition , 2007, Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007).

[13]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[14]  Yue-Shi Lee,et al.  A General and Multi-lingual Phrase Chunking Model Based on Masking Method , 2006, CICLing.

[15]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[16]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[17]  Shi Wuguang Chinese Word Segmentation Based On Direct Maximum Entropy Model , 2005, SIGHAN@IJCNLP 2005.

[18]  Yue-Shi Lee,et al.  The Exploration of Deterministic and Efficient Dependency Parsing , 2006, CoNLL.

[19]  Hwee Tou Ng,et al.  Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? , 2004, EMNLP.

[20]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[21]  Nianwen Xu,et al.  Chinese Word Segmentation as Character Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..