Using context and semantic resources for cross-domain word Segmentation

Chinese word Segmentation (CWS) plays a fundamental role in Chinese language processing, because almost all Chinese language processing tasks are assumed to work with segmented input. After active research for many years, most of reports from evaluation tasks always give impressive results. But most of them are limited to testing corpora on specific area. Once used on another different domain, the accuracy will plummet. Thus, the domain-adaptive word segmentation is introduced into Bakeoffs. In this paper, we propose a new joint decoding strategy that combines the character-based and word-based conditional random field model, which takes the part-of-speech of words in dictionary as important features in a segment path. Moreover, according to the characteristics of the cross-domain segmentation, context information is reasonably used to guide CWS. Besides, because there are similar contexts among synonyms, semantic information can be used to recall some out-of-vocabularies (OOVs). This method is proven to be effective through several experiments on the simplified Chinese test data from SIGHAN Bakeoff 2010. Except for the domain of literature, the F-scores are higher than the best performance of the corresponding open test. In addition, the rate of OOV recall reaches 70.7%, 84.3%, 79.0% and 86.2%, respectively.

[1]  Huang De-gen,et al.  Dual-Layer CRFs Based on Subword for Chinese Word Segmentation , 2010 .

[2]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[3]  Song Yan,et al.  Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding , 2009 .

[4]  Hai Zhao,et al.  Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding: Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding , 2009 .

[5]  Tetsuji Nakagawa,et al.  Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information , 2004, COLING.

[6]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[7]  Chongyang Zhang,et al.  Chinese Word Segmentation with Conditional Support Vector Inspired Markov Models , 2010, CIPS-SIGHAN.

[8]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[9]  Nianwen Xue,et al.  Chinese Word Segmentation as Character Tagging , 2003, ROCLING/IJCLCLP.

[10]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[11]  Mengqiu Wang,et al.  A Dual-layer CRFs Based Joint Decoding Method for Cascaded Segmentation and Labeling Tasks , 2007, IJCAI.

[12]  Degen Huang,et al.  HMM Revises Low Marginal Probability by CRF for Chinese Word Segmentation , 2010, CIPS-SIGHAN.

[13]  Heyan Huang,et al.  Incorporating New Words Detection with Chinese Word Segmentation , 2010, CIPS-SIGHAN.

[14]  Hwee Tou Ng,et al.  A Maximum Entropy Approach to Chinese Word Segmentation , 2005, SIGHAN@IJCNLP 2005.

[15]  Stephan Vogel,et al.  A Multi-layer Chinese Word Segmentation System Optimized for Out-of-domain Tasks , 2010, CIPS-SIGHAN.

[16]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[17]  Eiichiro Sumita,et al.  Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation , 2006, NAACL.

[18]  Hai Zhao,et al.  Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling , 2006, PACLIC.