An Empirical Study of Automatic Chinese Word Segmentation for Spoken Language Understanding and Named Entity Recognition

Word segmentation is usually recognized as the first step for many Chinese natural language processing tasks, yet its impact on these subsequent tasks is relatively under-studied. For example, how to solve the mismatch problem when applying an existing word segmenter to new data? Does a better word segmenter yield a better subsequent NLP task performance? In this work, we conduct an initial attempt to answer these questions on two related subsequent tasks: semantic slot filling in spoken language understanding and named entity recognition. We propose three techniques to solve the mismatch problem: using word segmentation outputs as additional features, adaptation with partial-learning and taking advantage of n-best word segmentation list. Experimental results demonstrate the effectiveness of these techniques for both tasks and we achieve an error reduction of about 11% for spoken language understanding and 24% for named entity recognition over the baseline systems.

[1]  Lidia S. Chao,et al.  A Joint Chinese Named Entity Recognition and Disambiguation System , 2012, CIPS-SIGHAN.

[2]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[3]  Joakim Nivre,et al.  Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging , 2013, TACL.

[4]  Xiaojun Wan,et al.  Named Entity Recognition in Chinese News Comments on the Web , 2011, IJCNLP.

[5]  Liang Tian,et al.  Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints , 2014, ACL.

[6]  Weiwei Sun,et al.  A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2011, ACL.

[7]  Stephen Clark,et al.  A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model , 2010, EMNLP.

[8]  Shinsuke Mori,et al.  Keyboard Logs as Natural Annotations for Word Segmentation , 2015, EMNLP.

[9]  Cheung-Chi Leung,et al.  Investigation of using different Chinese word segmentation standards and algorithms for automatic speech recognition , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[10]  Xu Sun,et al.  Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation , 2013, EMNLP.

[11]  Hermann Ney,et al.  Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation , 2008, COLING.

[12]  Jun'ichi Tsujii,et al.  Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese , 2012, ACL.

[13]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[14]  Qun Liu,et al.  Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging , 2008, COLING.

[15]  Nanyun Peng,et al.  Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings , 2015, EMNLP.

[16]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[17]  Gökhan Tür,et al.  Semantic parsing using word confusion networks with conditional random fields , 2013, INTERSPEECH.

[18]  Gina-Anne Levow,et al.  The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[19]  Yang Liu,et al.  Joint Chinese Word Segmentation, POS Tagging and Parsing , 2012, EMNLP-CoNLL.

[20]  Li Li,et al.  Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations , 2013, ACL.

[21]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[22]  Qun Liu,et al.  A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2008, ACL.

[23]  Chu-Ren Huang,et al.  Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification , 2007, ACL.

[24]  Chengqing Zong,et al.  A Study of the Effectiveness of Suffixes for Chinese Word Segmentation , 2013, PACLIC.

[25]  Eiichiro Sumita,et al.  Improved Statistical Machine Translation by Multiple Chinese Word Segmentation , 2008, WMT@ACL.

[26]  Yi Qian,et al.  Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. , 2014, Journal of the American Medical Informatics Association : JAMIA.

[27]  Pascale Fung,et al.  Using N-best lists for Named Entity Recognition from Chinese Speech , 2004, NAACL.

[28]  Yi Su,et al.  Full-rank linear-chain NeuroCRF for sequence labeling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Yoshimasa Tsuruoka,et al.  Improving Chinese Word Segmentation and POS Tagging with Semi-supervised Methods Using Large Auto-Analyzed Data , 2011, IJCNLP.

[30]  Maosong Sun,et al.  Punctuation as Implicit Annotations for Chinese Word Segmentation , 2009, CL.

[31]  Zhao Hai,et al.  Chinese Word Segmentation: A Decade Review , 2007 .

[32]  Yuji Matsumoto,et al.  Synthetic Word Parsing Improves Chinese Word Segmentation , 2015, ACL.

[33]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[34]  Weiwei Sun,et al.  Enhancing Chinese Word Segmentation Using Unlabeled Data , 2011, EMNLP.

[35]  Nianwen Xu,et al.  Chinese Word Segmentation as Character Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[36]  Dale Schuurmans,et al.  Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR , 2002, COLING.

[37]  Geoffrey Zweig,et al.  Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Fan Yang,et al.  Semi-Supervised Chinese Word Segmentation Using Partial-Label Learning With Conditional Random Fields , 2014, EMNLP.