Is Word Segmentation Necessary for Deep Learning of Chinese Representations?

Segmenting a chunk of text into words is usually the first step of processing Chinese text, but its necessity has rarely been explored. In this paper, we ask the fundamental question of whether Chinese word segmentation (CWS) is necessary for deep learning-based Chinese Natural Language Processing. We benchmark neural word-based models which rely on word segmentation against neural char-based models which do not involve word segmentation in four end-to-end NLP benchmark tasks: language modeling, machine translation, sentence matching/paraphrase and text classification. Through direct comparisons between these two types of models, we find that char-based models consistently outperform word-based models. Based on these observations, we conduct comprehensive experiments to study why word-based models underperform char-based models in these deep learning-based NLP tasks. We show that it is because word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting. We hope this paper could encourage researchers in the community to rethink the necessity of word segmentation in deep learning-based Chinese Natural Language Processing.

[1]  Yue Zhang,et al.  Word-Context Character Embeddings for Chinese Word Segmentation , 2017, EMNLP.

[2]  Hui Li,et al.  Chinese word segmentation and its effect on information retrieval , 2004, Inf. Process. Manag..

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[5]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[6]  Jie-Li Tsai,et al.  Where Do Chinese Readers Send Their Eyes , 2003 .

[7]  Wei Liu,et al.  Chinese Text Classification without Automatic Word Segmentation , 2007, Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007).

[8]  Xiang Zhang,et al.  Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean? , 2017, ArXiv.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Zhao Hai,et al.  Chinese Word Segmentation: A Decade Review , 2007 .

[11]  Xuanjing Huang,et al.  A Feature-Enriched Neural Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2016, IJCAI.

[12]  Wanxiang Che,et al.  LTP: A Chinese Language Technology Platform , 2010, COLING.

[13]  Xuanjing Huang,et al.  Adversarial Multi-Criteria Learning for Chinese Word Segmentation , 2017, ACL.

[14]  Hai Zhao,et al.  An Improved Chinese Word Segmentation System with Conditional Random Field , 2006, SIGHAN@COLING/ACL.

[15]  Yue Zhang,et al.  Neural Word Segmentation with Rich Pretraining , 2017, ACL.

[16]  Hai Zhao,et al.  An Empirical Study on Word Segmentation for Chinese Machine Translation , 2013, CICLing.

[17]  Zhiguo Wang,et al.  Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[18]  Xin Liu,et al.  The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification , 2018, EMNLP.

[19]  Hermann Ney,et al.  Do We Need Chinese Word Segmentation for Statistical Machine Translation? , 2004, SIGHAN@ACL.

[20]  Hai Zhao,et al.  Fast and Accurate Neural Word Segmentation for Chinese , 2017, ACL.

[21]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[22]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[23]  Xu Sun,et al.  Bag-of-Words as Target for Neural Machine Translation , 2018, ACL.

[24]  Christopher D. Manning,et al.  Word Segmentation of Informal Arabic with Domain Adaptation , 2014, ACL.

[25]  ChengXiang Zhai,et al.  Domain Adaptation in Natural Language Processing , 2008 .

[26]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[27]  Xuanjing Huang,et al.  Gated Recursive Neural Network for Chinese Word Segmentation , 2015, ACL.

[28]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[29]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[30]  Guodong Zhou,et al.  Modeling Source Syntax for Neural Machine Translation , 2017, ACL.

[31]  Huanbo Luan,et al.  Prior Knowledge Integration for Neural Machine Translation using Posterior Regularization , 2017, ACL.

[32]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[33]  Xiaoqing Zheng,et al.  Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[34]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[35]  Yue Zhang,et al.  Subword Encoding in Lattice LSTM for Chinese Word Segmentation , 2018, NAACL.

[36]  Shujian Huang,et al.  Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder , 2017, ACL.

[37]  Hui Xiong,et al.  Cross-Domain Learning from Multiple Sources: A Consensus Regularization Perspective , 2010, IEEE Transactions on Knowledge and Data Engineering.

[38]  Hao Xin,et al.  Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components , 2017, EMNLP.

[39]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[40]  Hai Zhao,et al.  Neural Word Segmentation Learning for Chinese , 2016, ACL.

[41]  K. Rayner,et al.  Reading spaced and unspaced Chinese text: evidence from eye movements. , 2008, Journal of experimental psychology. Human perception and performance.

[42]  Rui Li,et al.  Multi-Granularity Chinese Word Embedding , 2016, EMNLP.

[43]  Xin Liu,et al.  LCQMC:A Large-scale Chinese Question Matching Corpus , 2018, COLING.

[44]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[45]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Nan Yu,et al.  Segmenting Chinese Microtext: Joint Informal-Word Detection and Segmentation with Neural Networks , 2017, IJCAI.

[47]  Zheng Huang,et al.  Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation , 2016, ICONIP.

[48]  Yue Zhang,et al.  Transition-Based Neural Word Segmentation , 2016, ACL.

[49]  F. Xia,et al.  The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[50]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[51]  Rico Sennrich,et al.  Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[52]  Bob Carpenter Character Language Models for Chinese Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[53]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.