Is Chinese Word Segmentation a Solved Task? Rethinking Neural Chinese Word Segmentation

The performance of the Chinese Word Segmentation (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks, especially the successful use of large pre-trained models. In this paper, we take stock of what we have achieved and rethink what's left in the CWS task. Methodologically, we propose a fine-grained evaluation for existing CWS systems, which not only allows us to diagnose the strengths and weaknesses of existing models (under the in-dataset setting), but enables us to quantify the discrepancy between different criterion and alleviate the negative transfer problem when doing multi-criteria learning. Strategically, despite not aiming to propose a novel model in this paper, our comprehensive experiments on eight models and seven datasets, as well as thorough analysis, could search for some promising direction for future research. We make all codes publicly available and release an interface that can quickly evaluate and diagnose user's models: this https URL.

[1]  Yue Zhang,et al.  Word-Context Character Embeddings for Chinese Word Segmentation , 2017, EMNLP.

[2]  Hai Zhao,et al.  Neural Word Segmentation Learning for Chinese , 2016, ACL.

[3]  Fan Yang,et al.  An Empirical Study of Automatic Chinese Word Segmentation for Spoken Language Understanding and Named Entity Recognition , 2016, NAACL.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Ji Ma,et al.  State-of-the-art Chinese Word Segmentation with Bi-LSTMs , 2018, EMNLP.

[6]  Samuel Madden,et al.  MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis , 2018, SIGMOD Conference.

[7]  Xuanjing Huang,et al.  Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study , 2020, AAAI.

[8]  Min Zhang,et al.  Multi-Grained Chinese Word Segmentation , 2017, EMNLP.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Baobao Chang,et al.  Max-Margin Tensor Neural Network for Chinese Word Segmentation , 2014, ACL.

[11]  Jinlan Fu,et al.  Interpretable Multi-dataset Evaluation for Named Entity Recognition , 2020, EMNLP.

[12]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[13]  D. W. Zimmerman,et al.  Relative Power of the Wilcoxon Test, the Friedman Test, and Repeated-Measures ANOVA on Ranks , 1993 .

[14]  Xiaoqing Zheng,et al.  Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[15]  Wei Chu,et al.  Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning , 2020, COLING.

[16]  M. Mukaka,et al.  Statistics corner: A guide to appropriate use of correlation coefficient in medical research. , 2012, Malawi medical journal : the journal of Medical Association of Malawi.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Xuanjing Huang,et al.  A New Psychometric-inspired Evaluation Metric for Chinese Word Segmentation , 2016, ACL.

[19]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[20]  Xuanjing Huang,et al.  Gated Recursive Neural Network for Chinese Word Segmentation , 2015, ACL.

[21]  Chu-Ren Huang,et al.  Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification , 2007, ACL.

[22]  Yue Zhang,et al.  Subword Encoding in Lattice LSTM for Chinese Word Segmentation , 2018, NAACL.

[23]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[24]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[25]  Xuanjing Huang,et al.  Adversarial Multi-Criteria Learning for Chinese Word Segmentation , 2017, ACL.

[26]  Yue Zhang,et al.  Domain Adaptation for CRF-based Chinese Word Segmentation using Free Annotations , 2014, EMNLP.

[27]  Yue Zhang,et al.  Neural Word Segmentation with Rich Pretraining , 2017, ACL.

[28]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[29]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[30]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[31]  Xuanjing Huang,et al.  Multi-Criteria Chinese Word Segmentation with Transformer , 2019, ArXiv.