Robust Chinese Word Segmentation with Contextualized Word Representations

In recent years, after the neural-network-based method was proposed, the accuracy of the Chinese word segmentation task has made great progress. However, when dealing with out-of-vocabulary words, there is still a large error rate. We used a simple bidirectional LSTM architecture and a large-scale pretrained language model to generate high-quality contextualize character representations, which successfully reduced the weakness of the ambiguous meanings of each Chinese character that widely appears in Chinese characters, and hence effectively reduced OOV error rate. State-of-the-art performance is achieved on many datasets.

[1]  Yang Liu,et al.  A non-DNN Feature Engineering Approach to Dependency Parsing - FBAML at CoNLL 2017 Shared Task , 2017, CoNLL Shared Task.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Zhao Hai,et al.  Chinese Word Segmentation: A Decade Review , 2007 .

[4]  Bo Xu,et al.  Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation , 2017, IJCNLP.

[5]  Yijia Liu,et al.  Exploring Segment Representations for Neural Segmentation Models , 2016, IJCAI.

[6]  Yijia Liu,et al.  Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation , 2018, CoNLL.

[7]  Ji Ma,et al.  State-of-the-art Chinese Word Segmentation with Bi-LSTMs , 2018, EMNLP.

[8]  Yue Zhang,et al.  Word-Context Character Embeddings for Chinese Word Segmentation , 2017, EMNLP.

[9]  Erhard W. Hinrichs,et al.  Accurate Linear-Time Chinese Word Segmentation via Embedding Matching , 2015, ACL.

[10]  Yue Zhang,et al.  Transition-Based Neural Word Segmentation , 2016, ACL.

[11]  Murhaf Fares,et al.  Word vectors, reuse, and replicability: Towards a community repository of large-text resources , 2017, NODALIDA.

[12]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[13]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[14]  Keh-Jiann Chen,et al.  Improving PCFG Chinese Parsing with Context-Dependent Probability Re-estimation , 2012, CIPS-SIGHAN.

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Baobao Chang,et al.  Max-Margin Tensor Neural Network for Chinese Word Segmentation , 2014, ACL.

[17]  Hai Zhao,et al.  Fast and Accurate Neural Word Segmentation for Chinese , 2017, ACL.

[18]  Daisuke Kawahara,et al.  Neural Joint Model for Transition-based Chinese Syntactic Analysis , 2017, ACL.

[19]  Xuanjing Huang,et al.  Adversarial Multi-Criteria Learning for Chinese Word Segmentation , 2017, ACL.

[20]  Yue Zhang,et al.  Neural Word Segmentation with Rich Pretraining , 2017, ACL.