Improving Chinese Word Segmentation with Wordhood Memory Networks

Contextual features always play an important role in Chinese word segmentation (CWS). Wordhood information, being one of the contextual features, is proved to be useful in many conventional character-based segmenters. However, this feature receives less attention in recent neural models and it is also challenging to design a framework that can properly integrate wordhood information from different wordhood measures to existing neural frameworks. In this paper, we therefore propose a neural framework, WMSeg, which uses memory networks to incorporate wordhood information with several popular encoder-decoder combinations for CWS. Experimental results on five benchmark datasets indicate the memory mechanism successfully models wordhood information for neural segmenters and helps WMSeg achieve state-of-the-art performance on all those datasets. Further experiments and analyses also demonstrate the robustness of our proposed framework with respect to different wordhood measures and the efficiency of wordhood information in cross-domain experiments.

[1]  Maosong Sun,et al.  Punctuation as Implicit Annotations for Chinese Word Segmentation , 2009, CL.

[2]  Yangyang Shi,et al.  Deep LSTM based Feature Mapping for Query Classification , 2016, NAACL.

[3]  Yan Song,et al.  Using a Goodness Measurement for Domain Adaptation: A Case Study on Chinese Word Segmentation , 2012, LREC.

[4]  Yijia Liu,et al.  Exploring Segment Representations for Neural Segmentation Models , 2016, IJCAI.

[5]  Yan Song,et al.  Entropy-based Training Data Selection for Domain Adaptation , 2012, COLING.

[6]  Hai Zhao,et al.  An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework , 2008, IJCNLP.

[7]  Song Yan,et al.  Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding , 2009 .

[8]  Yonggang Wang,et al.  Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge , 2020, ACL.

[9]  Haizhou Li,et al.  Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers , 2011, ACL.

[10]  Xu Sun,et al.  Dependency-based Gated Recursive Neural Network for Chinese Word Segmentation , 2016, ACL.

[11]  Xuanjing Huang,et al.  Adversarial Multi-Criteria Learning for Chinese Word Segmentation , 2017, ACL.

[12]  Hai Zhao,et al.  An Improved Chinese Word Segmentation System with Conditional Random Field , 2006, SIGHAN@COLING/ACL.

[13]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[14]  Yan Song,et al.  A Common Case of Jekyll and Hyde: The Synergistic Effect of Using Divided Source Training Data for Feature Augmentation , 2013, IJCNLP.

[15]  Jing Li,et al.  Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings , 2018, NAACL.

[16]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[17]  Zhongguo Li Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation , 2011, ACL.

[18]  Yue Zhang,et al.  Word-Context Character Embeddings for Chinese Word Segmentation , 2017, EMNLP.

[19]  Tong Zhang,et al.  ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations , 2019, FINDINGS.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[22]  Ji Ma,et al.  State-of-the-art Chinese Word Segmentation with Bi-LSTMs , 2018, EMNLP.

[23]  Nianwen Xue,et al.  Chinese Comma Disambiguation for Discourse Analysis , 2012, ACL.

[24]  Gina-Anne Levow,et al.  The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[25]  Prajwol Shrestha Incremental N-gram Approach for Language Identification in Code-Switched Text , 2014, CodeSwitch@EMNLP.

[26]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[27]  Yan Song,et al.  Learning Word Representations with Regularization from Prior Knowledge , 2017, CoNLL.

[28]  Wei Chen,et al.  Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets , 2017, NAACL.

[29]  Weiwei Sun,et al.  Enhancing Chinese Word Segmentation Using Unlabeled Data , 2011, EMNLP.

[30]  Erhard W. Hinrichs,et al.  Accurate Linear-Time Chinese Word Segmentation via Embedding Matching , 2015, ACL.

[31]  Yue Zhang,et al.  Transition-Based Neural Word Segmentation , 2016, ACL.

[32]  Jing Li,et al.  Topic Memory Networks for Short Text Classification , 2018, EMNLP.

[33]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 2022, International Conference on Computational Linguistics.

[34]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[35]  Masao Utiyama,et al.  Incorporating Word Attention into Character-Based Word Segmentation , 2019, NAACL.

[36]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[37]  Baobao Chang,et al.  Feature-based Neural Language Model and Chinese Word Segmentation , 2013, IJCNLP.

[38]  Yan Song,et al.  Transliteration of Name Entity via Improved Statistical Translation on Character Sequences , 2009, NEWS@IJCNLP.

[39]  Xu Sun,et al.  Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation , 2013, EMNLP.

[40]  Xuanjing Huang,et al.  Multi-Criteria Chinese Word Segmentation with Transformer , 2019, ArXiv.

[41]  Tianyong Hao,et al.  T-Know: a Knowledge Graph-based Question Answering and Infor-mation Retrieval System for Traditional Chinese Medicine , 2018, COLING.

[42]  Bo Xu,et al.  Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation , 2017, IJCNLP.

[43]  Xipeng Qiu,et al.  Switch-LSTMs for Multi-Criteria Chinese Word Segmentation , 2018, AAAI.

[44]  Camille Pradel,et al.  Mining Discourse Markers for Unsupervised Sentence Representation Learning , 2019, NAACL.

[45]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[46]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[47]  Yan Song,et al.  Chinese Word Segmentation Based on an Approach of Maximum Entropy Modeling , 2006, SIGHAN@COLING/ACL.

[48]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[49]  Baobao Chang,et al.  Max-Margin Tensor Neural Network for Chinese Word Segmentation , 2014, ACL.

[50]  Jason Weston,et al.  Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.