Subword Encoding in Lattice LSTM for Chinese Word Segmentation

We investigate a lattice LSTM network for Chinese word segmentation (CWS) to utilize words or subwords. It integrates the character sequence features with all subsequences information matched from a lexicon. The matched subsequences serve as information shortcut tunnels which link their start and end characters directly. Gated units are used to control the contribution of multiple input links. Through formula derivation and comparison, we show that the lattice LSTM is an extension of the standard LSTM with the ability to take multiple inputs. Previous lattice LSTM model takes word embeddings as the lexicon input, we prove that subword encoding can give the comparable performance and has the benefit of not relying on any external segmentor. The contribution of lattice LSTM comes from both lexicon and pretrained embeddings information, we find that the lexicon information contributes more than the pretrained embeddings information through controlled experiments. Our experiments show that the lattice structure with subword encoding gives competitive or better results with previous state-of-the-art methods on four segmentation benchmarks. Detailed analyses are conducted to compare the performance of word encoding and subword encoding in lattice LSTM. We also investigate the performance of lattice LSTM structure under different circumstances and when this model works or fails.

[1]  Hai Zhao,et al.  Fast and Accurate Neural Word Segmentation for Chinese , 2017, ACL.

[2]  Xu Sun,et al.  Dependency-based Gated Recursive Neural Network for Chinese Word Segmentation , 2016, ACL.

[3]  Rongrong Ji,et al.  Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation , 2016, AAAI.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[6]  Yue Zhang,et al.  Word-Context Character Embeddings for Chinese Word Segmentation , 2017, EMNLP.

[7]  Yue Zhang,et al.  Chinese NER Using Lattice LSTM , 2018, ACL.

[8]  Nianwen Xue,et al.  Chinese Word Segmentation as Character Tagging , 2003, ROCLING/IJCLCLP.

[9]  Baobao Chang,et al.  Max-Margin Tensor Neural Network for Chinese Word Segmentation , 2014, ACL.

[10]  Hai Zhao,et al.  Neural Word Segmentation Learning for Chinese , 2016, ACL.

[11]  Ji Ma,et al.  State-of-the-art Chinese Word Segmentation with Bi-LSTMs , 2018, EMNLP.

[12]  Yue Zhang,et al.  NCRF++: An Open-source Neural Sequence Labeling Toolkit , 2018, ACL.

[13]  Peng Qian,et al.  Overview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Micro-Blog Texts , 2016, NLPCC/ICCPOL.

[14]  Benjamin Heinzerling,et al.  BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages , 2017, LREC.

[15]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[16]  Hai Zhao,et al.  Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling , 2006, PACLIC.

[17]  Yue Zhang,et al.  Neural Word Segmentation with Rich Pretraining , 2017, ACL.

[18]  Bo Xu,et al.  Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation , 2017, IJCNLP.

[19]  Yijia Liu,et al.  Exploring Segment Representations for Neural Segmentation Models , 2016, IJCAI.

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[22]  Hongyu Guo,et al.  DAG-Structured Long Short-Term Memory for Semantic Compositionality , 2016, NAACL.

[23]  Matthias Sperber,et al.  Neural Lattice-to-Sequence Models for Uncertain Inputs , 2017, EMNLP.

[24]  Weiwei Sun Word-based and Character-based Word Segmentation Models: Comparison and Combination , 2010, COLING.

[25]  Xuanjing Huang,et al.  Gated Recursive Neural Network for Chinese Word Segmentation , 2015, ACL.

[26]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[27]  Eiichiro Sumita,et al.  Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation , 2006, NAACL.

[28]  Xiaoqing Zheng,et al.  Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[29]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[30]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[31]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[32]  Stephen Clark,et al.  Chinese Segmentation with a Word-Based Perceptron Algorithm , 2007, ACL.

[33]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[34]  Xuanjing Huang,et al.  DAG-based Long Short-Term Memory for Neural Word Segmentation , 2017, ArXiv.

[35]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[36]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[37]  Erhard W. Hinrichs,et al.  Accurate Linear-Time Chinese Word Segmentation via Embedding Matching , 2015, ACL.

[38]  Yue Zhang,et al.  Transition-Based Neural Word Segmentation , 2016, ACL.

[39]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[40]  Fei Xia,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.