论文信息 - A Two-Stage Deep Neural Network for Sequence Labeling

A Two-Stage Deep Neural Network for Sequence Labeling

State-of-the-art sequence labeling systems require large amounts of task-specific knowledge in the form of handcrafted features and data pre-processing, and those systems are established on news corpus. English as second language (ESL) corpus is collected from articles written by English-learner. The corpus is full of grammatical mistakes, and then it is much more difficult to do sequence labeling. We propose a two-stage deep neural network architecture for sequence labeling, which enable the higher-layer to make use of the coarse-grained labeling information of the lower-level. We evaluate our model on three datasets for three sequence labeling tasks—Penn Treebank WSJ corpus for part-of-speech (POS) tagging, CoNLL 2003 corpus for named entity recognition (NER) and CoNLL 2013 corpus for grammatical error correction (GEC). We obtain state-of-the-art performance on three datasets—97.60% accuracy for POS tagging, 91.38% F1 for NER and 38% F1 for determiner error correction of GEC and 28.89% F1 for prepositional error correction of GEC. We also evaluate our system on ESL corpus PiGai for POS tagging and obtain 96.73% accuracy. The implementation of our network is publicly available.

Yongmei Tan | Lin Yang | Shaozhang Niu | Hao Zhu | Yongheng Zhang

[1] Aaron C. Courville,et al. Recurrent Batch Normalization , 2016, ICLR.

[2] Chao-Huang Chang,et al. HMM-Based Part-of-Speech Tagging for Chinese Corpora , 1993, VLC@ACL.

[3] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[4] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5] Serge J. Belongie,et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[6] Eduard H. Hovy,et al. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[7] Jun Wang,et al. Learning text representation using recurrent convolutional neural network with highway layers , 2016, SIGIR 2016.

[8] Hwee Tou Ng,et al. The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[9] Jonathan Baxter,et al. Learning internal representations , 1995, COLT '95.

[10] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11] Sepp Hochreiter,et al. Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[12] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[14] Andrew McCallum,et al. Lexicon Infused Phrase Embeddings for Named Entity Resolution , 2014, CoNLL.

[15] Anders Søgaard,et al. Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[16] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[17] Houfeng Wang,et al. Bidirectional Recurrent Convolutional Neural Network for Relation Classification , 2016, ACL.

[18] Zaiqing Nie,et al. Joint Entity Recognition and Disambiguation , 2015, EMNLP.

[19] Guillaume Lample,et al. Neural Architectures for Named Entity Recognition , 2016, NAACL.

[20] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[21] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[22] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23] Cícero Nogueira dos Santos,et al. Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[24] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[25] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Wei Xu,et al. Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[28] Gonçalo Simões,et al. Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings , 2018, ACL.

[29] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[30] Jürgen Schmidhuber,et al. Highway Networks , 2015, ArXiv.

[31] Ying Zhang,et al. Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Yuji Matsumoto,et al. NAIST at 2013 CoNLL Grammatical Error Correction Shared Task , 2013, CoNLL Shared Task.

[33] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.