A Two-Stage Deep Neural Network for Sequence Labeling

State-of-the-art sequence labeling systems require large amounts of task-specific knowledge in the form of handcrafted features and data pre-processing, and those systems are established on news corpus. English as second language (ESL) corpus is collected from articles written by English-learner. The corpus is full of grammatical mistakes, and then it is much more difficult to do sequence labeling. We propose a two-stage deep neural network architecture for sequence labeling, which enable the higher-layer to make use of the coarse-grained labeling information of the lower-level. We evaluate our model on three datasets for three sequence labeling tasks—Penn Treebank WSJ corpus for part-of-speech (POS) tagging, CoNLL 2003 corpus for named entity recognition (NER) and CoNLL 2013 corpus for grammatical error correction (GEC). We obtain state-of-the-art performance on three datasets—97.60% accuracy for POS tagging, 91.38% F1 for NER and 38% F1 for determiner error correction of GEC and 28.89% F1 for prepositional error correction of GEC. We also evaluate our system on ESL corpus PiGai for POS tagging and obtain 96.73% accuracy. The implementation of our network is publicly available.

[1]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[2]  Chao-Huang Chang,et al.  HMM-Based Part-of-Speech Tagging for Chinese Corpora , 1993, VLC@ACL.

[3]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[6]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[7]  Jun Wang,et al.  Learning text representation using recurrent convolutional neural network with highway layers , 2016, SIGIR 2016.

[8]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[9]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[14]  Andrew McCallum,et al.  Lexicon Infused Phrase Embeddings for Named Entity Resolution , 2014, CoNLL.

[15]  Anders Søgaard,et al.  Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[16]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[17]  Houfeng Wang,et al.  Bidirectional Recurrent Convolutional Neural Network for Relation Classification , 2016, ACL.

[18]  Zaiqing Nie,et al.  Joint Entity Recognition and Disambiguation , 2015, EMNLP.

[19]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[20]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[21]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Cícero Nogueira dos Santos,et al.  Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[24]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[25]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[28]  Gonçalo Simões,et al.  Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings , 2018, ACL.

[29]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[30]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[31]  Ying Zhang,et al.  Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yuji Matsumoto,et al.  NAIST at 2013 CoNLL Grammatical Error Correction Shared Task , 2013, CoNLL Shared Task.

[33]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.