An Empirical Exploration of Skip Connections for Sequential Tagging

In this paper, we empirically explore the effects of various kinds of skip connections in stacked bidirectional LSTMs for sequential tagging. We investigate three kinds of skip connections connecting to LSTM cells: (a) skip connections to the gates, (b) skip connections to the internal states and (c) skip connections to the cell outputs. We present comprehensive experiments showing that skip connections to cell outputs outperform the remaining two. Furthermore, we observe that using gated identity functions as skip mappings works pretty well. Based on this novel skip connections, we successfully train deep stacked bidirectional LSTM models and obtain state-of-the-art results on CCG supertagging and comparable results on POS tagging.

[1]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[2]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[3]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[4]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[5]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[6]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[7]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[8]  Sandra M. Aluísio,et al.  Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese , 2014, Journal of the Brazilian Computer Society.

[9]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[10]  James R. Curran,et al.  Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models , 2007, Computational Linguistics.

[11]  Anders Søgaard,et al.  Semi-supervised condensed nearest neighbor for part-of-speech tagging , 2011, ACL.

[12]  Martin Kay,et al.  Syntactic Process , 1979, ACL.

[13]  Luke S. Zettlemoyer,et al.  LSTM CCG Parsing , 2016, NAACL.

[14]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[15]  Jason Baldridge,et al.  Non-Transformational Syntax: Formal and Explicit Models of Grammar , 2011 .

[16]  Mark Steedman,et al.  CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[17]  Jiajun Zhang,et al.  A Dynamic Window Neural Network for CCG Supertagging , 2017, AAAI.

[18]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[19]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[20]  Stephen Clark,et al.  CCG Supertagging with a Recurrent Neural Network , 2015, ACL.

[21]  Ekaterina Vylomova,et al.  Depth-Gated LSTM , 2015, ArXiv.

[22]  Mark Steedman,et al.  Improved CCG Parsing with Semi-supervised Supertagging , 2014, TACL.

[23]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[24]  Jürgen Schmidhuber,et al.  LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Ashish Vaswani,et al.  Supertagging With LSTMs , 2016, NAACL.

[28]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[29]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[30]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[31]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34]  Benjamin Schrauwen,et al.  Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Stephen Clark,et al.  Expected F-Measure Training for Shift-Reduce Parsing with Recurrent Neural Networks , 2016, HLT-NAACL.

[37]  Hai Zhao,et al.  Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network , 2015, ArXiv.

[38]  Geoffrey E. Hinton,et al.  Experiments on Learning by Back Propagation. , 1986 .

[39]  Mark Steedman,et al.  Combinatory Categorial Grammar , 2011 .

[40]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[41]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[43]  Kersti Börjars,et al.  Non-Transformational Syntax: Formal and Explicit Models of Grammar , 2011 .