论文信息 - Gradual Learning of Deep Recurrent Neural Networks

Gradual Learning of Deep Recurrent Neural Networks

Recurrent Neural Networks (RNNs) achieve state-of-the-art results in many sequence-to-sequence modeling tasks. However, RNNs are difficult to train and tend to suffer from overfitting. Motivated by the Data Processing Inequality (DPI), we formulate the multi-layered network as a Markov chain, introducing a training method that comprises training the network gradually and using layer-wise gradient clipping. We found that applying our methods, combined with previously introduced regularization and optimization methods, resulted in improvements in state-of-the-art architectures operating in language modeling tasks.

Haim H. Permuter | Ziv Aharoni | Gal Rattner

[1] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[2] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[3] Lior Wolf,et al. Using the Output Embedding to Improve Language Models , 2016, EACL.

[4] Zoubin Ghahramani,et al. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[5] Ruslan Salakhutdinov,et al. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[6] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[7] Razvan Pascanu,et al. On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[8] Timothy Doster,et al. Gradual DropIn of Layers to Train Very Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.

[10] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11] Franco Scarselli,et al. On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[12] Benjamin Schrauwen,et al. Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[15] Jürgen Schmidhuber,et al. Recurrent Highway Networks , 2016, ICML.

[16] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[17] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[18] Yoshua Bengio,et al. Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[19] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[20] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[21] H. Shimodaira,et al. Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[22] Tianqi Chen,et al. Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[23] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[25] Aaron C. Courville,et al. Recurrent Batch Normalization , 2016, ICLR.

[26] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[27] Steve Renals,et al. Dynamic Evaluation of Neural Sequence Models , 2017, ICML.

[28] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[29] Quoc V. Le,et al. HyperNetworks , 2016, ICLR.

[30] Chris Dyer,et al. On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[31] Yoshua Bengio,et al. Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.