The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions

Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities "solve" the exploding gradient problem, we show that this is not the case in general and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the *collapsing domain problem*, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks. We show this is a direct consequence of the Pythagorean equation. By noticing that *any neural network is a residual network*, we devise the *residual trick*, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[3]  Nikos Komodakis,et al.  DiracNets: Training Very Deep Neural Networks Without Skip-Connections , 2017, ArXiv.

[4]  Harish S. Bhat,et al.  Predicting Adolescent Suicide Attempts with Neural Networks , 2017, ArXiv.

[5]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Eldad Haber,et al.  Reversible Architectures for Arbitrarily Deep Residual Neural Networks , 2017, AAAI.

[7]  Qiang Ye,et al.  Orthogonal Recurrent Neural Networks with Scaled Cayley Transform , 2017, ICML.

[8]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[9]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[10]  Jiajun Zhang,et al.  Deformable deep convolutional generative adversarial network in microwave based hand gesture recognition system , 2017, 2017 9th International Conference on Wireless Communications and Signal Processing (WCSP).

[11]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[14]  Shuchang Zhou,et al.  Learning to Run with Actor-Critic Ensemble , 2017, ArXiv.

[15]  R. Zemel,et al.  On the Representational Efficiency of Restricted Boltzmann Machines , 2013, NIPS 2013.

[16]  Surya Ganguli,et al.  Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[17]  Junmo Kim,et al.  Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[19]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[20]  Yoshua Bengio,et al.  Residual Connections Encourage Iterative Inference , 2017, ICLR.

[21]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[22]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[23]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[24]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[25]  Frederick Tung,et al.  Multi-level Residual Networks from Dynamical Systems View , 2017, ICLR.

[26]  Venu Govindaraju,et al.  Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks , 2016, ICML.

[27]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[28]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[29]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[30]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[33]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[34]  Lars Ruthotto,et al.  Learning Across Scales - Multiscale Methods for Convolution Neural Networks , 2018, AAAI.

[35]  Jianfeng Zhan,et al.  Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks , 2017, ICANN.

[36]  Matus Telgarsky,et al.  Representation Benefits of Deep Feedforward Networks , 2015, ArXiv.

[37]  Samuel S. Schoenholz,et al.  Deep Mean Field Theory: Layerwise Variance and Width Variation as Methods to Control Gradient Explosion , 2018, ICLR 2018.

[38]  Misha Denil,et al.  Noisy Activation Functions , 2016, ICML.

[39]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[40]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[42]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[43]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Cesare Furlanello,et al.  Convolutional neural networks for structured omics: OmicsCNN and the OmicsConv layer , 2017, 1710.05918.

[45]  Shuang Wu,et al.  Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[46]  Sergey Ioffe,et al.  Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[47]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[48]  Brian McWilliams,et al.  The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.

[49]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[50]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[51]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[52]  Ping Luo,et al.  Learning Deep Architectures via Generalized Whitened Neural Networks , 2017, ICML.

[53]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[54]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[55]  Franco Scarselli,et al.  On the complexity of shallow and deep neural network classifiers , 2014, ESANN.

[56]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[57]  Samuel S. Schoenholz,et al.  Mean Field Residual Networks: On the Edge of Chaos , 2017, NIPS.

[58]  Yan Wang,et al.  A Powerful Generative Model Using Random Weights for the Deep Image Representation , 2016, NIPS.

[59]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[60]  Mohammad Malekzadeh,et al.  Replacement AutoEncoder: A Privacy-Preserving Algorithm for Sensory Data Analysis , 2017, 2018 IEEE/ACM Third International Conference on Internet-of-Things Design and Implementation (IoTDI).