论文信息 - The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions

The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions

Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities "solve" the exploding gradient problem, we show that this is not the case in general and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the *collapsing domain problem*, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks. We show this is a direct consequence of the Pythagorean equation. By noticing that *any neural network is a residual network*, we devise the *residual trick*, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.

[1] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.

[3] Nikos Komodakis,et al. DiracNets: Training Very Deep Neural Networks Without Skip-Connections , 2017, ArXiv.

[4] Harish S. Bhat,et al. Predicting Adolescent Suicide Attempts with Neural Networks , 2017, ArXiv.

[5] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Eldad Haber,et al. Reversible Architectures for Arbitrarily Deep Residual Neural Networks , 2017, AAAI.

[7] Qiang Ye,et al. Orthogonal Recurrent Neural Networks with Scaled Cayley Transform , 2017, ICML.

[8] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[9] Andrew L. Maas. Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[10] Jiajun Zhang,et al. Deformable deep convolutional generative adversarial network in microwave based hand gesture recognition system , 2017, 2017 9th International Conference on Wireless Communications and Signal Processing (WCSP).

[11] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[12] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Sepp Hochreiter,et al. Self-Normalizing Neural Networks , 2017, NIPS.

[14] Shuchang Zhou,et al. Learning to Run with Actor-Critic Ensemble , 2017, ArXiv.

[15] R. Zemel,et al. On the Representational Efficiency of Restricted Boltzmann Machines , 2013, NIPS 2013.

[16] Surya Ganguli,et al. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[17] Junmo Kim,et al. Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).