How to Start Training: The Effect of Initialization and Architecture

We investigate the effects of initialization and architecture on the start of training in deep ReLU nets. We identify two common failure modes for early training in which the mean and variance of activations are poorly behaved. For each failure mode, we give a rigorous proof of when it occurs at initialization and how to avoid it. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in. The second failure mode, exponentially large variance of activation length, can be avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained.

[1]  H. Sompolinsky,et al.  Transition to chaos in random neuronal networks , 2015, 1508.06486.

[2]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[3]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Surya Ganguli,et al.  Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[6]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[7]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[8]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[9]  Yann LeCun,et al.  Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNNs , 2016, ICML.

[10]  Masato Taki,et al.  Deep Residual Networks and Weight Initialization , 2017, ArXiv.

[11]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[12]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[15]  E. H. Lloyd,et al.  Statistics for Scientists and Engineers. , 1966 .

[16]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[18]  Yann LeCun,et al.  Recurrent Orthogonal Networks and Long-Memory Tasks , 2016, ICML.

[19]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[20]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[21]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[22]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[23]  Boris Hanin,et al.  Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? , 2018, NeurIPS.

[24]  Samuel S. Schoenholz,et al.  Deep Mean Field Theory: Layerwise Variance and Width Variation as Methods to Control Gradient Explosion , 2018, ICLR 2018.

[25]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[26]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[27]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[28]  Jascha Sohl-Dickstein,et al.  A Correspondence Between Random Neural Networks and Statistical Field Theory , 2017, ArXiv.

[29]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[30]  Samuel S. Schoenholz,et al.  Mean Field Residual Networks: On the Edge of Chaos , 2017, NIPS.

[31]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[32]  Elman Mansimov,et al.  Second-order Optimization for Deep Reinforcement Learning using Kronecker-factored Approximation , 2017, NIPS 2017.

[33]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[34]  J. Norris Appendix: probability and measure , 1997 .

[35]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.