Vanishing Curvature and the Power of Adaptive Methods in Randomly Initialized Deep Networks

This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly initialized neural networks. Leveraging an in-depth analysis of neural chains, we first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth), even when initialized with the popular Xavier and He initializations. Second, we extend the analysis to second-order derivatives and show that random i.i.d. initialization also gives rise to Hessian matrices with eigenspectra that vanish as networks grow in depth. Whenever this happens, optimizers are initialized in a very flat, saddle point-like plateau, which is particularly hard to escape with stochastic gradient descent (SGD) as its escaping time is inversely related to curvature. We believe that this observation is crucial for fully understanding (a) historical difficulties of training deep nets with vanilla SGD, (b) the success of adaptive gradient methods (which naturally adapt to curvature and thus quickly escape flat plateaus) and (c) the effectiveness of modern architectural components like residual connections and normalization layers.

[1]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[2]  Tero Karras,et al.  Training Generative Adversarial Networks with Limited Data , 2020, NeurIPS.

[3]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[4]  Andreas Veit,et al.  Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Nathan Srebro,et al.  Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[7]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Suvrit Sra,et al.  Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.

[9]  Philipp Hennig,et al.  Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[13]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[14]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[15]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[16]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[17]  M. Nica,et al.  Products of Many Large Random Matrices and Gradients in Deep Neural Networks , 2018, Communications in Mathematical Physics.

[18]  Jascha Sohl-Dickstein,et al.  A Mean Field Theory of Batch Normalization , 2019, ICLR.

[19]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[20]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[21]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[22]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[23]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[24]  Carla P. Gomes,et al.  Understanding Batch Normalization , 2018, NeurIPS.

[25]  Frederik Kunstner,et al.  Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.

[26]  Tie-Yan Liu,et al.  Convergence Theory of Learning Over-parameterized ResNet: A Full Characterization. , 2019 .

[27]  I. Gemp The Unreasonable Effectiveness of Adam on Cycles , 2019 .

[28]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[29]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[30]  Thomas Hofmann,et al.  Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization , 2018, AISTATS.

[31]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[32]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[33]  A. C. Berry The accuracy of the Gaussian approximation to the sum of independent variates , 1941 .

[34]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[35]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[36]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[37]  Surya Ganguli,et al.  The Emergence of Spectral Universality in Deep Networks , 2018, AISTATS.

[38]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[39]  Arindam Banerjee,et al.  Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization , 2019, SDM.

[40]  Francis Bach,et al.  On the Convergence of Adam and Adagrad , 2020, ArXiv.

[41]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[42]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[43]  M. Springer,et al.  The Distribution of Products of Beta, Gamma and Gaussian Random Variables , 1970 .

[44]  Yoshua Bengio,et al.  Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[45]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[47]  Sanjiv Kumar,et al.  Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.

[48]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[49]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[50]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[51]  Francis Bach,et al.  Batch normalization provably avoids ranks collapse for randomly initialised deep networks , 2020, NeurIPS.

[52]  C. L. Mallows,et al.  Inequalities of Chebyshev Type Involving Conditional Expectations , 1969 .

[53]  Thomas Hofmann,et al.  Escaping Saddles with Stochastic Gradients , 2018, ICML.

[54]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  K. P. Choi On the medians of gamma distributions and an equation of Ramanujan , 1994 .

[56]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[57]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[58]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[59]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[60]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[61]  Samuel L. Smith,et al.  Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks , 2020, NeurIPS.

[62]  Zhouchen Lin,et al.  Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.

[63]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[65]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[66]  N. Temme Special Functions: An Introduction to the Classical Functions of Mathematical Physics , 1996 .

[67]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[68]  Antoine Labatie,et al.  Characterizing Well-Behaved vs. Pathological Deep Neural Networks , 2018, ICML.

[69]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Orestis Georgiou,et al.  Product of n independent uniform random variables , 2009 .

[71]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[72]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[73]  Daniel P. Robinson,et al.  Exploiting negative curvature in deterministic and stochastic optimization , 2017, Mathematical Programming.