The Effect of Gradient Noise on the Energy Landscape of Deep Networks

We analyze the regularization properties of additive gradient noise in the training of deep networks by posing it as finding the ground state of the Hamiltonian of a spherical spin glass in an external magnetic field. We show that depending upon the magnitude of the magnetic field, the Hamiltonian changes dramatically from a highly non-convex energy landscape with exponentially many critical points to a regime with polynomially many critical points and finally, "trivializes"' to exactly one minimum. This phenomenon, known as topology trivialization in the physics literature, can be leveraged to devise annealing schemes for additive noise such that the training starts in the polynomial regime but gradually morphs the energy landscape into the original one as training progresses. We demonstrate through experiments on fully-connected and convolutional neural networks that annealing schemes based on trivialization lead to accelerated training and also improve generalization error.

[1]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[2]  M. Mézard,et al.  Spin Glass Theory and Beyond , 1987 .

[3]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[4]  C. Lee Giles,et al.  An analysis of noise in recurrent neural networks: convergence and generalization , 1996, IEEE Trans. Neural Networks.

[5]  S. Kak Information, physics, and computation , 1996 .

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[8]  M. Talagrand Spin glasses : a challenge for mathematicians : cavity and mean field models , 2003 .

[9]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[10]  Riccardo Zecchina,et al.  Survey propagation: An algorithm for satisfiability , 2002, Random Struct. Algorithms.

[11]  Yan V Fyodorov,et al.  Replica Symmetry Breaking Condition Exposed by Random Matrix Calculation of Landscape Complexity , 2007, cond-mat/0702601.

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[15]  Richard E. Neapolitan Learning Bayesian Network Structure , 2009 .

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[18]  Tommi S. Jaakkola,et al.  Learning Bayesian Network Structure using LP Relaxations , 2010, AISTATS.

[19]  Florent Krzakala,et al.  Statistical physics-based reconstruction in compressed sensing , 2011, ArXiv.

[20]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[21]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[22]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[23]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[24]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[25]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[26]  T. Tao Topics in Random Matrix Theory , 2012 .

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Y. Fyodorov High-Dimensional Random Fields and Random Matrix Theory , 2013, 1307.2379.

[29]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[30]  Rina Panigrahy,et al.  Sparse Matrix Factorization , 2013, ArXiv.

[31]  D. Panchenko The Sherrington-Kirkpatrick Model , 2013 .

[32]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[33]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[34]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[35]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[36]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[37]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[38]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[39]  S. Majumdar,et al.  Top eigenvalue of a random matrix: large deviations and third order phase transition , 2013, 1311.0580.

[40]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[41]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[42]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[43]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[44]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[45]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[46]  Ryan P. Adams,et al.  Avoiding pathologies in very deep networks , 2014, AISTATS.

[47]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[48]  Zhaoran Wang,et al.  Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time , 2014, NIPS.

[49]  Stefano Soatto,et al.  Visual Representations: Defining Properties and Deep Approximations , 2014, ICLR 2016.

[50]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[51]  O. Zeitouni,et al.  The extremal process of critical points of the pure p-spin spherical spin glass model , 2015, 1509.03098.

[52]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[53]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[54]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[56]  Yann LeCun,et al.  Explorations on high dimensional landscapes , 2014, ICLR.

[57]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[60]  Richard G. Baraniuk,et al.  A Probabilistic Theory of Deep Learning , 2015, ArXiv.

[61]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[62]  Eliran Subag,et al.  The complexity of spherical p-spin models - a second moment approach , 2015, 1504.02251.

[63]  Stefano Soatto,et al.  Visual Scene Representations: Sufficiency, Minimality, Invariance and Deep Approximations , 2014, ICLR.

[64]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[65]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[67]  A. Bovier Metastability: A Potential-Theoretic Approach , 2016 .

[68]  Christian Borgs,et al.  Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes , 2016, Proceedings of the National Academy of Sciences.

[69]  Hossein Mobahi,et al.  Training Recurrent Neural Networks by Diffusion , 2016, ArXiv.

[70]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[71]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[72]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[73]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[74]  Rogério Schmidt Feris,et al.  A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection , 2016, ECCV.

[75]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[76]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[77]  Eric T. Nalisnick,et al.  A Scale Mixture Perspective of Multiplicative Noise in Neural Networks , 2015, 1506.03208.