Entropy-SGD: biasing gradient descent into wide valleys

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

[1]  References , 1971 .

[2]  E. Allgower,et al.  Introduction to Numerical Continuation Methods , 1987 .

[3]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[4]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[7]  Monasson,et al.  Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. , 1995, Physical review letters.

[8]  R.Monasson,et al.  Weight Space Structure and Internal Representations: a Direct Approach to Learning and Generalization in Multilayer Neural Network , 1995, cond-mat/9501082.

[9]  Monasson,et al.  Analytical and numerical study of internal representations in multilayer neural networks with binary weights. , 1996, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  D. Haussler,et al.  MUTUAL INFORMATION, METRIC ENTROPY AND CUMULATIVE RELATIVE ENTROPY RISK , 1997 .

[12]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[13]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[14]  G. Roberts,et al.  Langevin Diffusions and Metropolis-Hastings Algorithms , 2002 .

[15]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[16]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[17]  Riccardo Zecchina,et al.  Survey propagation: An algorithm for satisfiability , 2002, Random Struct. Algorithms.

[18]  Martin J. Wainwright,et al.  A new look at survey propagation and its generalizations , 2004, SODA '05.

[19]  Federico Ricci-Tersenghi,et al.  On the solution-space geometry of random constraint satisfaction problems , 2006, STOC '06.

[20]  Yan V Fyodorov,et al.  Replica Symmetry Breaking Condition Exposed by Random Matrix Calculation of Landscape Complexity , 2007, cond-mat/0702601.

[21]  A. Bray,et al.  Statistics of critical points of Gaussian fields on large-dimensional spaces. , 2006, Physical review letters.

[22]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[23]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[24]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[25]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[26]  Florent Krzakala,et al.  Statistical physics-based reconstruction in compressed sensing , 2011, ArXiv.

[27]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[28]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[29]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[30]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[31]  Ryan Babbush,et al.  Bayesian Sampling Using Stochastic Gradient Thermostats , 2014, NIPS.

[32]  Hossein Mobahi,et al.  On the Link between Gaussian Homotopy Continuation and Convex Envelopes , 2015, EMMCVPR.

[33]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[34]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[35]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[36]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[37]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[38]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[39]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[40]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[41]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[42]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[43]  Carlo Baldassi,et al.  Subdominant Dense Clusters Allow for Simple Learning and High Computational Performance in Neural Networks with Discrete Synapses. , 2015, Physical review letters.

[44]  Vivek Rathod,et al.  Bayesian dark knowledge , 2015, NIPS.

[45]  Yann LeCun,et al.  Open Problem: The landscape of the loss surfaces of multilayer networks , 2015, COLT.

[46]  Carlo Baldassi,et al.  Local entropy as a measure for sampling solutions in Constraint Satisfaction Problems , 2015 .

[47]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[49]  Stefano Soatto,et al.  On the energy landscape of deep networks , 2015, 1511.06485.

[50]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[51]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[52]  P. Chaudhari,et al.  The Effect of Gradient Noise on the Energy Landscape of Deep Networks , 2015 .

[53]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[54]  A. Bovier Metastability: A Potential-Theoretic Approach , 2016 .

[55]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[56]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[57]  Christian Borgs,et al.  Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes , 2016, Proceedings of the National Academy of Sciences.

[58]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[59]  Hossein Mobahi,et al.  Training Recurrent Neural Networks by Diffusion , 2016, ArXiv.

[60]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[61]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[62]  Yann LeCun,et al.  Singularity of the Hessian in Deep Learning , 2016, ArXiv.

[63]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[64]  Anima Anandkumar,et al.  Efficient approaches for escaping higher order saddle points in non-convex optimization , 2016, COLT.

[65]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[66]  Zhe Gan,et al.  Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization , 2015, AISTATS.

[67]  Shai Shalev-Shwartz,et al.  On Graduated Optimization for Stochastic Non-Convex Problems , 2015, ICML.

[68]  Carlo Baldassi,et al.  Learning may need only a few bits of synaptic precision. , 2016, Physical review. E.

[69]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[70]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[71]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[72]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[73]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[74]  Yoshua Bengio,et al.  Mollifying Networks , 2016, ICLR.

[75]  Zhe Gan,et al.  Scalable Bayesian Learning of Recurrent Neural Networks for Language Modeling , 2016, ACL.

[76]  C. Papadimitriou,et al.  Introduction to the Theory of Computation , 2018 .

[77]  Omer Levy,et al.  Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .