Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.

[1]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[2]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[3]  H. Yau,et al.  On the principal components of sample covariance matrices , 2014, 1404.0788.

[4]  Panos M. Pardalos,et al.  Optimization methods for computing global minima of nonconvex potential energy functions , 1994, J. Glob. Optim..

[5]  Yann LeCun,et al.  Explorations on high dimensional landscapes , 2014, ICLR.

[6]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[7]  Hossein Mobahi,et al.  Training Recurrent Neural Networks by Diffusion , 2016, ArXiv.

[8]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[9]  Hossein Mobahi,et al.  On the Link between Gaussian Homotopy Continuation and Convex Envelopes , 2015, EMMCVPR.

[10]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[11]  Hossein Mobahi,et al.  A Theoretical Analysis of Optimization by Gaussian Continuation , 2015, AAAI.

[12]  Joan Bruna,et al.  Topology and Geometry of Deep Rectified Network Optimization Landscapes , 2016 .

[13]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[14]  Christian Borgs,et al.  Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes , 2016, Proceedings of the National Academy of Sciences.

[15]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[16]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[17]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[18]  Georgios Piliouras,et al.  Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions , 2016, ITCS.

[19]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[20]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[21]  Shai Shalev-Shwartz,et al.  On Graduated Optimization for Stochastic Non-Convex Problems , 2015, ICML.

[22]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[23]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[24]  S. Péché,et al.  Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices , 2004, math/0403022.

[25]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[26]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[27]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[28]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[29]  H. Scheraga,et al.  On the multiple-minima problem in the conformational analysis of molecules: deformation of the potential energy hypersurface by the diffusion equation method , 1989 .

[30]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[31]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[32]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[33]  Yann LeCun,et al.  Singularity of the Hessian in Deep Learning , 2016, ArXiv.

[34]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.