An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in non-batch normalized networks, we observe the rapid appearance of large isolated eigenvalues in the spectrum, along with a surprising concentration of the gradient in the corresponding eigenspaces. In batch normalized networks, these two effects are almost absent. We characterize these effects, and explain how they affect optimization speed through both theory and experiments. As part of this work, we adapt advanced tools from numerical linear algebra that allow scalable and accurate estimation of the entire Hessian spectrum of ImageNet-scale neural networks; this technique may be of independent interest in other applications.

[1]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[2]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[3]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[4]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[5]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[8]  Yousef Saad,et al.  Fast Estimation of tr(f(A)) via Stochastic Lanczos Quadrature , 2017, SIAM J. Matrix Anal. Appl..

[9]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[10]  Ryan P. Adams,et al.  Estimating the Spectral Density of Large Implicit Matrices , 2018, 1802.03451.

[11]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[12]  Laurent Demanet On Chebyshev interpolation of analytic functions , 2010 .

[13]  Gene H. Golub,et al.  Calculation of Gauss quadrature rules , 1967, Milestones in Matrix Computation.

[14]  Vardan Papyan,et al.  The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. , 2018 .

[15]  Nico M. Temme,et al.  Numerical methods for special functions , 2007 .

[16]  Kurt Keutzer,et al.  Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.

[17]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  C. Lanczos An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .

[19]  Stanislav Fort,et al.  The Goldilocks zone: Towards better understanding of neural network loss landscapes , 2018, AAAI.

[20]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[21]  G. Golub,et al.  Matrices, Moments and Quadrature with Applications , 2009 .

[22]  Xaq Pitkow,et al.  Skip Connections Eliminate Singularities , 2017, ICLR.

[23]  Vardan Papyan,et al.  The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size , 2018, ArXiv.

[24]  Lei Wu,et al.  Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[25]  Yousef Saad,et al.  Approximating Spectral Densities of Large Matrices , 2013, SIAM Rev..

[26]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[27]  Pierre C Bellec,et al.  Concentration of quadratic forms under a Bernstein moment assumption , 2019, 1901.08736.

[28]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[29]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[30]  G. Golub,et al.  Some large-scale matrix computation problems , 1996 .

[31]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[34]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NIPS 2018.