Geometry of Neural Network Loss Surfaces via Random Matrix Theory

Understanding the geometry of neural network loss surfaces is important for the development of improved optimization algorithms and for building a theoretical understanding of why deep learning works. In this paper, we study the geometry in terms of the distribution of eigenvalues of the Hessian matrix at critical points of varying energy. We introduce an analytical framework and a set of tools from random matrix theory that allow us to compute an approximation of this distribution under a set of simplifying assumptions. The shape of the spectrum depends strongly on the energy and another key parameter, ϕ, which measures the ratio of parameters to data points. Our analysis predicts and numerical simulations support that for critical points of small index, the number of negative eigenvalues scales like the 3/2 power of the energy. We leave as an open problem an explanation for our observation that, in the context of a certain memorization task, the energy of minimizers is well-approximated by the function 1/2(1 - ϕ)2.

[1]  T. Tao Topics in Random Matrix Theory , 2012 .

[2]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[3]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[4]  T. Tao,et al.  Random covariance matrices: Universality of local statistics of eigenvalues , 2009, 0912.0966.

[5]  A. Edelman,et al.  Partial freeness of random matrices , 2012, 1204.2257.

[6]  A. Bray,et al.  Statistics of critical points of Gaussian fields on large-dimensional spaces. , 2006, Physical review letters.

[7]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[8]  E. Wigner Characteristic Vectors of Bordered Matrices with Infinite Dimensions I , 1955 .

[9]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[10]  A. Zee,et al.  Renormalizing rectangles and other topics in random matrix theory , 1996, cond-mat/9609190.

[11]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[12]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[13]  Surya Ganguli,et al.  Statistical Mechanics of Optimal Convex Inference in High Dimensions , 2016 .

[14]  R. Speicher Free Probability Theory , 1996, Oberwolfach Reports.

[15]  Thomas Dupic,et al.  Spectral density of products of Wishart dilute random matrices. Part I: the dense case , 2014, 1401.7802.

[16]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[17]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[18]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[19]  Z. Burda,et al.  Eigenvalues and singular values of products of rectangular gaussian random matrices. , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[21]  C. Laisant Intégration des fonctions inverses , 2022 .

[22]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[23]  Ralf R. Müller,et al.  On the asymptotic eigenvalue distribution of concatenated vector-valued fading channels , 2002, IEEE Trans. Inf. Theory.