Appearance of random matrix theory in deep learning

We investigate the local spectral statistics of the loss surface Hessians of artificial neural networks, where we discover agreement with Gaussian Orthogonal Ensemble statistics across several network architectures and datasets. These results shed new light on the applicability of Random Matrix Theory to modelling neural networks and suggest a role for it in the study of loss surfaces in deep learning.

[1]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[2]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[3]  Yann LeCun,et al.  Explorations on high dimensional landscapes , 2014, ICLR.

[4]  E. Gardner,et al.  Optimal storage properties of neural network models , 1988 .

[5]  C. Beenakker Random-matrix theory of quantum transport , 1996, cond-mat/9612179.

[6]  Vardan Papyan,et al.  The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. , 2018 .

[7]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[8]  Daniel A. Roberts,et al.  The Principles of Deep Learning Theory , 2021, ArXiv.

[9]  Stefano Soatto,et al.  On the energy landscape of deep networks , 2015, 1511.06485.

[10]  Y. Fyodorov Complexity of random energy landscapes, glass transition, and absolute value of the spectral determinant of random matrices. , 2004 .

[11]  R. Adler,et al.  Random Fields and Geometry , 2007 .

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  Taiji Suzuki,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.

[14]  M. Berry,et al.  Level clustering in the regular spectrum , 1977, Proceedings of the Royal Society of London. A. Mathematical and Physical Sciences.

[15]  Stephen Tyree,et al.  Exact Gaussian Processes on a Million Data Points , 2019, NeurIPS.

[16]  Y. Fyodorov,et al.  Hessian spectrum at the global minimum of high-dimensional random landscapes , 2018, Journal of Physics A: Mathematical and Theoretical.

[17]  Vardan Papyan,et al.  Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians , 2019, ICML.

[18]  Francesco Mezzadri,et al.  A Spin Glass Model for the Loss Surfaces of Generative Adversarial Networks , 2021, Journal of Statistical Physics.

[19]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[20]  G. Meurant,et al.  The Lanczos and conjugate gradient algorithms in finite precision arithmetic , 2006, Acta Numerica.

[21]  A. Buchleitner,et al.  Spectral backbone of excitation transport in ultracold Rydberg gases , 2014, 1409.5625.

[22]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[23]  M. Berry,et al.  Semiclassical level spacings when regular and chaotic orbits coexist , 1984 .

[24]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[25]  Florent Krzakala,et al.  Statistical physics of inference: thresholds and algorithms , 2015, ArXiv.

[26]  S. Kak Information, physics, and computation , 1996 .

[27]  Stephen J. Roberts,et al.  Towards understanding the true loss surface of deep neural networks using random matrix theory and iterative spectral methods , 2019 .

[28]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[29]  Sompolinsky,et al.  Spin-glass models of neural networks. , 1985, Physical review. A, General physics.

[30]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[31]  Diego Granziol Beyond Random Matrix Theory for Deep Networks , 2020, ArXiv.

[32]  G. Biroli,et al.  Complex Energy Landscapes in Spiked-Tensor and Simple Glassy Models: Ruggedness, Arrangements of Local Minima, and Phase Transitions , 2018, Physical Review X.

[33]  E. Bogomolny,et al.  Distribution of the ratio of consecutive level spacings in random matrix ensembles. , 2012, Physical review letters.

[34]  Jean-Philippe Bouchaud,et al.  Cleaning large correlation matrices: tools from random matrix theory , 2016, 1610.08104.

[35]  Jeffrey Pennington,et al.  The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network , 2018, NeurIPS.

[36]  Yue M. Lu,et al.  A Precise Performance Analysis of Learning with Random Features , 2020, ArXiv.

[37]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[38]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[39]  Marylou Gabri'e Mean-field inference methods for neural networks , 2019, ArXiv.

[40]  T. Dahlin,et al.  A comparison of the Gauss-Newton and quasi-Newton methods in resistivity imaging inversion , 2002 .

[41]  Florent Krzakala,et al.  Capturing the learning curves of generic features maps for realistic data sets with a teacher-student model , 2021, ArXiv.

[42]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[43]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[44]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[45]  Naftali Tishby,et al.  Machine learning and the physical sciences , 2019, Reviews of Modern Physics.

[46]  M. Stephanov,et al.  Random Matrices , 2005, hep-ph/0509286.

[47]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[48]  Sherif M. Abuelenin,et al.  Effect of Unfolding on the Spectral Statistics of Adjacency Matrices of Complex Networks , 2012, Complex Adaptive Systems.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Di He,et al.  A Gram-Gauss-Newton Method Learning Overparameterized Deep Neural Networks for Regression Problems , 2019, ArXiv.

[51]  Jeffrey Pennington,et al.  The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization , 2020, ICML.

[52]  Florent Krzakala,et al.  Generalisation error in learning with random features and the hidden manifold model , 2020, ICML.

[53]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[54]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[55]  Florent Krzakala,et al.  Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model , 2019, NeurIPS.

[56]  Yan V Fyodorov,et al.  Replica Symmetry Breaking Condition Exposed by Random Matrix Calculation of Landscape Complexity , 2007, cond-mat/0702601.

[57]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[58]  T. Tao Topics in Random Matrix Theory , 2012 .

[59]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[60]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[61]  Yann LeCun,et al.  Open Problem: The landscape of the loss surfaces of multilayer networks , 2015, COLT.

[62]  Florent Krzakala,et al.  The Gaussian equivalence of generative models for learning with shallow neural networks , 2020, MSML.

[63]  Lucas Benigni,et al.  Eigenvalue distribution of nonlinear models of random matrices , 2019, ArXiv.

[64]  C. Beenakker Book reviewSupersymmetry in disorder and chaos: by K. Efetov Cambridge University Press, 1997. £65.00 hbk (xiii + 441 pages) ISBN 0 521 47097 8 , 1997 .

[65]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[66]  Surya Ganguli,et al.  Statistical Mechanics of Deep Learning , 2020, Annual Review of Condensed Matter Physics.

[67]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[68]  T. Guhr,et al.  RANDOM-MATRIX THEORIES IN QUANTUM PHYSICS : COMMON CONCEPTS , 1997, cond-mat/9707301.

[69]  P. L. Doussal,et al.  Topology Trivialization and Large Deviations for the Minimum in the Simplest Random Optimization , 2013, 1304.0024.

[70]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[71]  H. Weidenmuller,et al.  Random Matrices and Chaos in Nuclear Physics , 2008, 0807.1070.