Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural Networks

We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, where labels are generated by a target network. We make use of the rich symmetry structure to develop a novel set of tools for studying families of spurious minima. In contrast to existing approaches which operate in limiting regimes, our technique directly addresses the nonconvex loss landscape for a finite number of inputs d and neurons k, and provides analytic, rather than heuristic, information. In particular, we derive analytic estimates for the loss at different minima, and prove that modulo O(d−1/2)-terms the Hessian spectrum concentrates near small positive constants, with the exception of Θ(d) eigenvalues which grow linearly with d. We further show that the Hessian spectrum at global and spurious minima coincide to O(d−1/2)-order, thus challenging our ability to argue about statistical generalization through local curvature. Lastly, our technique provides the exact fractional dimensionality at which families of critical points turn from saddles into spurious minima. This makes possible the study of the creation and the annihilation of spurious minima using powerful tools from equivariant bifurcation theory. One of the outstanding conundrums of deep learning concerns the ability of simple gradient-based methods to successfully train neural networks despite the nonconvexity of the associated optimization problems. Indeed, generic nonconvex optimization landscapes can exhibit wide and flat basins of attraction around poor local minima which may lead to a complete failure of such methods. The nature by which nonconvex problems associated with neural networks deviate from generic ones is currently not well-understood. In particular, much of the dynamics of gradient-based methods follows from the curvature of the loss landscape around local minima. It is therefore vital to study the local geometry of spurious (i.e., non-global local) and global minima in order to understand the mysterious mechanism which drives gradient-based methods towards minima of high quality. However, already establishing the very existence of spurious minima seems to be beyond reach of existing analytic tools; let alone rigorously arguing about their height, curvature and structure—the aim of this work. In this paper, we focus on two-layer ReLU neural networks of the form

[1]  Michael Biehl,et al.  Learning by on-line gradient descent , 1995 .

[2]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[3]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[4]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[5]  Joan Bruna,et al.  Symmetry Breaking in Symmetric Tensor Decomposition , 2021, ArXiv.

[6]  Yossi Arjevani,et al.  Equivariant bifurcation, quadratic equivariants, and symmetry breaking for the standard representation of S k , 2021, ArXiv.

[7]  Michiel Straat,et al.  Hidden Unit Specialization in Layered Neural Networks: ReLU vs. Sigmoidal Activation , 2019, Physica A: Statistical Mechanics and its Applications.

[8]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[9]  Vardan Papyan,et al.  The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. , 2018 .

[10]  Michael Field,et al.  Symmetry breaking and the maximal isotropy subgroup conjecture for reflection groups , 1989 .

[11]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[12]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[13]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[14]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[15]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[16]  Yossi Arjevani,et al.  Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale of Symmetry , 2020, NeurIPS.

[17]  Fredrik Meyer,et al.  Representation theory , 2015 .

[18]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[19]  Yossi Arjevani,et al.  Symmetry & critical points for a model shallow neural network , 2020, ArXiv.

[20]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[21]  T. Watkin,et al.  THE STATISTICAL-MECHANICS OF LEARNING A RULE , 1993 .

[22]  Jeffrey Pennington,et al.  The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network , 2018, NeurIPS.

[23]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[24]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[25]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[26]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[27]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[28]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[29]  Michael Field,et al.  Dynamics and Symmetry , 2007 .

[30]  Zhanxing Zhu,et al.  Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[31]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[32]  Taiji Suzuki,et al.  On Learnability via Gradient Method for Two-Layer ReLU Neural Networks in Teacher-Student Setting , 2021, ICML.

[33]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[34]  David Tse,et al.  Porcupine Neural Networks: (Almost) All Local Optima are Global , 2017, ArXiv.

[35]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[36]  Eric Vanden-Eijnden,et al.  Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions , 2020, NeurIPS.

[37]  S. Kak Information, physics, and computation , 1996 .

[38]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[39]  Michael Biehl,et al.  On-line backpropagation in two-layered neural networks , 1995 .

[40]  Florent Krzakala,et al.  Generalisation dynamics of online learning in over-parameterised neural networks , 2019, ArXiv.

[41]  Kurt Keutzer,et al.  Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.

[42]  Nicolas Macris,et al.  The committee machine: computational to statistical gaps in learning a two-layers neural network , 2018, NeurIPS.

[43]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[44]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[45]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[46]  E. Gardner,et al.  Three unfinished works on the optimal storage capacity of networks , 1989 .

[47]  Wolfgang Kinzel,et al.  Improving a Network Generalization Ability by Selecting Examples , 1990 .

[48]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[49]  Shun-ichi Amari,et al.  Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.

[50]  M. Mézard,et al.  Spin Glass Theory And Beyond: An Introduction To The Replica Method And Its Applications , 1986 .

[51]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[52]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[53]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[54]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[55]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[56]  Allan Sly,et al.  Proof of the Satisfiability Conjecture for Large k , 2014, STOC.

[57]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[58]  M. Golubitsky The Bénard Problem, Symmetry and the Lattice of Isotropy Subgroups , 1983 .

[59]  Rina Panigrahy,et al.  Electron-Proton Dynamics in Deep Learning , 2017, ArXiv.

[60]  Andrea Montanari,et al.  When do neural networks outperform kernel methods? , 2020, NeurIPS.

[61]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[62]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[63]  C. Thomas Representations Of Finite And Lie Groups , 2004 .

[64]  Florent Krzakala,et al.  Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup , 2019, NeurIPS.