Pure and Spurious Critical Points: a Geometric Study of Linear Networks

The critical locus of the loss function of a neural network is determined by the geometry of the functional space and by the parameterization of this space by the network's weights. We introduce a natural distinction between pure critical points, which only depend on the functional space, and spurious critical points, which arise from the parameterization. We apply this perspective to revisit and extend the literature on the loss function of linear neural networks. For this type of network, the functional space is either the set of all linear maps from input to output space, or a determinantal variety, i.e., a set of linear maps with bounded rank. We use geometric properties of determinantal varieties to derive new results on the landscape of linear networks with different loss functions and different parameterizations. Our analysis clearly illustrates that the absence of "bad" local minima in the loss landscape of linear networks is due to two distinct phenomena that apply in different settings: it is true for arbitrary smooth convex losses in the case of architectures that can express all linear maps ("filling architectures") but it holds only for the quadratic loss when the functional space is a determinantal variety ("non-filling architectures"). Without any assumption on the architecture, smooth convex losses may lead to landscapes with many bad minima.

[1]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[2]  Li Zhang,et al.  Depth creates no more spurious local minima , 2019, ArXiv.

[3]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[4]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[5]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[6]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[7]  Rekha R. Thomas,et al.  The Euclidean Distance Degree of an Algebraic Variety , 2013, Foundations of Computational Mathematics.

[8]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[9]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[10]  Luke Oeding,et al.  Learning Algebraic Models of Quantum Entanglement , 2019, Quantum Inf. Process..

[11]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[12]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[13]  Thomas Laurent,et al.  Deep linear neural networks with arbitrary loss: All local minima are global , 2017, ArXiv.

[14]  Joan Bruna,et al.  Spurious Valleys in Two-layer Neural Network Optimization Landscapes , 2018, 1802.06384.

[15]  Bernd Sturmfels,et al.  Exact Solutions in Structured Low-Rank Approximation , 2013, SIAM J. Matrix Anal. Appl..

[16]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[17]  Tingting Tang,et al.  The Loss Surface of Deep Linear Networks Viewed Through the Algebraic Geometry Lens , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[19]  Holger Rauhut,et al.  Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers , 2019, ArXiv.

[20]  Yi Zhou,et al.  Critical Points of Neural Networks: Analytical Forms and Landscape Properties , 2017, ArXiv.

[21]  J. Draisma,et al.  The average number of critical rank-one approximations to a tensor , 2014, 1408.3507.

[22]  Haihao Lu,et al.  Depth Creates No Bad Local Minima , 2017, ArXiv.

[23]  John M. Lee Introduction to Smooth Manifolds , 2002 .

[24]  Joan Bruna,et al.  On the Expressive Power of Deep Polynomial Neural Networks , 2019, NeurIPS.

[25]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[26]  Joe W. Harris,et al.  Algebraic Geometry: A First Course , 1995 .