Over-Parameterized Deep Neural Networks Have No Strict Local Minima For Any Continuous Activations

In this paper, we study the loss surface of the over-parameterized fully connected deep neural networks. We prove that for any continuous activation functions, the loss function has no bad strict local minimum, both in the regular sense and in the sense of sets. This result holds for any convex and differentiable loss function, and the data samples are only required to be distinct in at least one dimension. Furthermore, we show that bad local minima do exist for a class of activation functions, so without further assumptions it is impossible to prove every local minimum is a global minimum.

[1]  Justin A. Sirignano,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[2]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[3]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[4]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[5]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[6]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[7]  René Vidal,et al.  Structured Low-Rank Matrix Factorization: Optimality, Algorithm, and Applications to Image Processing , 2014, ICML.

[8]  Adam R. Klivans,et al.  Learning Depth-Three Neural Networks in Polynomial Time , 2017, ArXiv.

[9]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[10]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[11]  Guanghui Lan,et al.  Theoretical properties of the global optimizer of two layer neural network , 2017, ArXiv.

[12]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[13]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[14]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[15]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[16]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[17]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[18]  X H Yu,et al.  On the local minima free condition of backpropagation learning , 1995, IEEE Trans. Neural Networks.

[19]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[20]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[21]  B. Mityagin The Zero Set of a Real Analytic Function , 2015, Mathematical Notes.

[22]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[23]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[24]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[25]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[26]  Matthias Hein,et al.  The loss surface and expressivity of deep convolutional neural networks , 2017, ICLR.

[27]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[28]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[29]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[30]  Wilfred Kaplan Approximation by entire functions. , 1955 .

[31]  Quynh N. Nguyen,et al.  Globally Optimal Training of Generalized Polynomial Neural Networks with Nonlinear Spectral Methods , 2016, NIPS.

[32]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[33]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[34]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[35]  Javad Lavaei,et al.  A theory on the absence of spurious solutions for nonconvex and nonsmooth optimization , 2018, NeurIPS.

[36]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[37]  David Lopez-Paz,et al.  Easing non-convex optimization with neural networks , 2018, ICLR.

[38]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[39]  Matthias Hein,et al.  On the loss landscape of a class of deep neural networks with no bad local valleys , 2018, ICLR.

[40]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[41]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[42]  T. Poggio,et al.  Theory of Deep Learning III : the non-overfitting puzzle , 2018 .

[43]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[44]  R. Srikant,et al.  Understanding the Loss Surface of Neural Networks for Binary Classification , 2018, ICML.

[45]  Gang Wang,et al.  Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[46]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[47]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[48]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[49]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[50]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[51]  Ohad Shamir,et al.  Are ResNets Provably Better than Linear Predictors? , 2018, NeurIPS.

[52]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.