Sub-Optimal Local Minima Exist for Neural Networks with Almost All Non-Linear Activations

Does over-parameterization eliminate sub-optimal local minima for neural networks? An affirmative answer was given by a classical result in [59] for 1-hidden-layer wide neural networks. A few recent works have extended the setting to multi-layer neural networks, but none of them has proved every local minimum is global. Why is this result never extended to deep networks? In this paper, we show that the task is impossible because the original result for 1-hidden-layer network in [59] can not hold. More specifically, we prove that for any multi-layer network with generic input data and non-linear activation functions, sub-optimal local minima can exist, no matter how wide the network is (as long as the last hidden layer has at least two neurons). While the result of [59] assumes sigmoid activation, our counter-example covers a large set of activation functions (dense in the set of continuous functions), indicating that the limitation is not due to the specific activation. Our result indicates that "no bad local-min" may be unable to explain the benefit of over-parameterization for training neural nets.

[1]  R. Srikant,et al.  Understanding the Loss Surface of Neural Networks for Binary Classification , 2018, ICML.

[2]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[3]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[4]  Andrea Montanari,et al.  On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition , 2018, AISTATS.

[5]  Zhi-Quan Luo,et al.  Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[6]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[7]  Li Zhang,et al.  Depth creates no more spurious local minima , 2019, ArXiv.

[8]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[9]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[10]  R. Srikant,et al.  Why Deep Neural Networks for Function Approximation? , 2016, ICLR.

[11]  Justin A. Sirignano,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[12]  Javad Lavaei,et al.  How Much Restricted Isometry is Needed In Nonconvex Matrix Recovery? , 2018, NeurIPS.

[13]  B. Mityagin The Zero Set of a Real Analytic Function , 2015, Mathematical Notes.

[14]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[15]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[16]  Dawei Li,et al.  Over-Parameterized Deep Neural Networks Have No Strict Local Minima For Any Continuous Activations , 2018, ArXiv.

[17]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[18]  Pramod Viswanath,et al.  Learning One-hidden-layer Neural Networks under General Input Distributions , 2018, AISTATS.

[19]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[20]  R. Srikant,et al.  Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[21]  René Vidal,et al.  Global Optimality in Neural Network Training , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[23]  Ohad Shamir,et al.  Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[24]  Ruoyu Sun,et al.  Optimization for deep learning: theory and algorithms , 2019, ArXiv.

[25]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[26]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[27]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[28]  Razvan Pascanu,et al.  Local minima in training of deep networks , 2017, ArXiv.

[29]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[30]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[31]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[32]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[33]  Thomas Laurent,et al.  The Multilinear Structure of ReLU Networks , 2017, ICML.

[34]  Dacheng Tao,et al.  Piecewise linear activations substantially shape the loss surfaces of neural networks , 2020, ICLR.

[35]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[36]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[37]  Thomas Laurent,et al.  Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[38]  Matthias Hein,et al.  On the loss landscape of a class of deep neural networks with no bad local valleys , 2018, ICLR.

[39]  Haihao Lu,et al.  Depth Creates No Bad Local Minima , 2017, ArXiv.

[40]  Yi Zhou,et al.  Critical Points of Neural Networks: Analytical Forms and Landscape Properties , 2017, ArXiv.

[41]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[42]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[43]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[44]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[45]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[46]  X H Yu,et al.  On the local minima free condition of backpropagation learning , 1995, IEEE Trans. Neural Networks.

[47]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[48]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[49]  Joan Bruna,et al.  Spurious Valleys in Two-layer Neural Network Optimization Landscapes , 2018, 1802.06384.

[50]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[51]  David Lopez-Paz,et al.  Easing non-convex optimization with neural networks , 2018, ICLR.

[52]  Gang Wang,et al.  Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[53]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[54]  Marco Gori,et al.  Optimal learning in artificial neural networks: A review of theoretical results , 1996, Neurocomputing.

[55]  Quynh Nguyen,et al.  On Connected Sublevel Sets in Deep Learning , 2019, ICML.

[56]  Eduardo D. Sontag,et al.  Backpropagation Can Give Rise to Spurious Local Minima Even for Networks without Hidden Layers , 1989, Complex Syst..

[57]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[58]  David Tse,et al.  Porcupine Neural Networks: (Almost) All Local Optima are Global , 2017, ArXiv.

[59]  Rina Panigrahy,et al.  Convergence Results for Neural Networks via Electrodynamics , 2017, ITCS.

[60]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[61]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[62]  Samet Oymak,et al.  Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks , 2019, IEEE Journal on Selected Areas in Information Theory.