Adding One Neuron Can Eliminate All Bad Local Minima

One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In this paper, we study the landscape of neural networks for binary classification tasks. Under mild assumptions, we prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum.

[1]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[2]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[4]  Adam R. Klivans,et al.  Learning Depth-Three Neural Networks in Polynomial Time , 2017, ArXiv.

[5]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[6]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[7]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[8]  Chen Ling,et al.  The Best Rank-1 Approximation of a Symmetric Tensor and Related Spherical Optimization Problems , 2012, SIAM J. Matrix Anal. Appl..

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[11]  Quynh N. Nguyen,et al.  Globally Optimal Training of Generalized Polynomial Neural Networks with Nonlinear Spectral Methods , 2016, NIPS.

[12]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[13]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[14]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[15]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[16]  Ohad Shamir,et al.  Are ResNets Provably Better than Linear Predictors? , 2018, NeurIPS.

[17]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[18]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[19]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[20]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[21]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[22]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[25]  R. Srikant,et al.  Understanding the Loss Surface of Neural Networks for Binary Classification , 2018, ICML.

[26]  Matthias Hein,et al.  The loss surface and expressivity of deep convolutional neural networks , 2017, ICLR.

[27]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[28]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[29]  Matthias Hein,et al.  Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.

[30]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[31]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[32]  Adam R. Klivans,et al.  Learning Neural Networks with Two Nonlinear Layers in Polynomial Time , 2017, COLT.

[33]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[34]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[35]  René Vidal,et al.  Structured Low-Rank Matrix Factorization: Optimality, Algorithm, and Applications to Image Processing , 2014, ICML.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[38]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[39]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.