Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j a_j\sigma(\mathbf{w}^T\mathbf{Z}_j)$, in which both the convolutional weights $\mathbf{w}$ and the output weights $\mathbf{a}$ are parameters to be learned. When the labels are the outputs from a teacher network of the same architecture with fixed weights $(\mathbf{w}^*, \mathbf{a}^*)$, we prove that with Gaussian input $\mathbf{Z}$, there is a spurious local minimizer. Surprisingly, in the presence of the spurious local minimizer, gradient descent with weight normalization from randomly initialized weights can still be proven to recover the true parameters with constant probability, which can be boosted to probability $1$ with multiple restarts. We also show that with constant probability, the same procedure could also converge to the spurious local minimum, showing that the local minimum plays a non-trivial role in the dynamics of gradient descent. Furthermore, a quantitative analysis shows that the gradient descent dynamics has two phases: it starts off slow, but converges much faster after several iterations.

[1]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[2]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[3]  Yair Carmon,et al.  Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[4]  Varun Kanade,et al.  Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[5]  Anastasios Kyrillidis,et al.  Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach , 2016, AISTATS.

[6]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[7]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[8]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[9]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[10]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[11]  Yair Carmon,et al.  Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[12]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[13]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Zeyuan Allen-Zhu,et al.  Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[15]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[19]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[20]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[21]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[22]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[23]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[24]  Jiashi Feng,et al.  The Landscape of Deep Learning Algorithms , 2017, ArXiv.

[25]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[26]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[27]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[28]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[29]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[30]  Adam R. Klivans,et al.  Learning Depth-Three Neural Networks in Polynomial Time , 2017, ArXiv.

[31]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[32]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[33]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[34]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[35]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[36]  Ohad Shamir,et al.  Weight Sharing is Crucial to Succesful Optimization , 2017, ArXiv.

[37]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[38]  Arian Maleki,et al.  Global Analysis of Expectation Maximization for Mixtures of Two Gaussians , 2016, NIPS.

[39]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[40]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[41]  Adam R. Klivans,et al.  Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks , 2017, NIPS.

[42]  Junwei Lu,et al.  Symmetry, Saddle Points, and Global Geometry of Nonconvex Matrix Factorization , 2016, ArXiv.

[43]  Inderjit S. Dhillon,et al.  Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels , 2017, ArXiv.

[44]  Matthias Hein,et al.  The loss surface and expressivity of deep convolutional neural networks , 2017, ICLR.

[45]  Kfir Y. Levy,et al.  The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[46]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[47]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[48]  Yann LeCun,et al.  Open Problem: The landscape of the loss surfaces of multilayer networks , 2015, COLT.

[49]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[50]  Martin J. Wainwright,et al.  Learning Halfspaces and Neural Networks with Random Initialization , 2015, ArXiv.

[51]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[52]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[53]  Christos Tzamos,et al.  Ten Steps of EM Suffice for Mixtures of Two Gaussians , 2016, COLT.

[54]  David Tse,et al.  Porcupine Neural Networks: (Almost) All Local Optima are Global , 2017, ArXiv.

[55]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[56]  Rina Panigrahy,et al.  Convergence Results for Neural Networks via Electrodynamics , 2017, ITCS.

[57]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[58]  Moritz Hardt,et al.  The Noisy Power Method: A Meta Algorithm with Applications , 2013, NIPS.

[59]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[60]  Max Simchowitz,et al.  Low-rank Solutions of Linear Matrix Equations via Procrustes Flow , 2015, ICML.

[61]  Jirí Síma,et al.  Training a Single Sigmoidal Neuron Is Hard , 2002, Neural Comput..

[62]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[63]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[64]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, International Symposium on Information Theory.

[65]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.