The Global Optimization Geometry of Shallow Linear Neural Networks

We examine the squared error loss landscape of shallow linear neural networks. We show—with significantly milder assumptions than previous works—that the corresponding optimization problems have benign geometric properties: There are no spurious local minima, and the Hessian at every saddle point has at least one negative eigenvalue. This means that at every saddle point there is a directional negative curvature which algorithms can utilize to further decrease the objective value. These geometric properties imply that many local search algorithms (such as the gradient descent which is widely utilized for training neural networks) can provably solve the training problem with global convergence.

[1]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[2]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[3]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[4]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[5]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[6]  Hassan Mansour,et al.  Learning Optimal Nonlinearities for Iterative Thresholding Algorithms , 2015, IEEE Signal Processing Letters.

[7]  Thomas Laurent,et al.  Deep linear neural networks with arbitrary loss: All local minima are global , 2017, ArXiv.

[8]  Junwei Lu,et al.  Symmetry, Saddle Points, and Global Geometry of Nonconvex Matrix Factorization , 2016, ArXiv.

[9]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[10]  Robert J. Schalkoff,et al.  Artificial neural networks , 1997 .

[11]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[12]  Meisam Razaviyayn,et al.  Learning Deep Models: Critical Points and Local Openness , 2018, ICLR.

[13]  Sundeep Rangan,et al.  AMP-Inspired Deep Networks for Sparse Linear Inverse Problems , 2016, IEEE Transactions on Signal Processing.

[14]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[15]  Yihua Tan,et al.  Unsupervised Multilayer Feature Learning for Satellite Image Scene Classification , 2016, IEEE Geoscience and Remote Sensing Letters.

[16]  Junwei Lu,et al.  Symmetry. Saddle Points, and Global Optimization Landscape of Nonconvex Matrix Factorization , 2016, 2018 Information Theory and Applications Workshop (ITA).

[17]  René Vidal,et al.  Global Optimality in Neural Network Training , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[19]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[20]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[21]  Zhihui Zhu,et al.  Global Optimality in Low-Rank Matrix Optimization , 2017, IEEE Transactions on Signal Processing.

[22]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[23]  Xin Huang,et al.  Deep networks under scene-level supervision for multi-class geospatial object detection from remote sensing images , 2018, ISPRS Journal of Photogrammetry and Remote Sensing.

[24]  Richard G. Baraniuk,et al.  A deep learning approach to structured signal recovery , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[25]  Haihao Lu,et al.  Depth Creates No Bad Local Minima , 2017, ArXiv.

[26]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[27]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[28]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[29]  Jorge Nocedal,et al.  A trust region method based on interior point techniques for nonlinear programming , 2000, Math. Program..

[30]  Zhihui Zhu,et al.  The Global Optimization Geometry of Low-Rank Matrix Optimization , 2017, IEEE Transactions on Information Theory.

[31]  Yair Carmon,et al.  Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[32]  Junjie Wu,et al.  An Optimal 2-D Spectrum Matching Method for SAR Ground Moving Target Imaging , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[33]  Anastasios Kyrillidis,et al.  Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach , 2016, AISTATS.

[34]  Nicholas I. M. Gould,et al.  Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[35]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[36]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[37]  Xiao Li,et al.  Nonconvex Robust Low-rank Matrix Recovery , 2018, SIAM J. Optim..

[38]  Anthony Man-Cho So,et al.  On the Estimation Performance and Convergence Rate of the Generalized Power Method for Phase Synchronization , 2016, SIAM J. Optim..

[39]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[40]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[41]  Qiuwei Li,et al.  The non-convex geometry of low-rank matrix optimization , 2016, Information and Inference: A Journal of the IMA.

[42]  Yonina C. Eldar,et al.  Convolutional Phase Retrieval via Gradient Descent , 2017, IEEE Transactions on Information Theory.

[43]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[44]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[45]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[46]  Daniel P. Robinson,et al.  Exploiting negative curvature in deterministic and stochastic optimization , 2017, Mathematical Programming.

[47]  Michael I. Jordan,et al.  First-order methods almost always avoid saddle points: The case of vanishing step-sizes , 2019, NeurIPS.

[48]  Max Simchowitz,et al.  Low-rank Solutions of Linear Matrix Equations via Procrustes Flow , 2015, ICML.

[49]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[50]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  François-Xavier Vialard,et al.  An Interpolating Distance Between Optimal Transport and Fisher–Rao Metrics , 2010, Foundations of Computational Mathematics.

[52]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[53]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[54]  Yongjun Zhang,et al.  Learning Source-Invariant Deep Hashing Convolutional Neural Networks for Cross-Source Remote Sensing Image Retrieval , 2018, IEEE Transactions on Geoscience and Remote Sensing.