Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g., trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This “implicit regularization” feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e., phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a by-product, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control—measured entrywise and by the spectral norm—which might be of independent interest.

[1]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[2]  V. Koltchinskii,et al.  Nuclear norm penalization and optimal rates for noisy low rank matrix completion , 2010, 1011.6256.

[3]  Yoram Bresler,et al.  ADMiRA: Atomic Decomposition for Minimum Rank Approximation , 2009, IEEE Transactions on Information Theory.

[4]  John D. Lafferty,et al.  Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent , 2016, ArXiv.

[5]  John Wright,et al.  On the Global Geometry of Sphere-Constrained Sparse Blind Deconvolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiaodong Li,et al.  Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization , 2016, Applied and Computational Harmonic Analysis.

[7]  Tom Goldstein,et al.  PhaseMax: Convex Phase Retrieval via Basis Pursuit , 2016, IEEE Transactions on Information Theory.

[8]  Yonina C. Eldar,et al.  Non-Convex Phase Retrieval From STFT Measurements , 2016, IEEE Transactions on Information Theory.

[9]  Mahdi Soltanolkotabi,et al.  Structured Signal Recovery From Quadratic Measurements: Breaking Sample Complexity Barriers via Nonconvex Optimization , 2017, IEEE Transactions on Information Theory.

[10]  P. Wedin Perturbation bounds in connection with singular value decomposition , 1972 .

[11]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[12]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[13]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[14]  Lorenzo Rosasco,et al.  Generalization Properties and Implicit Regularization for Multiple Passes SGM , 2016, ICML.

[15]  Prateek Jain,et al.  Fast Exact Matrix Completion with Finite Samples , 2014, COLT.

[16]  Feng Ruan,et al.  Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval , 2017, Information and Inference: A Journal of the IMA.

[17]  Y. Bresler,et al.  Blind gain and phase calibration for low-dimensional or sparse signal sensing via power iteration , 2017, 2017 International Conference on Sampling Theory and Applications (SampTA).

[18]  Maxim Sviridenko,et al.  Concentration and moment inequalities for polynomials of independent random variables , 2012, SODA.

[19]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[20]  P. Bickel,et al.  On robust regression with high-dimensional predictors , 2013, Proceedings of the National Academy of Sciences.

[21]  Gang Wang,et al.  Solving large-scale systems of random quadratic equations via stochastic truncated amplitude flow , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[22]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[23]  Mary Wootters,et al.  Fast matrix completion without the condition number , 2014, COLT.

[24]  Liming Wang,et al.  Blind Deconvolution From Multiple Sparse Inputs , 2016, IEEE Signal Processing Letters.

[25]  Justin Romberg,et al.  Fast and Guaranteed Blind Multichannel Deconvolution Under a Bilinear System Model , 2016, IEEE Transactions on Information Theory.

[26]  Yudong Chen,et al.  Incoherence-Optimal Matrix Completion , 2013, IEEE Transactions on Information Theory.

[27]  Yuxin Chen,et al.  Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval , 2018, Mathematical Programming.

[28]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[29]  Andrea Montanari,et al.  Fundamental Limits of Weak Recovery with Applications to Phase Retrieval , 2017, COLT.

[30]  Cong Ma,et al.  A Selective Overview of Deep Learning , 2019, Statistical science : a review journal of the Institute of Mathematical Statistics.

[31]  Prateek Jain,et al.  Non-convex Robust PCA , 2014, NIPS.

[32]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[33]  Yonina C. Eldar,et al.  Phase retrieval from STFT measurements via non-convex optimization , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Nicolas Boumal,et al.  Near-Optimal Bounds for Phase Synchronization , 2017, SIAM J. Optim..

[35]  J. Tanner,et al.  Low rank matrix completion by alternating steepest descent methods , 2016 .

[36]  Xiaodong Li,et al.  Solving Quadratic Equations via PhaseLift When There Are About as Many Equations as Unknowns , 2012, Found. Comput. Math..

[37]  Sujay Sanghavi,et al.  The Local Convexity of Solving Systems of Quadratic Equations , 2015, 1506.07868.

[38]  Yonina C. Eldar,et al.  Sparsity Based Sub-wavelength Imaging with Partially Incoherent Light via Quadratic Compressed Sensing References and Links , 2022 .

[39]  N. Alon,et al.  The Probabilistic Method: Alon/Probabilistic , 2008 .

[40]  Tengyu Ma,et al.  On the optimization landscape of tensor decompositions , 2017, Mathematical Programming.

[41]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[42]  Chandler Davis The rotation of eigenvectors by a perturbation , 1963 .

[43]  Emmanuel J. Candès,et al.  PhaseLift: Exact and Stable Signal Recovery from Magnitude Measurements via Convex Programming , 2011, ArXiv.

[44]  Ken Kreutz-Delgado,et al.  The Complex Gradient Operator and the CR-Calculus ECE275A - Lecture Supplement - Fall 2005 , 2009, 0906.4835.

[45]  David Gross,et al.  Recovering Low-Rank Matrices From Few Coefficients in Any Basis , 2009, IEEE Transactions on Information Theory.

[46]  Andrea J. Goldsmith,et al.  Exact and Stable Covariance Estimation From Quadratic Sampling via Convex Programming , 2013, IEEE Transactions on Information Theory.

[47]  Martin J. Wainwright,et al.  Restricted strong convexity and weighted matrix completion: Optimal bounds with noise , 2010, J. Mach. Learn. Res..

[48]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[49]  Prateek Jain,et al.  Thresholding Based Outlier Robust PCA , 2017, COLT.

[50]  Yuxin Chen,et al.  Nonconvex Matrix Factorization from Rank-One Measurements , 2019, AISTATS.

[51]  Emmanuel J. Candès,et al.  A Probabilistic and RIPless Theory of Compressed Sensing , 2010, IEEE Transactions on Information Theory.

[52]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[53]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[54]  Yonina C. Eldar,et al.  GESPAR: Efficient Phase Retrieval of Sparse Signals , 2013, IEEE Transactions on Signal Processing.

[55]  Yue M. Lu,et al.  Phase Transitions of Spectral Initialization for High-Dimensional Nonconvex Estimation , 2017, Information and Inference: A Journal of the IMA.

[56]  Max Simchowitz,et al.  Low-rank Solutions of Linear Matrix Equations via Procrustes Flow , 2015, ICML.

[57]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[58]  Yonina C. Eldar,et al.  Phase Retrieval: An Overview of Recent Developments , 2015, ArXiv.

[59]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[60]  Yuling Yan,et al.  Inference and uncertainty quantification for noisy matrix completion , 2019, Proceedings of the National Academy of Sciences.

[61]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[62]  Noureddine El Karoui,et al.  On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators , 2018 .

[63]  Zhi-Quan Luo,et al.  Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[64]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[65]  Sundeep Rangan,et al.  Compressive Phase Retrieval via Generalized Approximate Message Passing , 2014, IEEE Transactions on Signal Processing.

[66]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[67]  Tengyao Wang,et al.  A useful variant of the Davis--Kahan theorem for statisticians , 2014, 1405.0680.

[68]  B. A. Schmitt Perturbation bounds for matrix square roots and pythagorean sums , 1992 .

[69]  Benjamin Recht,et al.  A Simpler Approach to Matrix Completion , 2009, J. Mach. Learn. Res..

[70]  Gilad Lerman,et al.  A Well-Tempered Landscape for Non-convex Robust Subspace Recovery , 2017, J. Mach. Learn. Res..

[71]  Wen Huang,et al.  Blind Deconvolution by a Steepest Descent Algorithm on a Quotient Manifold , 2017, SIAM J. Imaging Sci..

[72]  Joel A. Tropp,et al.  Convex recovery of a structured signal from independent random linear measurements , 2014, ArXiv.

[73]  Gongguo Tang,et al.  The nonconvex geometry of low-rank matrix optimizations with general objective functions , 2016, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[74]  Yanjun Li,et al.  Blind Recovery of Sparse Signals From Subsampled Convolution , 2015, IEEE Transactions on Information Theory.

[75]  Abhay Pasupathy,et al.  On the Global Geometry of Sphere-Constrained Sparse Blind Deconvolution , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[77]  Yingbin Liang,et al.  Provable Non-convex Phase Retrieval with Outliers: Median TruncatedWirtinger Flow , 2016, ICML.

[78]  T. Tao Topics in Random Matrix Theory , 2012 .

[79]  Damek Davis,et al.  The nonsmooth landscape of phase retrieval , 2017, IMA Journal of Numerical Analysis.

[80]  Chen Cheng,et al.  Asymmetry Helps: Eigenvalue and Eigenvector Analyses of Asymmetrically Perturbed Low-Rank Matrices , 2018, ArXiv.

[81]  Laurent Jacques,et al.  A non-convex blind calibration method for randomised sensing strategies , 2016, 2016 4th International Workshop on Compressed Sensing Theory and its Applications to Radar, Sonar and Remote Sensing (CoSeRa).

[82]  A. Mukherjea,et al.  Real and Functional Analysis , 1978 .

[83]  Ruslan Salakhutdinov,et al.  Geometry of Optimization and Implicit Regularization in Deep Learning , 2017, ArXiv.

[84]  Yonina C. Eldar,et al.  Convolutional Phase Retrieval via Gradient Descent , 2017, IEEE Transactions on Information Theory.

[85]  J. Berge,et al.  Orthogonal procrustes rotation for two or more matrices , 1977 .

[86]  Yingbin Liang,et al.  A Nonconvex Approach for Phase Retrieval: Reshaped Wirtinger Flow and Incremental Algorithms , 2017, J. Mach. Learn. Res..

[87]  N. Higham Estimating the matrixp-norm , 1992 .

[88]  Yonina C. Eldar,et al.  Solving Systems of Random Quadratic Equations via Truncated Amplitude Flow , 2016, IEEE Transactions on Information Theory.

[89]  Sujay Sanghavi,et al.  The Local Convexity of Solving Quadratic Equations , 2015 .

[90]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[91]  Constantine Caramanis,et al.  A Convex Formulation for Mixed Regression: Near Optimal Rates in the Face of Noise , 2013, ArXiv.

[92]  Yuxin Chen,et al.  The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences , 2016, Communications on Pure and Applied Mathematics.

[93]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[94]  Bing Gao,et al.  Phaseless Recovery Using the Gauss–Newton Method , 2016, IEEE Transactions on Signal Processing.

[95]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[96]  Anastasios Kyrillidis,et al.  Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach , 2016, AISTATS.

[97]  Dacheng Tao,et al.  Algorithmic Stability and Hypothesis Complexity , 2017, ICML.

[98]  W. Kahan,et al.  The Rotation of Eigenvectors by a Perturbation. III , 1970 .

[99]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[100]  Thomas Strohmer,et al.  Regularized Gradient Descent: A Nonconvex Recipe for Fast Joint Blind Deconvolution and Demixing , 2017, ArXiv.

[101]  Justin K. Romberg,et al.  An Overview of Low-Rank Matrix Recovery From Incomplete Observations , 2016, IEEE Journal of Selected Topics in Signal Processing.

[102]  Pablo A. Parrilo,et al.  Rank-Sparsity Incoherence for Matrix Decomposition , 2009, SIAM J. Optim..

[103]  Yuling Yan,et al.  Noisy Matrix Completion: Understanding Statistical Guarantees for Convex Relaxation via Nonconvex Optimization , 2019, SIAM J. Optim..

[104]  Junwei Lu,et al.  Symmetry, Saddle Points, and Global Geometry of Nonconvex Matrix Factorization , 2016, ArXiv.

[105]  Gang Wang,et al.  Solving Almost all Systems of Random Quadratic Equations , 2017, NIPS 2017.

[106]  Jianqing Fan,et al.  ENTRYWISE EIGENVECTOR ANALYSIS OF RANDOM MATRICES WITH LOW EXPECTED RANK. , 2017, Annals of statistics.

[107]  A. Fannjiang,et al.  Phase Retrieval with One or Two Diffraction Patterns by Alternating Projections with the Null Initialization , 2015, 1510.07379.

[108]  Ke Wei Solving systems of phaseless equations via Kaczmarz methods: a proof of concept study , 2015 .

[109]  Gang Wang,et al.  Sparse Phase Retrieval via Truncated Amplitude Flow , 2016, IEEE Transactions on Signal Processing.

[110]  Adel Javanmard,et al.  Debiasing the lasso: Optimal sample size for Gaussian designs , 2015, The Annals of Statistics.

[111]  Yuxin Chen,et al.  Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems , 2015, NIPS.

[112]  Anru Zhang,et al.  ROP: Matrix Recovery via Rank-One Projections , 2013, ArXiv.

[113]  Yuxin Chen,et al.  Spectral Method and Regularized MLE Are Both Optimal for Top-$K$ Ranking , 2017, Annals of statistics.

[114]  R. Mathias Perturbation Bounds for the Polar Decomposition , 1997 .

[115]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[116]  Yuejie Chi,et al.  Kaczmarz Method for Solving Quadratic Equations , 2016, IEEE Signal Processing Letters.

[117]  Christos Thrampoulidis,et al.  Phase retrieval via linear programming: Fundamental limits and algorithmic improvements , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[118]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[119]  Sham M. Kakade,et al.  Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent , 2016, NIPS.

[120]  F. M. Dopico A Note on Sin Θ Theorems for Singular Subspace Variations , 2000 .

[121]  Martin J. Wainwright,et al.  Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees , 2015, ArXiv.

[122]  Justin K. Romberg,et al.  Blind Deconvolution Using Convex Programming , 2012, IEEE Transactions on Information Theory.

[123]  John D. Lafferty,et al.  A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements , 2015, NIPS.

[124]  Noureddine El Karoui On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators , 2018 .

[125]  Vladislav Voroninski,et al.  An Elementary Proof of Convex Phase Retrieval in the Natural Parameter Space via the Linear Program PhaseMax , 2016, ArXiv.

[126]  Ali Ahmed,et al.  BranchHull: Convex bilinear inversion from the entrywise product of signals with known signs , 2017, Applied and Computational Harmonic Analysis.

[127]  Xiaodong Li,et al.  Optimal Rates of Convergence for Noisy Sparse Phase Retrieval via Thresholded Wirtinger Flow , 2015, ArXiv.

[128]  Gang Wang,et al.  Solving Most Systems of Random Quadratic Equations , 2017, NIPS.

[129]  Tony F. Chan,et al.  Guarantees of Riemannian Optimization for Low Rank Matrix Recovery , 2015, SIAM J. Matrix Anal. Appl..

[130]  Dong Wang,et al.  Distributed estimation of principal eigenspaces. , 2017, Annals of statistics.

[131]  Nicholas J. Highamy Estimating the matrix p-norm , 1992 .

[132]  Zhaoran Wang,et al.  A Nonconvex Optimization Framework for Low Rank Matrix Estimation , 2015, NIPS.

[133]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[134]  Roy Mathias,et al.  The spectral norm of a nonnegative matrix , 1990 .

[135]  Yonina C. Eldar,et al.  Phase Retrieval via Matrix Completion , 2011, SIAM Rev..

[136]  Yuxin Chen,et al.  Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview , 2018, IEEE Transactions on Signal Processing.

[137]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[138]  Thomas Strohmer,et al.  Self-calibration and biconvex compressive sensing , 2015, ArXiv.

[139]  Yan Shuo Tan,et al.  Phase Retrieval via Randomized Kaczmarz: Theoretical Guarantees , 2017, ArXiv.

[140]  Gilad Lerman,et al.  Fast, Robust and Non-convex Subspace Recovery , 2014, 1406.6145.

[141]  Yang Wang,et al.  Fast Rank-One Alternating Minimization Algorithm for Phase Retrieval , 2017, Journal of Scientific Computing.

[142]  Andrea Montanari,et al.  Matrix Completion from Noisy Entries , 2009, J. Mach. Learn. Res..

[143]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[144]  Ayfer Özgür,et al.  Phase Retrieval via Incremental Truncated Wirtinger Flow , 2016, ArXiv.

[145]  Justin Romberg,et al.  Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation , 2016, AISTATS.

[146]  Yuejie Chi,et al.  Guaranteed Blind Sparse Spikes Deconvolution via Lifting and Convex Optimization , 2015, IEEE Journal of Selected Topics in Signal Processing.

[147]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[148]  Yuxin Chen,et al.  The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled Chi-square , 2017, Probability Theory and Related Fields.