On the Gap Between Strict-Saddles and True Convexity: An Omega(log d) Lower Bound for Eigenvector Approximation

We prove a \emph{query complexity} lower bound on rank-one principal component analysis (PCA). We consider an oracle model where, given a symmetric matrix $M \in \mathbb{R}^{d \times d}$, an algorithm is allowed to make $T$ \emph{exact} queries of the form $w^{(i)} = Mv^{(i)}$ for $i \in \{1,\dots,T\}$, where $v^{(i)}$ is drawn from a distribution which depends arbitrarily on the past queries and measurements $\{v^{(j)},w^{(j)}\}_{1 \le j \le i-1}$. We show that for a small constant $\epsilon$, any adaptive, randomized algorithm which can find a unit vector $\widehat{v}$ for which $\widehat{v}^{\top}M\widehat{v} \ge (1-\epsilon)\|M\|$, with even small probability, must make $T = \Omega(\log d)$ queries. In addition to settling a widely-held folk conjecture, this bound demonstrates a fundamental gap between convex optimization and "strict-saddle" non-convex optimization of which PCA is a canonical example: in the former, first-order methods can have dimension-free iteration complexity, whereas in PCA, the iteration complexity of gradient-based methods must necessarily grow with the dimension. Our argument proceeds via a reduction to estimating the rank-one spike in a deformed Wigner model. We establish lower bounds for this model by developing a "truncated" analogue of the $\chi^2$ Bayes-risk lower bound of Chen et al.

[1]  A. Guionnet,et al.  An Introduction to Random Matrices , 2009 .

[2]  David P. Woodruff,et al.  Low-Rank PSD Approximation in Input-Sparsity Time , 2017, SODA.

[3]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[4]  Cameron Musco,et al.  Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition , 2015, NIPS.

[5]  Le Song,et al.  Communication Efficient Distributed Kernel Principal Component Analysis , 2015, KDD.

[6]  Sanjeev Arora,et al.  Computing a nonnegative matrix factorization -- provably , 2011, STOC '12.

[7]  Ohad Shamir,et al.  Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation , 2013, NIPS.

[8]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[9]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[10]  L. Addario-Berry,et al.  On Combinatorial Testing Problems 1 , 2010 .

[11]  David P. Woodruff,et al.  On Sketching Matrix Norms and the Top Singular Vector , 2014, SODA.

[12]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[13]  Nathan Srebro,et al.  Fast maximum margin matrix factorization for collaborative prediction , 2005, ICML.

[14]  Anastasios Kyrillidis,et al.  Dropping Convexity for Faster Semi-definite Optimization , 2015, COLT.

[15]  Nicolas Boumal,et al.  Nonconvex Phase Synchronization , 2016, SIAM J. Optim..

[16]  David P. Woodruff,et al.  On approximating functions of the singular values in a stream , 2016, STOC.

[17]  Ohad Shamir,et al.  Oracle Complexity of Second-Order Methods for Finite-Sum Problems , 2016, ICML.

[18]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[19]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[20]  Xi Chen,et al.  On Bayes Risk Lower Bounds , 2014, J. Mach. Learn. Res..

[21]  R. Castro Adaptive sensing performance lower bounds for sparse signal detection and support estimation , 2012, 1206.0648.

[22]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[23]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[24]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[25]  Cristopher Moore,et al.  Random k-SAT: Two Moments Suffice to Cross a Sharp Threshold , 2003, SIAM J. Comput..

[26]  Daniel A. Spielman,et al.  Spectral Graph Theory and its Applications , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[27]  A. Bandeira,et al.  Sharp nonasymptotic bounds on the norm of random matrices with independent entries , 2014, 1408.6185.

[28]  Max Simchowitz,et al.  Low-rank Solutions of Linear Matrix Equations via Procrustes Flow , 2015, ICML.

[29]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[30]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[31]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[32]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[33]  Martin J. Wainwright,et al.  Distance-based and continuum Fano inequalities with applications to statistical estimation , 2013, ArXiv.

[34]  Alessandro Lazaric,et al.  Best-Arm Identification in Linear Bandits , 2014, NIPS.

[35]  Rui M. Castro,et al.  Adaptive Compressed Sensing for Support Recovery of Structured Sparse Sets , 2014, IEEE Transactions on Information Theory.

[36]  Sham M. Kakade,et al.  Faster Eigenvector Computation via Shift-and-Invert Preconditioning , 2016, ICML.

[37]  Emmanuel J. Candès,et al.  On the Fundamental Limits of Adaptive Sensing , 2011, IEEE Transactions on Information Theory.

[38]  Marc Lelarge,et al.  Fundamental limits of symmetric low-rank matrix estimation , 2016, Probability Theory and Related Fields.

[39]  David P. Woodruff,et al.  Lower Bounds for Adaptive Sparse Recovery , 2012, SODA.

[40]  David P. Woodruff,et al.  Low rank approximation with entrywise l1-norm error , 2017, STOC.

[41]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[42]  Friedrich Liese phi-divergences, sufficiency, Bayes sufficiency, and deficiency , 2012, Kybernetika.

[43]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[44]  Christos Boutsidis,et al.  Optimal principal component analysis in distributed and streaming models , 2015, STOC.

[45]  I. Csiszár A class of measures of informativity of observation channels , 1972 .

[46]  John Wright,et al.  When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[47]  Ohad Shamir,et al.  On Lower and Upper Bounds in Smooth and Strongly Convex Optimization , 2016, J. Mach. Learn. Res..

[48]  Max Simchowitz,et al.  The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime , 2017, COLT.

[49]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[50]  Yair Carmon,et al.  Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step , 2016, ArXiv.

[51]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[52]  David P. Woodruff,et al.  On deterministic sketching and streaming for sparse recovery and norm estimation , 2014 .

[53]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[54]  David P. Woodruff,et al.  Weighted low rank approximations with provable guarantees , 2016, STOC.

[55]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[56]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[57]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[58]  Aditya Guntuboyina Lower Bounds for the Minimax Risk Using $f$-Divergences, and Applications , 2011, IEEE Transactions on Information Theory.

[59]  Kasper Green Larsen,et al.  Time Lower Bounds for Nonadaptive Turnstile Streaming Algorithms , 2014, STOC.

[60]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[61]  Tengyu Ma,et al.  Finding Approximate Local Minima for Nonconvex Optimization in Linear Time , 2016, ArXiv.

[62]  John D. Lafferty,et al.  A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements , 2015, NIPS.

[63]  Andrew Chi-Chih Yao,et al.  Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[64]  D. Féral,et al.  The Largest Eigenvalue of Rank One Deformation of Large Wigner Matrices , 2006, math/0605624.

[65]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[66]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[67]  Ohad Shamir,et al.  Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis , 2017, ICML.

[68]  Tengyu Ma,et al.  Online Learning of Eigenvectors , 2015, ICML.

[69]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[70]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[71]  John C. Duchi,et al.  Minimax rates for memory-bounded sparse linear regression , 2015, COLT.

[72]  Jakub W. Pachocki,et al.  Optimal lower bounds for universal relation, samplers, and finding duplicates , 2017, ArXiv.

[73]  J R Fienup,et al.  Phase retrieval algorithms: a comparison. , 1982, Applied optics.

[74]  Ohad Shamir,et al.  Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity , 2015, ICML.

[75]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[76]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[77]  Gregory Valiant,et al.  Memory, Communication, and Statistical Queries , 2016, COLT.

[78]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[79]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[80]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, International Symposium on Information Theory.

[81]  Robert D. Nowak,et al.  Query Complexity of Derivative-Free Optimization , 2012, NIPS.

[82]  Katta G. Murty,et al.  Some NP-complete problems in quadratic and nonlinear programming , 1987, Math. Program..

[83]  Rahul Jain,et al.  Lifting randomized query complexity to randomized communication complexity , 2017, Electron. Colloquium Comput. Complex..