Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture

We consider the problem of recovering a complete (i.e., square and invertible) matrix <inline-formula> <tex-math notation="LaTeX">$ A_{0}$ </tex-math></inline-formula>, from <inline-formula> <tex-math notation="LaTeX">$ Y \in \mathbb R ^{n \times p}$ </tex-math></inline-formula> with <inline-formula> <tex-math notation="LaTeX">$ Y = A_{0} X_{0}$ </tex-math></inline-formula>, provided <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula> is sufficiently sparse. This recovery problem is central to theoretical understanding of dictionary learning, which seeks a sparse representation for a collection of input signals and finds numerous applications in modern signal processing and machine learning. We give the first efficient algorithm that provably recovers <inline-formula> <tex-math notation="LaTeX">$ A_{0}$ </tex-math></inline-formula> when <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula> has <inline-formula> <tex-math notation="LaTeX">$O \left ({ n }\right )$ </tex-math></inline-formula> nonzeros per column, under suitable probability model for <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula>. In contrast, prior results based on efficient algorithms either only guarantee recovery when <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula> has <inline-formula> <tex-math notation="LaTeX">$O(\sqrt {n})$ </tex-math></inline-formula> zeros per column, or require multiple rounds of semidefinite programming relaxation to work when <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula> has <inline-formula> <tex-math notation="LaTeX">$O(n)$ </tex-math></inline-formula> nonzeros per column. Our algorithmic pipeline centers around solving a certain nonconvex optimization problem with a spherical constraint. In this paper, we provide a geometric characterization of the objective landscape. In particular, we show that the problem is highly structured with high probability: 1) there are no “spurious” local minimizers and 2) around all saddle points the objective has a negative directional curvature. This distinctive structure makes the problem amenable to efficient optimization algorithms. In a companion paper, we design a second-order trust-region algorithm over the sphere that provably converges to a local minimizer from arbitrary initializations, despite the presence of saddle points.

[1]  T. E. Harris A lower bound for the critical probability in a certain percolation process , 1960, Mathematical Proceedings of the Cambridge Philosophical Society.

[2]  Donald Goldfarb,et al.  Curvilinear path steplength algorithms for minimization which use directions of negative curvature , 1980, Math. Program..

[3]  Jorge J. Moré,et al.  Computing a Trust Region Step , 1983 .

[4]  Gerald B. Folland,et al.  Real Analysis: Modern Techniques and Their Applications , 1984 .

[5]  Katta G. Murty,et al.  Some NP-complete problems in quadratic and nonlinear programming , 1987, Math. Program..

[6]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[7]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[8]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[9]  G. Stewart,et al.  Matrix Perturbation Theory , 1990 .

[10]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[11]  C. Udriste,et al.  Convex Functions and Optimization Methods on Riemannian Manifolds , 1994 .

[12]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[13]  U. Helmke,et al.  Optimization and Dynamical Systems , 1994, Proceedings of the IEEE.

[14]  Alan M. Frieze,et al.  Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[15]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[16]  Franz Rendl,et al.  A semidefinite framework for trust region subproblems with applications to large scale minimization , 1997, Math. Program..

[17]  Alan Edelman,et al.  The Geometry of Algorithms with Orthogonality Constraints , 1998, SIAM J. Matrix Anal. Appl..

[18]  Martin Vetterli,et al.  Data Compression and Harmonic Analysis , 1998, IEEE Trans. Inf. Theory.

[19]  R. DeVore,et al.  Nonlinear approximation , 1998, Acta Numerica.

[20]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[21]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[22]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[23]  Nicholas I. M. Gould,et al.  Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[24]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[25]  E. Oja,et al.  Independent Component Analysis , 2013 .

[26]  Barak A. Pearlmutter,et al.  Blind Source Separation by Sparse Decomposition in a Signal Dictionary , 2001, Neural Computation.

[27]  E. Candès New Ties between Computational Harmonic Analysis and Approximation Theory , 2002 .

[28]  V. Temlyakov Nonlinear Methods of Approximation , 2003, Found. Comput. Math..

[29]  Shuzhong Zhang,et al.  New Results on Quadratic Minimization , 2003, SIAM J. Optim..

[30]  Henry Wolkowicz,et al.  The trust region subproblem and semidefinite programming , 2004, Optim. Methods Softw..

[31]  Rémi Gribonval,et al.  Learning unions of orthonormal bases with thresholded singular value decomposition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[32]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[33]  A. Bruckstein,et al.  On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them , 2006 .

[34]  Pierre-Antoine Absil,et al.  Trust-Region Methods on Riemannian Manifolds , 2007, Found. Comput. Math..

[35]  N. Higham Functions of Matrices: Theory and Computation (Other Titles in Applied Mathematics) , 2008 .

[36]  N. Higham Functions Of Matrices , 2008 .

[37]  David L. Donoho,et al.  Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[38]  Ronald A. DeVore,et al.  Nonlinear approximation and its applications , 2009 .

[39]  Jianwei Ma,et al.  A Review of Curvelets and Recent Applications , 2009 .

[40]  Levent Tunçel,et al.  Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[41]  Babak Hassibi,et al.  New Null Space Results and Recovery Thresholds for Matrix Rank Minimization , 2010, ArXiv.

[42]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[43]  Michael Elad,et al.  Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing , 2010 .

[44]  Gerlind Plonka-Hoch,et al.  The Curvelet Transform , 2010, IEEE Signal Processing Magazine.

[45]  Hédy Attouch,et al.  Proximal Alternating Minimization and Projection Methods for Nonconvex Problems: An Approach Based on the Kurdyka-Lojasiewicz Inequality , 2008, Math. Oper. Res..

[46]  L. Duembgen Bounding Standard Gaussian Tail Probabilities , 2010, 1012.2063.

[47]  Karin Schnass,et al.  Dictionary Identification—Sparse Matrix-Factorization via $\ell_1$ -Minimization , 2009, IEEE Transactions on Information Theory.

[48]  Massimiliano Pontil,et al.  $K$ -Dimensional Coding Schemes in Hilbert Spaces , 2010, IEEE Transactions on Information Theory.

[49]  Emmanuel J. Candès,et al.  PhaseLift: Exact and Stable Signal Recovery from Magnitude Measurements via Convex Programming , 2011, ArXiv.

[50]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[51]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[52]  Shie Mannor,et al.  The Sample Complexity of Dictionary Learning , 2010, COLT.

[53]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[54]  Pablo A. Parrilo,et al.  The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[55]  Huan Wang,et al.  Exact Recovery of Sparsely-Used Dictionaries , 2012, COLT.

[56]  Sanjeev Arora,et al.  Computing a nonnegative matrix factorization -- provably , 2011, STOC '12.

[57]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[58]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[59]  Laurent Demanet,et al.  Recovering the Sparsest Element in a Subspace , 2013, 1310.1654.

[60]  Jian-Feng Cai,et al.  Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration , 2013, 2013 IEEE International Conference on Computer Vision.

[61]  Andrea Montanari,et al.  The phase transition of matrix recovery from Gaussian measurements matches the minimax MSE of matrix denoising , 2013, Proceedings of the National Academy of Sciences.

[62]  Rina Panigrahy,et al.  Sparse Matrix Factorization , 2013, ArXiv.

[63]  Anima Anandkumar,et al.  Exact Recovery of Sparsely Used Overcomplete Dictionaries , 2013, ArXiv.

[64]  Yoram Bresler,et al.  Near Optimal Compressed Sensing of Sparse Rank-One Matrices via Sparse Power Factorization , 2013, ArXiv.

[65]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[66]  Karin Schnass,et al.  On the Identifiability of Overcomplete Dictionaries via the Minimisation Principle Underlying K-SVD , 2013, ArXiv.

[67]  Alexander G. Gray,et al.  Sparsity-Based Generalization Bounds for Predictive Sparse Coding , 2013, ICML.

[68]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[69]  Joel A. Tropp,et al.  Living on the edge: phase transitions in convex programs with random data , 2013, 1303.6672.

[70]  Anima Anandkumar,et al.  Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates , 2014, ArXiv.

[71]  Zhaoran Wang,et al.  High Dimensional Expectation-Maximization Algorithm: Statistical Optimization and Asymptotic Normality , 2014, 1412.8729.

[72]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[73]  Bo Huang,et al.  Square Deal: Lower Bounds and Improved Relaxations for Tensor Recovery , 2013, ICML.

[74]  A. Appendix Alternating Minimization for Mixed Linear Regression , 2014 .

[75]  Prateek Jain,et al.  Provable Tensor Factorization with Missing Data , 2014, NIPS.

[76]  Zhaoran Wang,et al.  Nonconvex Statistical Optimization: Minimax-Optimal Sparse PCA in Polynomial Time , 2014, ArXiv.

[77]  Anima Anandkumar,et al.  Analyzing Tensor Power Method Dynamics: Applications to Learning Overcomplete Latent Variable Models , 2014, ArXiv.

[78]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[79]  Prateek Jain,et al.  Non-convex Robust PCA , 2014, NIPS.

[80]  Mary Wootters,et al.  Fast matrix completion without the condition number , 2014, COLT.

[81]  Hui Ji,et al.  A Convergent Incoherent Dictionary Learning Algorithm for Sparse Coding , 2014, ECCV.

[82]  Po-Ling Loh,et al.  Support recovery without incoherence: A case for nonconvex regularization , 2014, ArXiv.

[83]  Huan Wang,et al.  On the local correctness of ℓ1-minimization for dictionary learning , 2011, 2014 IEEE International Symposium on Information Theory.

[84]  Justin K. Romberg,et al.  Blind Deconvolution Using Convex Programming , 2012, IEEE Transactions on Information Theory.

[85]  E. Candès Mathematics of Sparsity (and a Few Other Things) , 2014 .

[86]  Aditya Bhaskara,et al.  More Algorithms for Provable Dictionary Learning , 2014, ArXiv.

[87]  Anima Anandkumar,et al.  Provable Tensor Methods for Learning Mixtures of Classifiers , 2014, ArXiv.

[88]  Sanjeev Arora,et al.  New Algorithms for Learning Incoherent and Overcomplete Dictionaries , 2013, COLT.

[89]  Jean Ponce,et al.  Sparse Modeling for Image and Vision Processing , 2014, Found. Trends Comput. Graph. Vis..

[90]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[91]  Moritz Hardt,et al.  Understanding Alternating Minimization for Matrix Completion , 2013, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[92]  Joel A. Tropp,et al.  Sharp Recovery Bounds for Convex Demixing, with Applications , 2012, Found. Comput. Math..

[93]  Zuowei Shen,et al.  L0 Norm Based Dictionary Learning by Proximal Methods with Global Convergence , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[94]  Marc Teboulle,et al.  Proximal alternating linearized minimization for nonconvex and nonsmooth problems , 2013, Mathematical Programming.

[95]  Sunav Choudhary,et al.  Identifiability Scaling Laws in Bilinear Inverse Problems , 2014, ArXiv.

[96]  Sujay Sanghavi,et al.  The Local Convexity of Solving Quadratic Equations , 2015 .

[97]  Po-Ling Loh,et al.  Statistical consistency and asymptotic normality for high-dimensional robust M-estimators , 2015, ArXiv.

[98]  Karin Schnass,et al.  Convergence radius and sample complexity of ITKM algorithms for dictionary learning , 2015, Applied and Computational Harmonic Analysis.

[99]  John Wright,et al.  Complete dictionary recovery over the sphere , 2015, 2015 International Conference on Sampling Theory and Applications (SampTA).

[100]  Prateek Jain,et al.  Computing Matrix Squareroot via Non Convex Local Search , 2015, ArXiv.

[101]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[102]  Prateek Jain,et al.  Fast Exact Matrix Completion with Finite Samples , 2014, COLT.

[103]  Chenglong Bao,et al.  Convergence analysis for iterative data-driven tight frame construction scheme , 2015 .

[104]  Sanjeev Arora,et al.  Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders , 2012, Algorithmica.

[105]  Friedrich T. Sommer,et al.  When Can Dictionary Learning Uniquely Recover Sparse Data From Subsamples? , 2011, IEEE Transactions on Information Theory.

[106]  Han Liu,et al.  Provable sparse tensor decomposition , 2015, 1502.01425.

[107]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[108]  Karin Schnass,et al.  Local identification of overcomplete dictionaries , 2014, J. Mach. Learn. Res..

[109]  Yuxin Chen,et al.  Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems , 2015, NIPS.

[110]  Zhi-Quan Luo,et al.  Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[111]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[112]  David Steurer,et al.  Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method , 2014, STOC.

[113]  John D. Lafferty,et al.  A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements , 2015, NIPS.

[114]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[115]  Sujay Sanghavi,et al.  The Local Convexity of Solving Systems of Quadratic Equations , 2015, 1506.07868.

[116]  Rishi Saket,et al.  Tight Hardness of the Non-commutative Grothendieck Problem , 2014, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[117]  Rémi Gribonval,et al.  Sample Complexity of Dictionary Learning and Other Matrix Factorizations , 2013, IEEE Transactions on Information Theory.

[118]  Lee-Ad Gottlieb,et al.  Matrix Sparsification and the Sparse Null Space Problem , 2010, Algorithmica.

[119]  Rémi Gribonval,et al.  Sparse and Spurious: Dictionary Learning With Noise and Outliers , 2014, IEEE Transactions on Information Theory.

[120]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[121]  John Wright,et al.  When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[122]  John Wright,et al.  Complete Dictionary Recovery Using Nonconvex Optimization , 2015, ICML.

[123]  Martin J. Wainwright,et al.  Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees , 2015, ArXiv.

[124]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[125]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[126]  Anastasios Kyrillidis,et al.  Dropping Convexity for Faster Semi-definite Optimization , 2015, COLT.

[127]  Nicolas Boumal,et al.  The non-convex Burer-Monteiro approach works on smooth semidefinite programs , 2016, NIPS.

[128]  Tengyu Ma,et al.  Polynomial-Time Tensor Decompositions with Sum-of-Squares , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[129]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[130]  Anima Anandkumar,et al.  Provable Tensor Methods for Learning Mixtures of Generalized Linear Models , 2014, AISTATS.

[131]  John Wright,et al.  Finding a Sparse Vector in a Subspace: Linear Sparsity Using Alternating Directions , 2014, IEEE Transactions on Information Theory.

[132]  Nicolas Boumal,et al.  On the low-rank approach for semidefinite programs arising in synchronization and community detection , 2016, COLT.

[133]  Prateek Jain,et al.  Tensor vs. Matrix Methods: Robust Tensor Decomposition under Block Sparse Perturbations , 2015, AISTATS.

[134]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[135]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[136]  Prateek Jain,et al.  Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization , 2013, SIAM J. Optim..

[137]  Max Simchowitz,et al.  Low-rank Solutions of Linear Matrix Equations via Procrustes Flow , 2015, ICML.

[138]  Amit Singer,et al.  Approximating the little Grothendieck problem over the orthogonal and unitary groups , 2013, Mathematical Programming.

[139]  Elad Hazan,et al.  A linear-time algorithm for trust region problems , 2014, Math. Program..

[140]  Kyle Luh,et al.  Dictionary Learning With Few Samples and Matrix Concentration , 2015, IEEE Transactions on Information Theory.

[141]  Nicolas Boumal,et al.  Nonconvex Phase Synchronization , 2016, SIAM J. Optim..

[142]  Yanjun Li,et al.  Identifiability and Stability in Blind Deconvolution Under Minimal Assumptions , 2015, IEEE Transactions on Information Theory.

[143]  Bin Yu,et al.  Local Identifiability of $\ell_1$-minimization Dictionary Learning: a Sufficient and Almost Necessary Condition , 2015, J. Mach. Learn. Res..

[144]  Felix Krahmer,et al.  Optimal Injectivity Conditions for Bilinear Inverse Problems with Applications to Identifiability of Deconvolution Problems , 2016, SIAM J. Appl. Algebra Geom..

[145]  Praneeth Netrapalli,et al.  A Clustering Approach to Learning Sparsely Used Overcomplete Dictionaries , 2013, IEEE Transactions on Information Theory.

[146]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere II: Recovery by Riemannian Trust-Region Method , 2015, IEEE Transactions on Information Theory.

[147]  Anima Anandkumar,et al.  Analyzing Tensor Power Method Dynamics in Overcomplete Regime , 2014, J. Mach. Learn. Res..

[148]  Yoram Bresler,et al.  Near-Optimal Compressed Sensing of a Class of Sparse Low-Rank Matrices Via Sparse Power Factorization , 2013, IEEE Transactions on Information Theory.