论文信息 - Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture

Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture

We consider the problem of recovering a complete (i.e., square and invertible) matrix <inline-formula> <tex-math notation="LaTeX">$ A_{0}$ </tex-math></inline-formula>, from <inline-formula> <tex-math notation="LaTeX">$ Y \in \mathbb R ^{n \times p}$ </tex-math></inline-formula> with <inline-formula> <tex-math notation="LaTeX">$ Y = A_{0} X_{0}$ </tex-math></inline-formula>, provided <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula> is sufficiently sparse. This recovery problem is central to theoretical understanding of dictionary learning, which seeks a sparse representation for a collection of input signals and finds numerous applications in modern signal processing and machine learning. We give the first efficient algorithm that provably recovers <inline-formula> <tex-math notation="LaTeX">$ A_{0}$ </tex-math></inline-formula> when <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula> has <inline-formula> <tex-math notation="LaTeX">$O \left ({ n }\right )$ </tex-math></inline-formula> nonzeros per column, under suitable probability model for <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula>. In contrast, prior results based on efficient algorithms either only guarantee recovery when <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula> has <inline-formula> <tex-math notation="LaTeX">$O(\sqrt {n})$ </tex-math></inline-formula> zeros per column, or require multiple rounds of semidefinite programming relaxation to work when <inline-formula> <tex-math notation="LaTeX">$ X_{0}$ </tex-math></inline-formula> has <inline-formula> <tex-math notation="LaTeX">$O(n)$ </tex-math></inline-formula> nonzeros per column. Our algorithmic pipeline centers around solving a certain nonconvex optimization problem with a spherical constraint. In this paper, we provide a geometric characterization of the objective landscape. In particular, we show that the problem is highly structured with high probability: 1) there are no “spurious” local minimizers and 2) around all saddle points the objective has a negative directional curvature. This distinctive structure makes the problem amenable to efficient optimization algorithms. In a companion paper, we design a second-order trust-region algorithm over the sphere that provably converges to a local minimizer from arbitrary initializations, despite the presence of saddle points.

[1] T. E. Harris. A lower bound for the critical probability in a certain percolation process , 1960, Mathematical Proceedings of the Cambridge Philosophical Society.

[2] Donald Goldfarb,et al. Curvilinear path steplength algorithms for minimization which use directions of negative curvature , 1980, Math. Program..

[3] Jorge J. Moré,et al. Computing a Trust Region Step , 1983 .

[4] Gerald B. Folland,et al. Real Analysis: Modern Techniques and Their Applications , 1984 .

[5] Katta G. Murty,et al. Some NP-complete problems in quadratic and nonlinear programming , 1987, Math. Program..

[6] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[7] Kurt Hornik,et al. Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[8] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[9] G. Stewart,et al. Matrix Perturbation Theory , 1990 .

[10] Pierre Comon,et al. Independent component analysis, A new concept? , 1994, Signal Process..

[11] C. Udriste,et al. Convex Functions and Optimization Methods on Riemannian Manifolds , 1994 .

[12] David J. Field,et al. Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[13] U. Helmke,et al. Optimization and Dynamical Systems , 1994, Proceedings of the IEEE.

[14] Alan M. Frieze,et al. Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[15] David J. Field,et al. Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[16] Franz Rendl,et al. A semidefinite framework for trust region subproblems with applications to large scale minimization , 1997, Math. Program..

[17] Alan Edelman,et al. The Geometry of Algorithms with Orthogonality Constraints , 1998, SIAM J. Matrix Anal. Appl..

[18] Martin Vetterli,et al. Data Compression and Harmonic Analysis , 1998, IEEE Trans. Inf. Theory.

[19] R. DeVore,et al. Nonlinear approximation , 1998, Acta Numerica.

[20] Stephen J. Wright,et al. Numerical Optimization , 2018, Fundamental Statistical Inference.

[21] Aapo Hyvärinen,et al. Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[22] Erkki Oja,et al. Independent component analysis: algorithms and applications , 2000, Neural Networks.

[23] Nicholas I. M. Gould,et al. Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[24] P. Tseng. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[25] E. Oja,et al. Independent Component Analysis , 2013 .

[26] Barak A. Pearlmutter,et al. Blind Source Separation by Sparse Decomposition in a Signal Dictionary , 2001, Neural Computation.

[27] E. Candès. New Ties between Computational Harmonic Analysis and Approximation Theory , 2002 .

[28] V. Temlyakov. Nonlinear Methods of Approximation , 2003, Found. Comput. Math..

[29] Shuzhong Zhang,et al. New Results on Quadratic Minimization , 2003, SIAM J. Optim..

[30] Henry Wolkowicz,et al. The trust region subproblem and semidefinite programming , 2004, Optim. Methods Softw..

[31] Rémi Gribonval,et al. Learning unions of orthonormal bases with thresholded singular value decomposition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[32] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[33] A. Bruckstein,et al. On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them , 2006 .

[34] Pierre-Antoine Absil,et al. Trust-Region Methods on Riemannian Manifolds , 2007, Found. Comput. Math..

[35] N. Higham. Functions of Matrices: Theory and Computation (Other Titles in Applied Mathematics) , 2008 .

[36] N. Higham. Functions Of Matrices , 2008 .

[37] David L. Donoho,et al. Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[38] Ronald A. DeVore,et al. Nonlinear approximation and its applications , 2009 .

[39] Jianwei Ma,et al. A Review of Curvelets and Recent Applications , 2009 .

[40] Levent Tunçel,et al. Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[41] Babak Hassibi,et al. New Null Space Results and Recovery Thresholds for Matrix Rank Minimization , 2010, ArXiv.

[42] Andrea Montanari,et al. Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[43] Michael Elad,et al. Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing , 2010 .

[44] Gerlind Plonka-Hoch,et al. The Curvelet Transform , 2010, IEEE Signal Processing Magazine.

[45] Hédy Attouch,et al. Proximal Alternating Minimization and Projection Methods for Nonconvex Problems: An Approach Based on the Kurdyka-Lojasiewicz Inequality , 2008, Math. Oper. Res..

[46] L. Duembgen. Bounding Standard Gaussian Tail Probabilities , 2010, 1012.2063.

[47] Karin Schnass,et al. Dictionary Identification—Sparse Matrix-Factorization via $\ell_1$ -Minimization , 2009, IEEE Transactions on Information Theory.

[48] Massimiliano Pontil,et al. $K$ -Dimensional Coding Schemes in Hilbert Spaces , 2010, IEEE Transactions on Information Theory.

[49] Emmanuel J. Candès,et al. PhaseLift: Exact and Stable Signal Recovery from Magnitude Measurements via Convex Programming , 2011, ArXiv.

[50] Po-Ling Loh,et al. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[51] Yi Ma,et al. Robust principal component analysis? , 2009, JACM.

[52] Shie Mannor,et al. The Sample Complexity of Dictionary Learning , 2010, COLT.

[53] Joel A. Tropp,et al. User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[54] Pablo A. Parrilo,et al. The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[55] Huan Wang,et al. Exact Recovery of Sparsely-Used Dictionaries , 2012, COLT.

[56] Sanjeev Arora,et al. Computing a nonnegative matrix factorization -- provably , 2011, STOC '12.

[57] Po-Ling Loh,et al. Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[58] Holger Rauhut,et al. A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[59] Laurent Demanet,et al. Recovering the Sparsest Element in a Subspace , 2013, 1310.1654.

[60] Jian-Feng Cai,et al. Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration , 2013, 2013 IEEE International Conference on Computer Vision.

[61] Andrea Montanari,et al. The phase transition of matrix recovery from Gaussian measurements matches the minimax MSE of matrix denoising , 2013, Proceedings of the National Academy of Sciences.

[62] Rina Panigrahy,et al. Sparse Matrix Factorization , 2013, ArXiv.

[63] Anima Anandkumar,et al. Exact Recovery of Sparsely Used Overcomplete Dictionaries , 2013, ArXiv.

[64] Yoram Bresler,et al. Near Optimal Compressed Sensing of Sparse Rank-One Matrices via Sparse Power Factorization , 2013, ArXiv.

[65] Prateek Jain,et al. Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[66] Karin Schnass,et al. On the Identifiability of Overcomplete Dictionaries via the Minimisation Principle Underlying K-SVD , 2013, ArXiv.

[67] Alexander G. Gray,et al. Sparsity-Based Generalization Bounds for Predictive Sparse Coding , 2013, ICML.

[68] Gábor Lugosi,et al. Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[69] Joel A. Tropp,et al. Living on the edge: phase transitions in convex programs with random data , 2013, 1303.6672.

[70] Anima Anandkumar,et al. Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates , 2014, ArXiv.

[71] Zhaoran Wang,et al. High Dimensional Expectation-Maximization Algorithm: Statistical Optimization and Asymptotic Normality , 2014, 1412.8729.

[72] Aditya Bhaskara,et al. Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[73] Bo Huang,et al. Square Deal: Lower Bounds and Improved Relaxations for Tensor Recovery , 2013, ICML.

[74] A. Appendix. Alternating Minimization for Mixed Linear Regression , 2014 .

[75] Prateek Jain,et al. Provable Tensor Factorization with Missing Data , 2014, NIPS.

[76] Zhaoran Wang,et al. Nonconvex Statistical Optimization: Minimax-Optimal Sparse PCA in Polynomial Time , 2014, ArXiv.

[77] Anima Anandkumar,et al. Analyzing Tensor Power Method Dynamics: Applications to Learning Overcomplete Latent Variable Models , 2014, ArXiv.

[78] Roi Livni,et al. On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[79] Prateek Jain,et al. Non-convex Robust PCA , 2014, NIPS.

[80] Mary Wootters,et al. Fast matrix completion without the condition number , 2014, COLT.

[81] Hui Ji,et al. A Convergent Incoherent Dictionary Learning Algorithm for Sparse Coding , 2014, ECCV.

[82] Po-Ling Loh,et al. Support recovery without incoherence: A case for nonconvex regularization , 2014, ArXiv.

[83] Huan Wang,et al. On the local correctness of ℓ1-minimization for dictionary learning , 2011, 2014 IEEE International Symposium on Information Theory.

[84] Justin K. Romberg,et al. Blind Deconvolution Using Convex Programming , 2012, IEEE Transactions on Information Theory.

[85] E. Candès. Mathematics of Sparsity (and a Few Other Things) , 2014 .

[86] Aditya Bhaskara,et al. More Algorithms for Provable Dictionary Learning , 2014, ArXiv.

[87] Anima Anandkumar,et al. Provable Tensor Methods for Learning Mixtures of Classifiers , 2014, ArXiv.

[88] Sanjeev Arora,et al. New Algorithms for Learning Incoherent and Overcomplete Dictionaries , 2013, COLT.

[89] Jean Ponce,et al. Sparse Modeling for Image and Vision Processing , 2014, Found. Trends Comput. Graph. Vis..

[90] Martin J. Wainwright,et al. Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[91] Moritz Hardt,et al. Understanding Alternating Minimization for Matrix Completion , 2013, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[92] Joel A. Tropp,et al. Sharp Recovery Bounds for Convex Demixing, with Applications , 2012, Found. Comput. Math..

[93] Zuowei Shen,et al. L0 Norm Based Dictionary Learning by Proximal Methods with Global Convergence , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[94] Marc Teboulle,et al. Proximal alternating linearized minimization for nonconvex and nonsmooth problems , 2013, Mathematical Programming.

[95] Sunav Choudhary,et al. Identifiability Scaling Laws in Bilinear Inverse Problems , 2014, ArXiv.

[96] Sujay Sanghavi,et al. The Local Convexity of Solving Quadratic Equations , 2015 .

[97] Po-Ling Loh,et al. Statistical consistency and asymptotic normality for high-dimensional robust M-estimators , 2015, ArXiv.

[98] Karin Schnass,et al. Convergence radius and sample complexity of ITKM algorithms for dictionary learning , 2015, Applied and Computational Harmonic Analysis.

[99] John Wright,et al. Complete dictionary recovery over the sphere , 2015, 2015 International Conference on Sampling Theory and Applications (SampTA).

[100] Prateek Jain,et al. Computing Matrix Squareroot via Non Convex Local Search , 2015, ArXiv.

[101] Anima Anandkumar,et al. Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[102] Prateek Jain,et al. Fast Exact Matrix Completion with Finite Samples , 2014, COLT.

[103] Chenglong Bao,et al. Convergence analysis for iterative data-driven tight frame construction scheme , 2015 .

[104] Sanjeev Arora,et al. Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders , 2012, Algorithmica.

[105] Friedrich T. Sommer,et al. When Can Dictionary Learning Uniquely Recover Sparse Data From Subsamples? , 2011, IEEE Transactions on Information Theory.

[106] Han Liu,et al. Provable sparse tensor decomposition , 2015, 1502.01425.

[107] Sanjeev Arora,et al. Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[108] Karin Schnass,et al. Local identification of overcomplete dictionaries , 2014, J. Mach. Learn. Res..

[109] Yuxin Chen,et al. Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems , 2015, NIPS.

[110] Zhi-Quan Luo,et al. Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[111] Prateek Jain,et al. Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[112] David Steurer,et al. Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method , 2014, STOC.

[113] John D. Lafferty,et al. A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements , 2015, NIPS.

[114] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[115] Sujay Sanghavi,et al. The Local Convexity of Solving Systems of Quadratic Equations , 2015, 1506.07868.

[116] Rishi Saket,et al. Tight Hardness of the Non-commutative Grothendieck Problem , 2014, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[117] Rémi Gribonval,et al. Sample Complexity of Dictionary Learning and Other Matrix Factorizations , 2013, IEEE Transactions on Information Theory.

[118] Lee-Ad Gottlieb,et al. Matrix Sparsification and the Sparse Null Space Problem , 2010, Algorithmica.

[119] Rémi Gribonval,et al. Sparse and Spurious: Dictionary Learning With Noise and Outliers , 2014, IEEE Transactions on Information Theory.

[120] Xiaodong Li,et al. Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[121] John Wright,et al. When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[122] John Wright,et al. Complete Dictionary Recovery Using Nonconvex Optimization , 2015, ICML.

[123] Martin J. Wainwright,et al. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees , 2015, ArXiv.

[124] John Wright,et al. A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[125] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.

[126] Anastasios Kyrillidis,et al. Dropping Convexity for Faster Semi-definite Optimization , 2015, COLT.

[127] Nicolas Boumal,et al. The non-convex Burer-Monteiro approach works on smooth semidefinite programs , 2016, NIPS.

[128] Tengyu Ma,et al. Polynomial-Time Tensor Decompositions with Sum-of-Squares , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[129] Nathan Srebro,et al. Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[130] Anima Anandkumar,et al. Provable Tensor Methods for Learning Mixtures of Generalized Linear Models , 2014, AISTATS.

[131] John Wright,et al. Finding a Sparse Vector in a Subspace: Linear Sparsity Using Alternating Directions , 2014, IEEE Transactions on Information Theory.

[132] Nicolas Boumal,et al. On the low-rank approach for semidefinite programs arising in synchronization and community detection , 2016, COLT.

[133] Prateek Jain,et al. Tensor vs. Matrix Methods: Robust Tensor Decomposition under Block Sparse Perturbations , 2015, AISTATS.

[134] Daniel Soudry,et al. No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[135] Tengyu Ma,et al. Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[136] Prateek Jain,et al. Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization , 2013, SIAM J. Optim..

[137] Max Simchowitz,et al. Low-rank Solutions of Linear Matrix Equations via Procrustes Flow , 2015, ICML.

[138] Amit Singer,et al. Approximating the little Grothendieck problem over the orthogonal and unitary groups , 2013, Mathematical Programming.

[139] Elad Hazan,et al. A linear-time algorithm for trust region problems , 2014, Math. Program..

[140] Kyle Luh,et al. Dictionary Learning With Few Samples and Matrix Concentration , 2015, IEEE Transactions on Information Theory.

[141] Nicolas Boumal,et al. Nonconvex Phase Synchronization , 2016, SIAM J. Optim..

[142] Yanjun Li,et al. Identifiability and Stability in Blind Deconvolution Under Minimal Assumptions , 2015, IEEE Transactions on Information Theory.

[143] Bin Yu,et al. Local Identifiability of $\ell_1$-minimization Dictionary Learning: a Sufficient and Almost Necessary Condition , 2015, J. Mach. Learn. Res..

[144] Felix Krahmer,et al. Optimal Injectivity Conditions for Bilinear Inverse Problems with Applications to Identifiability of Deconvolution Problems , 2016, SIAM J. Appl. Algebra Geom..

[145] Praneeth Netrapalli,et al. A Clustering Approach to Learning Sparsely Used Overcomplete Dictionaries , 2013, IEEE Transactions on Information Theory.

[146] John Wright,et al. Complete Dictionary Recovery Over the Sphere II: Recovery by Riemannian Trust-Region Method , 2015, IEEE Transactions on Information Theory.

[147] Anima Anandkumar,et al. Analyzing Tensor Power Method Dynamics in Overcomplete Regime , 2014, J. Mach. Learn. Res..

[148] Yoram Bresler,et al. Near-Optimal Compressed Sensing of a Class of Sparse Low-Rank Matrices Via Sparse Power Factorization , 2013, IEEE Transactions on Information Theory.