On the optimization landscape of tensor decompositions

Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that "all local optima are (approximately) global optima", and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult. In this paper, we analyze the optimization landscape of the random over-complete tensor decomposition problem, which has many applications in unsupervised learning, especially in learning latent variable models. In practice, it can be efficiently solved by gradient ascent on a non-convex objective. We show that for any small constant $\epsilon > 0$, among the set of points with function values $(1+\epsilon)$-factor larger than the expectation of the function, all the local maxima are approximate global maxima. Previously, the best-known result only characterizes the geometry in small neighborhoods around the true components. Our result implies that even with an initialization that is barely better than the random guess, the gradient ascent algorithm is guaranteed to solve this problem. Our main technique uses Kac-Rice formula and random matrix theory. To our best knowledge, this is the first time when Kac-Rice formula is successfully applied to counting the number of local minima of a highly-structured random polynomial with dependent coefficients.

[1]  Johan Håstad Tensor Rank is NP-Complete , 1990, J. Algorithms.

[2]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[3]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[4]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[5]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[6]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[7]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, Symposium on the Theory of Computing.

[8]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[9]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[10]  Lieven De Lathauwer,et al.  Fourth-Order Cumulant-Based Blind Identification of Underdetermined Mixtures , 2007, IEEE Transactions on Signal Processing.

[11]  R. Adler,et al.  Random Fields and Geometry , 2007 .

[12]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[13]  E. Candès The restricted isometry property and its implications for compressed sensing , 2008 .

[14]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[15]  P. Comon,et al.  Tensor decompositions, alternating least squares and other tales , 2009 .

[16]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[17]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[18]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[19]  Tamara G. Kolda,et al.  Shifted Power Method for Computing Tensor Eigenpairs , 2010, SIAM J. Matrix Anal. Appl..

[20]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[21]  B. Sturmfels,et al.  The number of eigenvalues of a tensor , 2010, 1004.4953.

[22]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[23]  Christopher J. Hillar,et al.  Most Tensor Problems Are NP-Hard , 2009, JACM.

[24]  Antonio Auffinger,et al.  Complexity of random smooth functions on the high-dimensional sphere , 2011, 1110.5872.

[25]  Aditya Bhaskara,et al.  Smoothed analysis of tensor decompositions , 2013, STOC.

[26]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[27]  Santosh S. Vempala,et al.  Fourier PCA and robust tensor decomposition , 2013, STOC.

[28]  Anima Anandkumar,et al.  A Spectral Algorithm for Latent Dirichlet Allocation , 2012, Algorithmica.

[29]  Tengyu Ma,et al.  Decomposing Overcomplete 3rd Order Tensors using Sum-of-Squares Algorithms , 2015, APPROX-RANDOM.

[30]  B. Sturmfels,et al.  Eigenconfigurations of Tensors , 2015, 1505.05729.

[31]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[32]  David Steurer,et al.  Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method , 2014, STOC.

[33]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[34]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[35]  Anima Anandkumar,et al.  Learning Overcomplete Latent Variable Models through Tensor Methods , 2014, COLT.

[36]  John Wright,et al.  When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[37]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[38]  Tengyu Ma,et al.  Polynomial-Time Tensor Decompositions with Sum-of-Squares , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[39]  Tselil Schramm,et al.  Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors , 2015, STOC.

[40]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[41]  Nicolas Boumal,et al.  On the low-rank approach for semidefinite programs arising in synchronization and community detection , 2016, COLT.

[42]  Amnon Shashua,et al.  Convolutional Rectifier Networks as Generalized Tensor Decompositions , 2016, ICML.

[43]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[44]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[45]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[46]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[47]  Anima Anandkumar,et al.  Analyzing Tensor Power Method Dynamics in Overcomplete Regime , 2014, J. Mach. Learn. Res..

[48]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[49]  Tengyu Ma,et al.  Gradient Descent Learns Linear Dynamical Systems , 2016, J. Mach. Learn. Res..

[50]  P. Absil,et al.  Erratum to: ``Global rates of convergence for nonconvex optimization on manifolds'' , 2016, IMA Journal of Numerical Analysis.