Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition

We analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non-convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient updates are trapped in saddle points. In this paper we identify strict saddle property for non-convex problem that allows for efficient optimization. Using this property we show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge this is the first work that gives global convergence guarantees for stochastic gradient descent on non-convex functions with exponentially many local minima and saddle points. Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning a rich class of latent variable models. We propose a new optimization formulation for the tensor decomposition problem that has strict saddle property. As a result we get the first online algorithm for orthogonal tensor decomposition with global convergence guarantee.

[1]  O. Mangasarian PSEUDO-CONVEX FUNCTIONS , 1965 .

[2]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[3]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[4]  Mihalis Yannakakis,et al.  How easy is local search? , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[5]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[6]  Jean-Francois Cardoso,et al.  Source separation using higher order moments , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[7]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[8]  Alan M. Frieze,et al.  Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[9]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[10]  Magnus Rattray,et al.  Natural gradient descent for on-line learning , 1998 .

[11]  M. A. Hanson Invexity and the Kuhn–Tucker Theorem☆ , 1999 .

[12]  Aapo Hyvärinen Fast ICA for noisy data using Gaussian moments , 1999, ISCAS.

[13]  Krzysztof C. Kiwiel,et al.  Convergence and efficiency of subgradient methods for quasiconvex minimization , 2001, Math. Program..

[14]  Tamara G. Kolda,et al.  Orthogonal Tensor Decompositions , 2000, SIAM J. Matrix Anal. Appl..

[15]  Hyeyoung Park,et al.  On-Line Learning Theory of Soft Committee Machines with Correlated Hidden Units : Steepest Gradient Descent and Natural Gradient Descent , 2002, cond-mat/0212006.

[16]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[17]  P. Comon,et al.  Tensor decompositions, alternating least squares and other tales , 2009 .

[18]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[19]  Martin J. Wainwright,et al.  Fast global convergence rates of gradient methods for high-dimensional statistical recovery , 2010, NIPS.

[20]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[21]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[22]  Anima Anandkumar,et al.  Fast Detection of Overlapping Communities via Online Tensor Methods on GPUs , 2013, ArXiv.

[23]  Ryan P. Adams,et al.  Contrastive Learning Using Spectral Methods , 2013, NIPS.

[24]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[25]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[26]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[27]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[28]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[29]  Sanjeev Arora,et al.  Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders , 2012, Algorithmica.

[30]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.