论文信息 - Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition

Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition

We analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non-convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient updates are trapped in saddle points. In this paper we identify strict saddle property for non-convex problem that allows for efficient optimization. Using this property we show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge this is the first work that gives global convergence guarantees for stochastic gradient descent on non-convex functions with exponentially many local minima and saddle points. Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning a rich class of latent variable models. We propose a new optimization formulation for the tensor decomposition problem that has strict saddle property. As a result we get the first online algorithm for orthogonal tensor decomposition with global convergence guarantee.

[1] O. Mangasarian. PSEUDO-CONVEX FUNCTIONS , 1965 .

[2] Kazuoki Azuma. WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[3] Richard A. Harshman,et al. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[4] Mihalis Yannakakis,et al. How easy is local search? , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[5] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[6] Jean-Francois Cardoso,et al. Source separation using higher order moments , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[7] Saad,et al. On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[8] Alan M. Frieze,et al. Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[9] David J. Field,et al. Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[10] Magnus Rattray,et al. Natural gradient descent for on-line learning , 1998 .

[11] M. A. Hanson. Invexity and the Kuhn–Tucker Theorem☆ , 1999 .

[12] Aapo Hyvärinen. Fast ICA for noisy data using Gaussian moments , 1999, ISCAS.

[13] Krzysztof C. Kiwiel,et al. Convergence and efficiency of subgradient methods for quasiconvex minimization , 2001, Math. Program..

[14] Tamara G. Kolda,et al. Orthogonal Tensor Decompositions , 2000, SIAM J. Matrix Anal. Appl..

[15] Hyeyoung Park,et al. On-Line Learning Theory of Soft Committee Machines with Correlated Hidden Units : Steepest Gradient Descent and Natural Gradient Descent , 2002, cond-mat/0212006.

[16] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[17] P. Comon,et al. Tensor decompositions, alternating least squares and other tales , 2009 .

[18] Ohad Shamir,et al. Stochastic Convex Optimization , 2009, COLT.

[19] Martin J. Wainwright,et al. Fast global convergence rates of gradient methods for high-dimensional statistical recovery , 2010, NIPS.

[20] Seungjin Choi,et al. Independent Component Analysis , 2009, Handbook of Natural Computing.

[21] Ohad Shamir,et al. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[22] Anima Anandkumar,et al. Fast Detection of Overlapping Communities via Online Tensor Methods on GPUs , 2013, ArXiv.

[23] Ryan P. Adams,et al. Contrastive Learning Using Spectral Methods , 2013, NIPS.

[24] Prateek Jain,et al. Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[25] Yann LeCun,et al. The Loss Surface of Multilayer Networks , 2014, ArXiv.

[26] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[27] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[28] Anima Anandkumar,et al. Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[29] Sanjeev Arora,et al. Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders , 2012, Algorithmica.

[30] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.