Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes

Solving statistical learning problems often involves nonconvex optimization. Despite the empirical success of nonconvex statistical optimization methods, their global dynamics, especially convergence to the desirable local minima, remain less well understood in theory. In this paper, we propose a new analytic paradigm based on diffusion processes to characterize the global dynamics of nonconvex statistical optimization. As a concrete example, we study stochastic gradient descent (SGD) for the tensor decomposition formulation of independent component analysis. In particular, we cast different phases of SGD into diffusion processes, i.e., solutions to stochastic differential equations. Initialized from an unstable equilibrium, the global dynamics of SGD transit over three consecutive phases: (i) an unstable Ornstein-Uhlenbeck process slowly departing from the initialization, (ii) the solution to an ordinary differential equation, which quickly evolves towards the desirable local minimum, and (iii) a stable Ornstein-Uhlenbeck process oscillating around the desirable local minimum. Our proof techniques are based upon Stroock and Varadhan’s weak convergence of Markov chains to diffusion processes, which are of independent interest.

[1]  Christopher De Sa,et al.  Global Convergence of Stochastic Gradient Descent for Some Non-convex Matrix Problems , 2014, ICML.

[2]  D. Aldous Probability Approximations via the Poisson Clumping Heuristic , 1988 .

[3]  Zhaoran Wang,et al.  Nonconvex Statistical Optimization: Minimax-Optimal Sparse PCA in Polynomial Time , 2014, ArXiv.

[4]  Zhi-Quan Luo,et al.  Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[5]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[6]  Zhaoran Wang,et al.  Sparse PCA with Oracle Property , 2014, NIPS.

[7]  Zhaoran Wang,et al.  Low-Rank and Sparse Structure Pursuit via Alternating Minimization , 2016, AISTATS.

[8]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[9]  John Wright,et al.  When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[10]  E Weinan,et al.  Dynamics of Stochastic Gradient Algorithms , 2015, ArXiv.

[11]  Moritz Hardt,et al.  Understanding Alternating Minimization for Matrix Completion , 2013, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[12]  John D. Lafferty,et al.  A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements , 2015, NIPS.

[13]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere II: Recovery by Riemannian Trust-Region Method , 2015, IEEE Transactions on Information Theory.

[14]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[15]  Georgios Piliouras,et al.  Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions , 2016, ITCS.

[16]  Gene H. Golub,et al.  Matrix computations , 1983 .

[17]  Prateek Jain,et al.  Fast Exact Matrix Completion with Finite Samples , 2014, COLT.

[18]  John E. Moody,et al.  Towards Faster Stochastic Gradient Search , 1991, NIPS.

[19]  Hossein Mobahi,et al.  Training Recurrent Neural Networks by Diffusion , 2016, ArXiv.

[20]  Georgios Piliouras,et al.  Gradient Descent Converges to Minimizers: The Case of Non-Isolated Critical Points , 2016, ArXiv.

[21]  S. Shreve,et al.  Stochastic differential equations , 1955, Mathematical Proceedings of the Cambridge Philosophical Society.

[22]  V. Climenhaga Markov chains and mixing times , 2013 .

[23]  Yonina C. Eldar,et al.  Sparse Nonlinear Regression: Parameter Estimation and Asymptotic Inference , 2015, ArXiv.

[24]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[25]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[26]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[27]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[28]  Zhaoran Wang,et al.  OPTIMAL COMPUTATIONAL AND STATISTICAL RATES OF CONVERGENCE FOR SPARSE NONCONVEX LEARNING PROBLEMS. , 2013, Annals of statistics.

[29]  Martin J. Wainwright,et al.  Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees , 2015, ArXiv.

[30]  K. A. Semendyayev,et al.  Handbook of mathematics , 1985 .

[31]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[32]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[33]  Guang Cheng,et al.  Non-convex Statistical Optimization for Sparse Tensor Graphical Model , 2015, NIPS.

[34]  Xiaodong Li,et al.  Optimal Rates of Convergence for Noisy Sparse Phase Retrieval via Thresholded Wirtinger Flow , 2015, ArXiv.

[35]  Kean Ming Tan,et al.  Sparse generalized eigenvalue problem: optimal statistical rates via truncated Rayleigh flow , 2016, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[36]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[37]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[38]  Zhaoran Wang,et al.  High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality , 2015, NIPS.

[39]  B. Øksendal Stochastic Differential Equations , 1985 .

[40]  Tong Zhang,et al.  Near-optimal stochastic approximation for online principal component estimation , 2016, Math. Program..

[41]  Anima Anandkumar,et al.  Efficient approaches for escaping higher order saddle points in non-convex optimization , 2016, COLT.

[42]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[43]  M. Hirsch,et al.  Differential Equations, Dynamical Systems, and an Introduction to Chaos , 2003 .

[44]  Prateek Jain,et al.  Computing Matrix Squareroot via Non Convex Local Search , 2015, ArXiv.

[45]  Yuxin Chen,et al.  Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems , 2015, NIPS.

[46]  Max Simchowitz,et al.  Low-rank Solutions of Linear Matrix Equations via Procrustes Flow , 2015, ICML.

[47]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[48]  Anima Anandkumar,et al.  Analyzing Tensor Power Method Dynamics in Overcomplete Regime , 2014, J. Mach. Learn. Res..

[49]  S. Ethier,et al.  Markov Processes: Characterization and Convergence , 2005 .

[50]  R. Durrett Probability: Theory and Examples , 1993 .

[51]  D. W. Stroock,et al.  Multidimensional Diffusion Processes , 1979 .

[52]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[53]  Sujay Sanghavi,et al.  The Local Convexity of Solving Systems of Quadratic Equations , 2015, 1506.07868.

[54]  Zhaoran Wang,et al.  A Nonconvex Optimization Framework for Low Rank Matrix Estimation , 2015, NIPS.

[55]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[56]  Anastasios Kyrillidis,et al.  Dropping Convexity for Faster Semi-definite Optimization , 2015, COLT.

[57]  John Wright,et al.  Finding a Sparse Vector in a Subspace: Linear Sparsity Using Alternating Directions , 2014, IEEE Transactions on Information Theory.

[58]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[59]  Prateek Jain,et al.  Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization , 2013, SIAM J. Optim..

[60]  Prateek Jain,et al.  Non-convex Robust PCA , 2014, NIPS.

[61]  Han Liu,et al.  Provable sparse tensor decomposition , 2015, 1502.01425.