DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION

Our topic is the relationship between dynamical systems and optimization. This is a venerable, vast area in mathematics, counting among its many historical threads the study of gradient flow and the variational perspective on mechanics. We aim to build some new connections in this general area, studying aspects of gradient-based optimization from a continuous-time, variational point of view. We go beyond classical gradient flow to focus on second-order dynamics, aiming to show the relevance of such dynamics to optimization algorithms that not only converge, but converge quickly. Although our focus is theoretical, it is important to motivate the work by considering the applied context from which it has emerged. Modern statistical data analysis often involves very large data sets and very large parameter spaces, so that computational efficiency is of paramount importance in practical applications. In such settings, the notion of efficiency is more stringent than that of classical computational complexity theory, where the distinction between polynomial complexity and exponential complexity has been a useful focus. In large-scale data analysis, algorithms need to be not merely polynomial, but linear, or nearly linear, in relevant problem parameters. Optimization theory has provided both practical and theoretical support for this endeavor. It has supplied computationally-efficient algorithms, as well as analysis tools that allow rates of convergence to be determined as explicit functions of problem parameters. The dictum of efficiency has led to a focus on algorithms that are based principally on gradients of objective functions, or on estimates of gradients, given that Hessians incur quadratic or cubic complexity in the dimension of the configuration space (Bottou, 2010; Nesterov, 2012). More broadly, the blending of inferential and computational ideas is one of the major intellectual trends of the current century—one currently referred to by terms such as “data science” and “machine learning.” It is a trend that inspires the search for new mathematical concepts that allow computational and inferential desiderata to be studied jointly. For example, one would like to impose runtime budgets on data-analysis algorithms as a function of statistical

[1]  V. G. Troitsky,et al.  Journal of Mathematical Analysis and Applications , 1960 .

[2]  Boris Polyak Gradient methods for the minimisation of functionals , 1963 .

[3]  O. Mangasarian PSEUDO-CONVEX FUNCTIONS , 1965 .

[4]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[5]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[6]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[7]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[8]  Mihalis Yannakakis,et al.  How easy is local search? , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[9]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[10]  Jean-Francois Cardoso,et al.  Source separation using higher order moments , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[11]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[12]  S. Mitter,et al.  Recursive stochastic algorithms for global optimization in R d , 1991 .

[13]  P. Pardalos Complexity in numerical optimization , 1993 .

[14]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[15]  Alan M. Frieze,et al.  Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[16]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[17]  Magnus Rattray,et al.  Natural gradient descent for on-line learning , 1998 .

[18]  M. A. Hanson Invexity and the Kuhn–Tucker Theorem☆ , 1999 .

[19]  Aapo Hyvärinen Fast ICA for noisy data using Gaussian moments , 1999, ISCAS.

[20]  Krzysztof C. Kiwiel,et al.  Convergence and efficiency of subgradient methods for quasiconvex minimization , 2001, Math. Program..

[21]  Tamara G. Kolda,et al.  Orthogonal Tensor Decompositions , 2000, SIAM J. Matrix Anal. Appl..

[22]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[23]  Hyeyoung Park,et al.  On-Line Learning Theory of Soft Committee Machines with Correlated Hidden Units : Steepest Gradient Descent and Natural Gradient Descent , 2002, cond-mat/0212006.

[24]  E. Hairer,et al.  Geometric Numerical Integration: Structure Preserving Algorithms for Ordinary Differential Equations , 2004 .

[25]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[26]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[27]  Yan V Fyodorov,et al.  Replica Symmetry Breaking Condition Exposed by Random Matrix Calculation of Landscape Complexity , 2007, cond-mat/0702601.

[28]  A. Bray,et al.  Statistics of critical points of Gaussian fields on large-dimensional spaces. , 2006, Physical review letters.

[29]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[30]  P. Comon,et al.  Tensor decompositions, alternating least squares and other tales , 2009 .

[31]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[32]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[33]  Vladimír Lacko,et al.  On decompositional algorithms for uniform sampling from n-spheres and n-balls , 2010, J. Multivar. Anal..

[34]  Nicholas I. M. Gould,et al.  On the Complexity of Steepest Descent, Newton's and Regularized Newton's Methods for Nonconvex Unconstrained Optimization Problems , 2010, SIAM J. Optim..

[35]  Martin J. Wainwright,et al.  Fast global convergence rates of gradient methods for high-dimensional statistical recovery , 2010, NIPS.

[36]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[37]  Emmanuel J. Candès,et al.  Tight Oracle Inequalities for Low-Rank Matrix Recovery From a Minimal Number of Noisy Random Measurements , 2011, IEEE Transactions on Information Theory.

[38]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[39]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[40]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[41]  Anima Anandkumar,et al.  Fast Detection of Overlapping Communities via Online Tensor Methods on GPUs , 2013, ArXiv.

[42]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[43]  Yin Tat Lee,et al.  Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[44]  Ryan P. Adams,et al.  Contrastive Learning Using Spectral Methods , 2013, NIPS.

[45]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[46]  M. Betancourt,et al.  The Geometric Foundations of Hamiltonian Monte Carlo , 2014, 1410.5110.

[47]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[48]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[49]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[50]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[51]  A. Dalalyan Theoretical guarantees for approximate sampling from smooth and log‐concave densities , 2014, 1412.7392.

[52]  Prateek Jain,et al.  Computing Matrix Squareroot via Non Convex Local Search , 2015, ArXiv.

[53]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[54]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[55]  Sanjeev Arora,et al.  Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders , 2012, Algorithmica.

[56]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[57]  Zhi-Quan Luo,et al.  Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[58]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[59]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[60]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[61]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[62]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[63]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[64]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[65]  Nicolas Boumal,et al.  The non-convex Burer-Monteiro approach works on smooth semidefinite programs , 2016, NIPS.

[66]  John D. Lafferty,et al.  Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent , 2016, ArXiv.

[67]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[68]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[69]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[70]  Nicolas Boumal,et al.  On the low-rank approach for semidefinite programs arising in synchronization and community detection , 2016, COLT.

[71]  Yair Carmon,et al.  Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[72]  Kfir Y. Levy,et al.  The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[73]  Yair Carmon,et al.  Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step , 2016, ArXiv.

[74]  É. Moulines,et al.  Sampling from a strongly log-concave distribution with the Unadjusted Langevin Algorithm , 2016 .

[75]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[76]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[77]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[78]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[79]  Michael I. Jordan,et al.  A Lyapunov Analysis of Momentum Methods in Optimization , 2016, ArXiv.

[80]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[81]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[82]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[83]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[84]  Yair Carmon,et al.  "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , 2017, ICML.

[85]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, ArXiv.

[86]  Michael O'Neill,et al.  Behavior of Accelerated Gradient Methods Near Critical Points of Nonconvex Problems , 2017 .

[87]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[88]  Andrea Montanari,et al.  Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality , 2017, COLT.

[89]  Anastasios Kyrillidis,et al.  Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach , 2016, AISTATS.

[90]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[91]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[92]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[93]  Michael I. Jordan,et al.  Underdamped Langevin MCMC: A non-asymptotic analysis , 2017, COLT.

[94]  Stephen J. Wright,et al.  Complexity Analysis of Second-Order Line-Search Algorithms for Smooth Nonconvex Optimization , 2017, SIAM J. Optim..

[95]  Michael I. Jordan,et al.  On Symplectic Optimization , 2018, 1802.03653.

[96]  Yurii Nesterov,et al.  Linear convergence of first order methods for non-strongly convex optimization , 2015, Math. Program..

[97]  Huan Li,et al.  Provable accelerated gradient method for nonconvex low rank optimization , 2017, Machine Learning.

[98]  Geelon So ags Tensor Decompositions , 2021, Matrix and Tensor Decompositions in Signal Processing.