Accelerated First-Order Optimization Algorithms for Machine Learning

Numerical optimization serves as one of the pillars of machine learning. To meet the demands of big data applications, lots of efforts have been put on designing theoretically and practically fast algorithms. This article provides a comprehensive survey on accelerated first-order algorithms with a focus on stochastic algorithms. Specifically, this article starts with reviewing the basic accelerated algorithms on deterministic convex optimization, then concentrates on their extensions to stochastic convex optimization, and at last introduces some recent developments on acceleration for nonconvex optimization.

[1]  Wei Shi,et al.  Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs , 2016, SIAM J. Optim..

[2]  Ohad Shamir,et al.  On the Iteration Complexity of Oblivious First-Order Optimization Algorithms , 2016, ICML.

[3]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[4]  Nathan Srebro,et al.  Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[5]  Yi Zhou,et al.  SpiderBoost and Momentum: Faster Stochastic Variance Reduction Algorithms , 2018 .

[6]  Huan Li,et al.  On the Complexity Analysis of the Primal Solutions for the Accelerated Randomized Dual Coordinate Ascent , 2018, J. Mach. Learn. Res..

[7]  Marten van Dijk,et al.  Finite-sum smooth optimization with SARAH , 2019, Computational Optimization and Applications.

[8]  Antonin Chambolle,et al.  Stochastic Primal-Dual Hybrid Gradient Algorithm with Arbitrary Sampling and Imaging Applications , 2017, SIAM J. Optim..

[9]  Antonin Chambolle,et al.  A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging , 2011, Journal of Mathematical Imaging and Vision.

[10]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[11]  Donghwan Kim,et al.  Optimized first-order methods for smooth convex minimization , 2014, Mathematical Programming.

[12]  Quanquan Gu,et al.  Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[13]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[14]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[15]  Lin Xiao,et al.  An Accelerated Randomized Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization , 2015, SIAM J. Optim..

[16]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[17]  Zeyuan Allen-Zhu,et al.  Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[18]  Julien Mairal,et al.  Optimization with First-Order Surrogate Functions , 2013, ICML.

[19]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[20]  Zheng Qu,et al.  Restarting the accelerated coordinate descent method with a rough strong convexity estimate , 2018, Comput. Optim. Appl..

[21]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[22]  Alexander Gasnikov,et al.  Primal–dual accelerated gradient methods with small-dimensional relaxation oracle , 2018, Optim. Methods Softw..

[23]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[24]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[25]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[26]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[27]  Zhouchen Lin,et al.  A Sharp Convergence Rate Analysis for Distributed Accelerated Gradient Methods , 2018, 1810.01053.

[28]  Yuejie Chi,et al.  Communication-Efficient Distributed Optimization in Networks with Gradient Tracking , 2019, AISTATS.

[29]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[30]  Na Li,et al.  Harnessing smoothness to accelerate distributed optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[31]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[32]  Marc Teboulle,et al.  Fast Gradient-Based Algorithms for Constrained Total Variation Image Denoising and Deblurring Problems , 2009, IEEE Transactions on Image Processing.

[33]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[34]  Xiaoming Yuan,et al.  On the acceleration of augmented Lagrangian method for linearly constrained optimization , 2010 .

[35]  J. Berkson Application of the Logistic Function to Bio-Assay , 1944 .

[36]  Marc Teboulle,et al.  Performance of first-order methods for smooth convex minimization: a novel approach , 2012, Mathematical Programming.

[37]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[38]  Yangyang Xu,et al.  Accelerated First-Order Primal-Dual Proximal Methods for Linearly Constrained Composite Convex Programming , 2016, SIAM J. Optim..

[39]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[40]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[41]  Michael I. Jordan,et al.  Stochastic Cubic Regularization for Fast Nonconvex Optimization , 2017, NeurIPS.

[42]  Yunmei Chen,et al.  Optimal Primal-Dual Methods for a Class of Saddle Point Problems , 2013, SIAM J. Optim..

[43]  Alexander J. Smola,et al.  Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , 2016, NIPS.

[44]  Peter Richtárik,et al.  Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop , 2019, ALT.

[45]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[46]  Dmitry Kovalev,et al.  Optimal and Practical Algorithms for Smooth and Strongly Convex Decentralized Optimization , 2020, NeurIPS.

[47]  Yuanzhi Li,et al.  Neon2: Finding Local Minima via First-Order Oracles , 2017, NeurIPS.

[48]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[49]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[50]  Peter Richtárik,et al.  Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2015, NIPS.

[51]  Yurii Nesterov,et al.  Linear convergence of first order methods for non-strongly convex optimization , 2015, Math. Program..

[52]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[53]  Thomas Brox,et al.  iPiano: Inertial Proximal Algorithm for Nonconvex Optimization , 2014, SIAM J. Imaging Sci..

[54]  Zhouchen Lin,et al.  Accelerating Asynchronous Algorithms for Convex Optimization by Momentum Compensation , 2018, ArXiv.

[55]  Na Li,et al.  Accelerated Distributed Nesterov Gradient Descent , 2017, IEEE Transactions on Automatic Control.

[56]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[57]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[58]  Lihua Xie,et al.  Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[59]  Usman A. Khan,et al.  A Linear Algorithm for Optimization Over Directed Graphs With Geometric Convergence , 2018, IEEE Control Systems Letters.

[60]  Mikael Johansson,et al.  Convergence Analysis of Approximate Primal Solutions in Dual First-Order Methods , 2015, SIAM J. Optim..

[61]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[62]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[63]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[64]  Richard G. Baraniuk,et al.  Fast Alternating Direction Optimization Methods , 2014, SIAM J. Imaging Sci..

[65]  Soummya Kar,et al.  Decentralized Stochastic Optimization and Machine Learning: A Unified Variance-Reduction Framework for Robust Performance and Fast Convergence , 2020, IEEE Signal Processing Magazine.

[66]  Renato D. C. Monteiro,et al.  Iteration-complexity of first-order penalty methods for convex programming , 2013, Math. Program..

[67]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[68]  Zeyuan Allen-Zhu,et al.  Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization , 2018, ICML.

[69]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[70]  Huan Li,et al.  Accelerated Optimization for Machine Learning: First-Order Algorithms , 2020 .

[71]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[72]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[73]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[74]  Nathan Srebro,et al.  Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[75]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[76]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[77]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[78]  Thomas Brox,et al.  iPiasco: Inertial Proximal Algorithm for Strongly Convex Optimization , 2015, Journal of Mathematical Imaging and Vision.

[79]  Zhouchen Lin,et al.  Accelerated Alternating Direction Method of Multipliers: An Optimal O(1 / K) Nonergodic Analysis , 2016, Journal of Scientific Computing.

[80]  Yin Tat Lee,et al.  Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[81]  Ohad Shamir,et al.  On Lower and Upper Bounds in Smooth and Strongly Convex Optimization , 2016, J. Mach. Learn. Res..

[82]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[83]  Antonin Chambolle,et al.  On the ergodic convergence rates of a first-order primal–dual algorithm , 2016, Math. Program..

[84]  Euhanna Ghadimi,et al.  Global convergence of the Heavy-ball method for convex optimization , 2014, 2015 European Control Conference (ECC).

[85]  Dmitriy Drusvyatskiy,et al.  An Optimal First Order Method Based on Optimal Quadratic Averaging , 2016, SIAM J. Optim..

[86]  Bikash Joshi,et al.  An Explicit Convergence Rate for Nesterov's Method from SDP , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[87]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[88]  Guanghui Lan,et al.  Gradient sliding for composite optimization , 2014, Mathematical Programming.

[89]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[90]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[91]  Shai Shalev-Shwartz,et al.  SDCA without Duality, Regularization, and Individual Convexity , 2016, ICML.

[92]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[93]  Huan Li,et al.  Accelerated Proximal Gradient Methods for Nonconvex Programming , 2015, NIPS.

[94]  Zaïd Harchaoui,et al.  Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice , 2017, J. Mach. Learn. Res..

[95]  Yunmei Chen,et al.  An Accelerated Linearized Alternating Direction Method of Multipliers , 2014, SIAM J. Imaging Sci..

[96]  Rong Jin,et al.  Linear Convergence with Condition Number Independent Access of Full Gradients , 2013, NIPS.

[97]  Zeyuan Allen Zhu,et al.  Optimal Black-Box Reductions Between Optimization Objectives , 2016, NIPS.

[98]  Michael I. Jordan,et al.  On Nonconvex Optimization for Machine Learning , 2019, J. ACM.

[99]  Guilherme França,et al.  An explicit rate bound for over-relaxed ADMM , 2015, 2016 IEEE International Symposium on Information Theory (ISIT).

[100]  Kilian Q. Weinberger,et al.  Optimal Convergence Rates for Convex Distributed Optimization in Networks , 2019, J. Mach. Learn. Res..

[101]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[102]  Michael I. Jordan,et al.  Stochastic Gradient Descent Escapes Saddle Points Efficiently , 2019, ArXiv.

[103]  Soummya Kar,et al.  Variance-Reduced Decentralized Stochastic Optimization With Accelerated Convergence , 2020, IEEE Transactions on Signal Processing.

[104]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[105]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[106]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[107]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[108]  Chih-Jen Lin,et al.  Iteration complexity of feasible descent methods for convex optimization , 2014, J. Mach. Learn. Res..

[109]  Yuchen Zhang,et al.  Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization , 2014, ICML.

[110]  Martin Jaggi,et al.  An accelerated communication-efficient primal-dual optimization framework for structured machine learning , 2017, ArXiv.

[111]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[112]  Ali H. Sayed,et al.  Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks , 2011, IEEE Transactions on Signal Processing.

[113]  Usman A. Khan,et al.  Distributed Nesterov Gradient Methods Over Arbitrary Graphs , 2019, IEEE Signal Processing Letters.

[114]  José M. F. Moura,et al.  Fast Distributed Gradient Methods , 2011, IEEE Transactions on Automatic Control.

[115]  Zheng Qu,et al.  Adaptive restart of accelerated gradient methods under local quadratic growth condition , 2017, IMA Journal of Numerical Analysis.

[116]  Guanghui Lan,et al.  A unified variance-reduced accelerated gradient method for convex optimization , 2019, NeurIPS.

[117]  Tianbao Yang,et al.  First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time , 2017, NeurIPS.

[118]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[119]  Yair Carmon,et al.  "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , 2017, ICML.

[120]  Laurent Massoulié,et al.  An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums , 2019, NeurIPS.

[121]  Zhouchen Lin,et al.  Revisiting EXTRA for Smooth Distributed Optimization , 2020, SIAM J. Optim..

[122]  Aaron Defazio,et al.  A Simple Practical Accelerated Method for Finite Sums , 2016, NIPS.

[123]  Yurii Nesterov,et al.  Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems , 2017, SIAM J. Optim..

[124]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[125]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[126]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[127]  Lin Xiao,et al.  On the complexity analysis of randomized block-coordinate descent methods , 2013, Mathematical Programming.

[128]  Guanghui Lan,et al.  Accelerated gradient sliding for structured convex optimization , 2016, Computational Optimization and Applications.

[129]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[130]  Zeyuan Allen Zhu,et al.  Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[131]  Fanhua Shang,et al.  A Simple Stochastic Variance Reduced Algorithm with Fast Convergence Rates , 2018, ICML.

[132]  Yi Zhou,et al.  Conditional Gradient Sliding for Convex Optimization , 2016, SIAM J. Optim..

[133]  Thomas Hofmann,et al.  Escaping Saddles with Stochastic Gradients , 2018, ICML.

[134]  Aryan Mokhtari,et al.  DSA: Decentralized Double Stochastic Averaging Gradient Algorithm , 2015, J. Mach. Learn. Res..

[135]  Yair Carmon,et al.  Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[136]  Zhouchen Lin,et al.  Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.

[137]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[138]  Angelia Nedic,et al.  A Dual Approach for Optimal Algorithms in Distributed Optimization over Networks , 2018, 2020 Information Theory and Applications Workshop (ITA).

[139]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[140]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..