Analysis of biased stochastic gradient descent using sequential semidefinite programs

We present a convergence rate analysis for biased stochastic gradient descent (SGD), where individual gradient updates are corrupted by computation errors. We develop stochastic quadratic constraints to formulate a small linear matrix inequality (LMI) whose feasible points lead to convergence bounds of biased SGD. Based on this LMI condition, we develop a sequential minimization approach to analyze the intricate trade-offs that couple stepsize selection, convergence rate, optimization accuracy, and robustness to gradient inaccuracy. We also provide feasible points for this LMI and obtain theoretical formulas that quantify the convergence properties of biased SGD under various assumptions on the loss functions.

[1]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[2]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[3]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[4]  H. Robbins A Stochastic Approximation Method , 1951 .

[5]  Alexander J. Smola,et al.  A scalable modular convex solver for regularized risk minimization , 2007, KDD '07.

[6]  Stephen P. Boyd,et al.  Graph Implementations for Nonsmooth Convex Programs , 2008, Recent Advances in Learning and Control.

[7]  Alexandre d'Aspremont,et al.  Smooth Optimization with Approximate Gradient , 2005, SIAM J. Optim..

[8]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[9]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[10]  Mark W. Schmidt,et al.  Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[11]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets , 2012, ArXiv.

[12]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[13]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[14]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[15]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[16]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[17]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[18]  Marc Teboulle,et al.  Performance of first-order methods for smooth convex minimization: a novel approach , 2012, Mathematical Programming.

[19]  Hamid Reza Feyzmahdavian,et al.  A delayed proximal gradient method with linear convergence rate , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[20]  Yurii Nesterov,et al.  First-order methods of smooth convex optimization with inexact oracle , 2013, Mathematical Programming.

[21]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[22]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[23]  Yuxin Chen,et al.  Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems , 2015, NIPS.

[24]  Zhi-Quan Luo,et al.  Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[25]  Michael I. Jordan,et al.  A General Analysis of the Convergence of ADMM , 2015, ICML.

[26]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[27]  Paul Valiant,et al.  Optimizing Star-Convex Functions , 2015, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[28]  Etienne de Klerk,et al.  On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions , 2016, Optimization Letters.

[29]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[30]  Adrien B. Taylor,et al.  Exact Worst-Case Performance of First-Order Methods for Composite Convex Optimization , 2015, SIAM J. Optim..

[31]  Adrien B. Taylor,et al.  Smooth strongly convex interpolation and exact worst-case performance of first-order methods , 2015, Mathematical Programming.

[32]  Peter Seiler,et al.  A Unified Analysis of Stochastic Optimization Methods Using Jump System Theory and Quadratic Constraints , 2017, COLT.

[33]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[34]  Bryan Van Scoy,et al.  Lyapunov Functions for First-Order Methods: Tight Automated Convergence Guarantees , 2018, ICML.

[35]  Francis Bach,et al.  Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions , 2019, COLT.