Accelerating Stochastic Gradient Descent

There is widespread sentiment that fast gradient methods (e.g. Nesterov’s acceleration, conjugate gradient, heavy ball) are not effective for the purposes of stochastic optimization due to their instability and error accumulation. Numerous works have attempted to quantify these instabilities in the face of either statistical or non-statistical errors (Paige, 1971; Proakis, 1974; Polyak, 1987; Greenbaum, 1989; Roy and Shynk, 1990; Sharma et al., 1998; d’Aspremont, 2008; Devolder et al., 2014; Yuan et al., 2016). This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes this conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and non-convex optimization problems.

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[3]  Christopher C. Paige,et al.  The computation of eigenvalues and eigenvectors of very large sparse matrices , 1971 .

[4]  D. Anbar On Optimal Estimation Methods Using Stochastic Approximation Procedures , 1973 .

[5]  V. Fabian Asymptotically Efficient Stochastic Approximation; The RM Case , 1973 .

[6]  J. Proakis,et al.  Channel identification for high speed digital communications , 1974 .

[7]  G. Pflug Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, HJ.; Clark, D.S. , 1980 .

[8]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[9]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[10]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[11]  A. Greenbaum Behavior of slightly perturbed Lanczos and conjugate-gradient recurrences , 1989 .

[12]  John J. Shynk,et al.  Analysis of the momentum LMS algorithm , 1990, IEEE Trans. Acoust. Speech Signal Process..

[13]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[14]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[15]  William A. Sethares,et al.  Analysis of momentum adaptive filtering algorithms , 1998, IEEE Trans. Signal Process..

[16]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[17]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[18]  H. Robbins A Stochastic Approximation Method , 1951 .

[19]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[20]  Alexandre d'Aspremont,et al.  Smooth Optimization with Approximate Gradient , 2005, SIAM J. Optim..

[21]  James T. Kwok,et al.  Accelerated Gradient Methods for Stochastic Optimization and Online Learning , 2009, NIPS.

[22]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[23]  Maxim Raginsky,et al.  Information-Based Complexity, Feedback and Dynamics in Convex Programming , 2010, IEEE Transactions on Information Theory.

[24]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[25]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[26]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[27]  Sham M. Kakade,et al.  Random Design Analysis of Ridge Regression , 2012, COLT.

[28]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[29]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[30]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization, II: Shrinking Procedures and Optimal Algorithms , 2013, SIAM J. Optim..

[31]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[32]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[33]  Jonathan D. Rosenblatt,et al.  On the Optimality of Averaging in Distributed Statistical Learning , 2014, 1407.2724.

[34]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[35]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[36]  Yurii Nesterov,et al.  First-order methods of smooth convex optimization with inexact oracle , 2013, Mathematical Programming.

[37]  Francis R. Bach,et al.  Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2015, AISTATS.

[38]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[39]  Sham M. Kakade,et al.  Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization , 2015, ICML.

[40]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[41]  Nathan Srebro,et al.  Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[42]  Ali H. Sayed,et al.  On the influence of momentum acceleration on online learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Prateek Jain,et al.  Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[44]  Michael I. Jordan,et al.  A Lyapunov Analysis of Momentum Methods in Optimization , 2016, ArXiv.

[45]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.