Performance of noisy Nesterov's accelerated method for strongly convex optimization problems

We study the performance of noisy gradient descent and Nesterov's accelerated methods for strongly convex objective functions with Lipschitz continuous gradients. The steady-state second-order moment of the error in the iterates is analyzed when the gradient is perturbed by an additive white noise with zero mean and identity covariance. For any given condition number $\kappa$, we derive explicit upper bounds on noise amplification that only depend on $\kappa$ and the problem size. We use quadratic objective functions to derive lower bounds and to demonstrate that the upper bounds are tight up to a constant factor. The established upper bound for Nesterov's accelerated method is larger than the upper bound for gradient descent by a factor of $\sqrt{\kappa}$. This gap identifies a fundamental tradeoff that comes with acceleration in the presence of stochastic uncertainties in the gradient evaluation.

[1]  Mark W. Schmidt,et al.  Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[2]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[3]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[4]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[5]  Bin Hu,et al.  A Robust Accelerated Optimization Algorithm for Strongly Convex Functions , 2017, 2018 Annual American Control Conference (ACC).

[6]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[7]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[8]  Alejandro Ribeiro,et al.  Analysis of Optimization Algorithms via Integral Quadratic Constraints: Nonstrongly Convex Problems , 2017, SIAM J. Optim..

[9]  M. Baes Estimate sequence methods: extensions and approximations , 2009 .

[10]  Bassam Bamieh,et al.  Coherence in Large-Scale Networks: Dimension-Dependent Limitations of Local Feedback , 2011, IEEE Transactions on Automatic Control.

[11]  Zhi-Quan Luo,et al.  A Unified Algorithmic Framework for Block-Structured Optimization Involving Big Data: With applications in machine learning and signal processing , 2015, IEEE Signal Processing Magazine.

[12]  Z.-Q. Luo,et al.  Error bounds and convergence analysis of feasible descent methods: a general approach , 1993, Ann. Oper. Res..

[13]  Vahid Tarokh,et al.  On Optimal Generalizability in Parametric Learning , 2017, NIPS.

[14]  Mihailo R. Jovanovic,et al.  The Proximal Augmented Lagrangian Method for Nonsmooth Composite Optimization , 2016, IEEE Transactions on Automatic Control.

[15]  Bin Hu,et al.  Dissipativity Theory for Nesterov's Accelerated Method , 2017, ICML.

[16]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[17]  Alexandre d'Aspremont,et al.  Smooth Optimization with Approximate Gradient , 2005, SIAM J. Optim..

[18]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[19]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[20]  Mihailo R. Jovanovic,et al.  Variance Amplification of Accelerated First-Order Algorithms for Strongly Convex Quadratic Optimization Problems , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[21]  Yurii Nesterov,et al.  First-order methods of smooth convex optimization with inexact oracle , 2013, Mathematical Programming.

[22]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[23]  Jean-François Aujol,et al.  Stability of Over-Relaxations for the Forward-Backward Algorithm, Application to FISTA , 2015, SIAM J. Optim..

[24]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[25]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[26]  Yoshua Bengio,et al.  Gradient-Based Optimization of Hyperparameters , 2000, Neural Computation.

[27]  Michael I. Jordan,et al.  Averaging Stochastic Gradient Descent on Riemannian Manifolds , 2018, COLT.

[28]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[29]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[30]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[31]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[32]  H. Robbins A Stochastic Approximation Method , 1951 .

[33]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..