Global convergence rate of incremental aggregated gradient methods for nonsmooth problems

We analyze the proximal incremental aggregated gradient (PIAG) method for minimizing the sum of a large number of smooth component functions f(x) = Σi=1m fi(x) and a convex function r(x). Such composite optimization problems arise in a number of machine learning applications including regularized regression problems and constrained distributed optimization problems over sensor networks. Our method computes an approximate gradient for the function f(x) by aggregating the component gradients evaluated at outdated iterates over a finite window K and uses a proximal operator with respect to the regularization function r(x) at the intermediate iterate obtained by moving along the approximate gradient. Under the assumptions that f(x) is strongly convex and each fi(x) is smooth with Lipschitz gradients, we show the first linear convergence rate result for the PIAG method and provide explicit convergence rate estimates that highlight the dependence on the condition number of the problem and the size of the window K over which outdated component gradients are evaluated.

[1]  Paul Tseng,et al.  Incrementally Updated Gradient Methods for Constrained and Regularized Optimization , 2013, Journal of Optimization Theory and Applications.

[2]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[3]  Dimitri P. Bertsekas,et al.  Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.

[4]  Dimitri P. Bertsekas,et al.  Incremental Subgradient Methods for Nondifferentiable Optimization , 2001, SIAM J. Optim..

[5]  Mikhail V. Solodov,et al.  Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero , 1998, Comput. Optim. Appl..

[6]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[7]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[8]  R. Tyrrell Rockafellar,et al.  Variational Analysis , 1998, Grundlehren der mathematischen Wissenschaften.

[9]  H. Robbins A Stochastic Approximation Method , 1951 .

[10]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[11]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[12]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[13]  Dmitriy Drusvyatskiy,et al.  Error Bounds, Quadratic Growth, and Linear Convergence of Proximal Methods , 2016, Math. Oper. Res..

[14]  Asuman E. Ozdaglar,et al.  On the Convergence Rate of Incremental Aggregated Gradient Algorithms , 2015, SIAM J. Optim..

[15]  Dimitri P. Bertsekas,et al.  Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[16]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[17]  Mark W. Schmidt,et al.  Linear Convergence of Proximal-Gradient Methods under the Polyak-Łojasiewicz Condition , 2015 .

[18]  Alfred O. Hero,et al.  A Convergent Incremental Gradient Method with a Constant Step Size , 2007, SIAM J. Optim..

[19]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[20]  Vivek S. Borkar,et al.  Distributed Asynchronous Incremental Subgradient Methods , 2001 .

[21]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[22]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[23]  Hamid Reza Feyzmahdavian,et al.  A delayed proximal gradient method with linear convergence rate , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[24]  Sébastien Bubeck,et al.  Theory of Convex Optimization for Machine Learning , 2014, ArXiv.

[25]  Dimitri P. Bertsekas,et al.  Convex Optimization Algorithms , 2015 .

[26]  Yongduan Song,et al.  Distributed Economic Dispatch for Smart Grids With Random Wind Power , 2016, IEEE Transactions on Smart Grid.

[27]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[28]  Dimitri P. Bertsekas,et al.  Incremental Aggregated Proximal and Augmented Lagrangian Algorithms , 2015, ArXiv.