Optimal Decentralized Distributed Algorithms for Stochastic Convex Optimization.

We consider stochastic convex optimization problems with affine constraints and develop several methods using either primal or dual approach to solve it. In the primal case, we use special penalization technique to make the initial problem more convenient for using optimization methods. We propose algorithms to solve it based on Similar Triangles Method with Inexact Proximal Step for the convex smooth and strongly convex smooth objective functions and methods based on Gradient Sliding algorithm to solve the same problems in the non-smooth case. We prove the convergence guarantees in the smooth convex case with deterministic first-order oracle. We propose and analyze three novel methods to handle stochastic convex optimization problems with affine constraints: SPDSTM, R-RRMA-AC-SA$^2$ and SSTM_sc. All methods use stochastic dual oracle. SPDSTM is the stochastic primal-dual modification of STM and it is applied for the dual problem when the primal functional is strongly convex and Lipschitz continuous on some ball. We extend the result from Dvinskikh & Gasnikov (2019) for this method to the case when only biased stochastic oracle is available. R-RRMA-AC-SA$^2$ is an accelerated stochastic method based on the restarts of RRMA-AC-SA$^2$ from Foster et al. (2019) and SSTM_sc is just stochastic STM for strongly convex problems. Both methods are applied to the dual problem when the primal functional is strongly convex, smooth and Lipschitz continuous on some ball and use stochastic dual first-order oracle. We develop convergence analysis for these methods for the unbiased and biased oracles respectively. Finally, we apply all aforementioned results and approaches to solve the decentralized distributed optimization problem and discuss the optimality of the obtained results in terms of communication rounds and the number of oracle calls per node.

[1]  Alexander Gasnikov,et al.  Projected Gradient Method for Decentralized Optimization over Time-Varying Networks , 2019, 1911.08527.

[2]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[3]  A. V. Gasnikov,et al.  Universal Method for Stochastic Composite Optimization Problems , 2018 .

[4]  Angelia Nedi'c,et al.  Optimal Distributed Convex Optimization on Slowly Time-Varying Graphs , 2018, IEEE Transactions on Control of Network Systems.

[5]  Sebastian U. Stich,et al.  Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[6]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[7]  Ying Sun,et al.  Accelerated Primal-Dual Algorithms for Distributed Smooth Convex Optimization over Networks , 2020, AISTATS.

[8]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[9]  Alexander Gasnikov,et al.  Universal fast gradient method for stochastic composit optimization problems , 2016 .

[10]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[11]  S. Kakade,et al.  On the duality of strong convexity and strong smoothness : Learning applications and matrix regularization , 2009 .

[12]  Umut Simsekli,et al.  Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks , 2019, ArXiv.

[13]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[14]  V. Spokoiny Parametric estimation. Finite sample theory , 2011, 1111.3029.

[15]  Michael I. Jordan,et al.  A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.

[16]  Konstantin Mishchenko,et al.  Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2020, AISTATS.

[17]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.

[18]  Aharon Ben-Tal,et al.  Lectures on modern convex optimization , 1987 .

[19]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[20]  Ioannis Ch. Paschalidis,et al.  Asymptotic Network Independence in Distributed Optimization for Machine Learning , 2019, ArXiv.

[21]  P. Rigollet,et al.  Entropic optimal transport is maximum-likelihood deconvolution , 2018, Comptes Rendus Mathematique.

[22]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[23]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[24]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[25]  Jan Vondrák,et al.  High probability generalization bounds for uniformly stable algorithms with nearly optimal rate , 2019, COLT.

[26]  Hadrien Hendrikx,et al.  Accelerated Decentralized Optimization with Local Updates for Smooth and Strongly Convex Objectives , 2018, AISTATS.

[27]  J. Slack How to make the gradient , 1994, Nature.

[28]  Mike Davies,et al.  The Practicality of Stochastic Optimization in Imaging Inverse Problems , 2020, IEEE Transactions on Computational Imaging.

[29]  Marco Canini,et al.  Natural Compression for Distributed Deep Learning , 2019, MSML.

[30]  Julien Mairal,et al.  A Generic Acceleration Framework for Stochastic Composite Optimization , 2019, NeurIPS.

[31]  Angelia Nedic,et al.  Optimal Algorithms for Distributed Optimization , 2017, ArXiv.

[32]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[33]  Alexander Gasnikov,et al.  Gradient Methods for Problems with Inexact Model of the Objective , 2019, MOTOR.

[34]  P. Dvurechensky,et al.  Dual approaches to the minimization of strongly convex functionals with a simple structure under affine constraints , 2017 .

[35]  Eduard A. Gorbunov,et al.  An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization , 2018, SIAM J. Optim..

[36]  Ioannis Ch. Paschalidis,et al.  Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent , 2020, IEEE Signal Processing Magazine.

[37]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[38]  Xiaorui Liu,et al.  A Double Residual Compression Algorithm for Efficient Distributed Learning , 2019, AISTATS.

[39]  Ioannis Ch. Paschalidis,et al.  A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient Descent , 2019, ArXiv.

[40]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[41]  Alexander Gasnikov,et al.  Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods (Coordinate Descent, Directional Search, Derivative-Free Method) , 2017, ArXiv.

[42]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[43]  Guanghui Lan,et al.  Algorithms for stochastic optimization with expectation constraints , 2016, 1604.03887.

[44]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[45]  A. Juditsky,et al.  Deterministic and Stochastic Primal-Dual Subgradient Algorithms for Uniformly Convex Minimization , 2014 .

[46]  Peter Richtárik,et al.  First Analysis of Local GD on Heterogeneous Data , 2019, ArXiv.

[47]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[48]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[49]  Yi Zhou,et al.  Random gradient extrapolation for distributed and stochastic optimization , 2017, SIAM J. Optim..

[50]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[51]  Alexander Gasnikov,et al.  On Primal-Dual Approach for Distributed Stochastic Convex Optimization over Networks , 2019 .

[52]  Yurii Nesterov,et al.  Double Smoothing Technique for Large-Scale Linearly Constrained Convex Optimization , 2012, SIAM J. Optim..

[53]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[54]  Kaiwen Zhou,et al.  Direct Acceleration of SAGA using Sampled Negative Momentum , 2018, AISTATS.

[55]  Alexander Gasnikov,et al.  Stochastic Intermediate Gradient Method for Convex Problems with Stochastic Inexact Oracle , 2016, Journal of Optimization Theory and Applications.

[56]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[57]  Y. Nesterov,et al.  First-order methods with inexact oracle: the strongly convex case , 2013 .

[58]  Darina Dvinskikh,et al.  On the Complexity of Approximating Wasserstein Barycenter , 2019, ArXiv.

[59]  Alexey Chernov,et al.  Fast Primal-Dual Gradient Method for Strongly Convex Minimization Problems with Linear Constraints , 2016, DOOR.

[60]  Darina Dvinskikh,et al.  Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters , 2018, NeurIPS.

[61]  Anastasia A. Lagunovskaya,et al.  Gradient-free prox-methods with inexact oracle for stochastic convex optimization problems on a simplex , 2014, 1412.3890.

[62]  Peter Richtárik,et al.  SGD: General Analysis and Improved Rates , 2019, ICML 2019.

[63]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[64]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[65]  Yi Zhou,et al.  Communication-efficient algorithms for decentralized and stochastic optimization , 2017, Mathematical Programming.

[66]  Julien Mairal,et al.  Estimate Sequences for Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise , 2019, J. Mach. Learn. Res..

[67]  Kilian Q. Weinberger,et al.  Optimal Convergence Rates for Convex Distributed Optimization in Networks , 2019, J. Mach. Learn. Res..

[68]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[69]  Laurent Massoulié,et al.  Optimal Algorithms for Non-Smooth Distributed Optimization in Networks , 2018, NeurIPS.

[70]  Darina Dvinskikh,et al.  SA vs SAA for population Wasserstein barycenter calculation , 2020, ArXiv.

[71]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[72]  H. Robbins A Stochastic Approximation Method , 1951 .

[73]  A. Juditsky,et al.  5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , I : General Purpose Methods , 2010 .

[74]  Peter Richtárik,et al.  A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent , 2019, AISTATS.

[75]  Gabriel Peyré,et al.  A Smoothed Dual Approach for Variational Wasserstein Problems , 2015, SIAM J. Imaging Sci..

[76]  Olivier Devolder,et al.  Exactness, inexactness and stochasticity in first-order methods for large-scale convex optimization , 2013 .

[77]  Asuman E. Ozdaglar,et al.  A Universally Optimal Multistage Accelerated Stochastic Gradient Method , 2019, NeurIPS.

[78]  S. Guminov,et al.  Accelerated Alternating Minimization, Accelerated Sinkhorn's Algorithm and Accelerated Iterative Bregman Projections. , 2019 .

[79]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[80]  A. Gasnikov,et al.  Decentralized and Parallelized Primal and Dual Accelerated Methods for Stochastic Convex Programming Problems , 2019, 1904.09015.

[81]  Peter Richtárik,et al.  Better Communication Complexity for Local SGD , 2019, ArXiv.

[82]  Julien Mairal,et al.  Estimate Sequences for Variance-Reduced Stochastic Composite Optimization , 2019, ICML.

[83]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[84]  Alexander V. Gasnikov,et al.  Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex , 2016, Automation and Remote Control.

[85]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[86]  Rosemary Park,et al.  In Short , 2000 .

[87]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[88]  Guanghui Lan,et al.  Gradient sliding for composite optimization , 2014, Mathematical Programming.

[89]  Fanhua Shang,et al.  A Simple Stochastic Variance Reduced Algorithm with Fast Convergence Rates , 2018, ICML.

[90]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[91]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[92]  A. Juditsky,et al.  Large Deviations of Vector-valued Martingales in 2-Smooth Normed Spaces , 2008, 0809.0813.

[93]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[94]  Ohad Shamir,et al.  The Complexity of Making the Gradient Small in Stochastic Convex Optimization , 2019, COLT.

[95]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[96]  A. Gasnikov Universal gradient descent , 2017, 1711.00394.

[97]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.