An Optimal Algorithm for Decentralized Finite Sum Optimization

Modern large-scale finite-sum optimization relies on two key aspects: distribution and stochastic updates. For smooth and strongly convex problems, existing decentralized algorithms are slower than modern accelerated variance-reduced stochastic algorithms when run on a single machine, and are therefore not efficient. Centralized algorithms are fast, but their scaling is limited by global aggregation steps that result in communication bottlenecks. In this work, we propose an efficient \textbf{A}ccelerated \textbf{D}ecentralized stochastic algorithm for \textbf{F}inite \textbf{S}ums named ADFS, which uses local stochastic proximal updates and decentralized communications between nodes. On $n$ machines, ADFS minimizes the objective function with $nm$ samples in the same time it takes optimal algorithms to optimize from $m$ samples on one machine. This scaling holds until a critical network size is reached, which depends on communication delays, on the number of samples $m$, and on the network topology. We give a lower bound of complexity to show that ADFS is optimal among decentralized algorithms. To derive ADFS, we first develop an extension of the accelerated proximal coordinate gradient algorithm to arbitrary sampling. Then, we apply this coordinate descent algorithm to a well-chosen dual problem based on an augmented graph approach, leading to the general ADFS algorithm. We illustrate the improvement of ADFS over state-of-the-art decentralized approaches with experiments.

[1]  U. Khan,et al.  Variance-Reduced Decentralized Stochastic Optimization With Accelerated Convergence , 2019, IEEE Transactions on Signal Processing.

[2]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[3]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[4]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[5]  Yurii Nesterov,et al.  Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems , 2017, SIAM J. Optim..

[6]  Aryan Mokhtari,et al.  Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication , 2018, ICML.

[7]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[8]  Peter Richtárik,et al.  Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches , 2018, AISTATS.

[9]  Henrik Sandberg,et al.  A Survey of Distributed Optimization and Control Algorithms for Electric Power Systems , 2017, IEEE Transactions on Smart Grid.

[10]  Aaron Defazio,et al.  A Simple Practical Accelerated Method for Finite Sums , 2016, NIPS.

[11]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[12]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[13]  Zeyuan Allen Zhu,et al.  Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[14]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[15]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[16]  Yin Tat Lee,et al.  Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[17]  Angelia Nedic,et al.  A Dual Approach for Optimal Algorithms in Distributed Optimization over Networks , 2018, 2020 Information Theory and Applications Workshop (ITA).

[18]  Lin Xiao,et al.  An Accelerated Proximal Coordinate Gradient Method , 2014, NIPS.

[19]  Hadrien Hendrikx,et al.  Accelerated Decentralized Optimization with Local Updates for Smooth and Strongly Convex Objectives , 2018, AISTATS.

[20]  Laurent Massoulié,et al.  Optimal Algorithms for Non-Smooth Distributed Optimization in Networks , 2018, NeurIPS.

[21]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[22]  Mikael Johansson,et al.  A Randomized Incremental Subgradient Method for Distributed Optimization in Networked Systems , 2009, SIAM J. Optim..

[23]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[24]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[25]  Dusan Jakovetic,et al.  A Unification and Generalization of Exact Distributed First-Order Methods , 2017, IEEE Transactions on Signal and Information Processing over Networks.

[26]  Stéphan Clémençon,et al.  Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions , 2016, ICML.

[27]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[28]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[29]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[30]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[31]  Lin Xiao,et al.  An Accelerated Randomized Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization , 2015, SIAM J. Optim..

[32]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[33]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[34]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[35]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[36]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[37]  Giuseppe Notarstefano,et al.  Asynchronous Distributed Optimization Via Randomized Dual Proximal Gradient , 2015, IEEE Transactions on Automatic Control.

[38]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[39]  Laurent Massoulié,et al.  An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums , 2019, NeurIPS.

[40]  Fabian Pedregosa,et al.  ASAGA: Asynchronous Parallel SAGA , 2016, AISTATS.

[41]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[42]  Stephen P. Boyd,et al.  Optimal Scaling of a Gradient Method for Distributed Resource Allocation , 2006 .

[43]  Pascal Bianchi,et al.  A Coordinate Descent Primal-Dual Algorithm and Application to Distributed Asynchronous Optimization , 2014, IEEE Transactions on Automatic Control.

[44]  Aryan Mokhtari,et al.  DSA: Decentralized Double Stochastic Averaging Gradient Algorithm , 2015, J. Mach. Learn. Res..

[45]  Noga Alon,et al.  lambda1, Isoperimetric inequalities for graphs, and superconcentrators , 1985, J. Comb. Theory, Ser. B.

[46]  Wei Shi,et al.  Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs , 2016, SIAM J. Optim..

[47]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.