FedChain: Chained Algorithms for Near-optimal Communication Cost in Federated Learning

Federated learning (FL) aims to minimize the communication complexity of training a model over heterogeneous data distributed across many clients. A common approach is local update methods, where clients take multiple optimization steps over local data before communicating with the server (e.g., FedAvg). Local update methods can exploit similarity between clients’ data. However, in existing analyses, this comes at the cost of slow convergence in terms of the dependence on the number of communication rounds R. On the other hand, global update methods, where clients simply return a gradient vector in each round (e.g., SGD), converge faster in terms of R but fail to exploit the similarity between clients even when clients are homogeneous. We propose FedChain, an algorithmic framework that combines the strengths of local update methods and global update methods to achieve fast convergence in terms of R while leveraging the similarity between clients. Using FedChain, we instantiate algorithms that improve upon previously known rates in the general convex and PL settings, and are near-optimal (via an algorithm-independent lower bound that we show) for problems that satisfy strong convexity. Empirical results support this theoretical gain over existing methods.

[1]  Blake Woodworth The Minimax Complexity of Distributed Optimization , 2021, ArXiv.

[2]  Virginia Smith,et al.  On Large-Cohort Training for Federated Learning , 2021, NeurIPS.

[3]  Mehrdad Mahdavi,et al.  Local Stochastic Gradient Descent Ascent: Convergence Analysis and Communication Efficiency , 2021, AISTATS.

[4]  George J. Pappas,et al.  Linear Convergence in Federated Learning: Tackling Client Heterogeneity and Sparse Gradients , 2021, NeurIPS.

[5]  Ohad Shamir,et al.  The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication , 2021, COLT.

[6]  Manzil Zaheer,et al.  Federated Composite Optimization , 2020, ICML.

[7]  Eduard A. Gorbunov,et al.  Local SGD: Unified Theory and New Efficient Methods , 2020, AISTATS.

[8]  E. Xing,et al.  Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms , 2020, ICLR.

[9]  Manzil Zaheer,et al.  Adaptive Federated Optimization , 2020, ICLR.

[10]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[11]  Mehrdad Mahdavi,et al.  Distributionally Robust Federated Averaging , 2021, NeurIPS.

[12]  Sashank J. Reddi,et al.  Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning , 2020, ArXiv.

[13]  Jakub Konecný,et al.  On the Outsized Importance of Learning Rates in Local Update Methods , 2020, ArXiv.

[14]  Tengyu Ma,et al.  Federated Accelerated Stochastic Gradient Descent , 2020, NeurIPS.

[15]  Nathan Srebro,et al.  Minibatch vs Local SGD for Heterogeneous Distributed Learning , 2020, NeurIPS.

[16]  Mehrdad Mahdavi,et al.  Adaptive Personalized Federated Learning , 2020, ArXiv.

[17]  Martin Jaggi,et al.  A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.

[18]  Ohad Shamir,et al.  Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[19]  Asuman Ozdaglar,et al.  An Optimal Multistage Stochastic Gradient Method for Minimax Problems , 2020, 2020 59th IEEE Conference on Decision and Control (CDC).

[20]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[21]  Michael G. Rabbat,et al.  SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum , 2019, ICLR.

[22]  Peter Richtárik,et al.  Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2019, AISTATS.

[23]  Anit Kumar Sahu,et al.  Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[24]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[25]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[26]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[27]  Anit Kumar Sahu,et al.  FedDANE: A Federated Newton-Type Method , 2019, 2019 53rd Asilomar Conference on Signals, Systems, and Computers.

[28]  Anit Kumar Sahu,et al.  MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling , 2019, 2019 Sixth Indian Control Conference (ICC).

[29]  Asuman E. Ozdaglar,et al.  A Universally Optimal Multistage Accelerated Stochastic Gradient Method , 2019, NeurIPS.

[30]  Jianyu Wang,et al.  Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD , 2018, MLSys.

[31]  Kaiwen Zhou,et al.  Direct Acceleration of SAGA using Sampled Negative Momentum , 2018, AISTATS.

[32]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[33]  Jianyu Wang,et al.  Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[34]  Gregory Cohen,et al.  EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[35]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[36]  Alexander J. Smola,et al.  Fast Incremental Method for Nonconvex Optimization , 2016, ArXiv.

[37]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[38]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[39]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization, II: Shrinking Procedures and Optimal Algorithms , 2013, SIAM J. Optim..

[41]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[42]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[43]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.