FedDR - Randomized Douglas-Rachford Splitting Algorithms for Nonconvex Federated Composite Optimization

We develop two new algorithms, called, FedDR and asyncFedDR, for solving a fundamental nonconvex composite optimization problem in federated learning. Our algorithms rely on a novel combination between a nonconvex Douglas-Rachford splitting method, randomized block-coordinate strategies, and asynchronous implementation. They can also handle convex regularizers. Unlike recent methods in the literature, e.g., FedSplit and FedPD, our algorithms update only a subset of users at each communication round, and possibly in an asynchronous manner, making them more practical. These new algorithms also achieve communication efficiency and more importantly can handle statistical and system heterogeneity, which are the two main challenges in federated learning. Our convergence analysis shows that the new algorithms match the communication complexity lower bound up to a constant factor under standard assumptions. Our numerical experiments illustrate the advantages of our methods compared to existing ones on several datasets.

[1]  Eduard A. Gorbunov,et al.  Local SGD: Unified Theory and New Efficient Methods , 2020, AISTATS.

[2]  Manzil Zaheer,et al.  Federated Composite Optimization , 2020, ICML.

[3]  Sebastian Caldas,et al.  LEAF: A Benchmark for Federated Settings , 2018, ArXiv.

[4]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[5]  Farzin Haddadpour,et al.  Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization , 2019, NeurIPS.

[6]  Jianyu Wang,et al.  Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[7]  Farzin Haddadpour,et al.  On the Convergence of Local Descent Methods in Federated Learning , 2019, ArXiv.

[8]  Indranil Gupta,et al.  Asynchronous Federated Optimization , 2019, ArXiv.

[9]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[10]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[11]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[12]  Matthew K. Tam,et al.  A Lyapunov-type approach to convergence of the Douglas–Rachford algorithm for a nonconvex setting , 2017, J. Glob. Optim..

[13]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[14]  Filip Hanzely,et al.  Lower Bounds and Optimal Algorithms for Personalized Federated Learning , 2020, NeurIPS.

[15]  Wotao Yin,et al.  Acceleration of Primal–Dual Methods by Preconditioning and Simple Subproblem Procedures , 2018, Journal of Scientific Computing.

[16]  Ohad Shamir,et al.  Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[17]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[18]  Patrick L. Combettes,et al.  Stochastic Quasi-Fejér Block-Coordinate Fixed Point Iterations with Random Sweeping , 2014 .

[19]  Francisco Facchinei,et al.  Asynchronous parallel algorithms for nonconvex optimization , 2016, Mathematical Programming.

[20]  Anit Kumar Sahu,et al.  Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[21]  Guoyin Li,et al.  Douglas–Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems , 2014, Math. Program..

[22]  Qi Dou,et al.  FedBN: Federated Learning on Non-IID Features via Local Batch Normalization , 2021, ICLR.

[23]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[24]  Martin J. Wainwright,et al.  FedSplit: An algorithmic framework for fast federated optimization , 2020, NeurIPS.

[25]  Xiang Li,et al.  On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.

[26]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[27]  Guoyin Li,et al.  Global Convergence of Splitting Methods for Nonconvex Composite Optimization , 2014, SIAM J. Optim..

[28]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[29]  Wotao Yin,et al.  FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data , 2020, ArXiv.

[30]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[31]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[32]  Ioannis Mitliagkas,et al.  Parallel SGD: When does averaging help? , 2016, ArXiv.

[33]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[34]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[35]  Aryan Mokhtari,et al.  Federated Learning with Compression: Unified Analysis and Sharp Guarantees , 2020, AISTATS.

[36]  Yue Zhao,et al.  Federated Learning with Non-IID Data , 2018, ArXiv.

[37]  Peter Richtárik,et al.  First Analysis of Local GD on Heterogeneous Data , 2019, ArXiv.

[38]  Ming Yan,et al.  ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates , 2015, SIAM J. Sci. Comput..

[39]  Song Han,et al.  Deep Leakage from Gradients , 2019, NeurIPS.

[40]  P. Lions,et al.  Splitting Algorithms for the Sum of Two Nonlinear Operators , 1979 .

[41]  Jakub Konecný,et al.  Convergence and Accuracy Trade-Offs in Federated Learning and Meta-Learning , 2021, AISTATS.

[42]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[43]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[44]  Laura Wynter,et al.  Fed+: A Family of Fusion Algorithms for Federated Learning , 2020, ArXiv.

[45]  Panagiotis Patrinos,et al.  Douglas-Rachford Splitting and ADMM for Nonconvex Optimization: Tight Convergence Results , 2017, SIAM J. Optim..

[46]  Martin Jaggi,et al.  Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning. , 2020, 2008.03606.

[47]  Nathan Srebro,et al.  Minibatch vs Local SGD for Heterogeneous Distributed Learning , 2020, NeurIPS.

[48]  Patrick L. Combettes,et al.  Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions , 2015, Mathematical Programming.

[49]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.