Breaking the centralized barrier for cross-device federated learning

Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which can cause a client drift phenomenon. In fact, designing an algorithm for FL that is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, MIME, which i) mitigates client drift and ii) adapts an arbitrary centralized optimization algorithm such as momentum and Adam to the cross-device federated learning setting. MIME uses a combination of control-variates and server-level optimizer state (e.g. momentum) at every client-update step to ensure that each local update mimics that of the centralized method run on i.i.d. data. We prove a reduction result showing that MIME can translate the convergence of a generic algorithm in the centralized setting into convergence in the federated setting. Moreover, we show that, when combined with momentum-based variance reduction, MIME is provably faster than any centralized method–the first such result. We also perform a thorough experimental exploration of MIME’s performance on real world datasets (implemented here).

[1]  Zaïd Harchaoui,et al.  Robust Aggregation for Federated Learning , 2019, IEEE Transactions on Signal Processing.

[2]  Venkatesh Saligrama,et al.  Federated Learning Based on Dynamic Regularization , 2021, ICLR.

[3]  Ananda Theertha Suresh,et al.  FedJAX: Federated learning simulation with JAX , 2021, ArXiv.

[4]  Martin Jaggi,et al.  Learning from History for Byzantine Robust Optimization , 2020, ICML.

[5]  Manzil Zaheer,et al.  Adaptive Federated Optimization , 2020, ICLR.

[6]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[7]  Sashank J. Reddi,et al.  Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning , 2020, ArXiv.

[8]  Qinghua Liu,et al.  Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization , 2020, NeurIPS.

[9]  Kartik Sreenivasan,et al.  Attack of the Tails: Yes, You Really Can Backdoor Federated Learning , 2020, NeurIPS.

[10]  Jakub Konecný,et al.  On the Outsized Importance of Learning Rates in Local Update Methods , 2020, ArXiv.

[11]  Swaroop Ramaswamy,et al.  Understanding Unintended Memorization in Federated Learning , 2020, ArXiv.

[12]  Sai Praneeth Karimireddy,et al.  Secure Byzantine-Robust Machine Learning , 2020, ArXiv.

[13]  Nathan Srebro,et al.  Minibatch vs Local SGD for Heterogeneous Distributed Learning , 2020, NeurIPS.

[14]  Niranjan A. Subrahmanya,et al.  Training Keyword Spotting Models on Non-IID Data with Federated Learning , 2020, INTERSPEECH.

[15]  Martin Jaggi,et al.  A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.

[16]  Ohad Shamir,et al.  Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[17]  Ashok Cutkosky,et al.  Momentum Improves Normalized SGD , 2020, ICML.

[18]  Michael G. Rabbat,et al.  SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum , 2019, ICLR.

[19]  Peter Richtárik,et al.  Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2019, AISTATS.

[20]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[21]  Suvrit Sra,et al.  Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.

[22]  Tian Li,et al.  Fair Resource Allocation in Federated Learning , 2019, ICLR.

[23]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[24]  John C. Duchi,et al.  Lower bounds for non-convex stochastic optimization , 2019, Mathematical Programming.

[25]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning , 2019, ArXiv.

[26]  Sashank J. Reddi,et al.  Why ADAM Beats SGD for Attention Models , 2019, ArXiv.

[27]  Tzu-Ming Harry Hsu,et al.  Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[28]  Sebastian U. Stich,et al.  The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, 1909.05350.

[29]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[30]  Lam M. Nguyen,et al.  Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization , 2019, 1905.05920.

[31]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[32]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[33]  Aymeric Dieuleveut,et al.  Communication trade-offs for synchronized distributed SGD with large step size , 2019, NeurIPS 2019.

[34]  Ohad Shamir,et al.  The Complexity of Making the Gradient Small in Stochastic Convex Optimization , 2019, COLT.

[35]  Hubert Eichner,et al.  Towards Federated Learning at Scale: System Design , 2019, SysML.

[36]  Mehryar Mohri,et al.  Agnostic Federated Learning , 2019, ICML.

[37]  Yoram Singer,et al.  Memory Efficient Adaptive Optimization , 2019, NeurIPS.

[38]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[39]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[40]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[41]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[42]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[43]  Kin K. Leung,et al.  Adaptive Federated Learning in Resource Constrained Edge Computing Systems , 2018, IEEE Journal on Selected Areas in Communications.

[44]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[45]  Anit Kumar Sahu,et al.  On the Convergence of Federated Optimization in Heterogeneous Networks , 2018, ArXiv.

[46]  Sebastian Caldas,et al.  LEAF: A Benchmark for Federated Settings , 2018, ArXiv.

[47]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[48]  Sebastian Caldas,et al.  Expanding the Reach of Federated Learning by Reducing Client Resource Requirements , 2018, ArXiv.

[49]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[50]  Sanjiv Kumar,et al.  cpSGD: Communication-efficient and differentially-private distributed SGD , 2018, NeurIPS.

[51]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[52]  Sashank J. Reddi,et al.  On the Convergence of Adam and Beyond , 2018, ICLR.

[53]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[54]  Tassilo Klein,et al.  Differentially Private Federated Learning: A Client Level Perspective , 2017, ArXiv.

[55]  Sarvar Patel,et al.  Practical Secure Aggregation for Privacy-Preserving Machine Learning , 2017, IACR Cryptol. ePrint Arch..

[56]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[57]  Jie Liu,et al.  Stochastic Recursive Gradient Algorithm for Nonconvex Optimization , 2017, ArXiv.

[58]  Gregory Cohen,et al.  EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[59]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[60]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[61]  Michael I. Jordan,et al.  Less than a Single Pass: Stochastically Controlled Stochastic Gradient , 2016, AISTATS.

[62]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[63]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[64]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[65]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[66]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[67]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[68]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[69]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[70]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[71]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[72]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[73]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.