论文信息 - Breaking the centralized barrier for cross-device federated learning - 字舞流文

Breaking the centralized barrier for cross-device federated learning

Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which can cause a client drift phenomenon. In fact, designing an algorithm for FL that is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, MIME, which i) mitigates client drift and ii) adapts an arbitrary centralized optimization algorithm such as momentum and Adam to the cross-device federated learning setting. MIME uses a combination of control-variates and server-level optimizer state (e.g. momentum) at every client-update step to ensure that each local update mimics that of the centralized method run on i.i.d. data. We prove a reduction result showing that MIME can translate the convergence of a generic algorithm in the centralized setting into convergence in the federated setting. Moreover, we show that, when combined with momentum-based variance reduction, MIME is provably faster than any centralized method–the first such result. We also perform a thorough experimental exploration of MIME’s performance on real world datasets (implemented here).

Sashank J. Reddi | Sebastian U. Stich | Sai Praneeth Karimireddy | Martin Jaggi | Satyen Kale | S. Stich

[1] Zaïd Harchaoui,et al. Robust Aggregation for Federated Learning , 2019, IEEE Transactions on Signal Processing.

[2] Venkatesh Saligrama,et al. Federated Learning Based on Dynamic Regularization , 2021, ICLR.

[3] Ananda Theertha Suresh,et al. FedJAX: Federated learning simulation with JAX , 2021, ArXiv.

[4] Martin Jaggi,et al. Learning from History for Byzantine Robust Optimization , 2020, ICML.

[5] Manzil Zaheer,et al. Adaptive Federated Optimization , 2020, ICLR.

[6] Richard Nock,et al. Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[7] Sashank J. Reddi,et al. Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning , 2020, ArXiv.

[8] Qinghua Liu,et al. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization , 2020, NeurIPS.

[9] Kartik Sreenivasan,et al. Attack of the Tails: Yes, You Really Can Backdoor Federated Learning , 2020, NeurIPS.

[10] Jakub Konecný,et al. On the Outsized Importance of Learning Rates in Local Update Methods , 2020, ArXiv.

[11] Swaroop Ramaswamy,et al. Understanding Unintended Memorization in Federated Learning , 2020, ArXiv.

[12] Sai Praneeth Karimireddy,et al. Secure Byzantine-Robust Machine Learning , 2020, ArXiv.

[13] Nathan Srebro,et al. Minibatch vs Local SGD for Heterogeneous Distributed Learning , 2020, NeurIPS.

[14] Niranjan A. Subrahmanya,et al. Training Keyword Spotting Models on Non-IID Data with Federated Learning , 2020, INTERSPEECH.

[15] Martin Jaggi,et al. A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.

[16] Ohad Shamir,et al. Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[17] Ashok Cutkosky,et al. Momentum Improves Normalized SGD , 2020, ICML.

[18] Michael G. Rabbat,et al. SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum , 2019, ICLR.

[19] Peter Richtárik,et al. Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2019, AISTATS.

[20] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[21] Suvrit Sra,et al. Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.

[22] Tian Li,et al. Fair Resource Allocation in Federated Learning , 2019, ICLR.

[23] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[24] John C. Duchi,et al. Lower bounds for non-convex stochastic optimization , 2019, Mathematical Programming.

[25] Sashank J. Reddi,et al. SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning , 2019, ArXiv.

[26] Sashank J. Reddi,et al. Why ADAM Beats SGD for Attention Models , 2019, ArXiv.

[27] Tzu-Ming Harry Hsu,et al. Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[28] Sebastian U. Stich,et al. The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, 1909.05350.

[29] Francesco Orabona,et al. Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[30] Lam M. Nguyen,et al. Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization , 2019, 1905.05920.

[31] Rong Jin,et al. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[32] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[33] Aymeric Dieuleveut,et al. Communication trade-offs for synchronized distributed SGD with large step size , 2019, NeurIPS 2019.

[34] Ohad Shamir,et al. The Complexity of Making the Gradient Small in Stochastic Convex Optimization , 2019, COLT.

[35] Hubert Eichner,et al. Towards Federated Learning at Scale: System Design , 2019, SysML.

[36] Mehryar Mohri,et al. Agnostic Federated Learning , 2019, ICML.

[37] Yoram Singer,et al. Memory Efficient Adaptive Optimization , 2019, NeurIPS.

[38] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[39] Léon Bottou,et al. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[40] Mark W. Schmidt,et al. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[41] Shenghuo Zhu,et al. Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[42] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[43] Kin K. Leung,et al. Adaptive Federated Learning in Resource Constrained Edge Computing Systems , 2018, IEEE Journal on Selected Areas in Communications.

[44] Michael Carbin,et al. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[45] Anit Kumar Sahu,et al. On the Convergence of Federated Optimization in Heterogeneous Networks , 2018, ArXiv.

[46] Sebastian Caldas,et al. LEAF: A Benchmark for Federated Settings , 2018, ArXiv.

[47] Yurii Nesterov,et al. Lectures on Convex Optimization , 2018 .

[48] Sebastian Caldas,et al. Expanding the Reach of Federated Learning by Reducing Client Resource Requirements , 2018, ArXiv.

[49] Tong Zhang,et al. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[50] Sanjiv Kumar,et al. cpSGD: Communication-efficient and differentially-private distributed SGD , 2018, NeurIPS.

[51] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[52] Sashank J. Reddi,et al. On the Convergence of Adam and Beyond , 2018, ICLR.

[53] Sanjiv Kumar,et al. Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[54] Tassilo Klein,et al. Differentially Private Federated Learning: A Client Level Perspective , 2017, ArXiv.

[55] Sarvar Patel,et al. Practical Secure Aggregation for Privacy-Preserving Machine Learning , 2017, IACR Cryptol. ePrint Arch..

[56] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[57] Jie Liu,et al. Stochastic Recursive Gradient Algorithm for Nonconvex Optimization , 2017, ArXiv.

[58] Gregory Cohen,et al. EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[59] Ananda Theertha Suresh,et al. Distributed Mean Estimation with Limited Communication , 2016, ICML.

[60] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[61] Michael I. Jordan,et al. Less than a Single Pass: Stochastically Controlled Stochastic Gradient , 2016, AISTATS.

[62] Blaise Agüera y Arcas,et al. Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[63] Peter Richtárik,et al. Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[64] Peter Richtárik,et al. Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[65] Alexander J. Smola,et al. AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[66] Mark W. Schmidt,et al. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[67] Ohad Shamir,et al. Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[68] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[69] Ohad Shamir,et al. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[70] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[71] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[72] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[73] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.