SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning

Federated learning is a key scenario in modern large-scale machine learning. In that scenario, the training data remains distributed over a large number of clients, which may be phones, other mobile devices, or network sensors and a centralized model is learned without ever transmitting client data over the network. The standard optimization algorithm used in this scenario is Federated Averaging (FedAvg). However, when client data is heterogeneous, which is typical in applications, FedAvg does not admit a favorable convergence guarantee. This is because local updates on clients can drift apart, which also explains the slow convergence and hard-to-tune nature of FedAvg in practice. This paper presents a new Stochastic Controlled Averaging algorithm (SCAFFOLD) which uses control variates to reduce the drift between different clients. We prove that the algorithm requires significantly fewer rounds of communication and benefits from favorable convergence guarantees.

[1]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[2]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[3]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[4]  Rong Jin,et al.  Linear Convergence with Condition Number Independent Access of Full Gradients , 2013, NIPS.

[5]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[6]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[7]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[8]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[9]  Alexander Olshevsky,et al.  A geometrically convergent method for distributed optimization over time-varying graphs , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[10]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[11]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[12]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[14]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[15]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[16]  Jie Liu,et al.  Stochastic Recursive Gradient Algorithm for Nonconvex Optimization , 2017, ArXiv.

[17]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[18]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[19]  Sarvar Patel,et al.  Practical Secure Aggregation for Privacy-Preserving Machine Learning , 2017, IACR Cryptol. ePrint Arch..

[20]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[21]  Walid Saad,et al.  Federated Learning for Ultra-Reliable Low-Latency V2V Communications , 2018, 2018 IEEE Global Communications Conference (GLOBECOM).

[22]  Nathan Srebro,et al.  Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization , 2018, NeurIPS.

[23]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[24]  Yue Zhao,et al.  Federated Learning with Non-IID Data , 2018, ArXiv.

[25]  Wei Shi,et al.  Federated learning of predictive models from federated Electronic Health Records , 2018, Int. J. Medical Informatics.

[26]  Sanjiv Kumar,et al.  cpSGD: Communication-efficient and differentially-private distributed SGD , 2018, NeurIPS.

[27]  Hubert Eichner,et al.  Federated Learning for Mobile Keyboard Prediction , 2018, ArXiv.

[28]  Hubert Eichner,et al.  APPLIED FEDERATED LEARNING: IMPROVING GOOGLE KEYBOARD QUERY SUGGESTIONS , 2018, ArXiv.

[29]  Peter Richtárik,et al.  One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods , 2019, ArXiv.

[30]  Mehryar Mohri,et al.  Agnostic Federated Learning , 2019, ICML.

[31]  Swaroop Ramaswamy,et al.  Federated Learning for Emoji Prediction in a Mobile Keyboard , 2019, ArXiv.

[32]  Sebastian U. Stich,et al.  The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, 1909.05350.

[33]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[34]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[35]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[36]  Cyril Allauzen,et al.  Federated Learning of N-Gram Language Models , 2019, CoNLL.

[37]  Aymeric Dieuleveut,et al.  Communication trade-offs for synchronized distributed SGD with large step size , 2019, NeurIPS 2019.

[38]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[39]  Kin K. Leung,et al.  Adaptive Federated Learning in Resource Constrained Edge Computing Systems , 2018, IEEE Journal on Selected Areas in Communications.

[40]  Peter Richtárik,et al.  First Analysis of Local GD on Heterogeneous Data , 2019, ArXiv.

[41]  Tom Ouyang,et al.  Federated Learning Of Out-Of-Vocabulary Words , 2019, ArXiv.

[42]  Anit Kumar Sahu,et al.  Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[43]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[44]  Xiang Li,et al.  On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.