Local Adaptivity in Federated Learning: Convergence and Consistency

The federated learning (FL) framework trains a machine learning model using decentralized data stored at edge client devices by periodically aggregating locally trained models. Popular optimization algorithms of FL use vanilla (stochastic) gradient descent for both local updates at clients and global updates at the aggregating server. Recently, adaptive optimization methods such as AdaGrad have been studied for server updates. However, the effect of using adaptive optimization methods for local updates at clients is not yet understood. We show in both theory and practice that while local adaptive methods can accelerate convergence, they can cause a non-vanishing solution bias, where the final converged solution may be different from the stationary point of the global objective function. We propose correction techniques to overcome this inconsistency and complement the local adaptive methods for FL. Extensive experiments on realistic federated training tasks show that the proposed algorithms can achieve faster convergence and higher test accuracy than the baselines without local adaptivity.

[1]  Tzu-Ming Harry Hsu,et al.  Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[2]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[3]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[4]  Enhong Chen,et al.  Variance Reduced Local SGD with Lower Communication Complexity , 2019, ArXiv.

[5]  Peter Richtárik,et al.  Optimal Client Sampling for Federated Learning , 2020, ArXiv.

[6]  Laurent Condat,et al.  From Local SGD to Local Fixed Point Methods for Federated Learning , 2020, ICML.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Jakub Konecný,et al.  Convergence and Accuracy Trade-Offs in Federated Learning and Meta-Learning , 2021, AISTATS.

[9]  Wotao Yin,et al.  FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data , 2020, ArXiv.

[10]  Jianyu Wang,et al.  Client Selection in Federated Learning: Convergence Analysis and Power-of-Choice Selection Strategies , 2020, ArXiv.

[11]  Sebastian U. Stich,et al.  Unified Optimal Analysis of the (Stochastic) Gradient Method , 2019, ArXiv.

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[14]  Martin Jaggi,et al.  A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.

[15]  Haris Vikalo,et al.  Communication-Efficient Federated Learning via Optimal Client Sampling , 2020, LatinX in AI at International Conference on Machine Learning 2020.

[16]  Konstantin Mishchenko,et al.  Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2020, AISTATS.

[17]  Qinghua Liu,et al.  Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization , 2020, NeurIPS.

[18]  Anit Kumar Sahu,et al.  FedDANE: A Federated Newton-Type Method , 2019, 2019 53rd Asilomar Conference on Signals, Systems, and Computers.

[19]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning , 2019, ArXiv.

[20]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[21]  Hubert Eichner,et al.  Federated Learning for Mobile Keyboard Prediction , 2018, ArXiv.

[22]  O. Koyejo,et al.  Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates , 2019, ArXiv.

[23]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[24]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[25]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[26]  Stephen P. Boyd,et al.  A Primer on Monotone Operator Methods , 2015 .

[27]  Jinbo Bi,et al.  Effective Federated Adaptive Gradient Methods with Non-IID Decentralized Data , 2020, ArXiv.

[28]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[29]  J. Duncan,et al.  AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.

[30]  Martin J. Wainwright,et al.  FedSplit: An algorithmic framework for fast federated optimization , 2020, NeurIPS.

[31]  Jianyu Wang,et al.  SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum , 2020, ICLR.

[32]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[33]  Wei Shi,et al.  Federated learning of predictive models from federated Electronic Health Records , 2018, Int. J. Medical Informatics.

[34]  Nathan Srebro,et al.  Minibatch vs Local SGD for Heterogeneous Distributed Learning , 2020, NeurIPS.

[35]  Jie Xu,et al.  Federated Learning for Healthcare Informatics , 2019, ArXiv.