FedPD: A Federated Learning Framework With Adaptivity to Non-IID Data

Federated Learning (FL) is popular for communication-efficient learning from distributed data. To utilize data at different clients without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a computation then aggregation model, in which multiple local updates are performed using local data before aggregation. These algorithms fail to work when faced with practical challenges, e.g., the local data being non-identically independently distributed. In this paper, we first characterize the behavior of the FedAvg algorithm, and show that without strong and unrealistic assumptions on the problem structure, it can behave erratically. Aiming at designing FL algorithms that are provably fast and require as few assumptions as possible, we propose a new algorithm design strategy from the primal-dual optimization perspective. Our strategy yields algorithms that can deal with non-convex objective functions, achieves the best possible optimization and communication complexity (in a well-defined sense), and accommodates full-batch and mini-batch local computation models. Importantly, the proposed algorithms are communication efficient, in that the communication effort can be reduced when the level of heterogeneity among the local data also reduces. In the extreme case where the local data becomes homogeneous, only $\mathcal {O}(1)$ communication is required among the agents. To the best of our knowledge, this is the first algorithmic framework for FL that achieves all the above properties.

[1]  I. Gijbels,et al.  Penalized likelihood regression for generalized linear models with non-quadratic penalties , 2011 .

[2]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[3]  Georgios B. Giannakis,et al.  LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning , 2018, NeurIPS.

[4]  Qing Ling,et al.  COLA: Communication-censored Linearized ADMM for Decentralized Consensus Optimization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Enhong Chen,et al.  Variance Reduced Local SGD with Lower Communication Complexity , 2019, ArXiv.

[6]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[7]  Sebastian Caldas,et al.  LEAF: A Benchmark for Federated Settings , 2018, ArXiv.

[8]  Jianyu Wang,et al.  Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[9]  Hubert Eichner,et al.  Towards Federated Learning at Scale: System Design , 2019, SysML.

[10]  Mingyi Hong,et al.  Distributed Non-Convex First-Order optimization and Information Processing: Lower Complexity Bounds and Rate Optimal Algorithms , 2018, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[11]  Tie-Yan Liu,et al.  Convergence of Distributed Stochastic Variance Reduced Methods Without Sampling Extra Data , 2020, IEEE Transactions on Signal Processing.

[12]  Kin K. Leung,et al.  When Edge Meets Learning: Adaptive Control for Resource-Constrained Distributed Machine Learning , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[13]  Ali H. Sayed,et al.  Stochastic gradient descent with finite samples sizes , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[14]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[15]  Bo Zhao,et al.  iDLG: Improved Deep Leakage from Gradients , 2020, ArXiv.

[16]  Pramod K. Varshney,et al.  Parallel Restarted SPIDER - Communication Efficient Distributed Nonconvex Optimization with Optimal Computation Complexity , 2019, ArXiv.

[17]  Ameet Talwalkar,et al.  Federated Multi-Task Learning , 2017, NIPS.

[18]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[19]  Peter Richtárik,et al.  First Analysis of Local GD on Heterogeneous Data , 2019, ArXiv.

[20]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Anit Kumar Sahu,et al.  On the Convergence of Federated Optimization in Heterogeneous Networks , 2018, ArXiv.

[22]  Anit Kumar Sahu,et al.  Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[23]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[24]  Xu Chen,et al.  In-Edge AI: Intelligentizing Mobile Edge Computing, Caching and Communication by Federated Learning , 2018, IEEE Network.

[25]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[26]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[27]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[28]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[29]  Farzin Haddadpour,et al.  Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization , 2019, ICML.

[30]  Anit Kumar Sahu,et al.  FedDANE: A Federated Newton-Type Method , 2019, 2019 53rd Asilomar Conference on Signals, Systems, and Computers.

[31]  Xiang Li,et al.  On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.