Local Averaging Helps: Hierarchical Federated Learning and Convergence Analysis

Federated learning is an effective approach to realize collaborative learning among edge devices without exchanging raw data. In practice, these devices may connect to local hubs which are then connected to the global server (aggregator). Due to the (possibly limited) computation capability of these local hubs, it is reasonable to assume that they can perform simple averaging operations. A natural question is whether such local averaging is beneficial under different system parameters and how much gain can be obtained compared to the case without such averaging. In this paper, we study hierarchical federated learning with stochastic gradient descent (HF-SGD) and conduct a thorough theoretical analysis to analyze its convergence behavior. The analysis demonstrates the impact of local averaging precisely as a function of system parameters. Due to the higher communication cost of global averaging, a strategy of decreasing the global averaging frequency and increasing the local averaging frequency is proposed. Experiments validate the proposed theoretical analysis and the advantages of hierarchical federated learning.

[1]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[2]  Leandros Tassiulas,et al.  Model Pruning Enables Efficient Federated Learning on Edge Devices , 2019, IEEE transactions on neural networks and learning systems.

[3]  Hubert Eichner,et al.  Towards Federated Learning at Scale: System Design , 2019, SysML.

[4]  Cheng Chen,et al.  FedCluster: Boosting the Convergence of Federated Learning via Cluster-Cycling , 2020, 2020 IEEE International Conference on Big Data (Big Data).

[5]  Anit Kumar Sahu,et al.  Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[6]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[7]  Jun Zhang,et al.  Edge-Assisted Hierarchical Federated Learning with Non-IID Data , 2019, ArXiv.

[8]  Swaroop Ramaswamy,et al.  Federated Learning for Emoji Prediction in a Mobile Keyboard , 2019, ArXiv.

[9]  Jianyu Wang,et al.  Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD , 2018, MLSys.

[10]  Xiang Li,et al.  On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.

[11]  Hubert Eichner,et al.  Federated Learning for Mobile Keyboard Prediction , 2018, ArXiv.

[12]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Stacy Patterson,et al.  Multi-Level Local SGD for Heterogeneous Hierarchical Networks , 2020, ArXiv.

[15]  Kin K. Leung,et al.  Adaptive Federated Learning in Resource Constrained Edge Computing Systems , 2018, IEEE Journal on Selected Areas in Communications.

[16]  Peng Jiang,et al.  A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.

[17]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[18]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[19]  Kin K. Leung,et al.  Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach , 2020, ArXiv.

[20]  Jingyan Jiang,et al.  Decentralized Federated Learning: A Segmented Gossip Approach , 2019, ArXiv.