Toward Communication-Efficient Federated Learning in the Internet of Things With Edge Computing

Federated learning is an emerging concept that trains the machine learning models with the local distributed data sets, without sending the raw data to the data center. But, in the Internet of Things (IoT) where the wireless network resource is constrained, the key problem of federated learning is the communication overhead for parameter synchronization, which wastes bandwidth, increases training time, and even impacts the model accuracy. Gradient sparsification has received increasing attention, which only updates significant gradients and accumulates insignificant gradients locally. However, how to preserve the accuracy after a high ratio sparsification has been ignored in the literature. In this article, a general gradient sparsification (GGS) framework is proposed for adaptive optimizers, to correct the sparse gradient update process. It consists of two important mechanisms: 1) gradient correction and 2) batch normalization (BN) update with local gradients. With gradient correction, the optimizer can properly treat the accumulated insignificant gradients, which makes the model converge better. Furthermore, updating the BN layer with local gradients can relieve the impact of delayed gradients without increasing the communication overhead. We have conducted experiments on LeNet-5, CifarNet, DenseNet-121, and AlexNet with adaptive optimizers. Results show that when 99.9% gradients are sparsified, validation data sets are maintained with top-1 accuracy.

[1]  Takuya Akiba,et al.  Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.

[2]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[3]  Xiaojuan Qi,et al.  ICNet for Real-Time Semantic Segmentation on High-Resolution Images , 2017, ECCV.

[4]  Kin K. Leung,et al.  Adaptive Federated Learning in Resource Constrained Edge Computing Systems , 2018, IEEE Journal on Selected Areas in Communications.

[5]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[6]  Choong Seon Hong,et al.  Energy Efficient Federated Learning Over Wireless Communication Networks , 2019, IEEE Transactions on Wireless Communications.

[7]  Wei Xu,et al.  Energy Efficient Resource Allocation in Machine-to-Machine Communications With Multiple Access and Energy Harvesting for IoT , 2017, IEEE Internet of Things Journal.

[8]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Deniz Gündüz,et al.  Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[10]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Takayuki Nishio,et al.  Client Selection for Federated Learning with Heterogeneous Resources in Mobile Edge , 2018, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[13]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[14]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[15]  Walid Saad,et al.  A Joint Learning and Communications Framework for Federated Learning Over Wireless Networks , 2021, IEEE Transactions on Wireless Communications.

[16]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[17]  Ursula Challita,et al.  Artificial Neural Networks-Based Machine Learning for Wireless Networks: A Tutorial , 2017, IEEE Communications Surveys & Tutorials.

[18]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[19]  Albert Y. Zomaya,et al.  Federated Learning over Wireless Networks: Optimization Model Design and Analysis , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[20]  James Demmel,et al.  Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs , 2019, IEEE Transactions on Parallel and Distributed Systems.

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Xiaowei Zhou,et al.  MonoCap: Monocular Human Motion Capture using a CNN Coupled with a Geometric Prior , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[24]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[25]  Kaibin Huang,et al.  Broadband Analog Aggregation for Low-Latency Federated Edge Learning , 2018, IEEE Transactions on Wireless Communications.

[26]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  James Demmel,et al.  Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.

[29]  Xu Chen,et al.  In-Edge AI: Intelligentizing Mobile Edge Computing, Caching and Communication by Federated Learning , 2018, IEEE Network.

[30]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[31]  Sam Ade Jacobs,et al.  Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[32]  Chunpeng Wu,et al.  SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning , 2018, 1805.07898.

[33]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[34]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[35]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[36]  Zhi Ding,et al.  Federated Learning via Over-the-Air Computation , 2018, IEEE Transactions on Wireless Communications.

[37]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[38]  Wei Wang,et al.  CMFL: Mitigating Communication Overhead for Federated Learning , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[39]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[40]  Maja Pantic,et al.  T-Net: Parametrizing Fully Convolutional Nets With a Single High-Order Tensor , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Bo Li,et al.  Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[42]  Tieniu Tan,et al.  A Light CNN for Deep Face Representation With Noisy Labels , 2015, IEEE Transactions on Information Forensics and Security.

[43]  Ioannis Mitliagkas,et al.  Asynchrony begets momentum, with an application to deep learning , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[44]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[45]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[46]  Kaibin Huang,et al.  Towards an Intelligent Edge: Wireless Communication Meets Machine Learning , 2018, ArXiv.

[47]  Wei Zhang,et al.  AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training , 2017, AAAI.

[48]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[49]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[50]  H. Vincent Poor,et al.  Convergence Time Optimization for Federated Learning Over Wireless Networks , 2020, IEEE Transactions on Wireless Communications.

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[53]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[54]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[55]  Xin Pan,et al.  A hybrid MLP-CNN classifier for very fine resolution remotely sensed image classification , 2017, ISPRS Journal of Photogrammetry and Remote Sensing.