Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air

We study federated machine learning (ML) at the wireless edge, where power- and bandwidth-limited wireless devices with local datasets carry out distributed stochastic gradient descent (DSGD) with the help of a parameter server (PS). Standard approaches assume separate computation and communication, where local gradient estimates are compressed and transmitted to the PS over orthogonal links. Following this digital approach, we introduce D-DSGD, in which the wireless devices employ gradient quantization and error accumulation, and transmit their gradient estimates to the PS over a multiple access channel (MAC). We then introduce a novel analog scheme, called A-DSGD, which exploits the additive nature of the wireless MAC for over-the-air gradient computation, and provide convergence analysis for this approach. In A-DSGD, the devices first sparsify their gradient estimates, and then project them to a lower dimensional space imposed by the available channel bandwidth. These projections are sent directly over the MAC without employing any digital code. Numerical results show that A-DSGD converges faster than D-DSGD thanks to its more efficient use of the limited bandwidth and the natural alignment of the gradient estimates over the channel. The improvement is particularly compelling at low power and low bandwidth regimes. We also illustrate for a classification problem that, A-DSGD is more robust to bias in data distribution across devices, while D-DSGD significantly outperforms other digital schemes in the literature. We also observe that both D-DSGD and A-DSGD perform better with the number of devices, showing their ability in harnessing the computation power of edge devices.

[1]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[2]  Zhi Ding,et al.  Federated Learning via Over-the-Air Computation , 2018, IEEE Transactions on Wireless Communications.

[3]  Ramji Venkataramanan,et al.  Capacity-Achieving Sparse Superposition Codes via Approximate Message Passing Decoding , 2015, IEEE Transactions on Information Theory.

[4]  Min Ye,et al.  Communication-Computation Efficient Gradient Coding , 2018, ICML.

[5]  Ameet Talwalkar,et al.  Federated Multi-Task Learning , 2017, NIPS.

[6]  Andrea Montanari,et al.  Message-passing algorithms for compressed sensing , 2009, Proceedings of the National Academy of Sciences.

[7]  Arya Mazumdar,et al.  Robust Gradient Descent via Moment Encoding with LDPC Codes , 2018, ArXiv.

[8]  Kaibin Huang,et al.  Broadband Analog Aggregation for Low-Latency Federated Edge Learning , 2018, IEEE Transactions on Wireless Communications.

[9]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[10]  Nuwan S. Ferdinand,et al.  Hierarchical Coded Computation , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[11]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[12]  Takuya Akiba,et al.  Variance-based Gradient Compression for Efficient Distributed Deep Learning , 2018, ICLR.

[13]  J. Wolfowitz,et al.  Introduction to the Theory of Statistics. , 1951 .

[14]  Deniz Gündüz,et al.  SparseCast: Hybrid Digital-Analog Wireless Image Transmission Exploiting Frequency-Domain Sparsity , 2018, IEEE Communications Letters.

[15]  Deniz Gündüz,et al.  Federated Learning Over Wireless Fading Channels , 2019, IEEE Transactions on Wireless Communications.

[16]  Yiran Chen,et al.  Running sparse and low-precision neural network: When algorithm meets hardware , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[17]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[18]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[19]  Deniz Gündüz,et al.  Distributed Gradient Descent with Coded Partial Gradient Computations , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Deniz Gündüz,et al.  Over-the-Air Machine Learning at the Wireless Edge , 2019, 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC).

[21]  Andrea J. Goldsmith,et al.  Source and Channel Coding for Correlated Sources Over Multiuser Channels , 2008, IEEE Transactions on Information Theory.

[22]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[23]  Klaus-Robert Müller,et al.  Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication , 2018, 2019 International Joint Conference on Neural Networks (IJCNN).

[24]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[25]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[26]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[27]  Ramji Venkataramanan,et al.  The Error Probability of Sparse Superposition Codes With Approximate Message Passing Decoding , 2017, IEEE Transactions on Information Theory.

[28]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[29]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[30]  Joonhyuk Kang,et al.  Wireless Federated Distillation for Distributed Edge Learning with Heterogeneous Data , 2019, 2019 IEEE 30th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC).

[31]  Takayuki Nishio,et al.  Client Selection for Federated Learning with Heterogeneous Resources in Mobile Edge , 2018, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[32]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[33]  Nuwan S. Ferdinand,et al.  Exploitation of Stragglers in Coded Computation , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[34]  Peter Richtárik,et al.  Randomized Distributed Mean Estimation: Accuracy vs. Communication , 2016, Front. Appl. Math. Stat..

[35]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[36]  Parijat Dube,et al.  Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.

[37]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[38]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[39]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[40]  Dan Alistarh,et al.  SparCML: high-performance sparse communication for machine learning , 2018, SC.

[41]  Mehdi Bennis,et al.  Wireless Network Intelligence at the Edge , 2018, Proceedings of the IEEE.

[42]  Xu Sun,et al.  meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting , 2017, ICML.

[43]  Jakub Konecný,et al.  Federated Optimization: Distributed Optimization Beyond the Datacenter , 2015, ArXiv.

[44]  Z. Bai,et al.  Limit of the smallest eigenvalue of a large dimensional sample covariance matrix , 1993 .

[45]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[46]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[47]  Walid Saad,et al.  Distributed Federated Learning for Ultra-Reliable Low-Latency Vehicular Communications , 2018, IEEE Transactions on Communications.

[48]  Kunle Olukotun,et al.  Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[49]  Andrea Montanari,et al.  The dynamics of message passing on dense graphs, with applications to compressed sensing , 2010, 2010 IEEE International Symposium on Information Theory.

[50]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[51]  Slawomir Stanczak,et al.  Robust Analog Function Computation via Wireless Multiple-Access Channels , 2012, IEEE Transactions on Communications.

[52]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53]  Deniz Gündüz,et al.  Collaborative Machine Learning at the Wireless Edge with Blind Transmitters , 2019, 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[54]  Amir Salman Avestimehr,et al.  Near-Optimal Straggler Mitigation for Distributed Gradient Methods , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[55]  Junzhou Huang,et al.  Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.