Distributed Distillation for On-Device Learning

On-device learning promises collaborative training of machine learning models across edge devices without the sharing of user data. In state-of-the-art on-device learning algorithms, devices communicate their model weights over a decentralized communication network. Transmitting model weights requires huge communication overhead and means only devices with identical model architectures can be included. To overcome these limitations, we introduce a distributed distillation algorithm where devices communicate and learn from soft-decision (softmax) outputs, which are inherently architecture-agnostic and scale only with the number of classes. The communicated soft-decisions are each model’s outputs on a public, unlabeled reference dataset, which serves as a common vocabulary between devices. We prove that our algorithm converges with probability 1 to a stationary point where all devices in the communication network distill the entire network’s knowledge on the reference data, regardless of their local connections. Our analysis assumes smooth loss functions, which can be non-convex. Simulations support our theoretical findings and show that even a naive implementation of our algorithm significantly reduces the communication overhead while achieving an overall comparable performance to state-of-the-art, depending on the regime. By requiring little communication overhead and allowing for cross-architecture training, we remove two main obstacles to scaling on-device learning.

[1]  John N. Tsitsiklis,et al.  Problems in decentralized decision making and computation , 1984 .

[2]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[3]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[4]  Forrest N. Iandola,et al.  How to scale distributed deep learning? , 2016, ArXiv.

[5]  Stefan Wrobel,et al.  Efficient Decentralized Deep Learning by Dynamic Model Averaging , 2018, ECML/PKDD.

[6]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[7]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[8]  P. Lax,et al.  Multivariable Calculus with Applications , 2018 .

[9]  Ramesh Raskar,et al.  Distributed learning of deep neural network over multiple agents , 2018, J. Netw. Comput. Appl..

[10]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[11]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[12]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[13]  H. Robbins,et al.  A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[14]  Aryan Mokhtari,et al.  Robust and Communication-Efficient Collaborative Learning , 2019, NeurIPS.

[15]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[16]  Chinmay Hegde,et al.  Collaborative Deep Learning in Fixed Topology Networks , 2017, NIPS.

[17]  Vladimir Braverman,et al.  Communication-efficient distributed SGD with Sketching , 2019, NeurIPS.

[18]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[19]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[20]  Raj Kumar Maity,et al.  vqSGD: Vector Quantized Stochastic Gradient Descent , 2019, IEEE Transactions on Information Theory.

[21]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[22]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[23]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[24]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[25]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[26]  Pascal Bianchi,et al.  Convergence of a Multi-Agent Projected Stochastic Gradient Algorithm for Non-Convex Optimization , 2011, IEEE Transactions on Automatic Control.

[27]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[28]  Ludwig Schmidt,et al.  Unlabeled Data Improves Adversarial Robustness , 2019, NeurIPS.

[29]  W. Rudin Principles of mathematical analysis , 1964 .

[30]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Mikhail Belkin,et al.  A Co-Regularization Approach to Semi-supervised Learning with Multiple Views , 2005 .

[32]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[33]  Jakub Konecný,et al.  Federated Optimization: Distributed Optimization Beyond the Datacenter , 2015, ArXiv.

[34]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[35]  Christoforos N. Hadjicostis,et al.  Distributed strategies for average consensus in directed graphs , 2011, IEEE Conference on Decision and Control and European Control Conference.

[36]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Alfredo N. Iusem,et al.  On the projected subgradient method for nonsmooth convex optimization in a Hilbert space , 1998, Math. Program..

[39]  Thomas Gärtner,et al.  Efficient co-regularised least squares regression , 2006, ICML.

[40]  Amir Salman Avestimehr,et al.  Group Knowledge Transfer: Collaborative Training of Large CNNs on the Edge , 2020, ArXiv.

[41]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[42]  J. Norris Appendix: probability and measure , 1997 .

[43]  Jianyu Wang,et al.  Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[44]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[45]  Behrouz Touri,et al.  Non-Convex Distributed Optimization , 2015, IEEE Transactions on Automatic Control.