Federated Knowledge Distillation

Distributed learning frameworks often rely on exchanging model parameters across workers, instead of revealing their raw data. A prime example is federated learning that exchanges the gradients or weights of each neural network model. Under limited communication resources, however, such a method becomes extremely costly particularly for modern deep neural networks having a huge number of model parameters. In this regard, federated distillation (FD) is a compelling distributed learning solution that only exchanges the model outputs whose dimensions are commonly much smaller than the model sizes (e.g., 10 labels in the MNIST dataset). The goal of this chapter is to provide a deep understanding of FD while demonstrating its communication efficiency and applicability to a variety of tasks. To this end, towards demystifying the operational principle of FD, the first part of this chapter provides a novel asymptotic analysis for two foundational algorithms of FD, namely knowledge distillation (KD) and co-distillation (CD), by exploiting the theory of neural tangent kernel (NTK). Next, the second part elaborates on a baseline implementation of FD for a classification task, and illustrates its performance in terms of accuracy and communication efficiency compared to FL. Lastly, to demonstrate the applicability of FD to various distributed learning tasks and environments, the third part presents two selected applications, namely FD over asymmetric uplink-and-downlink wireless channels and FD for reinforcement learning.

[1]  Han Cha,et al.  Proxy Experience Replay: Federated Distillation for Distributed Reinforcement Learning , 2020, IEEE Intelligent Systems.

[2]  Morteza Haghir Chehreghani,et al.  On the Unreasonable Effectiveness of Knowledge Distillation: Analysis in the Kernel Regime , 2020, ArXiv.

[3]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[4]  Walid Saad,et al.  Distributed Federated Learning for Ultra-Reliable Low-Latency Vehicular Communications , 2018, IEEE Transactions on Communications.

[5]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[6]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[7]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[8]  Kannan Ramchandran,et al.  Synthesizing Differentially Private Datasets using Random Mixing , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[9]  Masahiro Morikura,et al.  Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data , 2020, ArXiv.

[10]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[11]  Seong-Lyun Kim,et al.  Tractable Resource Management With Uplink Decoupled Millimeter-Wave Overlay in Ultra-Dense Cellular Networks , 2015, IEEE Transactions on Wireless Communications.

[12]  Joonhyuk Kang,et al.  Wireless Federated Distillation for Distributed Edge Learning with Heterogeneous Data , 2019, 2019 IEEE 30th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC).

[13]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[14]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[15]  Ed H. Chi,et al.  Understanding and Improving Knowledge Distillation , 2020, ArXiv.

[16]  Seong-Lyun Kim,et al.  XOR Mixup: Privacy-Preserving Data Augmentation for One-Shot Federated Learning , 2020, ArXiv.

[17]  Seong-Lyun Kim,et al.  Multi-hop Federated Private Data Augmentation with Sample Compression , 2019, ArXiv.

[18]  Mehdi Bennis,et al.  Wireless Network Intelligence at the Edge , 2018, Proceedings of the IEEE.