Personalized Federated Learning for Heterogeneous Clients with Clustered Knowledge Transfer

Personalized federated learning (FL) aims to train model(s) that can perform well for individual clients that are highly data and system heterogeneous. Most work in personalized FL, however, assumes using the same model architecture at all clients and increases the communication cost by sending/receiving models. This may not be feasible for realistic scenarios of FL. In practice, clients have highly heterogeneous system-capabilities and limited communication resources. In our work, we propose a personalized FL framework, PERFED-CKT, where clients can use heterogeneous model architectures and do not directly communicate their model parameters. PERFED-CKT uses clustered co-distillation, where clients use logits to transfer their knowledge to other clients that have similar data-distributions. We theoretically show the convergence and generalization properties of PERFED-CKT and empirically show that PERFED-CKT achieves high test accuracy with several orders of magnitude lower communication cost compared to the state-of-the-art personalized FL schemes.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Martin J. Wainwright,et al.  FedSplit: An algorithmic framework for fast federated optimization , 2020, NeurIPS.

[4]  Ruslan Salakhutdinov,et al.  Think Locally, Act Globally: Federated Learning with Local and Global Representations , 2020, ArXiv.

[5]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[6]  Yanlin Zhou,et al.  Distilled One-Shot Federated Learning , 2020, ArXiv.

[7]  Farzin Haddadpour,et al.  On the Convergence of Local Descent Methods in Federated Learning , 2019, ArXiv.

[8]  Jianyu Wang,et al.  Client Selection in Federated Learning: Convergence Analysis and Power-of-Choice Selection Strategies , 2020, ArXiv.

[9]  Y. Mansour,et al.  Three Approaches for Personalization with Applications to Federated Learning , 2020, ArXiv.

[10]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[11]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[12]  Manzil Zaheer,et al.  Adaptive Federated Optimization , 2020, ICLR.

[13]  Aryan Mokhtari,et al.  Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach , 2020, NeurIPS.

[14]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Konstantin Mishchenko,et al.  Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2020, AISTATS.

[16]  Murali Annavaram,et al.  Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge , 2020, NeurIPS.

[17]  Tzu-Ming Harry Hsu,et al.  Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[18]  Lorenzo Rosasco,et al.  Online Learning, Stability, and Stochastic Gradient Descent , 2011, ArXiv.

[19]  Ilai Bistritz,et al.  Distributed Distillation for On-Device Learning , 2020, NeurIPS.

[20]  Sebastian U. Stich,et al.  The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, 1909.05350.

[21]  Alfredo N. Iusem,et al.  On the projected subgradient method for nonsmooth convex optimization in a Hilbert space , 1998, Math. Program..

[22]  Chao Xu,et al.  Federated Learning with Positive and Unlabeled Data , 2021, ArXiv.

[23]  Qiang Yang,et al.  Towards Personalized Federated Learning , 2021, IEEE transactions on neural networks and learning systems.

[24]  Nguyen H. Tran,et al.  Personalized Federated Learning with Moreau Envelopes , 2020, NeurIPS.

[25]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[26]  Mingliang Xu,et al.  Adversarial co-distillation learning for image recognition , 2021, Pattern Recognit..

[27]  Martin Jaggi,et al.  A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.

[28]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[29]  Junpu Wang,et al.  FedMD: Heterogenous Federated Learning via Model Distillation , 2019, ArXiv.

[30]  Sebastian U. Stich,et al.  Ensemble Distillation for Robust Model Fusion in Federated Learning , 2020, NeurIPS.

[31]  Lingjuan Lyu,et al.  Federated Model Distillation with Noise-Free Differential Privacy , 2021, IJCAI.

[32]  Ryo Yonetani,et al.  Adaptive Distillation for Decentralized Learning from Heterogeneous Clients , 2020, ArXiv.

[33]  Jianyu Wang,et al.  Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[34]  Bingsheng He,et al.  Practical One-Shot Federated Learning for Cross-Silo Setting , 2020, IJCAI.

[35]  Shenghuo Zhu,et al.  Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication , 2018, ArXiv.

[36]  K. Ramchandran,et al.  An Efficient Framework for Clustered Federated Learning , 2020, IEEE Transactions on Information Theory.

[37]  Laurent Condat,et al.  From Local SGD to Local Fixed Point Methods for Federated Learning , 2020, ICML.

[38]  Wotao Yin,et al.  FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data , 2020, ArXiv.

[39]  Sanja Fidler,et al.  Personalized Federated Learning with First Order Model Optimization , 2020, ICLR.

[40]  Virginia Smith,et al.  Ditto: Fair and Robust Federated Learning Through Personalization , 2020, ICML.

[41]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[42]  Masahiro Morikura,et al.  Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data , 2020, ArXiv.

[43]  Ohad Shamir,et al.  Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[44]  Michael G. Rabbat,et al.  A Closer Look at Codistillation for Distributed Training , 2020, ArXiv.

[45]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .