FedDistill: Making Bayesian Model Ensemble Applicable to Federated Learning

Federated learning aims to collaboratively train a strong global model by accessing users' locally trained models but not their own data. A crucial step is therefore to aggregate local models into a global model, which has been shown challenging when users have non-i.i.d. data. In this paper, we propose a novel aggregation algorithm named FedBE, which takes a Bayesian inference perspective by sampling higher-quality global models and combining them via Bayesian model Ensemble, leading to much robust aggregation. We show that an effective model distribution can be constructed by simply fitting a Gaussian or Dirichlet distribution to the local models. Our empirical studies validate FedBE's superior performance, especially when users' data are not i.i.d. and when the neural networks go deeper. Moreover, FedBE is compatible with recent efforts in regularizing users' model training, making it an easily applicable module: you only need to replace the aggregation method but leave other parts of your federated learning algorithm intact.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[3]  Percy Liang,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[4]  Enhong Chen,et al.  Variance Reduced Local SGD with Lower Communication Complexity , 2019, ArXiv.

[5]  Xiang Li,et al.  On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.

[6]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[7]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[8]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Tianjian Chen,et al.  Federated Machine Learning: Concept and Applications , 2019 .

[11]  Chun Chen,et al.  Online Knowledge Distillation with Diverse Peers , 2019, AAAI.

[12]  Ameet Talwalkar,et al.  Federated Multi-Task Learning , 2017, NIPS.

[13]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[14]  Anit Kumar Sahu,et al.  FedDANE: A Federated Newton-Type Method , 2019, 2019 53rd Asilomar Conference on Signals, Systems, and Computers.

[15]  Farzin Haddadpour,et al.  On the Convergence of Local Descent Methods in Federated Learning , 2019, ArXiv.

[16]  Yasaman Khazaeni,et al.  Bayesian Nonparametric Federated Learning of Neural Networks , 2019, ICML.

[17]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[18]  Yue Zhao,et al.  Federated Learning with Non-IID Data , 2018, ArXiv.

[19]  Tzu-Ming Harry Hsu,et al.  Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[20]  Lifeng Sun,et al.  Federated Learning with Unbiased Gradient Aggregation and Controllable Meta Updating , 2019, ArXiv.

[21]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Raef Bassily,et al.  Private Query Release Assisted by Public Data , 2020, ICML.

[24]  Hubert Eichner,et al.  Towards Federated Learning at Scale: System Design , 2019, SysML.

[25]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[26]  Ameet Talwalkar,et al.  One-Shot Federated Learning , 2019, ArXiv.

[27]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[28]  Kristen Grauman,et al.  Learning Kernels for Unsupervised Domain Adaptation with Applications to Visual Object Recognition , 2014, International Journal of Computer Vision.

[29]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[30]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[31]  Junpu Wang,et al.  FedMD: Heterogenous Federated Learning via Model Distillation , 2019, ArXiv.

[32]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[33]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[34]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Sebastian U. Stich,et al.  Ensemble Distillation for Robust Model Fusion in Federated Learning , 2020, NeurIPS.

[36]  Saman Ghili,et al.  Tiny ImageNet Visual Recognition Challenge , 2014 .

[37]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[38]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[39]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[40]  Xiaolin Hu,et al.  Online Knowledge Distillation via Collaborative Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[42]  Ya Le,et al.  Tiny ImageNet Visual Recognition Challenge , 2015 .

[43]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[44]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[45]  Joseph E. Gonzalez,et al.  Benchmarking Semi-supervised Federated Learning , 2020, ArXiv.

[46]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[47]  Zhi-Hua Zhou,et al.  Heterogeneous Model Reuse via Optimizing Multiparty Multiclass Margin , 2019, ICML.

[48]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[49]  Manzil Zaheer,et al.  Adaptive Federated Optimization , 2020, ICLR.

[50]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[51]  Yasaman Khazaeni,et al.  Federated Learning with Matched Averaging , 2020, ICLR.

[52]  Konstantin Mishchenko,et al.  Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2020, AISTATS.

[53]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[54]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[55]  Tatsuya Harada,et al.  Maximum Classifier Discrepancy for Unsupervised Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[57]  Mehdi Bennis,et al.  Communication-Efficient On-Device Machine Learning: Federated Distillation and Augmentation under Non-IID Private Data , 2018, ArXiv.

[58]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[59]  Anit Kumar Sahu,et al.  On the Convergence of Federated Optimization in Heterogeneous Networks , 2018, ArXiv.

[60]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[61]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[62]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[63]  Fan Zhou,et al.  On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization , 2017, IJCAI.

[64]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[65]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[66]  Raef Bassily,et al.  Learning from Mixtures of Private and Public Populations , 2020, NeurIPS.

[67]  Amar Phanishayee,et al.  The Non-IID Data Quagmire of Decentralized Machine Learning , 2019, ICML.

[68]  Colin Wei,et al.  Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss , 2019, NeurIPS.

[69]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[70]  Aryan Mokhtari,et al.  FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization , 2019, AISTATS.

[71]  Dirk Van,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[72]  Mehryar Mohri,et al.  SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning , 2019, ArXiv.