Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning

An oft-cited challenge of federated learning is the presence of heterogeneity. \emph{Data heterogeneity} refers to the fact that data from different clients may follow very different distributions. \emph{System heterogeneity} refers to the fact that client devices have different system capabilities. A considerable number of federated optimization methods address this challenge. In the literature, empirical evaluations usually start federated training from random initialization. However, in many practical applications of federated learning, the server has access to proxy data for the training task that can be used to pre-train a model before starting federated training. We empirically study the impact of starting from a pre-trained model in federated learning using four standard federated learning benchmark datasets. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables the training of more accurate models (up to 40\%) than is possible when starting from random initialization. Surprisingly, we also find that starting federated learning from a pre-trained initialization reduces the effect of both data and system heterogeneity. We recommend that future work proposing and evaluating federated optimization methods evaluate the performance when starting from random and pre-trained initializations. We also believe this study raises several questions for further work on understanding the role of heterogeneity in federated optimization.

[1]  Guodong Long,et al.  Federated Learning from Pre-Trained Models: A Contrastive Learning Approach , 2022, NeurIPS.

[2]  Benjamin Van Durme,et al.  Pretrained Models for Multilingual Federated Learning , 2022, North American Chapter of the Association for Computational Linguistics.

[3]  Michael G. Rabbat,et al.  Federated Learning with Partial Model Personalization , 2022, ICML.

[4]  Bill Yuchen Lin,et al.  FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks , 2021, NAACL-HLT.

[5]  Wei-Lun Chao,et al.  On Pre-Training for Federated Learning , 2022, ArXiv.

[6]  Suhas Diggavi,et al.  A Field Guide to Federated Optimization , 2021, ArXiv.

[7]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[8]  Wotao Yin,et al.  FedPD: A Federated Learning Framework With Adaptivity to Non-IID Data , 2020, IEEE Transactions on Signal Processing.

[9]  Joel Stremmel,et al.  Pretraining Federated Text Models for Next Word Prediction , 2020, Advances in Intelligent Systems and Computing.

[10]  Manzil Zaheer,et al.  Adaptive Federated Optimization , 2020, ICLR.

[11]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[12]  Gauri Joshi,et al.  Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms , 2021, J. Mach. Learn. Res..

[13]  Sashank J. Reddi,et al.  Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning , 2020, ArXiv.

[14]  Qinghua Liu,et al.  Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization , 2020, NeurIPS.

[15]  Nathan Srebro,et al.  Minibatch vs Local SGD for Heterogeneous Distributed Learning , 2020, NeurIPS.

[16]  T. Hsu,et al.  Federated Visual Classification with Real-World Data Distribution , 2020, ECCV.

[17]  David J. Schwab,et al.  The Early Phase of Neural Network Training , 2020, ICLR.

[18]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[19]  Daniel M. Roy,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[20]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[21]  Phillip B. Gibbons,et al.  The Non-IID Data Quagmire of Decentralized Machine Learning , 2019, ICML.

[22]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[23]  Tzu-Ming Harry Hsu,et al.  Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[24]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Sebastian Caldas,et al.  LEAF: A Benchmark for Federated Settings , 2018, ArXiv.

[28]  Yue Zhao,et al.  Federated Learning with Non-IID Data , 2018, ArXiv.

[29]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[30]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[31]  Dimitris S. Papailiopoulos,et al.  Gradient Diversity Empowers Distributed Learning , 2017, ArXiv.

[32]  Gregory Cohen,et al.  EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[33]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[34]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[35]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[36]  Blaise Agüera y Arcas,et al.  Federated Learning of Deep Networks using Model Averaging , 2016, ArXiv.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[39]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[40]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .