Efficient and Private Federated Learning with Partially Trainable Networks

Federated learning is used for decentralized training of machine learning models on a large number (millions) of edge mobile devices. It is challenging because mobile devices often have limited communication bandwidth and local computation resources. Therefore, improving the efficiency of federated learning is critical for scalability and usability. In this paper, we propose to leverage partially trainable neural networks, which freeze a portion of the model parameters during the entire training process, to reduce the communication cost with little implications on model performance. Through extensive experiments, we empirically show that Federated learning of Partially Trainable neural networks (FedPT) can result in superior communication-accuracy trade-offs, with up to 46× reduction in communication cost, at a small accuracy cost. Our approach also enables faster training, with a smaller memory footprint, and better utility for strong differential privacy guarantees. The proposed FedPT method can be particularly interesting for pushing the limitations of overparameterization in on-device learning.

[1]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[2]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[3]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[4]  Suhas Diggavi,et al.  A Field Guide to Federated Optimization , 2021, ArXiv.

[5]  Nicholas D. Lane,et al.  FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout , 2021, NeurIPS.

[6]  Anit Kumar Sahu,et al.  Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[7]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[8]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[9]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[10]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[11]  Kurt Keutzer,et al.  Reservoir Transformers , 2021, ACL/IJCNLP.

[12]  Rogier C. van Dalen,et al.  Improving on-device speaker verification using federated learning with privacy , 2020, INTERSPEECH.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[15]  H. Brendan McMahan,et al.  Learning Differentially Private Recurrent Language Models , 2017, ICLR.

[16]  Manzil Zaheer,et al.  Adaptive Federated Optimization , 2020, ICLR.

[17]  Zhiwei Steven Wu,et al.  Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification , 2020, ArXiv.

[18]  Spyridon Bakas,et al.  Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data , 2020, Scientific Reports.

[19]  Tzu-Ming Harry Hsu,et al.  Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[20]  Hubert Eichner,et al.  Towards Federated Learning at Scale: System Design , 2019, SysML.

[21]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[22]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[23]  Phillip B. Gibbons,et al.  The Non-IID Data Quagmire of Decentralized Machine Learning , 2019, ICML.

[24]  Henry Markram,et al.  Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations , 2002, Neural Computation.

[25]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[26]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[27]  Karan Singhal,et al.  Federated Reconstruction: Partially Local Federated Learning , 2021, ArXiv.

[28]  Peter Kairouz,et al.  Practical and Private (Deep) Learning without Sampling or Shuffling , 2021, ICML.

[29]  Tianjian Chen,et al.  Federated Machine Learning: Concept and Applications , 2019 .

[30]  John K. Tsotsos,et al.  Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing , 2018, 2019 16th Conference on Computer and Robot Vision (CRV).

[31]  Douwe Kiela,et al.  No Training Required: Exploring Random Encoders for Sentence Classification , 2019, ICLR.

[32]  A Practical Survey on Faster and Lighter Transformers , 2021, 2103.14636.

[33]  Sebastian Caldas,et al.  LEAF: A Benchmark for Federated Settings , 2018, ArXiv.

[34]  Gregory Cohen,et al.  EMNIST: an extension of MNIST to handwritten letters , 2017, CVPR 2017.

[35]  Joshua Ainslie,et al.  FNet: Mixing Tokens with Fourier Transforms , 2021, NAACL.

[36]  Dimitris Papailiopoulos,et al.  Pufferfish: Communication-efficient Models At No Extra Cost , 2021, MLSys.

[37]  Tara N. Sainath,et al.  Echo State Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Guang-Bin Huang,et al.  Trends in extreme learning machines: A review , 2015, Neural Networks.

[39]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[40]  Peter Kairouz,et al.  (Nearly) Dimension Independent Private ERM with AdaGrad Ratesvia Publicly Estimated Subspaces , 2021, COLT.

[41]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[42]  David J. Schwab,et al.  Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs , 2020, ICLR.

[43]  Dan Boneh,et al.  Differentially Private Learning Needs Better Features (or Much More Data) , 2020, ICLR.

[44]  Samy Bengio,et al.  Are All Layers Created Equal? , 2019, J. Mach. Learn. Res..

[45]  H. Brendan McMahan,et al.  Training Production Language Models without Memorizing User Data , 2020, ArXiv.

[46]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[47]  Zheng Xu,et al.  The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent , 2019, ICML.

[48]  Sebastian Caldas,et al.  Expanding the Reach of Federated Learning by Reducing Client Resource Requirements , 2018, ArXiv.

[49]  Herbert Jaeger,et al.  Adaptive Nonlinear System Identification with Echo State Networks , 2002, NIPS.