On the Effectiveness of Partial Variance Reduction in Federated Learning with Heterogeneous Data

Data heterogeneity across clients is a key challenge in federated learning. Prior works address this by either aligning client and server models or using control variates to correct client model drift. Although these methods achieve fast convergence in convex or simple non-convex problems, the performance in over-parameterized models such as deep neural networks is lacking. In this paper, we first revisit the widely used FedAvg algorithm in a deep neural network to understand how data heterogeneity influences the gradient updates across the neural network layers. We observe that while the feature extraction layers are learned efficiently by FedAvg, the substantial diversity of the final classification layers across clients impedes the performance. Motivated by this, we propose to correct model drift by variance reduction only on the final layers. We demonstrate that this significantly outperforms existing benchmarks at a similar or lower communication cost. We furthermore provide proof for the convergence rate of our algorithm.

[1]  Sai Praneeth Karimireddy,et al.  TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels , 2022, NeurIPS.

[2]  S. Shakkottai,et al.  FedAvg with Fine Tuning: Local Updates Lead to Representation Learning , 2022, NeurIPS.

[3]  S. Matwin,et al.  AdaBest: Minimizing Client Drift in Federated Learning via Adaptive Bias Estimation , 2022, ECCV.

[4]  Yingwen Chen,et al.  FedDC: Federated Learning with Non-IID Data via Local Drift Decoupling and Correction , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Sebastian U. Stich,et al.  ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally! , 2022, ICML.

[6]  Venkatesh Saligrama,et al.  Federated Learning Based on Dynamic Regularization , 2021, ICLR.

[7]  Peter Richt'arik,et al.  FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning , 2021, ArXiv.

[8]  Suhas Diggavi,et al.  A Field Guide to Federated Optimization , 2021, ArXiv.

[9]  Jian Liang,et al.  No Fear of Heterogeneity: Classifier Calibration for Federated Learning with Non-IID Data , 2021, NeurIPS.

[10]  Se-Young Yun,et al.  FedBABU: Towards Enhanced Representation for Federated Image Classification , 2021, ICLR.

[11]  Bingsheng He,et al.  Model-Contrastive Federated Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Thao Nguyen,et al.  Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , 2020, ICLR.

[13]  Michael I. Jordan,et al.  Uncertainty Sets for Image Classifiers using Conformal Prediction , 2020, ICLR.

[14]  Spyridon Bakas,et al.  Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data , 2020, Scientific Reports.

[15]  Sebastian U. Stich,et al.  Ensemble Distillation for Robust Model Fusion in Federated Learning , 2020, NeurIPS.

[16]  Yaniv Romano,et al.  Classification with Valid and Adaptive Coverage , 2020, NeurIPS.

[17]  Martin Jaggi,et al.  A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.

[18]  Zhize Li,et al.  Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization , 2020, ICML.

[19]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[20]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[21]  Ahmed M. Abdelmoniem,et al.  On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning , 2019, AAAI.

[22]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[23]  Sebastian U. Stich,et al.  The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, ArXiv.

[24]  Peter Richtárik,et al.  Better Communication Complexity for Local SGD , 2019, ArXiv.

[25]  Sebastian U. Stich,et al.  Unified Optimal Analysis of the (Stochastic) Gradient Method , 2019, ArXiv.

[26]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[27]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[28]  Anit Kumar Sahu,et al.  On the Convergence of Federated Optimization in Heterogeneous Networks , 2018, ArXiv.

[29]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[30]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[31]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[32]  Dan Alistarh,et al.  QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent , 2016, ArXiv.

[33]  Blaise Agüera y Arcas,et al.  Federated Learning of Deep Networks using Model Averaging , 2016, ArXiv.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[37]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[38]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[39]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[41]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[42]  Hai Le Vu,et al.  An estimation of sensor energy consumption , 2009 .