Implicit Gradient Alignment in Distributed and Federated Learning

A major obstacle to achieving global convergence in distributed and federated learning is the misalignment of gradients across clients, or mini-batches due to heterogeneity and stochasticity of the distributed data. In this work, we show that data heterogeneity can in fact be exploited to improve generalization performance through implicit regularization. One way to alleviate the effects of heterogeneity is to encourage the alignment of gradients across different clients throughout training. Our analysis reveals that this goal can be accomplished by utilizing the right optimization method that replicates the implicit regularization effect of SGD, leading to gradient alignment as well as improvements in test accuracies. Since the existence of this regularization in SGD completely relies on the sequential use of different mini-batches during training, it is inherently absent when training with large mini-batches. To obtain the generalization benefits of this regularization while increasing parallelism, we propose a novel GradAlign algorithm that induces the same implicit regularization while allowing the use of arbitrarily large batches in each update. We experimentally validate the benefits of our algorithm in different distributed and federated learning settings.

[1]  Satrajit Chatterjee,et al.  Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization , 2020, ICLR.

[2]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[3]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[4]  Venkatesh Saligrama,et al.  Federated Learning Based on Dynamic Regularization , 2021, ICLR.

[5]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[6]  Angelia Nedic,et al.  Distributed Gradient Methods for Convex Machine Learning Problems in Networks: Distributed Optimization , 2020, IEEE Signal Processing Magazine.

[7]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[8]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[9]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[10]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[11]  Li Chen,et al.  Accelerating Federated Learning via Momentum Gradient Descent , 2019, IEEE Transactions on Parallel and Distributed Systems.

[12]  Konstantin Mishchenko,et al.  Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2020, AISTATS.

[13]  Soham De,et al.  On the Origin of Implicit Regularization in Stochastic Gradient Descent , 2021, ICLR.

[14]  Martin Jaggi,et al.  Extrapolation for Large-batch Training in Deep Learning , 2020, ICML.

[15]  Gregory Cohen,et al.  EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[16]  David G.T. Barrett,et al.  Implicit Gradient Regularization , 2020, ArXiv.

[17]  Mark W. Schmidt,et al.  Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[18]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[19]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[20]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[21]  Satrajit Chatterjee,et al.  Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment , 2020, ArXiv.

[22]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[23]  H. Robbins A Stochastic Approximation Method , 1951 .

[24]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[25]  Dimitris S. Papailiopoulos,et al.  Gradient Diversity: a Key Ingredient for Scalable Distributed Learning , 2017, AISTATS.

[26]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[27]  Srini Narayanan,et al.  Stiffness: A New Perspective on Generalization in Neural Networks , 2019, ArXiv.

[28]  Tzu-Ming Harry Hsu,et al.  Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[29]  Kurt Keutzer,et al.  Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.

[30]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[31]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[32]  Manzil Zaheer,et al.  Adaptive Federated Optimization , 2020, ICLR.

[33]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[34]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[35]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[36]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[37]  Sebastian Caldas,et al.  LEAF: A Benchmark for Federated Settings , 2018, ArXiv.

[38]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[39]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[40]  Ohad Shamir,et al.  Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[41]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .