EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback

Error feedback (EF), also known as error compensation, is an immensely popular convergence stabilization mechanism in the context of distributed training of supervised machine learning models enhanced by the use of contractive communication compression mechanisms, such as Top-k. First proposed by Seide et al. [2014] as a heuristic, EF resisted any theoretical understanding until recently [Stich et al., 2018, Alistarh et al., 2018]. While these early breakthroughs were followed by a steady stream of works offering various improvements and generalizations, the current theoretical understanding of EF is still very limited. Indeed, to the best of our knowledge, all existing analyses either i) apply to the single node setting only, ii) rely on very strong and often unreasonable assumptions, such as global boundedness of the gradients, or iterate-dependent assumptions that cannot be checked a-priori and may not hold in practice, or iii) circumvent these issues via the introduction of additional unbiased compressors, which increase the communication cost. In this work we fix all these deficiencies by proposing and analyzing a new EF mechanism, which we call EF21, which consistently and substantially outperforms EF in practice. Moreover, our theoretical analysis relies on standard assumptions only, works in the distributed heterogeneous data setting, and leads to better and more meaningful rates. In particular, we prove that EF21 enjoys a fast O(1/T ) convergence rate for smooth nonconvex problems, beating the previous bound of O(1/T ), which was shown under a strong bounded gradients assumption. We further improve this to a fast linear rate for Polyak-Lojasiewicz functions, which is the first linear convergence result for an error feedback method not relying on unbiased compressors. Since EF has a large number of applications where it reigns supreme, we believe that our 2021 variant, EF21, can have a large impact on the practice of communication efficient distributed learning.

[1]  Zhize Li,et al.  A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization , 2020, ArXiv.

[2]  Robert M. Gower,et al.  Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization , 2020, Journal of Optimization Theory and Applications.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[5]  Peter Richtárik,et al.  Distributed Second Order Methods with Fast Rates and Compressed Communication , 2021, ICML.

[6]  Peter Richtárik,et al.  99% of Worker-Master Communication in Distributed Optimization Is Not Needed , 2020, UAI.

[7]  Sarit Khirirat,et al.  Distributed learning with compressed gradients , 2018, 1806.06573.

[8]  Sebastian U. Stich,et al.  Analysis of SGD with Biased Gradient Estimators , 2020, ArXiv.

[9]  Martin Jaggi,et al.  Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.

[10]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[11]  Peter Richt'arik,et al.  Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor , 2020, ArXiv.

[12]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[13]  Jean-Baptiste Cordonnier,et al.  Convex Optimization using Sparsified Stochastic Gradient Descent with Memory , 2018 .

[14]  Peter Richtárik,et al.  Error Compensated Loopless SVRG for Distributed Optimization , 2020 .

[15]  Qinmin Yang,et al.  Lazily Aggregated Quantized Gradient Innovation for Communication-Efficient Federated Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[17]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[18]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[19]  Eduard A. Gorbunov,et al.  MARINA: Faster Non-Convex Distributed Learning with Compression , 2021, ICML.

[20]  Na Li,et al.  On Maintaining Linear Convergence of Distributed Learning and Optimization Under Limited Communication , 2019, IEEE Transactions on Signal Processing.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ji Liu,et al.  DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  Peter Richtárik,et al.  A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent , 2019, AISTATS.

[25]  Marco Canini,et al.  Natural Compression for Distributed Deep Learning , 2019, MSML.

[26]  JainPrateek,et al.  Non-convex Optimization for Machine Learning , 2017 .

[27]  Tim Verbelen,et al.  A Survey on Distributed Machine Learning , 2019, ACM Comput. Surv..

[28]  Prateek Jain,et al.  Non-convex Optimization for Machine Learning , 2017, Found. Trends Mach. Learn..

[29]  Sebastian U. Stich,et al.  The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, 1909.05350.

[30]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[31]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[32]  Eduard A. Gorbunov,et al.  Linearly Converging Error Compensated SGD , 2020, NeurIPS.

[33]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[34]  Xun Qian,et al.  Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization , 2020, ICML.

[35]  Xiangliang Zhang,et al.  PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization , 2020, ICML.

[36]  Sebastian U. Stich,et al.  Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[37]  Peter Richtárik,et al.  On Biased Compression for Distributed Learning , 2020, ArXiv.

[38]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[39]  Tong Zhang,et al.  Error Compensated Distributed SGD Can Be Accelerated , 2020, NeurIPS.

[40]  Indranil Gupta,et al.  CSER: Communication-efficient SGD with Error Reset , 2020, NeurIPS.

[41]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[42]  Ahmed M. Abdelmoniem,et al.  Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation , 2020 .