EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback

First proposed by Seide et al. (2014) as a heuristic, error feedback (EF) is a very popular mechanism for enforcing convergence of distributed gradient-based optimization methods enhanced with communication compression strategies based on the application of contractive compression operators. However, existing theory of EF relies on very strong assumptions (e.g., bounded gradients), and provides pessimistic convergence rates (e.g., while the best known rate for EF in the smooth nonconvex regime, and when full gradients are compressed, is O(1/T ), the rate of gradient descent in the same regime is O(1/T )). Recently, Richtárik et al. (2021) (2021) proposed a new error feedback mechanism, EF21, based on the construction of a Markov compressor induced by a contractive compressor. EF21 removes the aforementioned theoretical deficiencies of EF and at the same time works better in practice. In this work we propose six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression. Several of these techniques were never analyzed in conjunction with EF before, and in cases where they were (e.g., bidirectional compression), our rates are vastly superior.

[1]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[2]  Peter Richt'arik,et al.  FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning , 2021, ArXiv.

[3]  Tianbao Yang,et al.  Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization , 2016, 1604.03257.

[4]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[5]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[6]  Haibo Yang,et al.  Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning , 2021, ICLR.

[7]  Sebastian U. Stich,et al.  Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[8]  Peter Richt'arik,et al.  A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning , 2020, ICLR.

[9]  Nathan Srebro,et al.  Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[10]  Eduard A. Gorbunov,et al.  MARINA: Faster Non-Convex Distributed Learning with Compression , 2021, ICML.

[11]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[12]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[13]  Eduard A. Gorbunov,et al.  Linearly Converging Error Compensated SGD , 2020, NeurIPS.

[14]  Indranil Gupta,et al.  CSER: Communication-efficient SGD with Error Reset , 2020, NeurIPS.

[15]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[16]  Ji Liu,et al.  DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[17]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[18]  Peter Richtárik,et al.  On Biased Compression for Distributed Learning , 2020, ArXiv.

[19]  Jianyu Wang,et al.  Client Selection in Federated Learning: Convergence Analysis and Power-of-Choice Selection Strategies , 2020, ArXiv.

[20]  L. Bottou Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[23]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[24]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[25]  Peter Richtárik,et al.  Distributed Second Order Methods with Fast Rates and Compressed Communication , 2021, ICML.

[26]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[27]  Marco Canini,et al.  Natural Compression for Distributed Deep Learning , 2019, MSML.