On Communication Compression for Distributed Optimization on Heterogeneous Data

Lossy gradient compression, with either unbiased or biased compressors, has become a key tool to avoid the communication bottleneck in centrally coordinated distributed training of machine learning models. We analyze the performance of two standard and general types of methods: (i) distributed quantized SGD (D-QSGD) with arbitrary unbiased quantizers and (ii) distributed SGD with error-feedback and biased compressors (D-EF-SGD) in the heterogeneous (non-iid) data setting. Our results indicate that D-EF-SGD is much less affected than D-QSGD by non-iid data, but both methods can suffer a slowdown if data-skewness is high. We propose two alternatives that are not (or much less) affected by heterogenous data distributions: a new method that is only applicable to strongly convex problems, and we point out a more general approach that is applicable to linear compressors.

[1]  Martin Jaggi,et al.  PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning , 2020, NeurIPS.

[2]  Martin Jaggi,et al.  A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.

[3]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[4]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[5]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[6]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[7]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[8]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[9]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[10]  Jean-Baptiste Cordonnier,et al.  Convex Optimization using Sparsified Stochastic Gradient Descent with Memory , 2018 .

[11]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[12]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[13]  Vladimir Braverman,et al.  FetchSGD: Communication-Efficient Federated Learning with Sketching , 2022 .

[14]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[15]  Sebastian U. Stich,et al.  Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[16]  Sarit Khirirat,et al.  Distributed learning with compressed gradients , 2018, 1806.06573.

[17]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[18]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[19]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[20]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[21]  Jitendra Malik,et al.  Trajectory Normalized Gradients for Distributed Optimization , 2019, ArXiv.

[22]  Ji,et al.  DeepSqueeze : Decentralization Meets Error-Compensated Compression , 2019 .

[23]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[24]  Martin Jaggi,et al.  Fully Quantized Distributed Gradient Descent , 2018 .

[25]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[26]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[27]  Lei Yuan,et al.  $\texttt{DeepSqueeze}$: Decentralization Meets Error-Compensated Compression , 2019 .

[28]  Nathan Srebro,et al.  Minibatch vs Local SGD for Heterogeneous Distributed Learning , 2020, NeurIPS.

[29]  Sebastian U. Stich,et al.  Unified Optimal Analysis of the (Stochastic) Gradient Method , 2019, ArXiv.

[30]  Robert M. Gower,et al.  Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization , 2020, Journal of Optimization Theory and Applications.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Amar Phanishayee,et al.  The Non-IID Data Quagmire of Decentralized Machine Learning , 2019, ICML.

[33]  Vladimir Braverman,et al.  Communication-efficient distributed SGD with Sketching , 2019, NeurIPS.

[34]  Sebastian U. Stich,et al.  The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, ArXiv.

[35]  Aleksandr Beznosikov,et al.  On Biased Compression for Distributed Learning , 2020, ArXiv.

[36]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[37]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[38]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[39]  Martin Jaggi,et al.  Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.

[40]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[41]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[42]  Mehryar Mohri,et al.  SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning , 2019, ArXiv.

[43]  Xiang Li,et al.  On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.