Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

Large scale distributed optimization has become the default tool for the training of supervised machine learning models with a large number of parameters and training data. Recent advancements in the field provide several mechanisms for speeding up the training, including compressed communication, variance reduction and acceleration. However, none of these methods is capable of exploiting the inherently rich data-dependent smoothness structure of the local losses beyond standard smoothness constants. In this paper, we argue that when training supervised models, smoothness matrices—information-rich generalizations of the ubiquitous smoothness constants—can and should be exploited for further dramatic gains, both in theory and practice. In order to further alleviate the communication burden inherent in distributed optimization, we propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses. To showcase the power of this tool, we describe how our sparsification technique can be adapted to three distributed optimization algorithms—DCGD [Khirirat et al., 2018], DIANA [Mishchenko et al., 2019] and ADIANA [Li et al., 2020]—yielding significant savings in terms of communication complexity. The new methods always outperform the baselines, often dramatically so.

[1]  Ahmed M. Abdelmoniem,et al.  Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation , 2020 .

[2]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[3]  Tong Zhang,et al.  Error Compensated Distributed SGD Can Be Accelerated , 2020, NeurIPS.

[4]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[5]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[6]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling II: expected separable overapproximation , 2014, Optim. Methods Softw..

[7]  Junzhou Huang,et al.  Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.

[8]  Sebastian U. Stich,et al.  The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, 1909.05350.

[9]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[10]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[11]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[12]  Peter Richtárik,et al.  Randomized Iterative Methods for Linear Systems , 2015, SIAM J. Matrix Anal. Appl..

[13]  Peter Richt'arik,et al.  Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor , 2020, ArXiv.

[14]  Ce Zhang,et al.  Distributed Learning Systems with First-Order Methods , 2020, Found. Trends Databases.

[15]  Ji Liu,et al.  DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[16]  Peter Richtárik,et al.  A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent , 2019, AISTATS.

[17]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[18]  Peter Richtárik,et al.  On Biased Compression for Distributed Learning , 2020, ArXiv.

[19]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[20]  Mark W. Schmidt,et al.  Let's Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence , 2017 .

[21]  Peter Richtárik,et al.  One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods , 2019, ArXiv.

[22]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[23]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[24]  Mark W. Schmidt,et al.  Variance-Reduced Methods for Machine Learning , 2020, Proceedings of the IEEE.

[25]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[26]  Eduard A. Gorbunov,et al.  Linearly Converging Error Compensated SGD , 2020, NeurIPS.

[27]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[28]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[29]  Ali H. Sayed,et al.  A Linearly Convergent Proximal Gradient Algorithm for Decentralized Optimization , 2019, NeurIPS.

[30]  Peter Richtárik,et al.  SEGA: Variance Reduction via Gradient Sketching , 2018, NeurIPS.

[31]  Peter Richtárik,et al.  Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches , 2018, AISTATS.

[32]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[33]  Peter Richt'arik,et al.  On Stochastic Sign Descent Methods , 2019 .

[34]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[35]  Xun Qian,et al.  Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization , 2020, ICML.

[36]  Sebastian U. Stich,et al.  Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[37]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[38]  Peter Richtárik,et al.  Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop , 2019, ALT.

[39]  Nathan Srebro,et al.  Minibatch vs Local SGD for Heterogeneous Distributed Learning , 2020, NeurIPS.

[40]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[41]  Marco Canini,et al.  Natural Compression for Distributed Deep Learning , 2019, MSML.

[42]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[43]  Ohad Shamir,et al.  Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[44]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[45]  Peter Richtárik,et al.  On optimal probabilities in stochastic coordinate descent methods , 2013, Optim. Lett..

[46]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[47]  Peter Richtárik,et al.  99% of Worker-Master Communication in Distributed Optimization Is Not Needed , 2020, UAI.

[48]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[49]  Sarit Khirirat,et al.  Distributed learning with compressed gradients , 2018, 1806.06573.

[50]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[51]  Panos Kalnis,et al.  Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.

[52]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[53]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[54]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[55]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.