Stochastic distributed learning with gradient quantization and double-variance reduction

ABSTRACT We consider distributed optimization over several devices, each sending incremental model updates to a central server. This setting is considered, for instance, in federated learning. Various schemes have been designed to compress the model updates in order to reduce the overall communication cost. However, existing methods suffer from a significant slowdown due to additional variance coming from the compression operator and as a result, only converge sublinearly. What is needed is a variance reduction technique for taming the variance introduced by compression. We propose the first methods that achieve linear convergence for arbitrary compression operators. For strongly convex functions with condition number κ, distributed among n machines with a finite-sum structure, each worker having less than m components, we also (i) give analysis for the weakly convex and the non-convex cases and (ii) verify in experiments that our novel variance reduced schemes are more efficient than the baselines. Moreover, we show theoretically that as the number of devices increases, higher compression levels are possible without this affecting the overall number of communications in comparison with methods that do not perform any compression. This leads to a significant reduction in communication cost. Our general analysis allows to pick the most suitable compression for each problem, finding the right balance between additional variance and communication savings. Finally, we also (iii) give analysis for arbitrary quantized updates.

[1]  Peter Richtárik,et al.  Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop , 2019, ALT.

[2]  F. Bach,et al.  Stochastic quasi-gradient methods: variance reduction via Jacobian sketching , 2018, Mathematical Programming.

[3]  Xiang Li,et al.  Communication Efficient Decentralized Training with Multiple Local Updates , 2019, ArXiv.

[4]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[5]  Kaiwen Zhou,et al.  Direct Acceleration of SAGA using Sampled Negative Momentum , 2018, AISTATS.

[6]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[7]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[8]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[9]  Jean-Baptiste Cordonnier,et al.  Convex Optimization using Sparsified Stochastic Gradient Descent with Memory , 2018 .

[10]  Junzhou Huang,et al.  Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.

[11]  Sarit Khirirat,et al.  Distributed learning with compressed gradients , 2018, 1806.06573.

[12]  Sebastian U. Stich,et al.  SVRG meets SAGA: k-SVRG - A Tale of Limited Memory , 2018, ArXiv.

[13]  Wotao Yin,et al.  Breaking the Span Assumption Yields Fast Finite-Sum Minimization , 2018, NeurIPS.

[14]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[15]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[16]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[17]  Peter Richtárik,et al.  Randomized Distributed Mean Estimation: Accuracy vs. Communication , 2016, Front. Appl. Math. Stat..

[18]  Martin Jaggi,et al.  Fully Quantized Distributed Gradient Descent , 2018 .

[19]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[20]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[21]  Saibal Mukhopadhyay,et al.  On-chip training of recurrent neural networks with limited numerical precision , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[22]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[23]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[24]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[25]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[26]  Michael I. Jordan,et al.  Less than a Single Pass: Stochastically Controlled Stochastic Gradient , 2016, AISTATS.

[27]  Michael I. Jordan,et al.  Distributed optimization with arbitrary local solvers , 2015, Optim. Methods Softw..

[28]  Sam Ade Jacobs,et al.  Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[29]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[30]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[31]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[32]  Peter Richtárik,et al.  Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2015, NIPS.

[33]  Aurélien Lucchi,et al.  Variance Reduced Stochastic Gradient Descent with Neighbors , 2015, NIPS.

[34]  Peter Richtárik,et al.  Primal Method for ERM with Flexible Mini-batching Schemes and Non-convex Losses , 2015, ArXiv.

[35]  Peter Richtárik,et al.  Stochastic Dual Coordinate Ascent with Adaptive Probabilities , 2015, ICML.

[36]  Shai Shalev-Shwartz,et al.  SDCA without Duality , 2015, ArXiv.

[37]  Michael I. Jordan,et al.  Adding vs. Averaging in Distributed Primal-Dual Optimization , 2015, ICML.

[38]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[39]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[40]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[41]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[42]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[43]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[44]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[45]  Avleen Singh Bijral,et al.  Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.

[46]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[47]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[48]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[49]  Lisandro Dalcin,et al.  Parallel distributed computing using Python , 2011 .

[50]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[51]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[52]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[53]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[54]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[55]  H. Robbins A Stochastic Approximation Method , 1951 .

[56]  Lawrence G. Roberts,et al.  Picture coding using pseudo-random noise , 1962, IRE Trans. Inf. Theory.

[57]  W. M. Goodall Television by pulse code modulation , 1951 .