Moniqua: Modulo Quantized Communication in Decentralized SGD

Running Stochastic Gradient Descent (SGD) in a decentralized fashion has shown promising results. In this paper we propose Moniqua, a technique that allows decentralized SGD to use quantized communication. We prove in theory that Moniqua communicates a provably bounded number of bits per iteration, while converging at the same asymptotic rate as the original algorithm does with full-precision communication. Moniqua improves upon prior works in that it (1) requires zero additional memory, (2) works with 1-bit quantization, and (3) is applicable to a variety of decentralized algorithms. We demonstrate empirically that Moniqua converges faster with respect to wall clock time than other quantized decentralized algorithms. We also show that Moniqua is robust to very low bit-budgets, allowing 1-bit-per-parameter communication without compromising validation accuracy when training ResNet20 and ResNet110 on CIFAR10.

[1]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[2]  Xiaojing Ye,et al.  Consensus optimization with delayed and stochastic gradients on decentralized networks , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[3]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[4]  Martin Jaggi,et al.  COLA: Decentralized Linear Learning , 2018, NeurIPS.

[5]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[6]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[7]  Thinh T. Doan,et al.  On the Convergence of Distributed Subgradient Methods under Quantization , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[8]  George Michailidis,et al.  DAdam: A Consensus-Based Distributed Adaptive Gradient Method for Online Optimization , 2018, IEEE Transactions on Signal Processing.

[9]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[10]  Junzhou Huang,et al.  Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.

[11]  Kenneth G. Paterson,et al.  Certificateless Public Key Cryptography , 2003 .

[12]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[13]  Ji Liu,et al.  DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[14]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[15]  Hadrien Hendrikx,et al.  Accelerated Decentralized Optimization with Local Updates for Smooth and Strongly Convex Objectives , 2018, AISTATS.

[16]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[17]  Dan Alistarh,et al.  Distributed Learning over Unreliable Networks , 2018, ICML.

[18]  Peng Jiang,et al.  A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.

[19]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[20]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[21]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[22]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[23]  Jiaqi Zhang,et al.  Asynchronous Decentralized Optimization in Directed Networks , 2019, ArXiv.

[24]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[25]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[26]  Aryan Mokhtari,et al.  Decentralized double stochastic averaging gradient , 2015, 2015 49th Asilomar Conference on Signals, Systems and Computers.

[27]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[28]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[29]  Kunle Olukotun,et al.  High-Accuracy Low-Precision Training , 2018, ArXiv.

[30]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[31]  Christopher De Sa,et al.  Distributed Learning with Sublinear Communication , 2019, ICML.

[32]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[33]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[34]  Yi Zhou,et al.  Communication-efficient algorithms for decentralized and stochastic optimization , 2017, Mathematical Programming.

[35]  Ali H. Sayed,et al.  Decentralized Consensus Optimization With Asynchrony and Delays , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[38]  Hanan Samet,et al.  Training Quantized Nets: A Deeper Understanding , 2017, NIPS.

[39]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[40]  Aryan Mokhtari,et al.  Quantized Decentralized Consensus Optimization , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[41]  V. Climenhaga Markov chains and mixing times , 2013 .

[42]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[43]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[44]  Michael G. Rabbat,et al.  Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[45]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[46]  Kunle Olukotun,et al.  Understanding and optimizing asynchronous low-precision stochastic gradient descent , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[47]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[48]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[49]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[50]  Dan Alistarh A Brief Tutorial on Distributed and Concurrent Machine Learning , 2018, PODC.

[51]  Lei Yuan,et al.  $\texttt{DeepSqueeze}$: Decentralization Meets Error-Compensated Compression , 2019 .

[52]  Dan Alistarh,et al.  Synchronous Multi-GPU Deep Learning with Low-Precision Communication: An Experimental Study , 2018 .

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.