论文信息 - DEED: A General Quantization Scheme for Communication Efficiency in Bits - 字舞流文

DEED: A General Quantization Scheme for Communication Efficiency in Bits

In distributed optimization, a popular technique to reduce communication is quantization. In this paper, we provide a general analysis framework for inexact gradient descent that is applicable to quantization schemes. We also propose a quantization scheme Double Encoding and Error Diminishing (DEED). DEED can achieve small communication complexity in three settings: frequent-communication large-memory, frequent-communication small-memory, and infrequent-communication (e.g. federated learning). More specifically, in the frequent-communication large-memory setting, DEED can be easily combined with Nesterov's method, so that the total number of bits required is $\tilde{O}( \sqrt{\kappa} \log 1/\epsilon )$, where $\tilde{O}$ hides numerical constant and $\log \kappa$ factors. In the frequent-communication small-memory setting, DEED combined with SGD only requires $\tilde{O}( \kappa \log 1/\epsilon)$ number of bits in the interpolation regime. In the infrequent communication setting, DEED combined with Federated averaging requires a smaller total number of bits than Federated Averaging. All these algorithms converge at the same rate as their non-quantized versions, while using a smaller number of bits.

Tian Ye | Ruoyu Sun | Peijun Xiao | Tian Ye | Ruoyu Sun | Peijun Xiao

[1] Ji Liu,et al. DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[2] Peter Richtárik,et al. Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[3] Zhi-Quan Luo,et al. On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks , 1991, Neural Computation.

[4] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[5] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6] Blaise Agüera y Arcas,et al. Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[7] Masafumi Yamazaki,et al. Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds , 2019, ArXiv.

[8] Zhize Li,et al. Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization , 2020, ICML.

[9] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[10] Jianyu Wang,et al. Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[11] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[12] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[13] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[14] Aryan Mokhtari,et al. FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization , 2019, AISTATS.

[15] Fan Zhou,et al. On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization , 2017, IJCAI.

[16] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[17] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[18] Yaoliang Yu,et al. Petuum: A New Platform for Distributed Machine Learning on Big Data , 2013, IEEE Transactions on Big Data.

[19] Xiaorui Liu,et al. A Double Residual Compression Algorithm for Efficient Distributed Learning , 2019, AISTATS.

[20] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[21] Shai Shalev-Shwartz,et al. Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[22] Anit Kumar Sahu,et al. Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[23] Mikhail Belkin,et al. Accelerating SGD with momentum for over-parameterized learning , 2018, ICLR.

[24] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[25] Mark W. Schmidt,et al. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[26] Sarit Khirirat,et al. Distributed learning with compressed gradients , 2018, 1806.06573.

[27] Mikhail Belkin,et al. MaSS: an Accelerated Stochastic Method for Over-parametrized Learning , 2018, ArXiv.

[28] John N. Tsitsiklis,et al. Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[29] Ying-Chang Liang,et al. Federated Learning in Mobile Edge Networks: A Comprehensive Survey , 2020, IEEE Communications Surveys & Tutorials.

[30] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[31] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[32] Xiang Li,et al. On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.

[33] Shenghuo Zhu,et al. Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[34] James Demmel,et al. ImageNet Training in Minutes , 2017, ICPP.

[35] Nathan Srebro,et al. Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization , 2018, NeurIPS.

[36] K. B. Letaief,et al. A Survey on Mobile Edge Computing: The Communication Perspective , 2017, IEEE Communications Surveys & Tutorials.

[37] John N. Tsitsiklis,et al. Communication complexity of convex optimization , 1986, 1986 25th IEEE Conference on Decision and Control.

[38] Peter Richtárik,et al. First Analysis of Local GD on Heterogeneous Data , 2019, ArXiv.

[39] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[40] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[41] Tianjian Chen,et al. Federated Machine Learning: Concept and Applications , 2019 .

[42] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[43] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[44] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.

[45] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[46] Peter Elias,et al. Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.