论文信息 - vqSGD: Vector Quantized Stochastic Gradient Descent

vqSGD: Vector Quantized Stochastic Gradient Descent

In this work, we present a family of vector quantization schemes \emph{vqSGD} (Vector-Quantized Stochastic Gradient Descent) that provide an asymptotic reduction in the communication cost with convergence guarantees in first-order distributed optimization. In the process we derive the following fundamental information theoretic fact: $\Theta(\frac{d}{R^2})$ bits are necessary and sufficient to describe an unbiased estimator ${\hat{g}}({g})$ for any ${g}$ in the $d$-dimensional unit sphere, under the constraint that $\|{\hat{g}}({g})\|_2\le R$ almost surely. In particular, we consider a randomized scheme based on the convex hull of a point set, that returns an unbiased estimator of a $d$-dimensional gradient vector with almost surely bounded norm. We provide multiple efficient instances of our scheme, that are near optimal, and require only $o(d)$ bits of communication at the expense of tolerable increase in error. The instances of our quantization scheme are obtained using the properties of binary error-correcting codes and provide a smooth tradeoff between the communication and the estimation error of quantization. Furthermore, we show that \emph{vqSGD} also offers strong privacy guarantees.

[1] Amir Salman Avestimehr,et al. Fitting ReLUs via SGD and Quantized SGD , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[2] Christopher De Sa,et al. Distributed Learning with Sublinear Communication , 2019, ICML.

[3] Vladimir Braverman,et al. Communication-efficient distributed SGD with Sketching , 2019, NeurIPS.

[4] Martin J. Wainwright,et al. High-Dimensional Statistics , 2019 .

[5] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[6] Tong Zhang,et al. Trading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints , 2010, SIAM J. Optim..

[7] Aaron Roth,et al. The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[8] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.

[9] Ananda Theertha Suresh,et al. Distributed Mean Estimation with Limited Communication , 2016, ICML.

[10] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[11] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[12] O. Antoine,et al. Theory of Error-correcting Codes , 2022 .

[13] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[14] Cynthia Dwork,et al. Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[15] Sebastian U. Stich,et al. Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[16] M. Kerimov. The theory of error-correcting codes☆ , 1980 .

[17] Gérard D. Cohen,et al. Covering Codes , 2005, North-Holland mathematical library.

[18] Ian Goodfellow,et al. Deep Learning with Differential Privacy , 2016, CCS.

[19] Martin J. Wainwright,et al. Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[20] Jennifer Seberry,et al. Hadamard matrices, orthogonal designs and construction algorithms , 2003 .

[21] Peter Richtárik,et al. Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[22] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[23] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[24] Shai Ben-David,et al. Understanding Machine Learning: From Theory to Algorithms , 2014 .

[25] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[26] S L Warner,et al. Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[27] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[28] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[29] Dan Alistarh,et al. The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[30] Martin Jaggi,et al. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[31] Sanjiv Kumar,et al. cpSGD: Communication-efficient and differentially-private distributed SGD , 2018, NeurIPS.

[32] Vitaly Shmatikov,et al. Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[33] Sébastien Bubeck,et al. Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[34] Huanyu Zhang,et al. Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication , 2018, AISTATS.

[35] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[36] Per Ola Börjesson,et al. Simple Approximations of the Error Function Q(x) for Communications Applications , 1979, IEEE Trans. Commun..

[37] Prathamesh Mayekar,et al. RATQ: A Universal Fixed-Length Quantizer for Stochastic Optimization , 2019, IEEE Transactions on Information Theory.

[38] Úlfar Erlingsson,et al. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[39] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[40] Dimitris S. Papailiopoulos,et al. ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.