RATQ: A Universal Fixed-Length Quantizer for Stochastic Optimization

We present Rotated Adaptive Tetra-iterated Quantizer (RATQ), a fixed-length quantizer for gradients in first order stochastic optimization. RATQ is easy to implement and involves only a Hadamard transform computation and adaptive uniform quantization with appropriately chosen dynamic ranges. For noisy gradients with almost surely bounded Euclidean norms, we establish an information theoretic lower bound for optimization accuracy using finite precision gradients and show that RATQ almost attains this lower bound. For mean square bounded noisy gradients, we use a gain-shape quantizer which separately quantizes the Euclidean norm and uses RATQ to quantize the normalized unit norm vector. We establish lower bounds for performance of any optimization procedure and shape quantizer, when used with a uniform gain quantizer. Finally, we propose an adaptive quantizer for gain which when used with RATQ for shape quantizer outperforms uniform gain quantization and is, in fact, close to optimal. As a by-product, we show that our fixed-length quantizer RATQ has almost the same performance as the optimal variable-length quantizers for distributed mean estimation. Also, we obtain an efficient quantizer for Gaussian vectors which attains a rate very close to the Gaussian rate-distortion function and is, in fact, universal for subgaussian input vectors.

[1]  Himanshu Tyagi,et al.  Inference Under Information Constraints II: Communication Constraints and Shared Randomness , 2019, IEEE Transactions on Information Theory.

[2]  S. Agaian Hadamard Matrices and Their Applications , 1985 .

[3]  Ilya Dumer Covering Spheres with Spheres , 2007, Discret. Comput. Geom..

[4]  Jon Hamkins,et al.  Design and analysis of spherical codes , 1996 .

[5]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[6]  David P. Woodruff,et al.  Communication lower bounds for statistical estimation problems via a distributed data processing inequality , 2015, STOC.

[7]  Xiaowei Hu,et al.  (Bandit) Convex Optimization with Biased Noisy Gradient Oracles , 2015, AISTATS.

[8]  Jacob Ziv,et al.  On universal quantization , 1985, IEEE Trans. Inf. Theory.

[9]  Jon Hamkins,et al.  Asymptotically dense spherical codes - Part II: Laminated spherical codes , 1997, IEEE Trans. Inf. Theory.

[10]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[11]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[12]  Daniel M. Roy,et al.  NUQSGD: Improved Communication Efficiency for Data-parallel SGD via Nonuniform Quantization , 2019, ArXiv.

[13]  Raj Kumar Maity,et al.  vqSGD: Vector Quantized Stochastic Gradient Descent , 2019, IEEE Transactions on Information Theory.

[14]  Himanshu Tyagi,et al.  Inference Under Information Constraints I: Lower Bounds From Chi-Square Contraction , 2018, IEEE Transactions on Information Theory.

[15]  Jon Hamkins,et al.  Gaussian source coding with spherical codes , 2002, IEEE Trans. Inf. Theory.

[16]  H. Robbins A Stochastic Approximation Method , 1951 .

[17]  Jon Hamkins,et al.  Asymptotically dense spherical codes - Part h Wrapped spherical codes , 1997, IEEE Trans. Inf. Theory.

[18]  Nathan Srebro,et al.  Open Problem: The Oracle Complexity of Convex Optimization with Limited Memory , 2019, COLT.

[19]  Martin J. Wainwright,et al.  Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[20]  Christopher De Sa,et al.  Distributed Learning with Sublinear Communication , 2019, ICML.

[21]  Kunle Olukotun,et al.  Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[22]  Meir Feder,et al.  Low-Density Lattice Codes , 2007, IEEE Transactions on Information Theory.

[23]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[24]  A. Wyner Random packings and coverings of the unit n-sphere , 1967 .

[25]  Yanjun Han,et al.  Geometric Lower Bounds for Distributed Parameter Estimation Under Communication Constraints , 2018, IEEE Transactions on Information Theory.

[26]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[27]  Roger D. Hersch,et al.  Rotated dispersed dither: a new technique for digital halftoning , 1994, SIGGRAPH.

[28]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[29]  Martin J. Wainwright,et al.  Optimality guarantees for distributed statistical estimation , 2014, 1405.0782.

[30]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[31]  A. Lapidoth On the role of mismatch in rate distortion theory , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[32]  John C. Duchi Introductory lectures on stochastic optimization , 2018, IAS/Park City Mathematics Series.

[33]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[34]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[35]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[36]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[37]  R. Gallager Information Theory and Reliable Communication , 1968 .

[38]  Sanjiv Kumar,et al.  cpSGD: Communication-efficient and differentially-private distributed SGD , 2018, NeurIPS.

[39]  Roman Vershynin,et al.  Uncertainty Principles and Vector Quantization , 2006, IEEE Transactions on Information Theory.

[40]  Uri Erez,et al.  Dithered Quantization via Orthogonal Transformations , 2016, IEEE Transactions on Signal Processing.

[41]  Ohad Shamir,et al.  Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation , 2013, NIPS.

[42]  Yanfei Yan,et al.  Polar lattices: Where Arıkan meets Forney , 2013, 2013 IEEE International Symposium on Information Theory.

[43]  Martin J. Wainwright,et al.  Low density codes achieve the rate-distortion bound , 2006, Data Compression Conference (DCC'06).

[44]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[45]  Himanshu Tyagi,et al.  Extra Samples can Reduce the Communication for Independence Testing , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[46]  Maxim Raginsky,et al.  Information-Theoretic Lower Bounds on Bayes Risk in Decentralized Estimation , 2016, IEEE Transactions on Information Theory.

[47]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[48]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[49]  Prakash Narayan,et al.  Gaussian arbitrarily varying channels , 1987, IEEE Trans. Inf. Theory.

[50]  Imre Csiszár,et al.  Capacity of the Gaussian arbitrarily varying channel , 1991, IEEE Trans. Inf. Theory.

[51]  Kenneth Rose,et al.  On Constrained Randomized Quantization , 2012, IEEE Transactions on Signal Processing.

[52]  A. Lapidoth On the role of mismatch in rate distortion theory , 1997, IEEE Trans. Inf. Theory.

[53]  Tengyu Ma,et al.  On Communication Cost of Distributed Statistical Estimation and Dimensionality , 2014, NIPS.

[54]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[55]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[56]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.