SKCompress: compressing sparse and nonuniform gradient in distributed machine learning

Distributed machine learning (ML) has been extensively studied to meet the explosive growth of training data. A wide range of machine learning models are trained by a family of first-order optimization algorithms, i.e., stochastic gradient descent (SGD). The core operation of SGD is the calculation of gradients. When executing SGD in a distributed environment, the workers need to exchange local gradients through the network. In order to reduce the communication cost, a category of quantification-based compression algorithms are used to transform the gradients to binary format, at the expense of a low precision loss. Although the existing approaches work fine for dense gradients, we find that these methods are ill-suited for many cases where the gradients are sparse and nonuniformly distributed. In this paper, we study is there a compression framework that can efficiently handle sparse and nonuniform gradients? We propose a general compression framework, called SKCompress, to compress both gradient values and gradient keys in sparse gradients. Our first contribution is a sketch-based method that compresses the gradient values. Sketch is a class of algorithm that approximates the distribution of a data stream with a probabilistic data structure. We first use a quantile sketch to generate splits, sort gradient values into buckets, and encode them with the bucket indexes. Our second contribution is a new sketch algorithm, namely MinMaxSketch, which compresses the bucket indexes. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. Since the bucket indexes are nonuniform, we further adopt Huffman coding to compress MinMaxSketch. To compress the keys of sparse gradients, the third contribution of this paper is a delta-binary encoding method that calculates the increment of the gradient keys and encode them with binary format. An adaptive prefix is proposed to assign different sizes to different gradient keys, so that we can save more space. We also theoretically discuss the correctness and the error bound of our proposed methods. To the best of our knowledge, this is the first effort utilizing data sketch to compress gradients in ML. We implement a prototype system in a real cluster of our industrial partner Tencent Inc. and show that our method is up to $$12\times $$ 12 × faster than the existing methods.

[1]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[2]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[3]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[4]  S.C. Hinds,et al.  A document skew detection method using run-length encoding and the Hough transform , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[5]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[6]  Thomas Parnell,et al.  Tera-scale coordinate descent on GPUs , 2020, Future Gener. Comput. Syst..

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[9]  Dan Alistarh,et al.  ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models , 2016, ArXiv.

[10]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[11]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[12]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[13]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[14]  Qi Zhang,et al.  A Fast Algorithm for Approximate Quantiles in High Speed Data Streams , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[15]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[16]  Yiguang Hong,et al.  Distributed regression estimation with incomplete data in multi-agent networks , 2018, Science China Information Sciences.

[17]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[18]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[19]  George A. F. Seber,et al.  Linear regression analysis , 1977 .

[20]  KhannaSanjeev,et al.  Space-efficient online computation of quantile summaries , 2001 .

[21]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[22]  Junzhou Huang,et al.  Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.

[23]  Satoshi Matsuoka,et al.  Scaling Word2Vec on Big Corpus , 2019, Data Science and Engineering.

[24]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[25]  Matthew J. Streeter,et al.  Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning , 2014, NIPS.

[26]  Bor-Yiing Su,et al.  Robust Large-Scale Machine Learning in the Cloud , 2016, KDD.

[27]  Gaogang Xie,et al.  A Shifting Bloom Filter Framework for Set Queries , 2015, Proc. VLDB Endow..

[28]  Marimuthu Palaniswami,et al.  Internet of Things (IoT): A vision, architectural elements, and future directions , 2012, Future Gener. Comput. Syst..

[29]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[30]  Peng Liu,et al.  Elastic sketch: adaptive and fast network-wide measurements , 2018, SIGCOMM.

[31]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[32]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[33]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Yanxiang Huang,et al.  TencentRec: Real-time Stream Recommendation in Practice , 2015, SIGMOD Conference.

[36]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[37]  A. Stephen McGough,et al.  Exploring the Semantic Content of Unsupervised Graph Embeddings: An Empirical Study , 2018, Data Science and Engineering.

[38]  Haris Pozidis,et al.  Large-Scale Stochastic Learning Using GPUs , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[39]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[40]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[41]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[42]  Bin Cui,et al.  LDA*: A Robust and Large-scale Topic Modeling System , 2017, Proc. VLDB Endow..

[43]  Alexander J. Smola,et al.  DiFacto: Distributed Factorization Machines , 2016, WSDM.

[44]  K. D. Ikramov Sparse matrices , 2020, Krylov Subspace Methods with Application in Incompressible Fluid Flow Solvers.

[45]  Jiawei Jiang,et al.  TeslaML: Steering Machine Learning Automatically in Tencent , 2017, APWeb/WAIM.

[46]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[47]  Hua Lu,et al.  GVoS , 2017, ACM Trans. Inf. Syst..

[48]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[49]  Xinyu Wang,et al.  Real-time intelligent big data processing: technology, platform, and applications , 2019, Science China Information Sciences.

[50]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[51]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[52]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[53]  Dan Alistarh,et al.  QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent , 2016, ArXiv.

[54]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[55]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[56]  Jiawei Jiang,et al.  TencentBoost: A Gradient Boosting Tree System with Parameter Server , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[57]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[58]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[59]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[60]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[61]  Donald E. Knuth,et al.  Dynamic Huffman Coding , 1985, J. Algorithms.

[62]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[63]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[64]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[65]  Xing Xie,et al.  Mining interesting locations and travel sequences from GPS trajectories , 2009, WWW '09.