On the Convergence of Quantized Parallel Restarted SGD for Serverless Learning

With the growing data volume and the increasing concerns of data privacy, Stochastic Gradient Decent (SGD) based distributed training of deep neural network has been widely recognized as a promising approach. Compared with server-based architecture, serverless architecture with All-Reduce (AR) and Gossip paradigms can alleviate network congestion. To further reduce the communication overhead, we develop Quantized-PR-SGD, a novel compression approach for serverless learning that integrates quantization and parallel restarted (PR) techniques to compress the exchanged information and to reduce synchronization frequency respectively. The underlying theoretical guarantee for the proposed compression scheme is challenging since the precision loss incurred by quantization and the gradient deviation incurred by PR interact with each other. Moreover, the accumulated errors that are not strictly controlled make the training not converging in Gossip paradigm. Therefore, we establish the bound of accumulative errors according to synchronization mode and network topology to analyze the convergence properties of Quantized-PR-SGD. For both AR and Gossip paradigms, theoretical results show that Quantized-PR-SGD are at the convergence rate of $O(1/\sqrt{NM})$ for non-convex objectives, where $N$ is the total number of iterations while $M$ is the number of nodes. This indicates that Quantized-PR-SGD admits the same order of convergence rate and achieves linear speedup with respect to the number of nodes. Empirical study on various machine learning models demonstrates that communication overhead has reduced by 90\%, and the convergence speed has boosted by up to 3.2$\times$ under low bandwidth network compared with PR-SGD.

[1]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[2]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[5]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[6]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[7]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[8]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[9]  Song Guo,et al.  Intermittent Pulling with Local Compensation for Communication-Efficient Federated Learning , 2020, ArXiv.

[10]  H. Robbins A Stochastic Approximation Method , 1951 .

[11]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[12]  Sam Ade Jacobs,et al.  Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[13]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[14]  Ioannis Mitliagkas,et al.  Parallel SGD: When does averaging help? , 2016, ArXiv.

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[17]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[18]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[19]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[20]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[21]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[22]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[23]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[24]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[25]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[26]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[27]  Murat Demirbas,et al.  Performance Analysis and Comparison of Distributed Machine Learning Systems , 2019, ArXiv.