Stochastic Sign Descent Methods: New Algorithms and Better Theory

Various gradient compression schemes have been proposed to mitigate the communication cost in distributed training of large scale machine learning models. Sign-based methods, such as signSGD (Bernstein et al., 2018), have recently been gaining popularity because of their simple compression rule and connection to adaptive gradient methods, like ADAM. In this paper, we analyze sign-based methods for non-convex optimization in three key settings: (i) standard single node, (ii) parallel with shared data and (iii) distributed with partitioned data. For single machine case, we generalize the previous analysis of signSGD relying on intuitive bounds on success probabilities and allowing even biased estimators. Furthermore, we extend the analysis to parallel setting within a parameter server framework, where exponentially fast noise reduction is guaranteed with respect to number of nodes, maintaining 1-bit compression in both directions and using small mini-batch sizes. Next, we identify a fundamental issue with signSGD to converge in distributed environment. To resolve this issue, we propose a new sign-based method, Stochastic Sign Descent with Momentum (SSDM), which converges under standard bounded variance assumption with the optimal asymptotic rate. We validate several aspects of our theoretical findings with numerical experiments.

[1]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[2]  Philipp Hennig,et al.  Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.

[3]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[4]  H. Robbins A Stochastic Approximation Method , 1951 .

[5]  Zhiwei Steven Wu,et al.  Distributed Training with Heterogeneous Data: Bridging Median- and Mean-Based Algorithms , 2019, NeurIPS.

[6]  Mingyi Hong,et al.  signSGD via Zeroth-Order Oracle , 2019, ICLR.

[7]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[8]  Nam Sung Kim,et al.  GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training , 2018, NeurIPS.

[9]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[10]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[11]  Volkan Cevher,et al.  Stochastic Spectral Descent for Restricted Boltzmann Machines , 2015, AISTATS.

[12]  Ji Liu,et al.  DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[13]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[14]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[15]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[16]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[17]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[18]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[19]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[20]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[23]  Ashok Cutkosky,et al.  Momentum Improves Normalized SGD , 2020, ICML.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Sarit Khirirat,et al.  Distributed learning with compressed gradients , 2018, 1806.06573.

[26]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[27]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[28]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[29]  Mark W. Schmidt,et al.  Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[30]  I. Shevtsova On the absolute constants in the Berry-Esseen-type inequalities , 2011, 1111.6554.

[31]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[32]  Robert M. Gower,et al.  SGD with Arbitrary Sampling: General Analysis and Improved Rates , 2019, ICML.

[33]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[34]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[35]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[36]  Kamyar Azizzadenesheli,et al.  signSGD with Majority Vote is Communication Efficient and Fault Tolerant , 2018, ICLR.

[37]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[38]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.