signSGD with Majority Vote is Communication Efficient And Byzantine Fault Tolerant

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses $32\times$ less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. We model adversaries as those workers who may compute a stochastic gradient estimate and manipulate it, but may not coordinate with other adversaries. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

[1]  F. P. Cantelli Sui confini della probabilità , 1929 .

[2]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[3]  F. Pukelsheim The Three Sigma Rule , 1994 .

[4]  John J. Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities , 1999 .

[5]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[6]  H. Robbins A Stochastic Approximation Method , 1951 .

[7]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Volkan Cevher,et al.  Stochastic Spectral Descent for Discrete Graphical Models , 2016, IEEE Journal of Selected Topics in Signal Processing.

[10]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[11]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[12]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[13]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[14]  Sashank J. Reddi,et al.  On the Convergence of Adam and Beyond , 2018, ICLR.

[15]  Dan Alistarh,et al.  Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[16]  Richard Socher,et al.  An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[17]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[18]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[19]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[20]  Philipp Hennig,et al.  Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.