论文信息 - IntML: Natural Compression for Distributed Deep Learning

IntML: Natural Compression for Distributed Deep Learning

Distributed machine learning has become common practice, given the increasing model complexity and the sheer size of real-world datasets. While GPUs have massively increased compute power, networks have not improved at the same pace. As a result, in large deployments with several parallel workers, distributed ML training is increasingly network bounded [9]. Parallelization techniques like mini-batch Stochastic Gradient Descent (SGD) alternate computation phases with model update exchanges among workers. In the widely-used synchronous setting, at the end of every SGD iteration, every node communicates hundreds of MBs of gradient values. This means that network performance often has a substantial impact on overall training time. To prevent the network from becoming a bottleneck, several prior works [10, 1, 13, 12, 6, 3] have proposed lossy compression methods that reduce the communication data volume. These methods include sparsification methods that communicate only a fraction of the original gradients, and quantization methods that use fewer bits to represent the gradients. The main challenge underlying the design of compression methods is that the more compression is applied, the more information is lost, and the more will the compressed gradients differ from the original gradients, increasing its statistical variance. Higher variance implies slower convergence [1, 8], i.e., more communication rounds. So, compression offers a trade-off between the communication cost per iteration and the number of communication rounds. In this work, we introduce a new, remarkably simple yet theoretically and practically effective compression technique, which we call natural compression (Cnat). Our technique is applied individually to all gradient values and works by randomized rounding to the nearest (negative or positive) power of two. Cnat is “natural” since the nearest power of two of a real value expressed as a float is computationally inexpensive and can be obtained by ignoring the mantissa. Thus, our scheme communicates just the exponents and signs of the original floats. Importantly, natural compression enjoys a provably small variance. The interested reader can find a complete theoretical analysis in our technical report [5]. 1 2 2.5 4 3 4 1 4

[1] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[2] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[4] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[5] Ran El-Yaniv,et al. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[6] Ji Liu,et al. Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[7] Jérôme Malick,et al. Asynchronous Distributed Learning with Sparse Communications and Identification , 2018, ArXiv.

[8] Peter Richtárik,et al. Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[9] Marco Canini,et al. Natural Compression for Distributed Deep Learning , 2019, MSML.

[10] Panos Kalnis,et al. Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.