IntML: Natural Compression for Distributed Deep Learning

Distributed machine learning has become common practice, given the increasing model complexity and the sheer size of real-world datasets. While GPUs have massively increased compute power, networks have not improved at the same pace. As a result, in large deployments with several parallel workers, distributed ML training is increasingly network bounded [9]. Parallelization techniques like mini-batch Stochastic Gradient Descent (SGD) alternate computation phases with model update exchanges among workers. In the widely-used synchronous setting, at the end of every SGD iteration, every node communicates hundreds of MBs of gradient values. This means that network performance often has a substantial impact on overall training time. To prevent the network from becoming a bottleneck, several prior works [10, 1, 13, 12, 6, 3] have proposed lossy compression methods that reduce the communication data volume. These methods include sparsification methods that communicate only a fraction of the original gradients, and quantization methods that use fewer bits to represent the gradients. The main challenge underlying the design of compression methods is that the more compression is applied, the more information is lost, and the more will the compressed gradients differ from the original gradients, increasing its statistical variance. Higher variance implies slower convergence [1, 8], i.e., more communication rounds. So, compression offers a trade-off between the communication cost per iteration and the number of communication rounds. In this work, we introduce a new, remarkably simple yet theoretically and practically effective compression technique, which we call natural compression (Cnat). Our technique is applied individually to all gradient values and works by randomized rounding to the nearest (negative or positive) power of two. Cnat is “natural” since the nearest power of two of a real value expressed as a float is computationally inexpensive and can be obtained by ignoring the mantissa. Thus, our scheme communicates just the exponents and signs of the original floats. Importantly, natural compression enjoys a provably small variance. The interested reader can find a complete theoretical analysis in our technical report [5]. 1 2 2.5 4 3 4 1 4