论文信息 - Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning

Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning

Distributed stochastic algorithms, equipped with gradient compression techniques, such as codebook quantization, are becoming increasingly popular and considered state-of-the-art in training large deep neural network (DNN) models. However, communicating the quantized gradients in a network requires efficient encoding techniques. For this, practitioners generally use Elias encoding-based techniques without considering their computational overhead or data-volume. In this paper, based on Huffman coding, we propose several lossless encoding techniques that exploit different characteristics of the quantized gradients during distributed DNN training. Then, we show their effectiveness on 5 different DNN models across three different data-sets, and compare them with classic state-of-the-art Elias-based encoding techniques. Our results show that the proposed Huffman-based encoders (i.e., RLH, SH, and SHS) can reduce the encoded data-volume by up to 5.1×, 4.32×, and 3.8×, respectively, compared to the Elias-based encoders.

[1] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[2] David A. Huffman,et al. A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[3] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[4] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5] Bharat Sukhwani,et al. High-Throughput, Lossless Data Compresion on FPGAs , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[6] Tong Yang,et al. SKCompress: compressing sparse and nonuniform gradient in distributed machine learning , 2020, The VLDB Journal.

[7] Jürgen Schmidhuber,et al. Predictive Coding with Neural Nets: Application to Text Compression , 1994, NIPS.

[8] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[9] Yibo Zhu,et al. A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[10] Dimitris S. Papailiopoulos,et al. ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[11] Alistair Moffat,et al. On the implementation of minimum redundancy prefix codes , 1997, IEEE Trans. Commun..

[12] Thomas M. Cover,et al. Elements of information theory (2. ed.) , 2006 .

[13] Ahmed M. Abdelmoniem,et al. On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning , 2019, AAAI.

[14] Wolfgang Lehner,et al. Fast integer compression using SIMD instructions , 2010, DaMoN '10.

[15] Klaus-Robert Müller,et al. Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication , 2018, 2019 International Joint Conference on Neural Networks (IJCNN).

[16] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[18] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[19] Thomas M. Cover,et al. Elements of Information Theory: Cover/Elements of Information Theory, Second Edition , 2005 .

[20] Jungwon Lee,et al. Universal Deep Neural Network Compression , 2018, IEEE Journal of Selected Topics in Signal Processing.

[21] A. H. Robinson,et al. Results of a prototype television bandwidth compression scheme , 1967 .

[22] John Langford,et al. Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[23] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[24] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Jiang Wang,et al. Sparse Gradient Compression for Distributed SGD , 2019, DASFAA.

[26] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Suhas Diggavi,et al. Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[28] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[29] Mohammad Alian,et al. A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.

[31] David C. van Voorhis,et al. Optimal source codes for geometrically distributed integer alphabets (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[32] Peter Elias,et al. Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[33] Ahmed M. Abdelmoniem,et al. Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation , 2020 .

[34] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[35] Yue Yu,et al. Exploring Fast and Communication-Efficient Algorithms in Large-Scale Distributed Networks , 2019, AISTATS.

[36] Xin Jin,et al. Compressing deep neural networks for efficient visual inference , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[37] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[38] Shashwat Banchhor,et al. Decode efficient prefix codes , 2020, ArXiv.

[39] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[40] Reza Hashemian. Memory efficient and high-speed search Huffman coding , 1995, IEEE Trans. Commun..

[41] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[42] Lawrence L. Larmore,et al. Constructing Huffman Trees in Parallel , 1995, SIAM J. Comput..

[43] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[44] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[45] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[46] Yao Zhang,et al. Parallel lossless data compression on the GPU , 2012, 2012 Innovative Parallel Computing (InPar).

[47] Marco Canini,et al. Natural Compression for Distributed Deep Learning , 2019, MSML.