Auto-Precision Scaling for Distributed Deep Learning

In recent years, large-batch optimization is becoming the key of distributed deep learning. However, large-batch optimization is hard. Straightforwardly porting the code often leads to a significant loss in testing accuracy. As some researchers suggested that large batch optimization leads to a low generalization performance, and they further conjectured that large-batch training needs a higher floating-point precision to achieve a higher generalization performance. To solve this problem, we conduct an open study in this paper. Our target is to find the number of bits that large-batch training needs. To do so, we need a system for customized precision study. However, state-of-the-art systems have some limitations that lower the efficiency of developers and researchers. To solve this problem, we design and implement our own system CPD: A High Performance System for Customized-Precision Distributed DL. In our experiments, our application often loses accuracy if we use a very-low precision (e.g. 8 bits or 4 bits). To solve this problem, we proposed the APS (Auto-Precision-Scaling) algorithm, which is a layer-wise adaptive scheme for gradients shifting. With APS, we are able to make the large-batch training converge with only 4 bits.

[1]  Jeff Johnson,et al.  Rethinking floating point for deep learning , 2018, ArXiv.

[2]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[5]  Sam Ade Jacobs,et al.  Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[6]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[7]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Takuya Akiba,et al.  Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.

[10]  Yang You,et al.  Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[11]  Kunle Olukotun,et al.  DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[12]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[13]  Pradeep Dubey,et al.  A Study of BFLOAT16 for Deep Learning Training , 2019, ArXiv.

[14]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[15]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[16]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[17]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[18]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[19]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[20]  Shengen Yan,et al.  GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training , 2019, IEEE Transactions on Big Data.

[21]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[22]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[23]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[26]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[27]  Tao Wang,et al.  Image Classification at Supercomputer Scale , 2018, ArXiv.

[28]  Natalia Gimelshein,et al.  Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design , 2016, ArXiv.

[29]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[30]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[33]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[34]  Christopher De Sa,et al.  QPyTorch: A Low-Precision Arithmetic Simulation Framework , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[35]  Xin Wang,et al.  Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[36]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.