moDNN: Memory Optimal Deep Neural Network Training on Graphics Processing Units

Graphics processing units (GPUs) have been widely adopted to accelerate the training of deep neural networks (DNNs). Although the computational performance of GPUs has been improving steadily, the memory size of modern GPUs is still quite limited, which restricts the sizes of DNNs that can be trained on GPUs, and hence raises serious challenges. This paper introduces a framework, referred to as moDNN (memory optimal DNN training on GPUs), to optimize the memory usage in DNN training. moDNN supports automatic tuning of DNN training code to match any given memory budget (not smaller than the theoretical lower bound). By taking full advantage of overlapping computations and data transfers, we develop new heuristics to judiciously schedule data offloading and prefetching transfers, together with convolution algorithm selection, to optimize memory usage. We further devise a new sub-batch size selection method which also greatly reduces memory usage. moDNN can save memory usage up to 59×, compared with an ideal case which assumes that the GPU memory is sufficient to hold all data. When executing moDNN on a GPU with 12 GB memory, the training time is increased by only 3 percent, which is much shorter than that incurred by the best known approach, vDNN. Furthermore, we propose an optimization strategy for moDNN on multiple GPUs again by utilizing the idea of overlapping data transfers and GPU computations. The results show that 3.7× speedup is attained on four GPUs.

[1]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Mordecai Avriel,et al.  Nonlinear programming , 1976 .

[4]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[5]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[6]  Javier Romero,et al.  Coupling Adaptive Batch Sizes with Learning Rates , 2016, UAI.

[7]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[8]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[9]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[10]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[11]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[12]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[14]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[15]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[16]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[17]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[18]  Hassan Foroosh,et al.  Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ronald L. Rivest,et al.  Introduction to Algorithms, 3rd Edition , 2009 .

[20]  Xiaoming Chen,et al.  moDNN: Memory optimal DNN training on GPUs , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[22]  Zenglin Xu,et al.  Efficient Communications in Training Large Scale Neural Networks , 2017, ACM Multimedia.

[23]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[24]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[25]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[26]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[27]  Mohak Shah,et al.  Comparative Study of Deep Learning Software Frameworks , 2015, 1511.06435.

[28]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[29]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[30]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Carter Bays,et al.  A comparison of next-fit, first-fit, and best-fit , 1977, CACM.

[32]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[33]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Quanquan C. Liu Red-blue and standard pebble games : complexity and applications in the sequential and parallel models , 2017 .

[36]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[37]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[38]  Alex Graves,et al.  Memory-Efficient Backpropagation Through Time , 2016, NIPS.

[39]  C. Charalambous,et al.  Conjugate gradient algorithm for efficient training of artifi-cial neural networks , 1990 .

[40]  Peter B. Galvin,et al.  Operating System Concepts, 4th Ed. , 1993 .