Training Deeper Models by GPU Memory Optimization on TensorFlow

With the advent of big data, easy-to-get GPGPU and progresses in neural network modeling techniques, training deep learning model on GPU becomes a popular choice. However, due to the inherent complexity of deep learning models and the limited memory resources on modern GPUs, training deep models is still a nontrivial task, especially when the model size is too big for a single GPU. In this paper, we propose a general dataflow-graph based GPU memory optimization strategy, i.e.,“swap-out/in”, to utilize host memory as a bigger memory pool to overcome the limitation of GPU memory. Meanwhile, to optimize the memory-consuming sequence-to-sequence (Seq2Seq) models, dedicated optimization strategies are also proposed. These strategies are integrated into TensorFlow seamlessly without accuracy loss. In the extensive experiments, significant memory usage reductions are observed. The max training batch size can be increased by 2 to 30 times given a fixed model and system configuration.

[1]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[3]  Jasmin Ajanovic PCI express 3.0 overview , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[4]  Luca Maria Gambardella,et al.  Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition , 2010, ArXiv.

[5]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[9]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[10]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[11]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[12]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[14]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[15]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[16]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[17]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Natalia Gimelshein,et al.  Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design , 2016, ArXiv.

[19]  Alex Graves,et al.  Memory-Efficient Backpropagation Through Time , 2016, NIPS.

[20]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[21]  Daniel Brand,et al.  MEC: Memory-efficient Convolution for Deep Neural Network , 2017, ICML.

[22]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Kilian Q. Weinberger,et al.  Memory-Efficient Implementation of DenseNets , 2017, ArXiv.