MemFlow: Memory-Aware Distributed Deep Learning

As the number of layers and the amount of training data increases, the trend is to train deep neural networks in parallel across devices. In such scenarios, neural network training is increasingly bottlenecked by high memory requirements posed by intermediate results, or feature maps, that are produced during the forward pass and consumed during the backward pass. We recognize that the best-performing device parallelization configurations should consider memory usage in addition to the canonical metric of computation time. Towards this we introduce MemFlow, an optimization framework for distributed deep learning that performs joint optimization over memory usage and computation time when searching for a parallelization strategy. MemFlow consists of: (i) a task graph with memory usage estimates; (ii) a memory-aware execution simulator; and (iii) a Markov Chain Monte Carlo search algorithm that considers various degrees of recomputation i.e., discarding feature maps during the forward pass and recomputing them during the backward pass. Our experiments demonstrate that under memory constraints, MemFlow can readily locate valid and superior parallelization strategies unattainable with previous frameworks.

[1]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[2]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[3]  Abdul Wasay,et al.  Rapid Training of Very Large Ensembles of Diverse Neural Networks , 2018, ArXiv.

[4]  Zenglin Xu,et al.  Superneurons: dynamic GPU memory management for training deep neural networks , 2018, PPoPP.

[5]  Alexander Aiken,et al.  TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[6]  Shafiq R. Joty,et al.  Co-Morbidity Exploration on Wearables Activity Data Using Unsupervised Pre-training and Multi-Task Learning , 2017, ArXiv.

[7]  Alessandro Rozza,et al.  Automated Pruning for Deep Neural Network Compression , 2017, 2018 24th International Conference on Pattern Recognition (ICPR).

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Matei Zaharia,et al.  Optimizing DNN Computation with Relaxed Graph Substitutions , 2019, MLSys.

[12]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[15]  Amar Phanishayee,et al.  Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[16]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[17]  Stephen W. Keckler,et al.  Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).