A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for large DNN models that do not fit into single device memory.ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized.ParDNN is completely independent of the deep learning aspects of a DNN and requires no modification neither at the model nor at the systems level implementation of operation kernels. It partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to a few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving super-linear scaling for both the batch size and training throughput. In comparison to related work (Mesh-TensorFlow and gradient Checkpointing), ParDNN either outperforms or qualitatively improves upon them.

[1]  Bora Uçar,et al.  Acyclic Partitioning of Large Directed Acyclic Graphs , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[2]  Olatunji Ruwase,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.

[3]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[4]  M. A. Cleveland,et al.  The Problem With Critical Path Scheduling Algorithms , 1996 .

[5]  François Pellegrini,et al.  Distillating knowledge about SCOTCH , 2009, Combinatorial Scientific Computing.

[6]  Taro Sekiyama,et al.  Profile-guided memory optimization for deep neural networks , 2018, ArXiv.

[7]  Frank D. Anger,et al.  Scheduling Precedence Graphs in Systems with Interprocessor Communication Times , 1989, SIAM J. Comput..

[8]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[9]  Vipin Kumar,et al.  Multilevel Graph Partitioning Schemes , 1995, ICPP.

[10]  Razvan Pascanu,et al.  Theano: Deep Learning on GPUs with Python , 2012 .

[11]  Toshio Endo,et al.  ooc_cuDNN: Accommodating convolutional neural networks over GPU memory capacity , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[12]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[13]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[16]  Tao Yang,et al.  A Comparison of Clustering Heuristics for Scheduling Directed Acycle Graphs on Multiprocessors , 1992, J. Parallel Distributed Comput..

[17]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[18]  Minjie Wang,et al.  Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.

[19]  Ling Yuan,et al.  A Novel Task-Duplication Based Clustering Algorithm for Heterogeneous Computing Environments , 2019, IEEE Transactions on Parallel and Distributed Systems.

[20]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[21]  Anne Benoit,et al.  A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[23]  Kurt Keutzer,et al.  Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.

[24]  Quoc V. Le,et al.  A Hierarchical Model for Device Placement , 2018, ICLR.

[25]  Kiyokuni Kawachiya,et al.  Profiling based out-of-core Hybrid method for large neural networks: poster , 2019, PPoPP.

[26]  Jing-Chiou Liou,et al.  A comparison of general approaches to multiprocessor scheduling , 1997, Proceedings 11th International Parallel Processing Symposium.

[27]  Ishfaq Ahmad,et al.  Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[28]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[29]  Charles R. Qi,et al.  Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks , 2018, ICML.

[30]  Yunbo Wang,et al.  Eidetic 3D LSTM: A Model for Video Prediction and Beyond , 2019, ICLR.

[31]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[32]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[33]  Ishfaq Ahmad,et al.  Bubble scheduling: A quasi dynamic algorithm for static allocation of tasks to parallel architectures , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[34]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[35]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[36]  Vivek Sarkar,et al.  Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .

[37]  Prabhat,et al.  CosmoFlow: Using Deep Learning to Learn the Universe at Scale , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[39]  David A. Bader,et al.  Graph Partitioning and Graph Clustering , 2013 .

[40]  Jian Wang,et al.  Comparative analysis of list scheduling algorithms on homogeneous multi-processors , 2016, 2016 8th IEEE International Conference on Communication Software and Networks (ICCSN).

[41]  Bruce Hendrickson Graph Partitioning , 2011, Encyclopedia of Parallel Computing.

[42]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[43]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[44]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[45]  Hai Jin,et al.  Capuchin: Tensor-based GPU Memory Management for Deep Learning , 2020, ASPLOS.

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Marc Snir,et al.  Channel and filter parallelism for large-scale CNN training , 2019, SC.

[48]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[49]  J. B. G. Frenk,et al.  Heuristic for the 0-1 Min-Knapsack Problem , 1991, Acta Cybern..

[50]  Oliver Sinnen,et al.  List-Scheduling versus Cluster-Scheduling , 2018, IEEE Transactions on Parallel and Distributed Systems.

[51]  Tao Yang,et al.  DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors , 1994, IEEE Trans. Parallel Distributed Syst..

[52]  Kurt Keutzer,et al.  Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.

[53]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[54]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[55]  Mauro Cettolo,et al.  The IWSLT 2016 Evaluation Campaign , 2016, IWSLT.

[56]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[57]  Dhabaleswar K. Panda,et al.  OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training , 2018, 2018 IEEE 25th International Conference on High Performance Computing (HiPC).

[58]  Tao Yang,et al.  Scheduling and code generation for parallel architectures , 1993 .

[59]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[60]  Peter M. Fenwick,et al.  A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..

[61]  Oliver Sinnen,et al.  Task Scheduling for Parallel Systems , 2007, Wiley series on parallel and distributed computing.

[62]  Sung Jo Kim A general approach to multiprocessor scheduling , 1988 .

[63]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[64]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).