A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for large DNN models that do not fit into single device memory.ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized.ParDNN is completely independent of the deep learning aspects of a DNN and requires no modification neither at the model nor at the systems level implementation of operation kernels. It partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to a few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving super-linear scaling for both the batch size and training throughput. In comparison to related work (Mesh-TensorFlow and gradient Checkpointing), ParDNN either outperforms or qualitatively improves upon them.

[1]  M. A. Cleveland,et al.  The Problem With Critical Path Scheduling Algorithms , 1996 .

[2]  Kurt Keutzer,et al.  Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.

[3]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[4]  Jing-Chiou Liou,et al.  A comparison of general approaches to multiprocessor scheduling , 1997, Proceedings 11th International Parallel Processing Symposium.

[5]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[6]  Bora Uçar,et al.  Acyclic Partitioning of Large Directed Acyclic Graphs , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[7]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[8]  Minjie Wang,et al.  Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.

[9]  Vivek Sarkar,et al.  Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .

[10]  Arjan J. C. van Gemund,et al.  GLB: a low-cost scheduling algorithm for distributed-memory architectures , 1998, Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238).

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[13]  Ishfaq Ahmad,et al.  Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[14]  M. Snir,et al.  Channel and filter parallelism for large-scale CNN training , 2019, SC.

[15]  Tao Yang,et al.  Scheduling and code generation for parallel architectures , 1993 .

[16]  Yunbo Wang,et al.  Eidetic 3D LSTM: A Model for Video Prediction and Beyond , 2019, ICLR.

[17]  Olatunji Ruwase,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.

[18]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[19]  Ishfaq Ahmad,et al.  Bubble scheduling: A quasi dynamic algorithm for static allocation of tasks to parallel architectures , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[20]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[21]  Quoc V. Le,et al.  A Hierarchical Model for Device Placement , 2018, ICLR.

[22]  David A. Bader,et al.  Graph Partitioning and Graph Clustering , 2013 .

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Charles R. Qi,et al.  Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks , 2018, ICML.

[25]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[26]  Anne Benoit,et al.  A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[27]  Sung Jo Kim A general approach to multiprocessor scheduling , 1988 .

[28]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[29]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[30]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[31]  Frank D. Anger,et al.  Scheduling Precedence Graphs in Systems with Interprocessor Communication Times , 1989, SIAM J. Comput..

[32]  Taro Sekiyama,et al.  Profile-guided memory optimization for deep neural networks , 2018, ArXiv.

[33]  James C. Browne,et al.  The CODE 2.0 graphical parallel programming language , 1992, ICS '92.

[34]  Tao Yang,et al.  A Comparison of Clustering Heuristics for Scheduling Directed Acycle Graphs on Multiprocessors , 1992, J. Parallel Distributed Comput..

[35]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[36]  Oliver Sinnen,et al.  Task Scheduling for Parallel Systems , 2007, Wiley series on parallel and distributed computing.

[37]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[39]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[40]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[41]  Prabhat,et al.  CosmoFlow: Using Deep Learning to Learn the Universe at Scale , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Ling Yuan,et al.  A Novel Task-Duplication Based Clustering Algorithm for Heterogeneous Computing Environments , 2019, IEEE Transactions on Parallel and Distributed Systems.

[43]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[44]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[45]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[46]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[47]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[48]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[49]  Oliver Sinnen,et al.  List-Scheduling versus Cluster-Scheduling , 2018, IEEE Transactions on Parallel and Distributed Systems.

[50]  Dhabaleswar K. Panda,et al.  OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training , 2018, 2018 IEEE 25th International Conference on High Performance Computing (HiPC).

[51]  J. B. G. Frenk,et al.  Heuristic for the 0-1 Min-Knapsack Problem , 1991, Acta Cybern..

[52]  Hai Jin,et al.  Capuchin: Tensor-based GPU Memory Management for Deep Learning , 2020, ASPLOS.

[53]  Mauro Cettolo,et al.  The IWSLT 2016 Evaluation Campaign , 2016, IWSLT.

[54]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[55]  Razvan Pascanu,et al.  Theano: Deep Learning on GPUs with Python , 2012 .

[56]  Tao Yang,et al.  DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors , 1994, IEEE Trans. Parallel Distributed Syst..

[57]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[58]  Kurt Keutzer,et al.  Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.

[59]  Ruben Mayer,et al.  The tensorflow partitioning and scheduling problem: it's the critical path! , 2017, ArXiv.

[60]  Bruce Hendrickson Graph Partitioning , 2011, Encyclopedia of Parallel Computing.

[61]  François Pellegrini,et al.  Distillating knowledge about SCOTCH , 2009, Combinatorial Scientific Computing.

[62]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[63]  Peter M. Fenwick,et al.  A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..

[64]  Vipin Kumar,et al.  Multilevel Graph Partitioning Schemes , 1995, ICPP.

[65]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[66]  Kiyokuni Kawachiya,et al.  Profiling based out-of-core Hybrid method for large neural networks: poster , 2019, PPoPP.

[67]  Toshio Endo,et al.  ooc_cuDNN: Accommodating convolutional neural networks over GPU memory capacity , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[68]  Chris Yakopcic,et al.  A State-of-the-Art Survey on Deep Learning Theory and Architectures , 2019, Electronics.

[69]  Jian Wang,et al.  Comparative analysis of list scheduling algorithms on homogeneous multi-processors , 2016, 2016 8th IEEE International Conference on Communication Software and Networks (ICCSN).