A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
暂无分享,去创建一个
[1] M. A. Cleveland,et al. The Problem With Critical Path Scheduling Algorithms , 1996 .
[2] Kurt Keutzer,et al. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.
[3] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[4] Jing-Chiou Liou,et al. A comparison of general approaches to multiprocessor scheduling , 1997, Proceedings 11th International Parallel Processing Symposium.
[5] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.
[6] Bora Uçar,et al. Acyclic Partitioning of Large Directed Acyclic Graphs , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[7] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[8] Minjie Wang,et al. Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.
[9] Vivek Sarkar,et al. Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .
[10] Arjan J. C. van Gemund,et al. GLB: a low-cost scheduling algorithm for distributed-memory architectures , 1998, Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238).
[11] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.
[13] Ishfaq Ahmad,et al. Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..
[14] M. Snir,et al. Channel and filter parallelism for large-scale CNN training , 2019, SC.
[15] Tao Yang,et al. Scheduling and code generation for parallel architectures , 1993 .
[16] Yunbo Wang,et al. Eidetic 3D LSTM: A Model for Video Prediction and Beyond , 2019, ICLR.
[17] Olatunji Ruwase,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.
[18] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[19] Ishfaq Ahmad,et al. Bubble scheduling: A quasi dynamic algorithm for static allocation of tasks to parallel architectures , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.
[20] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[21] Quoc V. Le,et al. A Hierarchical Model for Device Placement , 2018, ICLR.
[22] David A. Bader,et al. Graph Partitioning and Graph Clustering , 2013 .
[23] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[24] Charles R. Qi,et al. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks , 2018, ICML.
[25] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[26] Anne Benoit,et al. A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[27] Sung Jo Kim. A general approach to multiprocessor scheduling , 1988 .
[28] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[29] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[30] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[31] Frank D. Anger,et al. Scheduling Precedence Graphs in Systems with Interprocessor Communication Times , 1989, SIAM J. Comput..
[32] Taro Sekiyama,et al. Profile-guided memory optimization for deep neural networks , 2018, ArXiv.
[33] James C. Browne,et al. The CODE 2.0 graphical parallel programming language , 1992, ICS '92.
[34] Tao Yang,et al. A Comparison of Clustering Heuristics for Scheduling Directed Acycle Graphs on Multiprocessors , 1992, J. Parallel Distributed Comput..
[35] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[36] Oliver Sinnen,et al. Task Scheduling for Parallel Systems , 2007, Wiley series on parallel and distributed computing.
[37] Natalia Gimelshein,et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[38] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[39] Ann Bies,et al. The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.
[40] Jean Roman,et al. SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.
[41] Prabhat,et al. CosmoFlow: Using Deep Learning to Learn the Universe at Scale , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[42] Ling Yuan,et al. A Novel Task-Duplication Based Clustering Algorithm for Heterogeneous Computing Environments , 2019, IEEE Transactions on Parallel and Distributed Systems.
[43] Geoffrey E. Hinton,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.
[44] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.
[45] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[46] Samy Bengio,et al. Device Placement Optimization with Reinforcement Learning , 2017, ICML.
[47] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[48] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[49] Oliver Sinnen,et al. List-Scheduling versus Cluster-Scheduling , 2018, IEEE Transactions on Parallel and Distributed Systems.
[50] Dhabaleswar K. Panda,et al. OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training , 2018, 2018 IEEE 25th International Conference on High Performance Computing (HiPC).
[51] J. B. G. Frenk,et al. Heuristic for the 0-1 Min-Knapsack Problem , 1991, Acta Cybern..
[52] Hai Jin,et al. Capuchin: Tensor-based GPU Memory Management for Deep Learning , 2020, ASPLOS.
[53] Mauro Cettolo,et al. The IWSLT 2016 Evaluation Campaign , 2016, IWSLT.
[54] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.
[55] Razvan Pascanu,et al. Theano: Deep Learning on GPUs with Python , 2012 .
[56] Tao Yang,et al. DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors , 1994, IEEE Trans. Parallel Distributed Syst..
[57] Samy Bengio,et al. Tensor2Tensor for Neural Machine Translation , 2018, AMTA.
[58] Kurt Keutzer,et al. Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.
[59] Ruben Mayer,et al. The tensorflow partitioning and scheduling problem: it's the critical path! , 2017, ArXiv.
[60] Bruce Hendrickson. Graph Partitioning , 2011, Encyclopedia of Parallel Computing.
[61] François Pellegrini,et al. Distillating knowledge about SCOTCH , 2009, Combinatorial Scientific Computing.
[62] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[63] Peter M. Fenwick,et al. A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..
[64] Vipin Kumar,et al. Multilevel Graph Partitioning Schemes , 1995, ICPP.
[65] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[66] Kiyokuni Kawachiya,et al. Profiling based out-of-core Hybrid method for large neural networks: poster , 2019, PPoPP.
[67] Toshio Endo,et al. ooc_cuDNN: Accommodating convolutional neural networks over GPU memory capacity , 2017, 2017 IEEE International Conference on Big Data (Big Data).
[68] Chris Yakopcic,et al. A State-of-the-Art Survey on Deep Learning Theory and Architectures , 2019, Electronics.
[69] Jian Wang,et al. Comparative analysis of list scheduling algorithms on homogeneous multi-processors , 2016, 2016 8th IEEE International Conference on Communication Software and Networks (ICCSN).