暂无分享,去创建一个
[1] Bora Uçar,et al. Acyclic Partitioning of Large Directed Acyclic Graphs , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[2] Olatunji Ruwase,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.
[3] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.
[4] M. A. Cleveland,et al. The Problem With Critical Path Scheduling Algorithms , 1996 .
[5] François Pellegrini,et al. Distillating knowledge about SCOTCH , 2009, Combinatorial Scientific Computing.
[6] Taro Sekiyama,et al. Profile-guided memory optimization for deep neural networks , 2018, ArXiv.
[7] Frank D. Anger,et al. Scheduling Precedence Graphs in Systems with Interprocessor Communication Times , 1989, SIAM J. Comput..
[8] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[9] Vipin Kumar,et al. Multilevel Graph Partitioning Schemes , 1995, ICPP.
[10] Razvan Pascanu,et al. Theano: Deep Learning on GPUs with Python , 2012 .
[11] Toshio Endo,et al. ooc_cuDNN: Accommodating convolutional neural networks over GPU memory capacity , 2017, 2017 IEEE International Conference on Big Data (Big Data).
[12] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[13] Jean Roman,et al. SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.
[14] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[15] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[16] Tao Yang,et al. A Comparison of Clustering Heuristics for Scheduling Directed Acycle Graphs on Multiprocessors , 1992, J. Parallel Distributed Comput..
[17] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[18] Minjie Wang,et al. Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.
[19] Ling Yuan,et al. A Novel Task-Duplication Based Clustering Algorithm for Heterogeneous Computing Environments , 2019, IEEE Transactions on Parallel and Distributed Systems.
[20] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[21] Anne Benoit,et al. A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[22] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[23] Kurt Keutzer,et al. Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.
[24] Quoc V. Le,et al. A Hierarchical Model for Device Placement , 2018, ICLR.
[25] Kiyokuni Kawachiya,et al. Profiling based out-of-core Hybrid method for large neural networks: poster , 2019, PPoPP.
[26] Jing-Chiou Liou,et al. A comparison of general approaches to multiprocessor scheduling , 1997, Proceedings 11th International Parallel Processing Symposium.
[27] Ishfaq Ahmad,et al. Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..
[28] Samy Bengio,et al. Device Placement Optimization with Reinforcement Learning , 2017, ICML.
[29] Charles R. Qi,et al. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks , 2018, ICML.
[30] Yunbo Wang,et al. Eidetic 3D LSTM: A Model for Video Prediction and Beyond , 2019, ICLR.
[31] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[32] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[33] Ishfaq Ahmad,et al. Bubble scheduling: A quasi dynamic algorithm for static allocation of tasks to parallel architectures , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.
[34] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[35] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[36] Vivek Sarkar,et al. Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .
[37] Prabhat,et al. CosmoFlow: Using Deep Learning to Learn the Universe at Scale , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[38] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[39] David A. Bader,et al. Graph Partitioning and Graph Clustering , 2013 .
[40] Jian Wang,et al. Comparative analysis of list scheduling algorithms on homogeneous multi-processors , 2016, 2016 8th IEEE International Conference on Communication Software and Networks (ICCSN).
[41] Bruce Hendrickson. Graph Partitioning , 2011, Encyclopedia of Parallel Computing.
[42] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[43] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[44] Geoffrey E. Hinton,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.
[45] Hai Jin,et al. Capuchin: Tensor-based GPU Memory Management for Deep Learning , 2020, ASPLOS.
[46] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[47] Marc Snir,et al. Channel and filter parallelism for large-scale CNN training , 2019, SC.
[48] Samy Bengio,et al. Tensor2Tensor for Neural Machine Translation , 2018, AMTA.
[49] J. B. G. Frenk,et al. Heuristic for the 0-1 Min-Knapsack Problem , 1991, Acta Cybern..
[50] Oliver Sinnen,et al. List-Scheduling versus Cluster-Scheduling , 2018, IEEE Transactions on Parallel and Distributed Systems.
[51] Tao Yang,et al. DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors , 1994, IEEE Trans. Parallel Distributed Syst..
[52] Kurt Keutzer,et al. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.
[53] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.
[54] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.
[55] Mauro Cettolo,et al. The IWSLT 2016 Evaluation Campaign , 2016, IWSLT.
[56] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[57] Dhabaleswar K. Panda,et al. OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training , 2018, 2018 IEEE 25th International Conference on High Performance Computing (HiPC).
[58] Tao Yang,et al. Scheduling and code generation for parallel architectures , 1993 .
[59] Ann Bies,et al. The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.
[60] Peter M. Fenwick,et al. A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..
[61] Oliver Sinnen,et al. Task Scheduling for Parallel Systems , 2007, Wiley series on parallel and distributed computing.
[62] Sung Jo Kim. A general approach to multiprocessor scheduling , 1988 .
[63] Natalia Gimelshein,et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[64] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).