暂无分享,去创建一个
Abhinav Jangda | Todd Mytkowicz | Madan Musuvathi | Youshan Miao | Olli Saarikivi | Saeed Maleki | Guodong Liu | Amir Hossein Nodehi Sabet | Jun Huang
[1] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[2] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[3] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[4] Aparna Chandramowlishwaran,et al. Pencil: A Pipelined Algorithm for Distributed Stencils , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Michel Steuwer,et al. LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[6] Dhabaleswar K. Panda,et al. Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication , 2017, ISC.
[7] Edgar Gabriel,et al. Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations , 2017, International Journal of Parallel Programming.
[8] Alexandre Denis,et al. MPI Overlap: Benchmark and Analysis , 2016, 2016 45th International Conference on Parallel Processing (ICPP).
[9] Pavan Balaji,et al. MPI+ULT: Overlapping Communication and Computation with User-Level Threads , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.
[10] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[11] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[13] Eduard Ayguadé,et al. Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.
[14] Olatunji Ruwase,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[15] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[16] Torsten Hoefler,et al. dCUDA: Hardware Supported Overlap of Computation and Communication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[17] Sergei Gorlatch,et al. High performance stencil code generation with Lift , 2018, CGO.
[18] Amar Phanishayee,et al. Efficient Large-Scale Language Model Training on GPU Clusters , 2021, ArXiv.
[19] André F. T. Martins,et al. Marian: Fast Neural Machine Translation in C++ , 2018, ACL.
[20] Uday Bondhugula. Compiling affine loop nests for distributed-memory parallel architectures , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[21] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[22] Rastislav Bodik,et al. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs , 2020, PACT.
[23] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[24] Shoaib Kamil,et al. Distributed Halide , 2016, PPoPP.
[25] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[26] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[27] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[28] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[29] Nectarios Koziris,et al. A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping , 2003, J. Parallel Distributed Comput..
[30] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[31] Gihan R. Mudalige,et al. Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS , 2017, IEEE Transactions on Parallel and Distributed Systems.
[32] Samuel Williams,et al. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid , 2013, 20th Annual International Conference on High Performance Computing.