论文信息 - CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning

CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning

Modern deep learning workloads run on distributed hardware and are difficult to optimize — data, model, and pipeline parallelism require a developer to thoughtfully restructure their workload around optimized computation and communication kernels in libraries such as cuBLAS and NCCL. The logical separation between computation and communication leaves performance on the table with missed optimization opportunities across abstraction boundaries. To explore these opportunities, this paper presents CoCoNet, which consists of a compute language to express programs with both computation and communication, a scheduling language to apply transformations on such programs, and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs enables new optimizations, such as overlapping or fusion of communication with computation. CoCoNet allowed us to optimize several data, model and pipeline parallel workloads in existing deep learning systems with very few lines of code. We show significant improvements after integrating novel CoCoNet generated kernels. ACM Reference Format: Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet,MadanlalMusuvathi, Olli Sarikivi, ToddMytkowicz, and Youshan Miao. 2021. CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning. In Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

[1] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[2] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .

[3] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[4] Aparna Chandramowlishwaran,et al. Pencil: A Pipelined Algorithm for Distributed Stencils , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5] Michel Steuwer,et al. LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[6] Dhabaleswar K. Panda,et al. Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication , 2017, ISC.

[7] Edgar Gabriel,et al. Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations , 2017, International Journal of Parallel Programming.

[8] Alexandre Denis,et al. MPI Overlap: Benchmark and Analysis , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[9] Pavan Balaji,et al. MPI+ULT: Overlapping Communication and Computation with User-Level Threads , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[10] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[11] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Eduard Ayguadé,et al. Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[14] Olatunji Ruwase,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[16] Torsten Hoefler,et al. dCUDA: Hardware Supported Overlap of Computation and Communication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] Sergei Gorlatch,et al. High performance stencil code generation with Lift , 2018, CGO.

[18] Amar Phanishayee,et al. Efficient Large-Scale Language Model Training on GPU Clusters , 2021, ArXiv.

[19] André F. T. Martins,et al. Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[20] Uday Bondhugula. Compiling affine loop nests for distributed-memory parallel architectures , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22] Rastislav Bodik,et al. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs , 2020, PACT.

[23] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[24] Shoaib Kamil,et al. Distributed Halide , 2016, PPoPP.

[25] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[26] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[27] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[28] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[29] Nectarios Koziris,et al. A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping , 2003, J. Parallel Distributed Comput..

[30] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31] Gihan R. Mudalige,et al. Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS , 2017, IEEE Transactions on Parallel and Distributed Systems.

[32] Samuel Williams,et al. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid , 2013, 20th Annual International Conference on High Performance Computing.