Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Large ML models and datasets have necessitated the use of multi-GPU systems for distributed model training. To harness the power offered by multi-GPU systems, it is critical to eliminate bottlenecks in inter-GPU communication — a problem made challenging by the heterogeneous nature of interconnects. In this work, we present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. TACCL is built on top of the standard NVIDIA Collective Communication Library (NCCL), allowing it to be a drop-in replacement for GPU communication in frameworks like PyTorch with minimal changes. TACCL generates algorithms for communication primitives like ALLGATHER, ALLTOALL, and ALLREDUCE that are up to 3× faster than NCCL. Using TACCL’s algorithms speeds up the end-to-end training of an internal mixture of experts model by 17%. By decomposing the optimization problem into parts and leveraging the symmetry in multi-GPU topologies, TACCL synthesizes collectives for up-to 80-GPUs in less than 3 minutes, at least two orders of magnitude faster than other synthesis-based state-of-the-art collective communication libraries.

[1]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[2]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[3]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[4]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[5]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[6]  Zhengyang Liu,et al.  Synthesizing Optimal Collective Algorithms , 2020, ArXiv.

[7]  D. S. Scott,et al.  Efficient All-to-All Communication Patterns in Hypercube and Mesh Topologies , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[8]  Luis Ceze,et al.  PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud , 2020, MLSys.

[9]  Nikhil R. Devanur,et al.  Blink: Fast and Generic Collectives for Distributed ML , 2019, MLSys.

[10]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[11]  Robert A. van de Geijn,et al.  Global combine on mesh architectures with wormhole routing , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[12]  Shahid H. Bokhari,et al.  Complete exchange on a circuit switched mesh , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[13]  Keijo Heljanko,et al.  Improving Dynamic Partial Order Reductions for Concolic Testing , 2012, 2012 12th International Conference on Application of Concurrency to System Design.

[14]  Minsik Cho,et al.  BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy , 2019, IBM J. Res. Dev..

[15]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[16]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[17]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.