Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast

The Message Passing Interface has been the dominating programming model for developing scalable and high-performance parallel applications. Collective operations empower group communication operations in a portable, and efficient manner and are used by a large number of applications across different domains. Optimization of collective operations is the key to achieve good performance speed-ups and portability. Broadcast or One-to-all communication is one of the most commonly used collectives in MPI applications. However, the existing algorithms for broadcast do not effectively utilize the high degree of parallelism and increased message rate capabilities offered by modern architectures. In this paper, we address these challenges and propose a Scalable Multi-Endpoint broadcast algorithm that combines hierarchical communication with multiple endpoints per node for high performance and scalability. We evaluate the proposed algorithm against state-of-the-art designs in other MPI libraries, including MVAPICH2, Intel MPI, and Spectrum MPI. We demonstrate the benefits of the proposed algorithm at benchmark and application level at scale on four different hardware architectures, including Intel Cascade Lake, Intel Skylake, AMD EPYC, and IBM POWER9, and with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed design shows up to 2.5 times performance improvements at a microbenchmark level with 128 Nodes. We also observe up to 37% improvement in broadcast communication latency for the SPECMPI scientific applications

[1]  Ajay D. Kshemkalyani,et al.  Dynamic multiroot, multiquery processing based on data sharing in sensor networks , 2010, TOSN.

[2]  Xin Yuan,et al.  Pipelined broadcast on Ethernet switched clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[3]  Torsten Hoefler,et al.  NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.

[4]  Dhabaleswar K. Panda,et al.  Efficient design for MPI asynchronous progress without dedicated resources , 2019, Parallel Comput..

[5]  Jérôme Vienne,et al.  Benefits of Cross Memory Attach for MPI libraries on HPC Clusters , 2014, XSEDE '14.

[6]  Dhabaleswar K. Panda,et al.  Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  K. PandaDhabaleswar,et al.  The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC , 2013 .

[8]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[9]  Emmanuel Jeannot,et al.  A hierarchical model to manage hardware topology in MPI applications , 2017, EuroMPI/USA.

[10]  Jeffrey S. Vetter,et al.  Statistical scalability analysis of communication operations in distributed applications , 2001, PPoPP '01.

[11]  Xin Yuan,et al.  Efficient MPI Bcast across different process arrival patterns , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12]  Thomas Hérault,et al.  MPI Applications on Grids: A Topology Aware Approach , 2009, Euro-Par.

[13]  Dhabaleswar K. Panda,et al.  Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast , 2019, IEEE Transactions on Parallel and Distributed Systems.

[14]  Keith D. Underwood,et al.  Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[15]  Manjunath Gorentla Venkata,et al.  Collective Framework and Performance Optimizations to Open MPI for Cray XT Platforms , 2011 .

[16]  Manjunath Gorentla Venkata,et al.  Design and Implementation of Broadcast Algorithms for Extreme-Scale Systems , 2011, 2011 IEEE International Conference on Cluster Computing.

[17]  Amith R. Mamidala,et al.  MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[18]  Torsten Hoefler,et al.  Corrected trees for reliable group communication , 2019, PPoPP.

[19]  Dhabaleswar K. Panda,et al.  Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20]  Xiaohui Wei,et al.  Latency-Balanced Optimization of MPI Collective Communication across Multi-clusters , 2013, 2013 8th ChinaGrid Annual Conference.

[21]  Hao Zhu,et al.  Hierarchical Collectives in MPICH2 , 2009, PVM/MPI.