Optimal Algorithms for Half-Duplex Inter-Group All-to-All Broadcast on Fully Connected and Ring Topologies

Half-duplex inter-group collective communications are bipartite message transfer patterns such that the processes in a sender group pass messages to the processes in a receiver group. These communication patterns serve as basic operations for scientific application workflows. In this paper, we present optimal parallel algorithms for half-duplex inter-group all-to-all broadcast under bidirectional communication constraint on fully connected and ring topologies. We implement the algorithms using MPI communication functions and perform experiments on Cori. For the fully connected topology case, we compare our algorithms with production MPI libraries. For the ring topology case, we implement our proposed algorithms using MPI_Sendrecv function to emulate a ring topology environment. The proposed algorithms are compared with the intra-group Allgather algorithm emulated under the same environment. Message sizes ranging from 32KB to 4MB are used for evaluations. The proposed algorithms for fully connected topology are up to 5 times faster than the root gathering algorithm adopted by MPICH. The proposed algorithms for the ring topology are up to 1.4 times faster than the intragroup Allgather algorithm.

[1]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[2]  Alok N. Choudhary,et al.  A flexible I/O arbitration framework for netCDF‐based big data processing workflows on high‐end supercomputers , 2017, Concurr. Comput. Pract. Exp..

[3]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[4]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[5]  Fan Zhang,et al.  Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[6]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[7]  Fan Zhang,et al.  ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing , 2015, Concurr. Comput. Pract. Exp..

[8]  Pedro V. Silva,et al.  Implementing MPI-2 Extended Collective Operations , 1999, PVM/MPI.

[9]  Jesper Larsson Träff,et al.  Full-Duplex Inter-Group All-to-All Broadcast Algorithms with Optimal Bandwidth , 2018, EuroMPI.

[10]  Joshua P. Hacker,et al.  Ensemble Data Assimilation to Characterize Surface-Layer Errors in Numerical Weather Prediction Models , 2013 .

[11]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[12]  A. Skjellum,et al.  Inter-communicator Extensions to MPI in the MPIX ( MPI eXtension ) Library , 1994 .

[13]  John N. Tsitsiklis,et al.  Optimal Communication Algorithms for Hypercubes , 1991, J. Parallel Distributed Comput..

[14]  Amotz Bar-Noy,et al.  Broadcasting Multiple Messages in Simultaneous Send/receive Systems , 1994, Discret. Appl. Math..

[15]  Hal Finkel,et al.  Large-scale compute-intensive analysis via a combined in-situ and co-scheduling workflow approach , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Hal Finkel,et al.  HACC: Simulating Sky Surveys on State-of-the-Art Supercomputing Architectures , 2014, 1410.2805.

[17]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .