Scalable Algorithms for MPI Intergroup Allgather and Allgatherv

Abstract MPI intergroup collective communication defines message transfer patterns between two disjoint groups of MPI processes. Such patterns occur in coupled applications, and in modern scientific application workflows, mostly with large data sizes. However, current implementations in MPI production libraries adopt the “root gathering algorithm”, which does not achieve optimal communication transfer time. In this paper, we propose algorithms for the intergroup Allgather and Allgatherv communication operations under single-port communication constraints. We implement the new algorithms using MPI point-to-point and standard intra-communicator collective communication functions. We evaluate their performance on the Cori supercomputer at NERSC. Using message sizes per compute node ranging from 64KBytes to 8MBytes, our experiments show significant performance improvements of up to 23.67 times on 256 compute nodes compared with the implementations of production MPI libraries.

[1]  Fan Zhang,et al.  Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[2]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[3]  Jesper Larsson Träff,et al.  Optimal broadcast for fully connected processor-node networks , 2008, J. Parallel Distributed Comput..

[4]  Jesper Larsson Träff,et al.  A Pipelined Algorithm for Large, Irregular All-Gather Problems , 2010, Int. J. High Perform. Comput. Appl..

[5]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[6]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[7]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[8]  Pedro V. Silva,et al.  Implementing MPI-2 Extended Collective Operations , 1999, PVM/MPI.

[9]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[10]  Jesper Larsson Träff,et al.  Two-tree algorithms for full bandwidth broadcast, reduction and scan , 2009, Parallel Comput..

[11]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[12]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[13]  Amith R. Mamidala,et al.  Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand , 2006, PVM/MPI.

[14]  Joshua P. Hacker,et al.  Ensemble Data Assimilation to Characterize Surface-Layer Errors in Numerical Weather Prediction Models , 2013 .

[15]  Fan Zhang,et al.  ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing , 2015, Concurr. Comput. Pract. Exp..

[16]  Jesper Larsson Träff,et al.  Full-Duplex Inter-Group All-to-All Broadcast Algorithms with Optimal Bandwidth , 2018, EuroMPI.

[17]  Alok N. Choudhary,et al.  A flexible I/O arbitration framework for netCDF‐based big data processing workflows on high‐end supercomputers , 2017, Concurr. Comput. Pract. Exp..