Collective operations in NEC's high-performance MPI libraries

We give an overview of the algorithms and implementations in the high-performance MPI libraries MPI/SX and MPI/ES of some of the most important collective operations of MPI (the message passing interface). The infrastructure of MPI/SX makes it easy to incorporate new algorithms and algorithms for common special cases (e.g. a single SX node, or a single MPI process per SX node). Algorithms that are among the best known are employed, and special hardware features of the SX architecture and internode crossbar switch (IXS) are exploited wherever possible. We discuss in more detail the implementation of MPLBarrier, MPLBcast, the MPI reduction collectives, MPI-Alltoall, and the gather/scatter collectives. Performance figures and comparisons to straightforward algorithms are given for a large SX-8 system, and for the Earth Simulator. The measurements show excellent absolute performance, and demonstrate the scalability of MPI/SX and MPI/ES to systems with large numbers of nodes

[1]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.

[2]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[3]  Hubert Ritzdorf,et al.  Efficient message Passing interface implementations for NEC parallel computers : Toward reality in scientific simulations: NEC's 21st Century Odyssey , 1998 .

[4]  Jesper Larsson Träff,et al.  Improved MPI All-to-all Communication on a Giganet SMP Cluster , 2002, PVM/MPI.

[5]  R. A. van de Geijn,et al.  Efficient Global Combine Operations , 1991 .

[6]  Jeffrey M. Squyres,et al.  The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms* , 2005 .

[7]  Jesper Larsson Träff,et al.  More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems , 2004, PVM/MPI.

[8]  Philip Heidelberger,et al.  Optimization of MPI collective communication on BlueGene/L systems , 2005, ICS '05.

[9]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[10]  Ralf H. Reussner,et al.  SKaMPI: A Detailed, Accurate MPI Benchmark , 1998, PVM/MPI.

[11]  Robert A. van de Geijn,et al.  On optimizing collective communication , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[12]  Jesper Larsson Träff An Improved Algorithm for (Non-commutative) Reduce-Scatter with an Application , 2005, PVM/MPI.

[13]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[14]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[15]  Jesper Larsson Träff,et al.  The MPI/SX Collectives Verification Library , 2005, PARCO.

[16]  Hubert Ritzdorf,et al.  The Implementation of MPI-2 One-Sided Communication for the NEC SX-5 , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[17]  Robert A. van de Geijn,et al.  On Global Combine Operations , 1994, J. Parallel Distributed Comput..

[18]  Joachim Worringen Experiment Management and Analysis with perfbase , 2005, 2005 IEEE International Conference on Cluster Computing.

[19]  Jack Dongarra,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.

[20]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[21]  Jesper Larsson Träff More efficient Reduction Algorithms for Message-Passing Parallel Systems , 2004 .

[22]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[23]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[24]  Jesper Larsson Träff,et al.  Optimal Broadcast for Fully Connected Networks , 2005, HPCC.

[25]  Jesper Larsson Träff,et al.  An Optimal Broadcast Algorithm Adapted to SMP Clusters , 2005, PVM/MPI.

[26]  William Gropp,et al.  Mpi - The Complete Reference: Volume 2, the Mpi Extensions , 1998 .

[27]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[28]  Jesper Larsson Träff,et al.  The Hierarchical Factor Algorithm for All-to-All Communication (Research Note) , 2002, Euro-Par.

[29]  J.L. Traff Hierarchical gather/scatter algorithms with graceful degradation , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[30]  Hubert Ritzdorf,et al.  The MPI/SX implementation of MPI for NEC's SX-6 and other NEC platforms , 2003 .

[31]  Carl Kesselman,et al.  Generalized communicators in the Message Passing Interface , 1996, Proceedings. Second MPI Developer's Conference.

[32]  Jesper Larsson Träff,et al.  Verifying Collective MPI Calls , 2004, PVM/MPI.

[33]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .