Optimizing MPI Collectives for X 1

Traditionally MPI collective operations have been based on point-to-point messages, with possible optimizations for system topologies and communication protocols. The Cray X1 scatter/gather hardware and shared memory mapping features allow for significantly different approaches to MPI collectives leading to substantial performance gains over standard methods, especially for short message lengths and higher process counts. This paper describes some of the algorithms used, implementation features, and relevant performance data.

[1]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[2]  R. C. Malone,et al.  Parallel ocean general circulation modeling , 1992 .

[3]  Robert A. van de Geijn,et al.  Building a high-performance collective communication library , 1994, Proceedings of Supercomputing '94.

[4]  William Gropp,et al.  Mpi---the complete reference: volume 1 , 1998 .

[5]  Steve Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP's , 1999, SC.

[6]  Cho-Li Wang,et al.  Efficient Scheduling of Complete Exchange on Clusters , 2000 .

[7]  Fabrizio Petrini,et al.  Hardware- and software-based collective communication on the Quadrics network , 2001, Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.

[8]  K. Feind SGI Message-Passing Status and Plans , 2001 .

[9]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[10]  S.J. Sistare,et al.  Ultra-High Performance Communication with MPI and the Sun Fire™ Link Interconnect , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[11]  William Gropp,et al.  MPI on BlueGene/L: Designing an Efficient General Purpose Messaging Solution for a Large Cellular System , 2003, PVM/MPI.

[12]  Hubert Ritzdorf,et al.  The MPI/SX implementation of MPI for NEC's SX-6 and other NEC platforms , 2003 .

[13]  Dhabaleswar K. Panda,et al.  Efficient collective operations using remote memory operations on VIA-based clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.