STAR-MPI: self tuned adaptive routines for MPI collective operations

Message Passing Interface (MPI) collective communication routines are widely used in parallel applications. In order for a collective communication routine to achieve high performance for different applications on different platforms, it must be adaptable to both the system architecture and the application workload. Current MPI implementations do not support such software adaptability and are not able to achieve high performance on many platforms. In this paper, we present STAR-MPI (Self Tuned Adaptive Routines for MPI collective operations), a set of MPI collective communication routines that are capable of adapting to system architecture and application workload. For each operation, STAR-MPI maintains a set of communication algorithms that can potentially be efficient at different situations. As an application executes, a STAR-MPI routine applies the Automatic Empirical Optimization of Software (AEOS) technique at run time to dynamically select the best performing algorithm for the application on the platform. We describe the techniques used in STAR-MPI, analyze STAR-MPI overheads, and evaluate the performance of STAR-MPI with applications and benchmarks. The results of our study indicate that STAR-MPI is robust and efficient. It is able to and efficient algorithms with reasonable overheads, and it out-performs traditional MPI implementations to a large degree in many cases.

[1]  Satoshi Matsuoka,et al.  OMPI: Optimizing MPI Programs using Partial Evaluation , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[2]  Mario Lauria,et al.  MPI-FM: High Performance MPI on Workstation Clusters , 1997, J. Parallel Distributed Comput..

[3]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[4]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[5]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  I. Rosenblum,et al.  MULTI-PROCESSOR MOLECULAR DYNAMICS USING THE BRENNER POTENTIAL: PARALLELIZATION OF AN IMPLICIT MULTI-BODY POTENTIAL , 1999 .

[8]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.

[9]  S. Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP’s , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[10]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[11]  Steve Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP's , 1999, SC.

[12]  Tao Yang,et al.  Program transformation and runtime support for threaded MPI execution on shared-memory machines , 2000, TOPL.

[13]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[14]  Xin Yuan,et al.  CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.

[15]  R. Rabenseifner,et al.  Automatic MPI Counter Profiling of All Users: First Results on a CRAY T3E 900-512 , 2004 .

[16]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[17]  Xin Yuan,et al.  Message scheduling for all-to-all personalized communication on ethernet switched clusters , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[18]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[19]  Xin Yuan,et al.  Bandwidth Efficient All-to-All Broadcast on Switched Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[20]  Xin Yuan,et al.  Pipelined broadcast on Ethernet switched clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.