A decomposition approach for optimizing the performance of MPI libraries

MPI provides a portable message passing interface for many parallel execution platforms but may lead to inefficiencies for some platforms and applications. In this article, we show that the performance of both, standard libraries and vendor-specific libraries, can be improved by an orthogonal organization of the processors in 2D or 3D meshes and by decomposing the collective communication operations into several phases. We describe an adaptive approach with a configuration phase to determine for a specific execution platform and a specific MPI library which decomposition leads to the best performance. This may also depend on the number of processors and the size of the messages to be transferred. The decomposition approach has been implemented in the form of a library extension which is called for each activation of a collective MPI operation. This has the advantage that neither the application programs nor the MPI library need to be changed while leading to significant performance improvements for many collective MPI operations

[1]  Robert A. van de Geijn,et al.  On optimizing collective communication , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[2]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[3]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[4]  Qing Huang,et al.  A Comparison of MPICH Allgather Algorithms on Switched Networks , 2003, PVM/MPI.

[5]  William Gropp,et al.  Users guide for mpich, a portable implementation of MPI , 1996 .

[6]  Jack Dongarra,et al.  Performance Modeling for Self Adapting Collective Communications for MPI , 2001 .

[7]  P. J. van der Houwen,et al.  Parallel Adams methods , 1999 .

[8]  Thomas Rauber,et al.  Execution Schemes for Parallel Adams Methods , 2004, Euro-Par.