Self-Consistent MPI Performance Guidelines

Message passing using the Message-Passing Interface (MPI) is at present the most widely adopted framework for programming parallel applications for distributed memory and clustered parallel systems. For reasons of (universal) implementability, the MPI standard does not state any specific performance guarantees, but users expect MPI implementations to deliver good and consistent performance in the sense of efficient utilization of the underlying parallel (communication) system. For performance portability reasons, users also naturally desire communication optimizations performed on one parallel platform with one MPI implementation to be preserved when switching to another MPI implementation on another platform. We address the problem of ensuring performance consistency and portability by formulating performance guidelines and conditions that are desirable for good MPI implementations to fulfill. Instead of prescribing a specific performance model (which may be realistic on some systems, under some MPI protocol and algorithm assumptions, etc.), we formulate these guidelines by relating the performance of various aspects of the semantically strongly interrelated MPI standard to each other. Common-sense expectations, for instance, suggest that no MPI function should perform worse than a combination of other MPI functions that implement the same functionality, no specialized function should perform worse than a more general function that can implement the same functionality, no function with weak semantic guarantees should perform worse than a similar function with stronger semantics, and so on. Such guidelines may enable implementers to provide higher quality MPI implementations, minimize performance surprises, and eliminate the need for users to make special, nonportable optimizations by hand. We introduce and semiformalize the concept of self-consistent performance guidelines for MPI, and provide a (nonexhaustive) set of such guidelines in a form that could be automatically verified by benchmarks and experiment management tools. We present experimental results that show cases where guidelines are not satisfied in common MPI implementations, thereby indicating room for improvement in today's MPI implementations.

[1]  Ralf H. Reussner,et al.  SKaMPI: A Detailed, Accurate MPI Benchmark , 1998, PVM/MPI.

[2]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[3]  P. H. Worley Comparison of Cray XT3 and XT4 Scalability , 2008 .

[4]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[5]  Boyana Norris,et al.  Computational Quality of Service in Parallel CFD , 2006 .

[6]  Adolfy Hoisie,et al.  A practical approach to performance analysis and modeling of large-scale systems , 2006, SC.

[7]  Vivek Sarkar,et al.  X10: concurrent programming for modern architectures , 2007, PPOPP.

[8]  William Gropp,et al.  Mpi - The Complete Reference: Volume 2, the Mpi Extensions , 1998 .

[9]  Seyed H. Roosta Principles of Parallel Programming , 2000 .

[10]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[11]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[12]  Allen D. Malony,et al.  Computational Quality of Service for Scientific CCA Applications: Composition, Substitution, and Reconfiguration , 2006 .

[13]  Jesús Labarta,et al.  Generation of Simple Analytical Models for Message Passing Applications , 2004, Euro-Par.

[14]  Joachim Worringen Automated Performance Comparison , 2006, PVM/MPI.

[15]  Sergei Gorlatch,et al.  Toward Formally-Based Design of Message Passing Programs , 2000, IEEE Trans. Software Eng..

[16]  Stephen Booth,et al.  Exchanging multiple messages via MPI , .

[17]  Werner Augustin,et al.  Usefulness and Usage of SKaMPI-Bench , 2003, PVM/MPI.

[18]  Katherine Yelick,et al.  UPC: Distributed Shared-Memory Programming , 2003 .

[19]  Joachim Worringen Experiment Management and Analysis with perfbase , 2005, 2005 IEEE International Conference on Cluster Computing.

[20]  Sergei Gorlatch,et al.  Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[21]  Ralf H. Reussner Using SKaMPI for developing high-performance MPI programs with performance portability , 2003, Future Gener. Comput. Syst..

[22]  Robert A. van de Geijn,et al.  Building a high-performance collective communication library , 1994, Proceedings of Supercomputing '94.

[23]  Jaswinder Pal Singh,et al.  Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors , 1997, PPOPP '97.

[24]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[25]  Kees Verstoep,et al.  Fast Measurement of LogP Parameters for Message Passing Platforms , 2000, IPDPS Workshops.

[26]  Jesús Labarta,et al.  Validation of Dimemas Communication Model for MPI Collective Operations , 2000, PVM/MPI.

[27]  Jesper Larsson Träff,et al.  SKaMPI: a comprehensive benchmark for public benchmarking of MPI , 2002, Sci. Program..

[28]  Patrick H. Worley,et al.  Performance Portability in the Physical Parameterizations of the Community Atmospheric Model , 2005, Int. J. High Perform. Comput. Appl..

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  Hubert Ritzdorf,et al.  Collective operations in NEC's high-performance MPI libraries , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[31]  Torsten Hoefler,et al.  Netgauge: A Network Performance Measurement Framework , 2007, HPCC.

[32]  Robert A. van de Geijn,et al.  On optimizing collective communication , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[33]  Jesper Larsson Träff,et al.  Self-consistent MPI Performance Requirements , 2007, PVM/MPI.

[34]  Phillip Colella,et al.  Parallel Languages and Compilers: Perspective From the Titanium Experience , 2007, Int. J. High Perform. Comput. Appl..

[35]  Thomas Rauber,et al.  Optimizing MPI collective communication by orthogonal structures , 2006, Cluster Computing.

[36]  Jesper Larsson Träff An Improved Algorithm for (Non-commutative) Reduce-Scatter with an Application , 2005, PVM/MPI.

[37]  Isabelle Guérin Lassous,et al.  PRO: A Model for the Design and Analysis of Efficient and Scalable Parallel Algorithms , 2006, Nord. J. Comput..

[38]  Robert B. Ross,et al.  Self-consistent MPI-IO Performance Requirements and Expectations , 2008, PVM/MPI.

[39]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[40]  Mark M. Mathis,et al.  A performance model of non-deterministic particle transport on large-scale systems , 2003, Future Gener. Comput. Syst..

[41]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[42]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[43]  M. Plummer,et al.  An LPAR-customized MPI_AllToAllV for the Materials Science code CASTEP , 2004 .