Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues

Designing and tuning parallel applications with MPI, particularly at large scale, requires understanding the performance implications of different choices of algorithms and implementation options. Which algorithm is better depends in part on the performance of the different possible communication approaches, which in turn can depend on both the system hardware and the MPI implementation. In the absence of detailed performance models for different MPI implementations, application developers often must select methods and tune codes without the means to realistically estimate the achievable performance and rationally defend their choices. In this paper, we advocate the construction of more useful performance models that take into account limitations on network-injection rates and effective bisection bandwidth. Since collective communication plays a crucial role in enabling scalability, we also provide analytical models for scalability of collective communication algorithms, such as broadcast, allreduce, and all-to-all. We apply these models to an IBM Blue Gene/P system and compare the analytical performance estimates with experimentally measured values.

[1]  Zhiwei Xu,et al.  Modeling communication overhead: MPI and MPL performance on the IBM SP2 , 1996, IEEE Parallel Distributed Technol. Syst. Appl..

[2]  Jesper Larsson Träff,et al.  Optimal broadcast for fully connected processor-node networks , 2008, J. Parallel Distributed Comput..

[3]  Torsten Hoefler,et al.  Multistage switches are not crossbars: Effects of static routing in high-performance networks , 2008, 2008 IEEE International Conference on Cluster Computing.

[4]  Amith R. Mamidala,et al.  MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, Hot Interconnects.

[5]  Juan Touriño,et al.  Performance Evaluation and Modeling of the Fujitsu AP3000 Message-Passing Libraries , 1999, Euro-Par.

[6]  Robert S. Germain,et al.  Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: Implementation and early performance measurements , 2005, IBM J. Res. Dev..

[7]  Chi-Chung Lam,et al.  Optimal Algorithms for All-to-All Personalized Communication on Rings and Two Dimensional Tori , 1997, J. Parallel Distributed Comput..

[8]  Chris J. Scheiman,et al.  LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation , 1997, J. Parallel Distributed Comput..

[9]  Torsten Hoefler,et al.  Parallel scaling of Teter’s minimization for Ab Initio calculations , 2006 .

[10]  Mary K. Vernon,et al.  A plug-and-play model for evaluating wavefront computations on parallel architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[11]  Tomás F. Pena,et al.  Accurate analytical performance model of communications in MPI applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[12]  Marco Danelutto,et al.  Euro-Par 2004 Parallel Processing , 2004, Lecture Notes in Computer Science.

[13]  Torsten Hoefler,et al.  A practical approach to the rating of barrier algorithms using the LogP model and Open MPI , 2005, 2005 International Conference on Parallel Processing Workshops (ICPPW'05).

[14]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.

[15]  Jack Dongarra,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.

[16]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience: Research Articles , 2007 .

[17]  Bin Jia Process Cooperation in Multiple Message Broadcast , 2007, PVM/MPI.

[18]  Fumihiko Ino,et al.  LogGPS: a parallel computational model for synchronization analysis , 2001, PPoPP '01.

[19]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[20]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[21]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .

[22]  Darren J. Kerbyson,et al.  Performance modeling in action: Performance prediction of a Cray XT4 system during upgrade , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[24]  Csaba Andras Moritz,et al.  LoGPC: Modeling Network Contention in Message-Passing Programs , 2001, IEEE Trans. Parallel Distributed Syst..

[25]  Csaba Andras Moritz,et al.  Performance Modeling and Evaluation of MPI , 2001, J. Parallel Distributed Comput..

[26]  Jesús Labarta,et al.  Generation of Simple Analytical Models for Message Passing Applications , 2004, Euro-Par.

[27]  Laxmikant V. Kalé,et al.  Topology aware task mapping techniques: an api and case study , 2009, PPoPP '09.

[28]  Josva Kleist,et al.  Migration = cloning; aliasing , 1999 .