A Study of Process Arrival Patterns for MPI Collective Operations

Process arrival pattern, which denotes the timing when different processes arrive at an MPI collective operation, can have a significant impact on the performance of the operation. In this work, we characterize the process arrival patterns in a set of MPI programs on two common cluster platforms, use a micro-benchmark to study the process arrival patterns in MPI programs with balanced loads, and investigate the impacts of different process arrival patterns on collective algorithms. Our results show that (1) the differences between the times when different processes arrive at a collective operation are usually sufficiently large to affect the performance; (2) application developers in general cannot effectively control the process arrival patterns in their MPI programs in the cluster environment: balancing loads at the application level does not balance the process arrival patterns; and (3) the performance of collective communication algorithms is sensitive to process arrival patterns. These results indicate that process arrival pattern is an important factor that must be taken into consideration in developing and optimizing MPI collective routines. We propose a scheme that achieves high performance with different process arrival patterns, and demonstrate that by explicitly considering process arrival pattern, more efficient MPI collective routines than the current ones can be obtained.

[1]  I. Rosenblum,et al.  MULTI-PROCESSOR MOLECULAR DYNAMICS USING THE BRENNER POTENTIAL: PARALLELIZATION OF AN IMPLICIT MULTI-BODY POTENTIAL , 1999 .

[2]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[3]  Xin Yuan,et al.  A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters , 2007, IEEE Transactions on Parallel and Distributed Systems.

[4]  Xin Yuan,et al.  Pipelined broadcast on Ethernet switched clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[5]  Quentin F. Stout,et al.  Statistical Analysis of Communication Time on the IBM SP2 , 2008 .

[6]  Quentin F. Stout,et al.  The Use of the MPI Communication Library in the NAS Parallel Benchmarks , 1999 .

[7]  G. Matthews,et al.  Molecular dynamics simulator , 1993 .

[8]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9]  Basel A. Mahafzah,et al.  Statistical analysis of message passing programs to guide computer design , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[10]  Peter Sanders,et al.  A bandwidth latency tradeoff for broadcast and reduction , 2003, Inf. Process. Lett..

[11]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[12]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[13]  D. Panda,et al.  Efficient Barrier and Allreduce on InfiniBand Clusters using Hardware Multicast and Adaptive Algorithms , 2004 .

[14]  Ahmad Faraj,et al.  Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.

[15]  Xin Yuan,et al.  An MPI prototype for compiled communication on Ethernet switched clusters , 2005, J. Parallel Distributed Comput..

[16]  R. Rabenseifner,et al.  Automatic MPI Counter Profiling of All Users: First Results on a CRAY T3E 900-512 , 2004 .

[17]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[18]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[19]  Xin Yuan,et al.  STAR-MPI: self tuned adaptive routines for MPI collective operations , 2006, ICS '06.

[20]  Jeffrey S. Vetter,et al.  An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[21]  Yves Robert,et al.  Pipelining broadcasts on heterogeneous platforms , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[22]  Amith R. Mamidala,et al.  Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[23]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[24]  Rami G. Melhem,et al.  Algorithms for Supporting Compiled Communication , 2003, IEEE Trans. Parallel Distributed Syst..

[25]  Cécile Germain,et al.  Static Communications in Parallel Scientific Propgrams , 1994, PARLE.

[26]  GroppWilliam,et al.  Optimization of Collective Communication Operations in MPICH , 2005 .

[27]  Xin Yuan,et al.  Bandwidth Efficient All-to-All Broadcast on Switched Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.