Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications

The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of the Message Passing Interface (MPI) to make independent progress (that is, to make progress on outstanding communication operations while not in the MPI library) is also believed to yield performance benefits. Using an intelligent network interface to offload the work required to support overlap and independent progress is thought to be an ideal solution, but the benefits of this approach have not been studied in depth at the application level. This lack of analysis is complicated by the fact that most MPI implementations do not sufficiently support overlap or independent progress. Recent work has demonstrated a quantifiable advantage for an MPI implementation that uses offload to provide overlap and independent progress. The study is conducted on two different platforms with each having two MPI implementations (one with and one without independent progress). Thus, identical network hardware and virtually identical software stacks are used. Furthermore, one platform, ASCI Red, allows further separation of features such as overlap and offload. Thus, this paper extends previous work by further qualifying the source of the performance advantage: offload, overlap, or independent progress.

[1]  Dhabaleswar K. Panda,et al.  Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2]  Remzi H. Arpaci-Dusseau,et al.  Architectural Requirements and Scalability of the NAS Parallel Benchmarks , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[3]  Keith D. Underwood,et al.  Characterizing a new class of threads in scientific applications for high end supercomputers , 2004, ICS '04.

[4]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[5]  Dhabaleswar K. Panda,et al.  Microbenchmark performance comparison of high-speed cluster interconnects , 2004, IEEE Micro.

[6]  David Scott,et al.  A TeraFLOP supercomputer in 1996: the ASCI TFLOP system , 1996, Proceedings of International Conference on Parallel Processing.

[7]  Scott Pakin,et al.  Identifying and Eliminating the Performance Variability on the ASCI Q Machine , 2003 .

[8]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[9]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[10]  Richard P. Martin,et al.  Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[11]  Keith D. Underwood,et al.  An analysis of NIC resource usage for offloading MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[12]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[13]  Ron Brightwell A New MPI Implementation for Cray SHMEM , 2004, PVM/MPI.

[14]  Keith D. Underwood,et al.  The impact of MPI queue usage on message latency , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[15]  Wolfgang Rehm,et al.  Implementing an MPICH-2 channel device over VAPI on InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[16]  Rolf Riesen,et al.  Portals 3.0: protocol building blocks for low overhead communication , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[17]  Keith D. Underwood,et al.  An Initial Analysis of the Impact of Overlap and Independent Progress for MPI , 2004, PVM/MPI.

[18]  Ron Brightwell,et al.  The Portals 3.0 Message Passing Interface Revision 1.0 , 1999 .

[19]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[20]  Wolfgang Rehm,et al.  An MPICH 2 Channel Device Implementation over VAPI on InfiniBand , 2004 .

[21]  Keith D. Underwood,et al.  Evaluation of an Eager Protocol Optimization for MPI , 2003, PVM/MPI.

[22]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[23]  R. Brightwell,et al.  Design and implementation of MPI on Puma portals , 1996, Proceedings. Second MPI Developer's Conference.

[24]  Sushmitha P. Kini,et al.  Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[25]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[26]  Rossen Dimitrov,et al.  Impact of Latency on Applications’ Performance , 2001 .