Unifying UPC and MPI runtimes: experience with MVAPICH

Unified Parallel C (UPC) is an emerging parallel programming language that is based on a shared memory paradigm. MPI has been a widely ported and dominant parallel programming model for the past couple of decades. Real-life scientific applications require a lot of investment by domain scientists. Many scientists choose the MPI programming model as it is considered low-risk. It is unlikely that entire applications will be re-written using the emerging UPC language (or PGAS paradigm) in the near future. It is more likely that parts of these applications will be converted to newer models. This requires that underlying implementation of system software be able to support both UPC and MPI simultaneously. Unfortunately, the current state-of-the-art of UPC and MPI interoperability leaves much to be desired both in terms of performance and ease-of-use. In this paper, we propose "Integrated Native Communication Runtime" (INCR) for MPI and UPC communications on InfiniBand clusters. Our library is capable of supporting both UPC and MPI communications simultaneously. This runtime is based on the widely used MVAPICH (MPI over InfiniBand) Aptus runtime, which is known to scale to tens-of-thousands of cores. Our evaluation reveals that INCR is able to deliver equal or better performance compared to the existing UPC runtime - GASNet on InfiniBand verbs. We observe that with UPC NAS benchmarks CG and MG (class B) at 128 processes, we outperform current GASNet implementation by 10% and 23%, respectively.

[1]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[2]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[3]  Dhabaleswar K. Panda,et al.  MVAPICH-Aptus: Scalable high-performance multi-transport MPI over InfiniBand , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[4]  Sayantan Sur,et al.  High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters , 2007, ICS '07.

[5]  David B. Loveman High performance Fortran , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[6]  Dan Bonachea,et al.  A new DMA registration strategy for pinning-based high performance networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[7]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[8]  D. Panda,et al.  Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[9]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[10]  Dhabaleswar K. Panda,et al.  Scalable MPI design over InfiniBand using eXtended Reliable Connection , 2008, 2008 IEEE International Conference on Cluster Computing.

[11]  William Gropp,et al.  Designing a Common Communication Subsystem , 2005, PVM/MPI.

[12]  Katherine A. Yelick,et al.  Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[13]  Sayantan Sur,et al.  Shared receive queue based scalable MPI design for InfiniBand clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[14]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[15]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[16]  Jason Duell,et al.  Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations , 2004, Int. J. High Perform. Comput. Netw..

[17]  José Nelson Amaral,et al.  Shared memory programming for large scale machines , 2006, PLDI '06.