Near-Optimal Rendezvous Protocols for RDMA-Enabled Clusters

Optimizing Message Passing Interface (MPI) point-to-point communication for large messages is of paramount importance since most communications in MPI applications are performed by such operations. Remote Direct Memory Access (RDMA) allows one-sided data transfer and provides great flexibility in the design of efficient communication protocols for large messages. However, achieving high performance on RDMA-enabled clusters is still challenging due to the complexity both in communication protocols and in protocol invocation scenarios. In this work, we investigate a profile-driven compiled-assisted protocol customization approach for efficient communication on RDMA-enabled clusters. We analyze existing protocols and show that they are not ideal in many situations. By leveraging the RDMA capability, we develop a set of protocols that can provide near-optimal performance for all protocol invocation scenarios, which allows protocol customization to achieve near-optimal performance when the appropriate protocol is used for each communication. Finally, we evaluate the potential benefits of protocol customization using micro-benchmarks and application benchmarks. The results demonstrate that the proposed protocols can out-perform traditional rendezvous protocols to a large degree in many situations and that protocol customization can significantly improve MPI communication performance.

[1]  D. Martin Swany,et al.  Gravel: A Communication Library to Fast Path MPI , 2008, PVM/MPI.

[2]  Sayantan Sur,et al.  RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[3]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[4]  Scott Pakin Receiver-initiated message passing over RDMA Networks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[5]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[6]  Martin Burtscher,et al.  Tolerating Message Latency Through the Early Release of Blocked Receives , 2005, Euro-Par.

[7]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[8]  Kenichi Hayashi,et al.  An MPI library which uses polling, interrupts and remote copying for the Fujitsu AP1000+ , 1996, Proceedings Second International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN'96).

[9]  Chamath Keppitiyagama,et al.  Asynchronous MPI messaging on Myrinet , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[10]  Dhabaleswar K. Panda,et al.  Host-assisted zero-copy remote memory access communication on InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[11]  Ahmad Afsahi,et al.  Improving Communication Progress and Overlap in MPI Rendezvous Protocol over RDMA-enabled Interconnects , 2008, 2008 22nd International Symposium on High Performance Computing Systems and Applications.

[12]  Amith R. Mamidala,et al.  Lock-Free Asynchronous Rendezvous Design for MPI Point-to-Point Communication , 2008, PVM/MPI.

[13]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[14]  Xin Yuan,et al.  Maximizing MPI point-to-point communication performance on RDMA-enabled clusters with customized protocols , 2009, ICS.

[15]  Chamath Indika Keppitiyagama A network processor based message manager for MPI , 2000 .

[16]  Amy W. Apon,et al.  Implementation and design analysis of a network messaging module using virtual interface architecture , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[17]  S. Rixner,et al.  An Event-driven Architecture for MPI Libraries , 2004 .