Zero-copy protocol for MPI using infiniband unreliable datagram

Memory copies are widely regarded as detrimental to the overall performance of applications. High-performance systems make every effort to reduce the number of memory copies, especially the copies incurred during message passing. State of the art implementations of message-passing libraries, such as MPI, utilize user-level networking protocols to reduce or eliminate memory copies. InfiniBand is an emerging user-level networking technology that is gaining rapid acceptance in several domains, including HPC. In order to eliminate message copies while transferring large messages, MPI libraries over InfiniBand employ ldquozero-copyrdquo protocols which use remote direct memory access (RDMA). RDMA is available only in the connection-oriented transports of InfiniBand, such as reliable connection (RC). However, the unreliable datagram (UD) transport of InfiniBand has been shown to scale much better than the RC transport in regard to memory usage. In an optimal design, it should be possible to perform zero-copy message transfers over scalable transports (such as UD). In this paper, we present our design of a novel zero-copy protocol which is directly based over the scalable UD transport. Thus, our protocol achieves the twin objectives of scalability and good performance. Our analysis shows that uni-directional messaging bandwidth can be within 9% of what is achievable over RC for messages of 64 KB and above. Application benchmark evaluation shows that our design delivers a 21% speedup for the in.rhodo dataset for LAMMPS over a copy-based approach, giving performance within 1% of RC.

[1]  Amith R. Mamidala,et al.  On using connection-oriented vs. connection-less transport for performance and scalability of collective and one-sided operations: trade-offs and impact , 2007, PPoPP.

[2]  Burkhard Monien,et al.  International Parallel and Distributed Processing Symposium (IPDPS 2004) , 2006 .

[3]  Sayantan Sur,et al.  Shared receive queue based scalable MPI design for InfiniBand clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[4]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[5]  Dhabaleswar K. Panda,et al.  Adaptive connection management for scalable MPI over InfiniBand , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[6]  Sayantan Sur,et al.  High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters , 2007, ICS '07.

[7]  Sayantan Sur,et al.  Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms , 2007 .

[8]  D. Panda,et al.  Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[9]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[10]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[11]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[12]  Yutaka Ishikawa,et al.  MPICH-PM: Design and Implementation of Zero Copy MPI for PM , 1998 .

[13]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[14]  Peter H. Hochschild,et al.  Breaking the connection: RDMA deconstructed , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[15]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[16]  Bernard Tourancheau,et al.  BIP: A New Protocol Designed for High Performance Networking on Myrinet , 1998, IPPS/SPDP Workshops.

[17]  Galen M. Shipman,et al.  Infiniband scalability in Open MPI , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.