High performance RDMA-based MPI implementation over InfiniBand

Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) operations. In this paper, we propose a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. We also achieve better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation currently delivers a latency of 6.8 microseconds for small messages and a peak bandwidth of 871 Million Bytes (831 Mega Bytes) per second. Performance evaluation at the MPI level shows that for small messages, our RDMA-based design can reduce the latency by 24%, increase the bandwidth by over 104%, and reduce the host overhead by up to 22%. For large messages, we improve performance by reducing the time for transferring control messages. We have also shown that our new design is beneficial to MPI collective communication and NAS Parallel Benchmarks.

[1]  Liviu Iftode,et al.  User-level communication in cluster-based servers , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[2]  Hiroshi Tezuka,et al.  Pin-down cache: a virtual memory management technique for zero-copy communication , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[3]  James C. Hoe,et al.  MPI-StarT: Delivering Network Performance to Numerical Applications , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4]  Kevin J. Nowka,et al.  Designing for a gigahertz [guTS integer processor] , 1998, IEEE Micro.

[5]  Kai Li,et al.  Virtual-Memory-Mapped Network Interfaces , 1995, IEEE Micro.

[6]  S.J. Sistare,et al.  Ultra-High Performance Communication with MPI and the Sun Fire™ Link Interconnect , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[8]  A. Skjellum,et al.  MPICH on the T3D: a case study of high performance message passing , 1996, Proceedings. Second MPI Developer's Conference.

[9]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[10]  Dhabaleswar K. Panda,et al.  Efficient collective operations using remote memory operations on VIA-based clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[11]  Margo I. Seltzer,et al.  Structure and Performance of the Direct Access File System , 2002, USENIX ATC, General Track.

[12]  Chita R. Das,et al.  A strategy to compute the InfiniBand arbitration tables , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[13]  A. Chien,et al.  High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[14]  Dhabaleswar K. Panda,et al.  MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems , 2001, IEEE Trans. Parallel Distributed Syst..

[15]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[16]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[17]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[18]  Dhabaleswar K. Panda,et al.  Impact of on-demand connection management in MPI over VIA , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[19]  D.E. Culler,et al.  Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[20]  Yuanyuan Zhou,et al.  Experiences with VI communication for database storage , 2002, ISCA.

[21]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.