Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support

Summary form only given. Modern high performance applications require efficient and scalable collective communication operations. Currently, most collective operations are implemented based on point-to-point operations. We propose to use hardware multicast in InfiniBand to design fast and scalable broadcast operations in MPl. InfiniBand supports multicast with unreliable datagram (UD) transport service. This makes it hard to be directly used by an upper layer such as MPl. To bridge the semantic gap between MPI/spl I.bar/Bcast and InfiniBand hardware multicast, we have designed and implemented a substrate on top of InfiniBand which provides functionalities such as reliability, inorder delivery and large message handling. By using a sliding-window based design, we improve MPI/spl I.bar/Bcast latency by removing most of the overhead in the substrate out of the communication critical path. By using optimizations such as a new coroot based scheme and delayed ACK, we can further balance and reduce the overhead. We have also addressed many detailed design issues such as buffer management, efficient handling of out-of-order and duplicate messages, timeout and retransmission, flow control and RDMA based ACK communication. Our performance evaluation shows that in an 8 node cluster testbed, hardware multicast based designs can improve MPl broadcast latency up to 58% and broadcast throughput up to 112%. The proposed solutions are also much more tolerant to process skew compared with the current point-to-point based implementation. We have also developed analytical model for our multicast based schemes and validated them with experimental numbers. Our analytical model shows that with the new designs, one can achieve MPl broadcast latency of small messages with 20.0/spl mu/s and of one MTU size message (around 1836 bytes of data payload) with 40.0/spl mu/s in a 1024 node cluster.

[1]  Sushmitha P. Kini,et al.  Fast and Scalable Barrier Using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters , 2003, PVM/MPI.

[2]  Dhabaleswar K. Panda,et al.  High performance and reliable NIC-based multicast over Myrinet/GM-2 , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[3]  SkjellumAnthony,et al.  A high-performance, portable implementation of the MPI message passing interface standard , 1996 .

[4]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[5]  William Gropp,et al.  Mpi---the complete reference: volume 1 , 1998 .

[6]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[7]  Donald F. Towsley,et al.  A comparison of sender-initiated and receiver-initiated reliable multicast protocols , 1994, IEEE J. Sel. Areas Commun..

[8]  ZHANGLi-xia,et al.  A reliable multicast framework for light-weight sessions and application level framing , 1995 .

[9]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[10]  Dhabaleswar K. Panda,et al.  Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[11]  Dhabaleswar K. Panda,et al.  Application-bypass broadcast in MPICH over GM , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[12]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[13]  Hans Eriksson,et al.  MBONE: the multicast backbone , 1994, CACM.

[14]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[15]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[16]  Sanjoy Paul,et al.  RMTP: a reliable multicast transport protocol , 1996, Proceedings of IEEE INFOCOM '96. Conference on Computer Communications.

[17]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[18]  Fabrizio Petrini,et al.  Hardware- and software-based collective communication on the Quadrics network , 2001, Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.

[19]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[20]  Dhabaleswar K. Panda,et al.  Efficient barrier using remote memory operations on VIA-based clusters , 2002, Proceedings. IEEE International Conference on Cluster Computing.