Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand

MPI_Allgather is an important collective operation which is used in applications such as matrix multiplication and in basic linear algebra operations. With the next generation systems going multi-core, the clusters deployed would enable a high process count per node. The traditional implementations of Allgather use two separate channels, namely network channel for communication across the nodes and shared memory channel for intra-node communication. An important drawback of this approach is the lack of sharing of communication buffers across these channels. This results in extra copying of data within a node yielding sub-optimal performance. This is true especially for a collective involving large number of processes with a high process density per node. In the approach proposed in the paper, we propose a solution which eliminates the extra copy costs by sharing the communication buffers for both intra and inter node communication. Further, we optimize the performance by allowing overlap of network operations with intra-node shared memory copies. On a 32, 2-way node cluster, we observe an improvement upto a factor of two for MPI_Allgather compared to the original implementation. Also, we observe overlap benefits upto 43% for 32x2 process configuration.

[1]  Sushmitha P. Kini,et al.  Fast and Scalable Barrier Using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters , 2003, PVM/MPI.

[2]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[3]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[4]  D. Panda,et al.  High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters , 2005, HiPC.

[5]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[6]  Jack Dongarra,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.

[7]  Sayantan Sur,et al.  Efficient and scalable all-to-all personalized exchange for InfiniBand-based clusters , 2004 .

[8]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[9]  Massimo Bernaschi,et al.  MPI collective communication operations on large shared memory systems , 2001, Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing.

[10]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[11]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[12]  Meng-Shiou Wu,et al.  Optimizing collective communications on SMP clusters , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[13]  Viktor K. Prasanna,et al.  High Performance Computing - HiPC 2005, 12th International Conference, Goa, India, December 18-21, 2005, Proceedings , 2005, HiPC.

[14]  Amith R. Mamidala,et al.  Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).