Scaling alltoall collective on multi-core systems

MPI_Alltoall is one of the most communication intense collective operation used in many parallel applications. Recently, the supercomputing arena has witnessed phenomenal growth of commodity clusters built using InfiniBand and multi-core systems. In this context, it is important to optimize this operation for these emerging clusters to allow for good application scaling. However, optimizing MPI_Alltoall on these emerging systems is not a trivial task. InfiniBand architecture allows for varying implementations of the network protocol stack. For example, the protocol can be totally on-loaded to a host processing core or it can be off-loaded onto the NIC or can use any combination of the two. Understanding the characteristics of these different implementations is critical in optimizing a communication intense operation such as MPI_Alltoall. In this paper, we systematically study these different architectures and propose new schemes for MPI_Alltoall tailored to these architectures. Specifically, we demonstrate that we cannot use one common scheme which performs optimally on each of these varying architectures. For example, on-loaded implementations can exploit multiple cores to achieve better network utilization, and in offload interfaces aggregation can be used to avoid congestion on multi-core systems. We employ shared memory aggregation techniques in these schemes and elucidate the impact of these schemes on multi-core systems. The proposed design achieves a reduction in MPI_Alltoall time by 55% for 512 Byte messages and speeds up the CPMD application by 33%.

[1]  Sayantan Sur,et al.  Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms , 2007, 15th Annual IEEE Symposium on High-Performance Interconnects (HOTI 2007).

[2]  GroppWilliam,et al.  Optimization of Collective Communication Operations in MPICH , 2005 .

[3]  Norman P. Jouppi,et al.  High-performance ethernet-based communications for future multi-core processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  James C. Hoe,et al.  MPI-StarT: Delivering Network Performance to Numerical Applications , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[5]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[6]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[7]  Amith R. Mamidala,et al.  Efficient SMP-aware MPI-level broadcast over InfiniBand's hardware multicast , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[8]  Alessandro Curioni,et al.  Dual-level parallelism for ab initio molecular dynamics: Reaching teraflop performance with the CPMD code , 2005, Parallel Comput..

[9]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  D. Roweth,et al.  Performance of All-to-All on QsNetII , 2005 .

[11]  S. Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP’s , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[12]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[13]  Laxmikant V. Kalé,et al.  Scaling all-to-all multicast on fat-tree networks , 2004, Proceedings. Tenth International Conference on Parallel and Distributed Systems, 2004. ICPADS 2004..

[14]  Steve Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP's , 1999, SC.

[15]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..