论文信息 - Optimizing MPI Alltoall Communication of Large Messages in Multicore Clusters

Optimizing MPI Alltoall Communication of Large Messages in Multicore Clusters

MPI All to all communication is widely used in many high performance computing (HPC) applications. In All to all communication, each process sends a distinct message to all other participating processes. In multicore clusters, processes within a node simultaneously contend for the same network resource of the node in All to all communication. However, many small synchronization messages are required in All to all communication of large messages. With the contention, their latency is orders of magnitude larger than that without contention. As a result, the synchronization overhead is significantly increased and accounts for a large proportion to the whole latency of All to all communication. In this paper, we analyse the considerable overhead of synchronization messages. Base on the analysis, an optimization is presented to reduce the number of synchronization messages from 3N to 2¡ÌN. Evaluations on a 240-core cluster show that the performance is improved by almost constant ratio, which is mainly determined by message size and independent of system scale. The performance of All to all communication is improved by 25% for 32K and 64K bytes messages. For FFT application, performance is improved by 20%.

Qiang Li | Ninghui Sun | Zhigang Huo

[1] Xin Yuan,et al. Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[2] Paul D. Coddington. Analysis of Algorithm Selection for Optimizing Collective Communication with MPICH for Ethernet and Myrinet Networks , 2007, Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2007).

[3] Paul D. Coddington,et al. Analysis of Algorithm Selection for Optimizing Collective Communication with MPICH for Ethernet and Myrinet Networks , 2007 .

[4] Dhabaleswar K. Panda,et al. High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[5] Sathish S. Vadhiyar,et al. Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[6] Dhabaleswar K. Panda,et al. Scalable and high performance collective communication for next generation multicore infiniband clusters , 2008 .

[7] Sayantan Sur,et al. Can memory-less network adapters benefit next-generation infiniband systems? , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[8] Dhabaleswar K. Panda,et al. High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[9] F. Petrini,et al. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).