论文信息 - SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems

SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems

Scientific applications use collective communication operations in Message Passing Interface (MPI) for global synchronization and data exchanges. Alltoall and AlltoallV are two important collective operations. They are used by MPI jobs to exchange messages among all of MPI processes. AlltoallV is a generalization of Alltoall, supporting messages of varying sizes. However, the existing MPI AlltoallV implementation has linear complexity, i.e., each process has to send messages to all other processes in the job. Such linear complexity can result in sub optimal scalability of MPI applications when they are deployed on millions of cores. To address above challenge, in this paper, we introduce a new Scalable LOgarithmic AlltoallV algorithm, named SLOAV, for MPI AlltoallV collective operation. SLOAV aims to achieve global exchange of small messages of different sizes in a logarithmic number of rounds. Furthermore, given the prevalence of multicore systems with shared memory, we design a hierarchical AlltoallV algorithm based on SLOAV by leveraging the advantages of shared memory, which is referred to as SLOAVx. Compared to SLOAV, SLOAVx significantly reduces the inter-node communication, thus improving the entire system performance and mitigating the impact of message latency. We have implemented and embedded both algorithms in Open MPI. Our evaluation on large-scale computer systems shows that for the 8-byte and 1024-process MPI Alltoallv operation, the SLOAV can reduce the latency by as much as 86.4%, when compared to the state-of-the-art, and SLOAVx can further optimize the SLOAV by up to 83.1% in terms of message latency on multicore systems. In addition, experiments with NAS Parallel Benchmark (NPB) demonstrate that our algorithms are very effective for real-world applications.

[1] Brice Goglin,et al. KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[2] Adrian Jackson. Planned AlltoAllv , .

[3] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[4] Jehoshua Bruck,et al. Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[5] Manjunath Gorentla Venkata,et al. Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct , 2012, 2012 41st International Conference on Parallel Processing.

[6] Ahmad Faraj,et al. Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.

[7] Manjunath Gorentla Venkata,et al. Cheetah: A Framework for Scalable Hierarchical Collective Operations , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[8] Dhabaleswar K. Panda,et al. Scalable, high-performance NIC-based all-to-all broadcast over Myrinet/GM , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[9] M. Plummer,et al. An LPAR-customized MPI_AllToAllV for the Materials Science code CASTEP , 2004 .

[10] Roger W. Hockney,et al. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[11] Xin Yuan,et al. Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[12] George Bosilca,et al. Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs , 2011, 2011 International Conference on Parallel Processing.

[13] Eli Upfal,et al. Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[14] Keith D. Underwood,et al. An analysis of the impact of MPI overlap and independent progress , 2004, ICS '04.