A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-blocking Alltoallv Collective on Multi-core Systems

Non-blocking collectives have been recently standardized by the Message Passing Interface (MPI) Forum. However, intelligent designs offered by the MPI communication runtimes are likely to be the key factors that drive their adoption. While hardware based solutions for non-blocking collective operations have shown promise, they require specialized hardware support and currently have several performance and scalability limitations. Alternatively, researchers have proposed software-based, Functional Partitioning solutions for non-blocking collectives, that rely on spare cores in each node to progress non-blocking collectives. However, these designs also require additional memory resources, and involve expensive copy operations. Such factors limit the overall performance and scalability benefits associated with using non-blocking collectives in MPI. In this paper, we propose a high performance, shared-memory backed, user-level approach based on functional partitioning, to design MPI-3 non-blocking collectives. Our approach relies on using one ``Communication Servlet (CS)" thread per node to seamlessly execute the non-blocking collective operations on behalf of the application processes. Our design also eliminates the need for additional memory resources and expensive copy operations between the application processes and the CS. We demonstrate that our solution can deliver near-perfect computation/communication overlap with large message, dense collective operations, such as MPI_Ialltoallv, while using just one core per node. We also study the benefits of our approach with a popular parallel 3D-FFT kernel, which has been re-designed to use the MPI_Ialltoallv operation. We observe that our proposed designs can improve the performance of the P3DFFT kernel by up to 27%, with 2,048 processes on the TACC Stampede system.

[1]  Yutaka Ishikawa,et al.  Design of Kernel-Level Asynchronous Collective Communication , 2010, EuroMPI.

[2]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[3]  Terry Jones,et al.  Impacts of Operating Systems on the Scalability of Parallel Applications , 2003 .

[4]  Sayantan Sur,et al.  High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT , 2011, Computer Science - Research and Development.

[5]  Forum Mpi MPI: A Message-Passing Interface , 1994 .

[6]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  J. C. Vassilicos,et al.  A numerical strategy to combine high-order schemes, complex geometry and parallel computing for high resolution DNS of fractal generated turbulence , 2010 .

[8]  Manjunath Gorentla Venkata,et al.  ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[9]  Torsten Hoefler,et al.  Message progression in parallel computing - to thread or not to thread? , 2008, 2008 IEEE International Conference on Cluster Computing.

[10]  Torsten Hoefler,et al.  A Case for Non-blocking Collective Operations , 2006, ISPA Workshops.

[11]  Corporate The MPI Forum,et al.  MPI: a message passing interface , 1993, Supercomputing '93.

[12]  Torsten Hoefler,et al.  Optimization principles for collective neighborhood communications , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Torsten Hoefler,et al.  Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned , 2011, Euro-Par.

[14]  Sayantan Sur,et al.  Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters Using Shared Memory Backed Windows , 2011, EuroMPI.

[15]  Darren J. Kerbyson,et al.  Efficient offloading of collective communications in large-scale systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[16]  Dhabaleswar K. Panda,et al.  Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[17]  Torsten Hoefler,et al.  Group Operation Assembly Language - A Flexible Way to Express Collective Communication , 2009, 2009 International Conference on Parallel Processing.

[18]  Xin Yuan,et al.  Efficient MPI Bcast across different process arrival patterns , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[19]  Amith R. Mamidala,et al.  Looking under the hood of the IBM Blue Gene/Q network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.