Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints

Modern high-speed interconnection networks are designed with capabilities to support communication from multiple processor cores. The MPI endpoints extension has been proposed to ease process and thread count tradeoffs by enabling multithreaded MPI applications to efficiently drive independent network communication. In this work, we present the first implementation of the MPI endpoints interface and demonstrate the first applications running on this new interface. We use a novel library-based design that can be layered on top of any existing, production MPI implementation. Our approach uses proxy processes to isolate threads in an MPI job, eliminating threading overheads within the MPI library and allowing threads to achieve process-like communication performance. We evaluate the performance advantages of our implementation through several benchmarks and kernels. Performance results for the Lattice QCD Dslash kernel indicate that endpoints provides up to 2.9× improvement in communication performance and 1.87× overall performance improvement over a highly optimized hybrid MPI+OpenMP baseline on 128 processors.

[1]  Pradeep Dubey,et al.  High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Hanhong Xue,et al.  Network Endpoints for Clusters of SMPs , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[3]  Mark Bull,et al.  Development of mixed mode MPI / OpenMP applications , 2001, Sci. Program..

[4]  Rajeev Thakur,et al.  Test Suite for Evaluating Performance of MPI Implementations That Support MPI_THREAD_MULTIPLE , 2007, PVM/MPI.

[5]  Amith R. Mamidala,et al.  PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[6]  Rajeev Thakur,et al.  Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems , 2010, EuroMPI.

[7]  Rajeev Thakur,et al.  Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[8]  Marshall C. Yovits,et al.  Ohio State University , 1974, SGAR.

[9]  Alan Wagner,et al.  FG-MPI: Fine-grain MPI for multicore and clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[10]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[11]  Patrick Carribault,et al.  Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC , 2010, IWOMP.

[12]  Raymond Namyst,et al.  MPC: A Unified Parallel Runtime for Clusters of NUMA Machines , 2008, Euro-Par.

[13]  Simon D. Hammond,et al.  The impact of hybrid-core processors on MPI message rate , 2013, EuroMPI.

[14]  Gerhard Wellein,et al.  Asynchronous MPI for the Masses , 2013, ArXiv.

[15]  Rajeev Thakur,et al.  Toward Efficient Support for Multithreaded MPI Communication , 2008, PVM/MPI.

[16]  Gaurav Saxena Thread Safety for Hybrid Programming in Thread-as-Rank Model , 2013 .

[17]  Asit Dan,et al.  Service-Oriented Computing. ICSOC/ServiceWave 2009 Workshops - International Workshops, ICSOC/ServiceWave 2009, Stockholm, Sweden, November 23-27, 2009, Revised Selected Papers , 2010, ICSOC/ServiceWave Workshops.

[18]  Rajeev Thakur,et al.  Enabling communication concurrency through flexible MPI endpoints , 2014, Int. J. High Perform. Comput. Appl..

[19]  Ping Tak Peter Tang,et al.  A framework for low-communication 1-D FFT , 2012, HiPC 2012.

[20]  Dhabaleswar K. Panda,et al.  Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems , 2014, PPoPP '14.

[21]  Pradeep Dubey,et al.  Lattice QCD on Intel® Xeon PhiTM Coprocessors , 2013, ISC.

[22]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[23]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[24]  Sayantan Sur,et al.  Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures , 2011, 2011 18th International Conference on High Performance Computing.

[25]  Pradeep Dubey,et al.  Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).