MT-MPI: multithreaded MPI for many-core environments

Many-core architectures, such as the Intel Xeon Phi, provide dozens of cores and hundreds of hardware threads. To utilize such architectures, application programmers are increasingly looking at hybrid programming models, where multiple threads interact with the MPI library (frequently called "MPI+X" models). A common mode of operation for such applications uses multiple threads to parallelize the computation, while one of the threads also issues MPI operations (i.e., MPI FUNNELED or SERIALIZED thread-safety mode). In MPI+OpenMP applications, this is achieved, for example, by placing MPI calls in OpenMP critical sections or outside the OpenMP parallel regions. However, such a model often means that the OpenMP threads are active only during the parallel computation phase and idle during the MPI calls, resulting in wasted computational resources. In this paper, we present MT-MPI, an internally multithreaded MPI implementation that transparently coordinates with the threading runtime system to share idle threads with the application. It is designed in the context of OpenMP and requires modifications to both the MPI implementation and the OpenMP runtime in order to share appropriate information between them. We demonstrate the benefit of such internal parallelism for various aspects of MPI processing, including derived datatype communication, shared-memory communication, and network I/O operations.

[1]  Rajeev Thakur,et al.  Toward Efficient Support for Multithreaded MPI Communication , 2008, PVM/MPI.

[2]  David A. Bader,et al.  A novel FDTD application featuring OpenMP-MPI hybrid parallelization , 2004 .

[3]  Yutaka Ishikawa,et al.  Revisiting rendezvous protocols in the context of RDMA-capable host channel adapters and many-core processors , 2013, EuroMPI.

[4]  John Michalakes,et al.  Parallel computing in regional weather modeling , 1997 .

[5]  Ewing L. Lusk,et al.  Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.

[6]  Rezaur Rahman,et al.  Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers , 2013 .

[7]  Rajeev Thakur,et al.  Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[8]  Allan Porterfield,et al.  OpenMP task scheduling strategies for multicore NUMA systems , 2012, Int. J. High Perform. Comput. Appl..

[9]  Alan J. Wallcraft,et al.  The NRL Layered Ocean Model , 1997, Parallel Comput..

[10]  Guillaume Mercier,et al.  Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem , 2006, PVM/MPI.

[11]  Robert B. Ross,et al.  Implementing Fast and Reusable Datatype Processing , 2003, PVM/MPI.

[12]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[13]  Bronis R. de Supinski,et al.  Minimizing MPI Resource Contention in Multithreaded Multicore Environments , 2010, 2010 IEEE International Conference on Cluster Computing.

[14]  Rajeev Thakur,et al.  Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems , 2010, EuroMPI.

[15]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[16]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[17]  Paul R. C. Kent,et al.  Development and performance of a mixed OpenMP/MPI quantum Monte Carlo code , 2000 .

[18]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..