MPC: A Unified Parallel Runtime for Clusters of NUMA Machines

Over the last decade, Message Passing Interface (MPI) has become a very successful parallel programming environment for distributed memory architectures such as clusters. However, the architecture of cluster node is currently evolving from small symmetric shared memory multiprocessors towards massively multicore, Non-Uniform Memory Access (NUMA) hardware. Although regular MPI implementations are using numerous optimizations to realize zero copycache-oblivious data transfers within shared-memory nodes, they might prevent applications from achieving most of the hardware's performance simply because the scheduling of heavyweight processes is not flexible enough to dynamically fit the underlying hardware topology. This explains why several research efforts have investigated hybrid approaches mixing message passing between nodes and memory sharing inside nodes, such as MPI+OpenMP solutions [1,2]. However, these approaches require lots of programming efforts in order to adapt/rewrite existing MPI applications. In this paper, we present the MultiProcessor Communications environnement (MPC), which aims at providing programmers with an efficient runtime system for their existing MPI, POSIX Thread or hybrid MPI+Thread applications. The key idea is to use user-level threads instead of processes over multiprocessor cluster nodes to increase scheduling flexibility, to better control memory allocations and optimize scheduling of the communication flows with other nodes. Most existing MPI applications can run over MPC with no modification. We obtained substantial gains (up to 20%) by using MPC instead of a regular MPI runtime on several scientific applications.

[1]  Erik D. Demaine,et al.  A Threads-Only MPI Implementation for the Development of Parallel Programs , 1997 .

[2]  Mitsuhisa Sato,et al.  Design and Implementation of OpenMPD: An OpenMP-Like Programming Language for Distributed Memory Systems , 2007, IWOMP.

[3]  Aad J. van der Steen,et al.  Overview of recent supercomputers , 2008 .

[4]  Kathryn S. McKinley,et al.  Composing high-performance memory allocators , 2001, PLDI '01.

[5]  B. Després,et al.  3D Finite Volume simulation of acoustic waves in the earth atmosphere , 2009 .

[6]  H. Jourdren HERA: A Hydrodynamic AMR Platform for Multi-Physics Simulations , 2005 .

[7]  Barbara Chapman A Practical Programming Model for the Multi-Core Era, 3rd International Workshop on OpenMP, IWOMP 2007, Beijing, China, June 3-7, 2007, Proceedings , 2008, IWOMP.

[8]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[9]  Marc Pérache,et al.  Contribution à l'élaboration d'environnements de programmation dédiés au calcul scientifique hautes performances , 2006 .

[10]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[11]  Paul R. C. Kent,et al.  Development and performance of a mixed OpenMP/MPI quantum Monte Carlo code , 2000 .

[12]  Mark Bull,et al.  Development of mixed mode MPI / OpenMP applications , 2001, Sci. Program..

[13]  Dhabaleswar K. Panda,et al.  Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[14]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[15]  Raymond Namyst Pm2 : un environnement pour une conception portable et une exécution efficace des applications parallèles irrégulières , 1997 .

[16]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.