An MPI Halo-Cell Implementation for Zero-Copy Abstraction

In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.

[1]  Michail Alvanos,et al.  Memory Management Techniques for Exploiting RDMA in PGAS Languages , 2014, LCPC.

[2]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[3]  Torsten Hoefler,et al.  Hybrid MPI: Efficient message passing for multi-core systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Raymond Namyst,et al.  MPC: A Unified Parallel Runtime for Clusters of NUMA Machines , 2008, Euro-Par.

[5]  Brice Goglin,et al.  KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[6]  Guillaume Mercier,et al.  Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[7]  Barbara M. Chapman,et al.  Introducing OpenSHMEM: SHMEM for the PGAS community , 2010, PGAS '10.

[8]  Torsten Hoefler,et al.  Ownership passing: efficient distributed memory programming on multi-core systems , 2013, PPoPP '13.

[9]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[10]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[11]  Guillaume Mercier,et al.  Data Transfers between Processes in an SMP System: Performance Study and Application to MPI , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[12]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[13]  Sayantan Sur,et al.  Unifying UPC and MPI runtimes: experience with MVAPICH , 2010, PGAS '10.

[14]  Alan Wagner,et al.  FG-MPI: Fine-grain MPI for multicore and clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[15]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.