Extreme-scale computing services over MPI: Experiences, observations and features proposal for next-generation message passing interface

The message passing interface (MPI) is one of the most portable high-performance computing (HPC) programming models, with platform-optimized implementations typically delivered with new HPC systems. Therefore, for distributed services requiring portable, high-performance, user-level network access, MPI promises to be an attractive alternative to custom network portability layers, platform-specific methods, or portable but less performant interfaces such as BSD sockets. In this paper, we present our experiences in using MPI as a network transport for a large-scale distributed storage system. We discuss the features of MPI that facilitate adoption as well as aspects which require various workarounds. Based on use cases, we derive a wish list for both MPI implementations and the MPI forum to facilitate the adoption of MPI by large-scale persistent services. The proposals in the wish list go beyond the sole needs of distributed services; we contend that they will benefit mainstream HPC applications at extreme scales as well.

[1]  Thomas Hérault,et al.  An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[2]  Ahmad Afsahi,et al.  A fast and resource-conscious MPI message queue mechanism for large-scale jobs , 2014, Future Gener. Comput. Syst..

[3]  Jack Dongarra,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.

[4]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[6]  Robert B. Ross,et al.  Using MPI in high-performance computing services , 2013, EuroMPI.

[7]  Wei-keng Liao,et al.  Scaling parallel I/O performance through I/O delegate and caching system , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Li Xiao,et al.  TPL: A Data Layout Method for Reducing Rotational Latency of Modern Hard Disk Drive , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[9]  Phillip M. Dickens,et al.  Towards an understanding of the performance of MPI-IO in Lustre file systems , 2008, 2008 IEEE International Conference on Cluster Computing.

[10]  Narayan Desai,et al.  MPI Cluster System Software , 2004, PVM/MPI.

[11]  Robert B. Ross,et al.  Noncontiguous I/O accesses through MPI-IO , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[12]  Zhiwei Xu,et al.  Can MPI Benefit Hadoop and MapReduce Applications? , 2011, 2011 40th International Conference on Parallel Processing Workshops.

[13]  Sriram Krishnamoorthy,et al.  Noncollective Communicator Creation in MPI , 2011, EuroMPI.

[14]  Jason Duell,et al.  Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations , 2004, Int. J. High Perform. Comput. Netw..

[15]  Dhabaleswar K. Panda,et al.  Implementing efficient MPI on LAPI for IBM RS/6000 SP systems: Experiences and performance evaluation , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[16]  Robert B. Ross,et al.  Mercury: Enabling remote procedure call for high-performance computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Scott Michael,et al.  The Lustre File System and 100 Gigabit Wide Area Networking: An Example Case from SC11 , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[18]  Samuel Lang,et al.  Server-side I/O coordination for parallel file systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  Brian E. Smith,et al.  Evaluation of Remote Memory Access Communication on the IBM Blue Gene/P Supercomputer , 2008, 2008 International Conference on Parallel Processing - Workshops.

[20]  Greg Bronevetsky,et al.  Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.

[21]  Jesús Carretero,et al.  An Implementation of MPI-IO on Expand: A Parallel File System Based on NFS Servers , 2002, PVM/MPI.

[22]  Amith R. Mamidala,et al.  PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Robert Latham,et al.  Can MPI Be Used for Persistent Parallel Services? , 2006, PVM/MPI.

[25]  Isao Kojima,et al.  MPI collective I/O based on advanced reservations to obtain performance guarantees from shared storage systems , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[26]  Jesús Carretero,et al.  Implementation and Evaluation of File Write-Back and Prefetching for MPI-IO Over GPFS , 2010, Int. J. High Perform. Comput. Appl..

[27]  Thomas Hérault,et al.  An Evaluation of User-Level Failure Mitigation Support in MPI , 2012, EuroMPI.

[28]  Jack J. Dongarra,et al.  Building and Using a Fault-Tolerant MPI Implementation , 2004, Int. J. High Perform. Comput. Appl..