论文信息 - Extreme-scale computing services over MPI: Experiences, observations and features proposal for next-generation message passing interface

Extreme-scale computing services over MPI: Experiences, observations and features proposal for next-generation message passing interface

The message passing interface (MPI) is one of the most portable high-performance computing (HPC) programming models, with platform-optimized implementations typically delivered with new HPC systems. Therefore, for distributed services requiring portable, high-performance, user-level network access, MPI promises to be an attractive alternative to custom network portability layers, platform-specific methods, or portable but less performant interfaces such as BSD sockets. In this paper, we present our experiences in using MPI as a network transport for a large-scale distributed storage system. We discuss the features of MPI that facilitate adoption as well as aspects which require various workarounds. Based on use cases, we derive a wish list for both MPI implementations and the MPI forum to facilitate the adoption of MPI by large-scale persistent services. The proposals in the wish list go beyond the sole needs of distributed services; we contend that they will benefit mainstream HPC applications at extreme scales as well.

Robert B. Ross | Dries Kimpe | Ahmad Afsahi | Judicael A. Zounmevo

[1] Thomas Hérault,et al. An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[2] Ahmad Afsahi,et al. A fast and resource-conscious MPI message queue mechanism for large-scale jobs , 2014, Future Gener. Comput. Syst..

[3] Jack Dongarra,et al. Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.

[4] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5] William Gropp,et al. Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[6] Robert B. Ross,et al. Using MPI in high-performance computing services , 2013, EuroMPI.

[7] Wei-keng Liao,et al. Scaling parallel I/O performance through I/O delegate and caching system , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8] Li Xiao,et al. TPL: A Data Layout Method for Reducing Rotational Latency of Modern Hard Disk Drive , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[9] Phillip M. Dickens,et al. Towards an understanding of the performance of MPI-IO in Lustre file systems , 2008, 2008 IEEE International Conference on Cluster Computing.

[10] Narayan Desai,et al. MPI Cluster System Software , 2004, PVM/MPI.

[11] Robert B. Ross,et al. Noncontiguous I/O accesses through MPI-IO , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[12] Zhiwei Xu,et al. Can MPI Benefit Hadoop and MapReduce Applications? , 2011, 2011 40th International Conference on Parallel Processing Workshops.

[13] Sriram Krishnamoorthy,et al. Noncollective Communicator Creation in MPI , 2011, EuroMPI.

[14] Jason Duell,et al. Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations , 2004, Int. J. High Perform. Comput. Netw..

[15] Dhabaleswar K. Panda,et al. Implementing efficient MPI on LAPI for IBM RS/6000 SP systems: Experiences and performance evaluation , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[16] Robert B. Ross,et al. Mercury: Enabling remote procedure call for high-performance computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[17] Scott Michael,et al. The Lustre File System and 100 Gigabit Wide Area Networking: An Example Case from SC11 , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[18] Samuel Lang,et al. Server-side I/O coordination for parallel file systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19] Brian E. Smith,et al. Evaluation of Remote Memory Access Communication on the IBM Blue Gene/P Supercomputer , 2008, 2008 International Conference on Parallel Processing - Workshops.

[20] Greg Bronevetsky,et al. Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.

[21] Jesús Carretero,et al. An Implementation of MPI-IO on Expand: A Parallel File System Based on NFS Servers , 2002, PVM/MPI.

[22] Amith R. Mamidala,et al. PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24] Robert Latham,et al. Can MPI Be Used for Persistent Parallel Services? , 2006, PVM/MPI.

[25] Isao Kojima,et al. MPI collective I/O based on advanced reservations to obtain performance guarantees from shared storage systems , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[26] Jesús Carretero,et al. Implementation and Evaluation of File Write-Back and Prefetching for MPI-IO Over GPFS , 2010, Int. J. High Perform. Comput. Appl..

[27] Thomas Hérault,et al. An Evaluation of User-Level Failure Mitigation Support in MPI , 2012, EuroMPI.

[28] Jack J. Dongarra,et al. Building and Using a Fault-Tolerant MPI Implementation , 2004, Int. J. High Perform. Comput. Appl..