NR-MPI: A Non-stop and Fault Resilient MPI

Fault resilience has became a major issue for HPC systems, in particular in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. Fault tolerant MPI was proposed to offer support of software level fault tolerance approaches. However, the widely used MPI implementations, such as MPICH and Mvapich2, provide limited support for fault tolerance. This paper proposes NR-MPI, a Non-stop and Fault Resilient MPI. NR-MPI implements the semantics of FT-MPI based on MPICH. Specifically, this paper focuses on failure detection in MPI library, online failure recovery of communicators for multiple failures, friendly programming interface extending for NR-MPI. Furthermore, to support failure recovery of applications, NR-MPI implements data backup and restore interfaces based on double in-memory checkpoint/restart. We conduct experiments with NPB benchmarks on TH-1A supercomputer. Experimental results show that NR-MPI based fault tolerant programs can recover from failures online without restarting, and the overhead is small even for applications with tens of thousands of cores.

[1]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[2]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[3]  Heon Young Yeom,et al.  Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3) , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[4]  Franck Cappello,et al.  Coordinated checkpoint versus message log for fault tolerant MPI , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.

[5]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[6]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[7]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Wesley Bland,et al.  User Level Failure Mitigation in MPI , 2012, Euro-Par Workshops.

[9]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[11]  Xuejun Yang,et al.  FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing , 2009, IEEE Transactions on Parallel and Distributed Systems.

[12]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[13]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 2004, Cluster Computing.

[14]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[15]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[16]  Rui Wang,et al.  Building algorithmically nonstop fault tolerant MPI programs , 2011, 2011 18th International Conference on High Performance Computing.

[17]  Kai Lu,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.

[18]  Xuejun Yang,et al.  Tianhe-1A Interconnect and Message-Passing Services , 2012, IEEE Micro.

[19]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[20]  Rui Wang,et al.  Supporting User-directed Fault Tolerance over Standard MPI , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[21]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[22]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[23]  Wesley Bland Enabling Application Resilience with and without the MPI Standard , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[24]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[25]  Richard L. Graham,et al.  Building a Fault Tolerant MPI Application: A Ring Communication Example , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[26]  James Arthur Kohl,et al.  HARNESS: a next generation distributed virtual machine , 1999, Future Gener. Comput. Syst..

[27]  Frank Mueller,et al.  Transparent fault tolerance for job healing in hpc environments , 2009 .

[28]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[29]  John Paul Walters,et al.  Application-Level Checkpointing Techniques for Parallel Programs , 2006, ICDCIT.