Evaluating User-Level Fault Tolerance for MPI Applications

The User Level Failure Mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in MPI. Previous work has presented performance evaluations of the interface; yet questions related to its programability and applicability remain unanswered. In this paper, we present our experiences on using ULFM in a case study (a large molecular dynamics application) to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for applications with work-decomposition flexibility (e.g., master-slave), it provides few benefits for more general (e.g., bulk synchronous) MPI applications.

[1]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[2]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[3]  Thomas Hérault,et al.  An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[4]  John A. Gunnels,et al.  Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Thomas Hérault,et al.  A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI , 2012, Euro-Par.

[6]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[7]  John A. Gunnels,et al.  Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .

[8]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[9]  Laura L. Pullum,et al.  Software Fault Tolerance Techniques and Implementation , 2001 .

[10]  Greg Bronevetsky,et al.  Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.

[11]  Darius Buntinas Scalable Distributed Consensus to Support MPI Fault Tolerance , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[12]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.