An Evaluation of User-Level Failure Mitigation Support in MPI

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact aspects of the User-Level Failure Mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.

[1]  Thomas Hérault,et al.  Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols , 2008, Future Gener. Comput. Syst..

[2]  Thomas Naughton,et al.  A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI , 2011, EuroMPI.

[3]  Sape J. Mullender,et al.  Distributed systems (2nd Ed.) , 1993 .

[4]  Jack Dongarra,et al.  A Proposal for User-Level Failure Mitigation in the MPI-3 Standard , 2012 .

[5]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[6]  Jack Dongarra,et al.  Redesigning the message logging model for high performance , 2010, ISC 2010.

[7]  Thomas Hérault,et al.  A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI , 2012, Euro-Par.

[8]  George Bosilca,et al.  Binomial Graph: A Scalable and Fault-Tolerant Logical Network Topology , 2007, ISPA.

[9]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[10]  Henri Casanova,et al.  Using group replication for resilience on exascale systems , 2014, Int. J. High Perform. Comput. Appl..

[11]  Ewing L. Lusk,et al.  Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.

[12]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[13]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[14]  Sam Toueg,et al.  Fault-tolerant broadcasts and related problems , 1993 .

[15]  Greg Bronevetsky,et al.  Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.

[16]  Thomas Hérault,et al.  Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..

[17]  Hui Liu,et al.  High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.

[18]  Thomas L. Sterling HPC in Phase Change: Towards a New Execution Model , 2010, VECPAR.

[19]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[20]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[21]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..