Toward Local Failure Local Recovery Resilience Model using MPI-ULFM

The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.

[1]  Thomas Naughton,et al.  A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI , 2011, EuroMPI.

[2]  Jack J. Dongarra,et al.  Building and Using a Fault-Tolerant MPI Implementation , 2004, Int. J. High Perform. Comput. Appl..

[3]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Kurt B. Ferreira,et al.  Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.

[5]  Thomas Hérault,et al.  An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[6]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[7]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[8]  Peter Arbenz,et al.  A fault tolerant implementation of Multi-Level Monte Carlo methods , 2013, PARCO.

[9]  Michael A. Heroux Toward resilient algorithms and applications , 2013, FTXS '13.

[10]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[11]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[12]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[13]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[14]  Mark A. Taylor,et al.  Architecture of LA-MPI, a network-fault-tolerant MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[15]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[16]  Eric Roman A Survey of Checkpoint / Restart Implementations , 2002 .

[17]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[18]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[19]  R. C. Whaley,et al.  LAPACK Working Note 94: A User''s Guide to the BLACS v1.0 , 1995 .