Fault tolerance of MPI applications in exascale systems: The ULFM solution

Abstract The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.

[1]  Gerhard Wellein,et al.  CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance , 2017, IEEE Transactions on Parallel and Distributed Systems.

[2]  Nuria Losada,et al.  Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications , 2016, The Journal of Supercomputing.

[3]  Markus Hegland,et al.  Complex scientific applications made fault-tolerant with the sparse grid combination technique , 2016, Int. J. High Perform. Comput. Appl..

[4]  George Bosilca,et al.  Local rollback for resilient MPI applications with application-level checkpointing and message logging , 2019, Future Gener. Comput. Syst..

[5]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[6]  Peter Arbenz,et al.  Intrinsic fault tolerance of multilevel Monte Carlo methods , 2015, J. Parallel Distributed Comput..

[7]  Ravishankar K. Iyer,et al.  Measuring the Resiliency of Extreme-Scale Computing Environments , 2016 .

[8]  Jack Dongarra,et al.  Redesigning the message logging model for high performance , 2010, ISC 2010.

[9]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[10]  Bert J. Debusschere,et al.  Application Fault Tolerance for Shrinking Resources via the Sparse Grid Combination Technique , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[11]  Martin Schulz,et al.  Evaluating and extending user-level fault tolerance in MPI applications , 2016, Int. J. High Perform. Comput. Appl..

[12]  Yves Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015 .

[13]  Suo Guang,et al.  NR-MPI: A Non-stop and Fault Resilient MPI Supporting Programmer Defined Data Backup and Restore for E-scale Super Computing Systems , 2016, Supercomput. Front. Innov..

[14]  Martin Schulz,et al.  Evaluating User-Level Fault Tolerance for MPI Applications , 2014, EuroMPI/ASIA.

[15]  Cosmin Safta,et al.  ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner , 2016, FTXS@HPDC.

[16]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[17]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[18]  Christian Engelmann,et al.  Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[19]  Martin Schulz,et al.  A Global Exception Fault Tolerance Model for MPI , 2014 .

[20]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[21]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.

[22]  Dinshaw S. Balsara,et al.  Resilient computational applications using Coarray Fortran , 2019, Parallel Comput..

[23]  George Bosilca,et al.  Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications , 2019, 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS).

[24]  Chris D. Cantwell,et al.  A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers , 2018, J. Sci. Comput..

[25]  Dhabaleswar K. Panda,et al.  EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications , 2018, Concurr. Comput. Pract. Exp..

[26]  Satoshi Matsuoka,et al.  FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[27]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[28]  Xiangke Liao,et al.  NR-MPI: A Non-stop and Fault Resilient MPI , 2013, ICPADS 2013.

[29]  Thomas Hérault,et al.  A failure detector for HPC platforms , 2018, Int. J. High Perform. Comput. Appl..

[30]  Manish Parashar,et al.  Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales , 2017, IEEE Transactions on Parallel and Distributed Systems.

[31]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Alan D. George,et al.  FEMPI: A Lightweight Fault-tolerant MPI for Embedded Cluster Systems , 2006, ESA.

[33]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[34]  Peter Arbenz,et al.  A fault tolerant implementation of Multi-Level Monte Carlo methods , 2013, ParCo 2013.

[35]  Pavan Balaji,et al.  Simplifying the Recovery Model of User-Level Failure Mitigation , 2014, 2014 Workshop on Exascale MPI at Supercomputing Conference.

[36]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[37]  Thomas Hérault,et al.  Practical scalable consensus for pseudo-synchronous distributed systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[39]  Anthony Skjellum,et al.  Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[40]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[41]  Srikumar Venugopal,et al.  Architecting Malleable MPI Applications for Priority-driven Adaptive Scheduling , 2016, EuroMPI.

[42]  Aurelien Bouteiller,et al.  PMIx: Process management for exascale environments , 2018, Parallel Comput..