Practical resilient cases for FA-MPI, a transactional fault-tolerant MPI

MPI is insufficient when confronting failures. FA-MPI (Fault-Aware MPI) provides extensions to the MPI standard designed to enable data-parallel applications to achieve resilience without sacrificing scalability. FA-MPI introduces transactions as a novel extension to the MPI message-passing model. Transactions support failure detection, isolation, mitigation, and recovery via application-driven policies. To achieve maximum achievable performance of modern machines, overlapping communication and I/O with computation through non-blocking operations is of growing importance. Therefore, we emphasize fault-tolerant, non-blocking communication operations plus a set of nestable lightweight transactional TryBlock API extensions able to exploit system and application hierarchy. This strategy enables applications to run to completion with higher probability than nominally. We modified two proxy applications---MiniFE and LULESH---by adding FA-MPI semantics to them. Finally we present performance and overhead results for 1K MPI processes.

[1]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Ross J. Roeser,et al.  Updates and changes , 2012 .

[3]  Nancy Wilkins-Diehr,et al.  XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[4]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[5]  Anthony Skjellum,et al.  Comparing, Contrasting, Generalizing, and Integrating Two Current Designs for Fault-Tolerant MPI , 2014, EuroMPI/ASIA.

[6]  Anthony Skjellum,et al.  Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[7]  Jinsuk Chung,et al.  Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.

[8]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[9]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.

[10]  Md. Mohsin Ali,et al.  Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[11]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[12]  Greg Bronevetsky,et al.  Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.

[13]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[15]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[17]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Anthony Skjellum,et al.  MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware , 2004, Cluster Computing.

[19]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[20]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[21]  Philip A. Bernstein,et al.  Concurrency Control in Distributed Database Systems , 1986, CSUR.