Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI

With the rapid scale out of supercomputers comes a corresponding higher failure frequency. Fault-tolerant methods have evolved to adapt to high rates of failure, but the behavior of MPI, the most widely used scalable programming middleware, is insufficient when confronting such failures. We present FA-MPI (Fault-Aware MPI), a set of extensions to the MPI standard designed to enable applications to implement a wide range of fault-tolerant methods. FA-MPI introduces transactional concepts to the MPI programming model for the first time to address failure detection, isolation, mitigation, and recovery via application-driven policies. To reach the maximum achievable performance of these scalable machines, overlapping communication and I/O with computation through non-blocking operations (while reducing jitter) are design themes of growing importance. Therefore, we emphasize fault tolerant, non-blocking communication operations combined with a set of nest able lightweight transactional Try Block API extensions architected to exploit system and application hierarchy both for failure detection and recovery. This is to enable applications to run to completion with higher probability than otherwise. Scaling up and out and fault-free overhead are key concerns that can be managed by tuning transaction granularity, we provide a simulation of FA-MPI in a stencil 3D program to illustrate this. Supported failure models include but are not limited to process failures, a key difference from other proposed fault-tolerant extensions to MPI. Restriction to non-blocking operations is a current limitation as compared to other proposed approaches insofar as legacy applications are concerned, but FA-MPI aligns well with future-looking applications emphasizing Exascale. And, tools to evolve legacy MPI programs to this fault-aware paradigm will soon bridge that portability gap.

[1]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Anthony Skjellum,et al.  MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware , 2004, Cluster Computing.

[3]  Marcos K. Aguilera,et al.  Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication , 1997, WDAG.

[4]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[5]  Wesley Bland,et al.  User Level Failure Mitigation in MPI , 2012, Euro-Par Workshops.

[6]  W. Jia,et al.  Fault-tolerant scaleable multicast algorithm with piggybacking approach on logical process ring , 1998 .

[7]  Jinsuk Chung,et al.  Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.

[8]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[10]  Thomas Naughton,et al.  A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI , 2011, EuroMPI.

[11]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[12]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[13]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[14]  Greg Bronevetsky,et al.  Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.

[15]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .