Post-failure recovery of MPI communication capability

As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery.

[1]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[2]  Hui Liu,et al.  High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.

[3]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[4]  G.V. Kopcsay,et al.  Creating the BlueGene/L supercomputer from low-power SoC ASICs , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[5]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Thomas Hérault,et al.  Correlated Set Coordination in Fault Tolerant Message Logging Protocols , 2011, Euro-Par.

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[10]  Thomas Hérault,et al.  Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols , 2008, Future Gener. Comput. Syst..

[11]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[12]  DongarraJack,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012 .

[13]  George Bosilca,et al.  Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..

[14]  Franck Cappello,et al.  HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[15]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[16]  Greg Bronevetsky,et al.  Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.

[17]  Luís Moura Silva,et al.  System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[18]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[19]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[20]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[21]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[22]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[23]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[24]  Thomas Hérault,et al.  An Evaluation of User-Level Failure Mitigation Support in MPI , 2012, EuroMPI.

[25]  F. Al-Shamali,et al.  Author Biographies. , 2015, Journal of social work in disability & rehabilitation.

[26]  Wu-chun Feng,et al.  Performance Evaluation of the Quadrics Interconnection Network , 2001, IPDPS.

[27]  Thomas Hérault,et al.  A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI , 2012, Euro-Par.

[28]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[29]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[30]  Laxmikant V. Kalé,et al.  Team-Based Message Logging: Preliminary Results , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[31]  Fabrizio Petrini,et al.  Performance Evaluation of the Quadrics Interconnection Network , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[32]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[33]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[34]  Thomas Hérault,et al.  An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.