Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications

With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high-performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.

[1]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[2]  Yuhua Tang,et al.  A Message Logging Protocol Based on User Level Failure Mitigation , 2013, ICA3PP.

[3]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[4]  Kevin Harms,et al.  Characterization of MPI Usage on a Production Supercomputer , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[6]  Ravishankar K. Iyer,et al.  Measuring the Resiliency of Extreme-Scale Computing Environments , 2016 .

[7]  R. Hornung,et al.  HYDRODYNAMICS CHALLENGE PROBLEM , 2011 .

[8]  Gabriel Rodríguez,et al.  CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications , 2010 .

[9]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[10]  George Bosilca,et al.  Local rollback for resilient MPI applications with application-level checkpointing and message logging , 2019, Future Gener. Comput. Syst..

[11]  Dolores Rexachs,et al.  Hybrid Message Pessimistic Logging. Improving current pessimistic message logging protocols , 2017, J. Parallel Distributed Comput..

[12]  Laxmikant V. Kalé,et al.  Camel: collective-aware message logging , 2015, The Journal of Supercomputing.

[13]  George Bosilca,et al.  Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..

[14]  Haim Avron,et al.  Revisiting Asynchronous Linear Solvers: Provable Convergence Rate through Randomization , 2014, IPDPS.

[15]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[16]  Franck Cappello,et al.  SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  George Bosilca,et al.  Using software-based performance counters to expose low-level open MPI performance information , 2017, EuroMPI/USA.

[18]  Harrick M. Vin,et al.  The Cost of Recovery in Message Logging Protocols , 2000, IEEE Trans. Knowl. Data Eng..

[19]  Laxmikant V. Kalé,et al.  Team-Based Message Logging: Preliminary Results , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[20]  Nicholas J. Higham,et al.  Performance analysis of asynchronous Jacobi’s method implemented in MPI, SHMEM and OpenMP , 2014, Int. J. High Perform. Comput. Appl..

[21]  E. Wolters,et al.  MOCFE-Bone: the 3D MOC mini-application for exascale research , 2013 .

[22]  Thomas Hérault,et al.  Correlated Set Coordination in Fault Tolerant Message Logging Protocols , 2011, Euro-Par.

[23]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[24]  Jack Dongarra,et al.  Performance of asynchronous optimized Schwarz with one-sided communication , 2019, Parallel Comput..

[25]  Timothy G. Mattson,et al.  The Parallel Research Kernels , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[26]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.