Resilient Optimistic Termination Detection for the Async-Finish Model

Driven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures. The async-finish task model, adapted for distributed systems as the asynchronous partitioned global address space programming model, provides a simple way to decompose a computation into nested task groups, each managed by a ‘finish’ that signals the termination of all tasks within the group.

[1]  David Cunningham,et al.  Resilient X10: efficient failure-aware programming , 2014, PPoPP '14.

[2]  Ten-Hwang Lai,et al.  An (N-1)-resilient algorithm for distributed termination detection , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[3]  Philip W. Trinder,et al.  Transparent fault tolerance for scalable functional computation , 2016, Journal of Functional Programming.

[4]  David Grove,et al.  Exploring the APGAS Programming Model using the LULESH Proxy Application , 2015 .

[5]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[6]  Edsger W. Dijkstra,et al.  Termination Detection for Diffusing Computations , 1980, Inf. Process. Lett..

[7]  Thomas Hérault,et al.  An Evaluation of User-Level Failure Mitigation Support in MPI , 2012, EuroMPI.

[8]  Laxmikant V. Kalé,et al.  Adoption protocols for fanout-optimal fault-tolerant termination detection , 2013, PPoPP '13.

[9]  Gokcen Kestor,et al.  Localized Fault Recovery for Nested Fork-Join Programs , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).