Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System

Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.

[1]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[2]  Vivek Sarkar,et al.  A Pluggable Framework for Composable HPC Scheduling Libraries , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[3]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[4]  Vivek Sarkar,et al.  Enabling Resilience in Asynchronous Many-Task Programming Models , 2019, Euro-Par.

[5]  Martin Berzins,et al.  ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms , 2015 .

[6]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Dhabaleswar K. Panda,et al.  EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications , 2018, Concurr. Comput. Pract. Exp..

[8]  Jinsuk Chung,et al.  Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[10]  Michael A. Heroux,et al.  Fenix, A Fault Tolerant Programming Framework for MPI Applications , 2016 .

[11]  Eduard Ayguadé,et al.  Task-Based Programming with OmpSs and Its Application , 2014, Euro-Par Workshops.

[12]  Omer Subasi,et al.  Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[13]  Thomas Hérault,et al.  Design for a Soft Error Resilient Dynamic Task-Based Runtime , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[14]  Scott Klasky,et al.  Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Ravishankar K. Iyer,et al.  Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[16]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[17]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org , 2010 .

[18]  Martin Schulz,et al.  Evaluating and extending user-level fault tolerance in MPI applications , 2016, Int. J. High Perform. Comput. Appl..

[19]  Daniel S. Katz,et al.  Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).

[20]  Benoît Meister,et al.  The Open Community Runtime: A runtime system for extreme scale computing , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[21]  Ignacio Laguna,et al.  Reinit++: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance , 2020, ISC.

[22]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[23]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[25]  Osman S. Unsal,et al.  NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[26]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[27]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..