Towards Distributed Software Resilience in Asynchronous Many- Task Programming Models

Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.

[1]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[2]  Osman S. Unsal,et al.  NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[3]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[4]  Omer Subasi,et al.  A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[5]  Vivek Sarkar,et al.  Enabling Resilience in Asynchronous Many-Task Programming Models , 2019, Euro-Par.

[6]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[7]  Scott Klasky,et al.  Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[9]  Dirk Pflüger,et al.  Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars , 2019, Int. J. High Perform. Comput. Appl..

[10]  Thomas Hérault,et al.  Design for a Soft Error Resilient Dynamic Task-Based Runtime , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[11]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[13]  Eric Roman A Survey of Checkpoint / Restart Implementations , 2002 .

[14]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.

[15]  Dietmar Fey,et al.  Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers , 2013, ScalA '13.

[16]  Jackson R. Mayo,et al.  Implementing Software Resiliency in HPX for Extreme Scale Computing , 2020, ArXiv.

[17]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[18]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[19]  Dietmar Fey,et al.  Higher-level parallelization for local and distributed asynchronous task-based programming , 2015, ESPM '15.

[20]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[21]  Jeanine Cook,et al.  The Performance Implication of Task Size for Applications on the HPX Runtime System , 2015, 2015 IEEE International Conference on Cluster Computing.

[22]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[23]  Patrick Diehl,et al.  Closing the Performance Gap with Modern C , 2016, HiPC 2016.

[24]  Franck Cappello,et al.  Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[25]  Ignacio Laguna,et al.  Reinit++: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance , 2020, ISC.

[26]  Thomas Heller,et al.  Application of the ParalleX execution model to stencil-based problems , 2012, Computer Science - Research and Development.

[27]  Parsa Amini,et al.  Assessing the Performance Impact of using an Active Global Address Space in HPX: A Case for AGAS , 2019, 2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM).

[28]  Dhabaleswar K. Panda,et al.  EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications , 2018, Concurr. Comput. Pract. Exp..

[29]  Patrick Diehl,et al.  An asynchronous and task-based implementation of peridynamics utilizing HPX—the C++ standard library for parallelism and concurrency , 2018, SN Applied Sciences.

[30]  George Bosilca,et al.  Fault tolerance of MPI applications in exascale systems: The ULFM solution , 2020, Future Gener. Comput. Syst..

[31]  Omer Subasi,et al.  Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).