Models for Resilience Design Patterns

Resilience plays an important role in supercomputers by providing correct and efficient operation in case of faults, errors, and failures. Resilience design patterns offer blueprints for effectively applying resilience technologies. Prior work focused on developing initial efficiency and performance models for resilience design patterns. This paper extends it by (1) describing performance, reliability, and availability models for all structural resilience design patterns, (2) providing more detailed models that include flowcharts and state diagrams, and (3) introducing the Resilience Design Pattern Modeling (RDPM) tool that calculates and plots the performance, reliability, and availability metrics of individual patterns and pattern combinations.

[1]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[2]  Christian Engelmann,et al.  A Pattern Language for High-Performance Computing Resilience , 2017, EuroPLoP.

[3]  Christian Engelmann,et al.  Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.2) , 2017 .

[4]  Kishor S. Trivedi,et al.  Reliability and Performability Techniques and Tools: A Survey , 1993, MMB.

[5]  Christian Engelmann,et al.  The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .

[6]  Franck Cappello,et al.  Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Petar Radojkovic Towards resilient EU HPC systems: a blueprint , 2019, CF.

[8]  Hoang Pham,et al.  Reliability Modeling, Analysis and Optimization , 2006, Series on Quality, Reliability and Engineering Statistics.

[9]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[10]  Christian Engelmann,et al.  Towards New Metrics for High-Performance Computing Resilience , 2017, FTXS '17.

[11]  Christian Engelmann,et al.  Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing , 2018, ICPE.

[12]  Kurt B. Ferreira,et al.  An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart , 2016, FTXS@HPDC.

[13]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[14]  Christian Engelmann,et al.  Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale , 2016, Supercomput. Front. Innov..

[15]  D. Quinlan,et al.  Inter-Agency Workshop on HPC Resilience at Extreme Scale National Security Agency Advanced Computing Systems February 21 – 24 , 2012 Coordinating Representatives John Daly ( DOD ) Bill Harrod ( DOE / SC ) Thuc Hoang ( DOE / NNSA , 2012 .

[16]  Christian Engelmann,et al.  Pattern-Based Modeling of High-Performance Computing Resilience , 2017, Euro-Par Workshops.

[17]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[18]  Christian Engelmann,et al.  Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[19]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[20]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[21]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.