Models for Resilience Design Patterns
暂无分享,去创建一个
[1] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[2] Christian Engelmann,et al. A Pattern Language for High-Performance Computing Resilience , 2017, EuroPLoP.
[3] Christian Engelmann,et al. Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.2) , 2017 .
[4] Kishor S. Trivedi,et al. Reliability and Performability Techniques and Tools: A Survey , 1993, MMB.
[5] Christian Engelmann,et al. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .
[6] Franck Cappello,et al. Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[7] Petar Radojkovic. Towards resilient EU HPC systems: a blueprint , 2019, CF.
[8] Hoang Pham,et al. Reliability Modeling, Analysis and Optimization , 2006, Series on Quality, Reliability and Engineering Statistics.
[9] Saurabh Gupta,et al. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[10] Christian Engelmann,et al. Towards New Metrics for High-Performance Computing Resilience , 2017, FTXS '17.
[11] Christian Engelmann,et al. Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing , 2018, ICPE.
[12] Kurt B. Ferreira,et al. An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart , 2016, FTXS@HPDC.
[13] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[14] Christian Engelmann,et al. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale , 2016, Supercomput. Front. Innov..
[15] D. Quinlan,et al. Inter-Agency Workshop on HPC Resilience at Extreme Scale National Security Agency Advanced Computing Systems February 21 – 24 , 2012 Coordinating Representatives John Daly ( DOD ) Bill Harrod ( DOE / SC ) Thuc Hoang ( DOE / NNSA , 2012 .
[16] Christian Engelmann,et al. Pattern-Based Modeling of High-Performance Computing Resilience , 2017, Euro-Par Workshops.
[17] Carl E. Landwehr,et al. Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.
[18] Christian Engelmann,et al. Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).
[19] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[20] A. Singh,et al. Fault-tolerant systems , 1990, Computer.
[21] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.