When Amdahl Meets Young/Daly
暂无分享,去创建一个
[1] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[2] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[3] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[4] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).
[5] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.
[6] Thomas Hérault,et al. Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..
[7] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[8] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[9] T. J. O'Gorman. The effect of cosmic rays on the soft error rate of a DRAM at ground level , 1994 .
[10] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[11] Zhiling Lan,et al. Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart , 2015, IEEE Transactions on Computers.
[12] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[13] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[14] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.
[15] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[16] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[17] Yves Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015 .
[18] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[19] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[20] Yves Robert,et al. Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[21] Hans P. Muhlfeld,et al. Cosmic ray soft error rates of 16-Mb DRAM memory chips , 1998, IEEE J. Solid State Circuits.
[22] Franck Cappello,et al. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.
[23] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[24] David Fiala. Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[25] Padma Raghavan,et al. Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.
[26] Franck Cappello,et al. Detecting silent data corruption through data dynamic monitoring for scientific applications , 2014, PPoPP '14.
[27] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[28] Y. Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.
[29] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[30] Austin R. Benson,et al. Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..
[31] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[32] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[33] Bronis R. de Supinski,et al. Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.
[34] Franck Cappello,et al. Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation , 2015, 2015 IEEE International Conference on Cluster Computing.
[35] George Bosilca,et al. Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..
[36] Thomas Hérault,et al. Performance and reliability trade-offs for the double checkpointing algorithm , 2014, Int. J. Netw. Comput..
[37] Patrick M. Widener,et al. Canaries in a Coal Mine: Using Application-Level Checkpoints to Detect Memory Failures , 2015, Euro-Par Workshops.
[38] Richard W. Vuduc,et al. Self-stabilizing iterative solvers , 2013, ScalA '13.
[39] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[40] Luís Moura Silva,et al. Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..
[41] Xian-He Sun,et al. Optimizing HPC Fault-Tolerant Environment: An Analytical Approach , 2010, 2010 39th International Conference on Parallel Processing.