Rejuvenating Shadows: Fault Tolerance with Forward Recovery
暂无分享,去创建一个
[1] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[2] Jianliang Xu,et al. Real-Time In-Memory Checkpointing for Future Hybrid Memory Systems , 2015, ICS.
[3] Rami G. Melhem,et al. Energy Consumption of Resilience Mechanisms in Large Scale Systems , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[4] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[5] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[6] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[7] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[8] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[9] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[10] Taieb Znati,et al. Shadow Replication: An Energy-Aware, Fault-Tolerant Computational Model for Green Cloud Computing , 2014 .
[11] Rami G. Melhem,et al. Adaptive and Power-Aware Resilience for Extreme-Scale Computing , 2016, 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld).
[12] Franck Cappello,et al. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[13] Joel F. Bartlett,et al. A NonStop kernel , 1981, SOSP.
[14] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.
[15] David Fiala. Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[16] Richard D. Schlichting,et al. Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.
[17] Erik Seligman,et al. Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..
[18] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[19] Daniel Marques,et al. Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, IPDPS.
[20] Christian Engelmann,et al. Redundant Execution of HPC Applications with MR-MPI , 2011 .
[21] Bryan Mills,et al. Power-aware resilience for exascale computing , 2014 .
[22] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[23] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[24] Stijn Eyerman,et al. Fine-grained DVFS using on-chip regulators , 2011, TACO.
[25] James H. Laros,et al. Redundant computing for exascale systems. , 2010 .
[26] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[27] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).