CARE: compiler-assisted recovery from soft failures
暂无分享,去创建一个
Santosh Pande | Chao Chen | Greg Eisenhauer | Qiang Guan | G. Eisenhauer | S. Pande | Qiang Guan | Chao Chen
[1] Eric Cheng,et al. The resilience wall: Cross-layer solution strategies , 2014, Technical Papers of 2014 International Symposium on VLSI Design, Automation and Test.
[2] Vivek Sarkar,et al. Software challenges in extreme scale systems , 2009 .
[3] Christof Fetzer,et al. HAFT: hardware-assisted fault tolerance , 2016, EuroSys.
[4] Karthik Pattabiraman,et al. Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[5] Yuanyuan Zhou,et al. Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.
[6] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.
[7] Fan Long,et al. Automatic runtime error repair and containment via recovery shepherding , 2014, PLDI.
[8] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[9] Meeta Sharma Gupta,et al. Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[10] Robert F. Lucas,et al. Rolex: resilience-oriented language extensions for extreme-scale systems , 2016, The Journal of Supercomputing.
[11] Margaret H. Wright,et al. The opportunities and challenges of exascale computing , 2010 .
[12] Israel Koren,et al. Experimental and Analytical Study of Xeon Phi Reliability , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[13] Santosh Pande,et al. LADR: low-cost application-level detector for reducing silent output corruptions , 2018, HPDC.
[14] Michael A. Heroux. Toward resilient algorithms and applications , 2013, FTXS '13.
[15] Dong Li,et al. Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[16] John Shalf,et al. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.
[17] Mattan Erez,et al. Evaluating and Accelerating High-Fidelity Error Injection for HPC , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Martin Schulz,et al. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[20] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[21] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[22] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[23] Franck Cappello,et al. Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[24] William Gropp,et al. Towards a More Complete Understanding of SDC Propagation , 2017, HPDC.
[25] Bo Fang,et al. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures , 2017, HPDC.
[26] Gokcen Kestor,et al. Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.