Characterizing a Detection Strategy for Transient Faults in HPC
暂无分享,去创建一个
Armando Eduardo De Giusti | Patricia Mabel Pesado | Claudia Cecilia Russo | Emilio Luque Fadón | Enzo Rucci | Marcelo Naiouf | Dolores Rexachs del Rosario | Guillermo Eugenio Feierherd | Diego Miguel Montezanti
[1] Kunle Olukotun,et al. The Future of Microprocessors , 2005, ACM Queue.
[2] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[3] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[4] Christian Engelmann,et al. Redundant Execution of HPC Applications with MR-MPI , 2011 .
[5] Andrew A. Chien,et al. The future of microprocessors , 2011, Commun. ACM.
[6] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[7] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[8] Emilio Luque,et al. SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters , 2012, CLEI Electron. J..
[9] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[10] Narayan Desai,et al. Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[11] Christian Engelmann,et al. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .
[12] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[13] James H. Laros,et al. rMPI : increasing fault resiliency in a message-passing environment. , 2011 .
[14] David I. August,et al. SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.
[15] Sarita V. Adve,et al. Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[16] David Fiala. Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[17] Tipp Moseley,et al. PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.
[18] Armando Eduardo De Giusti,et al. A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters , 2013 .
[19] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.
[20] Osman S. Unsal,et al. Fault tolerance for multi-threaded applications by leveraging hardware transactional memory , 2013, CF '13.
[21] Andrew A. Chien,et al. When is multi-version checkpointing needed? , 2013, FTXS '13.
[22] Frank Mueller,et al. Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[23] Dong Li,et al. Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.