An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications
暂无分享,去创建一个
[1] Franck Cappello,et al. Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[2] Christian Engelmann,et al. A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC , 2011, Euro-Par Workshops.
[3] Keun Soo Yim. Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[4] Rakesh Kumar,et al. Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[5] Daniel S. Katz,et al. Tests and Tolerances for High-Performance Software-Implemented Fault Detection , 2003, IEEE Trans. Computers.
[6] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[7] Daniel S. Katz,et al. Software-implemented fault detection for high-performance space applications , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.
[8] Austin R. Benson,et al. Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..
[9] Robert A. van de Geijn,et al. Fault-tolerant high-performance matrix multiplication: theory and practice , 2001, 2001 International Conference on Dependable Systems and Networks.
[10] Gilbert T. Walker,et al. On Periodicity in Series of Related Terms , 1931 .
[11] Israel Koren,et al. Application-level fault tolerance in the orbital thermal imaging spectrometer , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..
[12] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[13] Franck Cappello,et al. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.
[14] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[15] G. Yule. On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolfer's Sunspot Numbers , 1927 .
[16] Tiranee Achalakul,et al. Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.
[17] Robert A. van de Geijn,et al. Fault–Tolerant High–Performance Matrix Multiplication , 2004 .
[18] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[19] F. Cappello,et al. Toward Effective Detection of Silent Data Corruptions for HPC Applications , 2014 .
[20] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.