Soft Error Detection for Iterative Applications Using Offline Training
暂无分享,去创建一个
[1] Dong Li,et al. Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[2] Zizhong Chen,et al. Correcting soft errors online in LU factorization , 2013, HPDC '13.
[3] Hui Liu,et al. High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.
[4] Mahmut T. Kandemir,et al. Compiler-assisted soft error detection under performance and energy constraints in embedded systems , 2009, TECS.
[5] Amin Ansari,et al. Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.
[6] Edward J. McCluskey,et al. Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..
[7] Edward J. McCluskey,et al. Software-implemented EDAC protection against SEUs , 2000, IEEE Trans. Reliab..
[8] Zizhong Chen,et al. FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines , 2014, HPDC '14.
[9] Sanjay J. Patel,et al. ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..
[10] Timothy J. Dell,et al. A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .
[11] Jiaqi Liu,et al. A Practical Approach for Handling Soft Errors in Iterative Applications , 2015, 2015 IEEE International Conference on Cluster Computing.
[12] Austin R. Benson,et al. Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..
[13] Kurt B. Ferreira,et al. An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart , 2016, FTXS@HPDC.
[14] Jinsuk Chung,et al. Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.
[15] David I. August,et al. SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.
[16] Shubhendu S. Mukherjee,et al. Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[17] Martin Schulz,et al. Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing , 2012, 2012 41st International Conference on Parallel Processing.
[18] Mahmut T. Kandemir,et al. A data-centric approach to checksum reuse for array-intensive applications , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[19] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[20] Thomas Hérault,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.
[21] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.
[22] Zizhong Chen. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[23] Padma Raghavan,et al. Characterizing the impact of soft errors on iterative methods in scientific computing , 2011, ICS '11.
[24] Franck Cappello,et al. Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.
[25] Laxmikant V. Kalé,et al. A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[26] Christian Engelmann,et al. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .
[27] Franck Cappello,et al. An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[28] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[29] Frank Mueller,et al. Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[30] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[31] Zizhong Chen,et al. Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition , 2015, IEEE Transactions on Parallel and Distributed Systems.
[32] Sarita V. Adve,et al. Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[33] Gokcen Kestor,et al. Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[34] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[35] T. M. Mak,et al. Do we need anything more than single bit error correction (ECC)? , 2004, Records of the 2004 International Workshop on Memory Technology, Design and Testing, 2004..
[36] Rakesh Kumar,et al. Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[37] F. Cappello,et al. Toward Effective Detection of Silent Data Corruptions for HPC Applications , 2014 .
[38] Hui Liu,et al. Algorithm-Based Recovery for Newton's Method without Checkpointing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[39] Xin Li,et al. A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.
[40] Franck Cappello,et al. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.
[41] David Fiala. Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[42] Padma Raghavan,et al. Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.
[43] Edward J. McCluskey,et al. Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..
[44] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[45] International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, Denver, CO, USA - November 17 - 21, 2013 , 2013, SC.
[46] Jiaqi Liu,et al. Algorithm Level Fault Tolerance for Molecular Dynamic Applications , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).
[47] Heather M. Quinn,et al. Terrestrial-based radiation upsets: a cautionary tale , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).
[48] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.
[49] Hui Liu,et al. Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.
[50] Rakesh Kumar,et al. An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[51] Gerald M. Masson,et al. Checking the integrity of trees , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.