Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale
暂无分享,去创建一个
Franck Cappello | Yves Robert | Anne Benoit | Padma Raghavan | Aurélien Cavelan | Hongyang Sun | P. Raghavan | Y. Robert | A. Benoit | Aurélien Cavelan | Hongyang Sun | F. Cappello
[1] Christian Engelmann,et al. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .
[2] Zhibo Wu,et al. Thread-level redundancy fault tolerant CMP based on relaxed input replication , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).
[3] Henri Casanova,et al. Using group replication for resilience on exascale systems , 2014, Int. J. High Perform. Comput. Appl..
[4] Franck Cappello,et al. Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation , 2015, 2015 IEEE International Conference on Cluster Computing.
[5] James L. Walsh,et al. IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..
[6] Y. Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.
[7] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[8] Yves Robert,et al. Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors , 2016, TOPC.
[9] Yves Robert,et al. Which Verification for Soft Error Detection? , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).
[10] Austin R. Benson,et al. Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..
[11] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[12] E. Tronci,et al. 1996 , 1997, Affair of the Heart.
[13] T. J. O'Gorman. The effect of cosmic rays on the soft error rate of a DRAM at ground level , 1994 .
[14] Seetharami R. Seelam,et al. Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).
[15] Mikyung Kang,et al. Programming Models and Development Software for a Space-Based Many-Core Processor , 2011, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology.
[16] Richard W. Vuduc,et al. Self-stabilizing iterative solvers , 2013, ScalA '13.
[17] Padma Raghavan,et al. Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.
[18] Henri Casanova,et al. On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing , 2015, Future Gener. Comput. Syst..
[19] Rolf Riesen,et al. Transparent Redundant Computing with MPI , 2010, EuroMPI.
[20] Franck Cappello,et al. Detecting silent data corruption through data dynamic monitoring for scientific applications , 2014, PPoPP '14.
[21] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[22] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[23] Christian Engelmann,et al. Redundant Execution of HPC Applications with MR-MPI , 2011 .
[24] Israel Koren,et al. Application-level fault tolerance in the orbital thermal imaging spectrometer , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..
[25] Michael C. Huang,et al. Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.
[26] Franck Cappello,et al. Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.
[27] Jaspal Subhlok,et al. VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes , 2009, PVM/MPI.
[28] Franck Cappello,et al. Improving the trust in results of numerical simulations and scientific data analytics , 2015 .
[29] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[30] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[31] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[32] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[33] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[34] Bronis R. de Supinski,et al. Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.
[35] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[36] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[37] Franck Cappello,et al. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.
[38] David Fiala. Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[39] B R de Supinski,et al. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .
[40] George Bosilca,et al. Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..
[41] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[42] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[43] James H. Laros,et al. Does partial replication pay off? , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[44] Yves Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015 .
[45] Sathish S. Vadhiyar,et al. ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability , 2012, ICCS.
[46] Omer Subasi,et al. Programmer-directed partial redundancy for resilient HPC , 2015, Conf. Computing Frontiers.
[47] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[48] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[49] Zhiling Lan,et al. Reliability-aware scalability models for high performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[50] Frank Mueller,et al. Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[51] Carl E. Landwehr,et al. Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.
[52] Bongjae Kim,et al. Using replication and checkpointing for reliable task management in computational Grids , 2010, 2010 International Conference on High Performance Computing & Simulation.
[53] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..