Resilience for Stencil Computations with Latent Errors
暂无分享,去创建一个
[1] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[2] Andrew A. Chien,et al. Applying GVR to Molecular Dynamics : Enabling Resilience for Scientific Computations , 2014 .
[3] Franck Cappello,et al. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.
[4] Thomas Hérault,et al. On the Combination of Silent Error Detection and Checkpointing , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.
[5] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[6] Samuel Williams,et al. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..
[7] Ravishankar K. Iyer,et al. Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[8] Mark F. Adams,et al. Chombo Software Package for AMR Applications Design Document , 2014 .
[9] Andrew A. Chien,et al. Multi-versioning Performance Opportunities in BGAS System for Resilience , 2016, ISC.
[10] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[11] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[12] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[13] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[14] Fujita Hajime,et al. Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications , 2016 .
[15] Vivek Sarkar,et al. Software challenges in extreme scale systems , 2009 .
[16] Padma Raghavan,et al. Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.
[17] Andrew A. Chien,et al. Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience , 2015, ICCS.
[18] James F. Epperson,et al. An Introduction to Numerical Methods and Analysis , 2001 .
[19] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[20] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[21] Erol Gelenbe,et al. A model of roll-back recovery with multiple checkpoints , 1976, ICSE '76.
[22] Thomas Hérault,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.
[23] Andrew A. Chien,et al. Data decomposition in Monte Carlo neutron transport simulations using global view arrays , 2015, Int. J. High Perform. Comput. Appl..
[24] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[25] Manish Parashar,et al. Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[26] Andrew A. Chien,et al. When is multi-version checkpointing needed? , 2013, FTXS '13.
[27] G. Bronevetsky,et al. Detecting Soft Errors in Stencil based Computations , 2015 .
[28] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[29] Austin R. Benson,et al. Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..