Exploiting asynchrony from exact forward recovery for DUE in iterative solvers
暂无分享,去创建一个
Eduard Ayguadé | Mateo Valero | Jesús Labarta | Marc Casas | Miquel Moretó | Luc Jaulmes | E. Ayguadé | M. Valero | Jesús Labarta | Miquel Moretó | Marc Casas | Luc Jaulmes
[1] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.
[2] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.
[3] Osman S. Unsal,et al. NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[4] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[5] Milo M. K. Martin,et al. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.
[6] A. Kleen. mcelog : memory error handling in user space , 2010 .
[7] Emmanuel Agullo,et al. Towards resilient parallel linear Krylov solvers: recover-restart strategies , 2013 .
[8] Doe Hyun Yoon,et al. Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.
[9] Richard Barrett,et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.
[10] Omer Subasi,et al. Leveraging a Task-based Asynchronous Dataflow Substrate for Efficient and Scalable Resiliency , 2014 .
[11] Jack Dongarra,et al. HPCG Benchmark Technical Specification , 2013 .
[12] Thomas Hérault,et al. Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI , 2013, Concurr. Comput. Pract. Exp..
[13] Dong Tang,et al. Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[14] Narayanan Vijaykrishnan,et al. The effect of threshold voltages on the soft error rate [memory and logic circuits] , 2004, International Symposium on Signals, Circuits and Systems. Proceedings, SCS 2003. (Cat. No.03EX720).
[15] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[16] Irving S. Reed,et al. A class of multiple-error-correcting codes and the decoding scheme , 1954, Trans. IRE Prof. Group Inf. Theory.
[17] Alejandro Duran,et al. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..
[18] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Todd M. Austin,et al. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.
[20] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[21] Mark Anders,et al. Near-threshold voltage (NTV) design — Opportunities and challenges , 2012, DAC Design Automation Conference 2012.
[22] P. Strevens. Iii , 1985 .
[23] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[24] Zizhong Chen,et al. Correcting soft errors online in LU factorization , 2013, HPDC '13.
[25] J. Shewchuk. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .
[26] George Bosilca,et al. Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..
[27] Xin Li,et al. A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.
[28] B R de Supinski,et al. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .
[29] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[30] Yun Zhou,et al. The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.
[31] Ron Brightwell,et al. Cooperative Application/OS DRAM Fault Recovery , 2011, Euro-Par Workshops.
[32] Tamara G. Kolda,et al. An overview of the Trilinos project , 2005, TOMS.
[33] Mattan Erez,et al. Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[34] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[35] Martin Schulz,et al. Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.
[36] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[37] Henri Casanova,et al. Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[38] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[39] Jack J. Dongarra,et al. Fault Tolerant MPI for the HARNESS Meta-computing System , 2001, International Conference on Computational Science.
[40] Alejandro Duran,et al. Towards an Error Model for OpenMP , 2010, IWOMP.
[41] Sandia Report,et al. HPCG Technical Specification , 2013 .