Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors
暂无分享,去创建一个
[1] Hans P. Muhlfeld,et al. Cosmic ray soft error rates of 16-Mb DRAM memory chips , 1998, IEEE J. Solid State Circuits.
[2] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.
[3] Michael C. Huang,et al. Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.
[4] John Shalf,et al. DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .
[5] Omer Subasi,et al. Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[6] Jaspal Subhlok,et al. VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes , 2009, PVM/MPI.
[7] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[8] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[9] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[10] Bongjae Kim,et al. Using replication and checkpointing for reliable task management in computational Grids , 2010, 2010 International Conference on High Performance Computing & Simulation.
[11] Yves Robert,et al. Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[12] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[13] Franck Cappello,et al. Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model , 2017, IEEE Transactions on Parallel and Distributed Systems.
[14] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[15] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[16] Luís Moura Silva,et al. Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..
[17] Yves Robert,et al. Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors , 2016, TOPC.
[18] Henri Casanova,et al. On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing , 2015, Future Gener. Comput. Syst..
[19] Domenico Talia,et al. Workflow Systems for Science: Concepts and Tools , 2013 .
[20] Christian Engelmann,et al. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .
[21] Zhibo Wu,et al. Thread-level redundancy fault tolerant CMP based on relaxed input replication , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).
[22] Thomas Hérault,et al. Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..
[23] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[24] Emma S. Buneci. Qualitative Performance Analysis for Large-Scale Scientific Workflows , 2008 .
[25] T. J. O'Gorman. The effect of cosmic rays on the soft error rate of a DRAM at ground level , 1994 .
[26] Özalp Babaoglu,et al. On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..
[27] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[28] Daniel A. Reed,et al. Fault Tolerance and Recovery of Scientific Workflows on Computational Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[29] James L. Walsh,et al. IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..
[30] Bertram Ludäscher,et al. Scientific Workflows and Provenance: Introduction and Research Opportunities , 2012, Datenbank-Spektrum.
[31] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[32] Mikyung Kang,et al. Programming Models and Development Software for a Space-Based Many-Core Processor , 2011, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology.
[33] Zhiling Lan,et al. Reliability-aware scalability models for high performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[34] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[35] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[36] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).
[37] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.
[38] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[39] Dinesh P. Mehta,et al. Meta-Algorithms for Scheduling a Chain of Coarse-Grained Tasks on an Array of Reconfigurable FPGAs , 2013, VLSI Design.
[40] James H. Laros,et al. Does partial replication pay off? , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[41] Yves Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015 .
[42] Sathish S. Vadhiyar,et al. ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability , 2012, ICCS.
[43] Omer Subasi,et al. Programmer-directed partial redundancy for resilient HPC , 2015, Conf. Computing Frontiers.
[44] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[45] Franck Cappello,et al. Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale , 2017, FTXS '17.
[46] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[47] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[48] Christian Engelmann,et al. Redundant Execution of HPC Applications with MR-MPI , 2011 .
[49] Yves Robert,et al. Combining Checkpointing and Replication for Reliable Execution of Linear Workflows , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[50] Zhiling Lan,et al. Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart , 2015, IEEE Transactions on Computers.
[51] Unsal Osman,et al. Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016 .
[52] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[53] Huntington W. Curtis,et al. Accelerated testing for cosmic soft-error rate , 1996, IBM J. Res. Dev..