When is the Right Time to Start the Fault Tolerance Protection?
暂无分享,去创建一个
[1] Emilio Luque,et al. What is Missing in Current Checkpoint Interval Models? , 2011, 2011 31st International Conference on Distributed Computing Systems.
[2] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[3] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[4] Bran Selic,et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.
[5] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[6] Thomas Hérault,et al. MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..
[7] Gene Cooperman,et al. DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[8] B R de Supinski,et al. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .
[9] Franck Cappello,et al. Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model , 2017, IEEE Transactions on Parallel and Distributed Systems.
[10] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[11] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[13] Emilio Luque,et al. Hybrid Message Logging. Combining advantages of Sender-based and Receiver-based Approaches , 2014, ICCS.
[14] Emilio Luque,et al. Fault tolerance at system level based on RADIC architecture , 2015, J. Parallel Distributed Comput..
[15] Nuria Losada,et al. Resilient MPI applications using an application-level checkpointing framework and ULFM , 2016, The Journal of Supercomputing.
[16] Emilio Luque,et al. Parallel Application Signature for Performance Analysis and Prediction , 2015, IEEE Transactions on Parallel and Distributed Systems.
[17] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[18] Thomas Hérault,et al. Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization , 2013, Euro-Par.