Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
暂无分享,去创建一个
[1] Thomas Naughton,et al. A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI , 2011, EuroMPI.
[2] Jack J. Dongarra,et al. Building and Using a Fault-Tolerant MPI Implementation , 2004, Int. J. High Perform. Comput. Appl..
[3] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[4] Kurt B. Ferreira,et al. Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.
[5] Thomas Hérault,et al. An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.
[6] Tamara G. Kolda,et al. An overview of the Trilinos project , 2005, TOMS.
[7] Sandia Report,et al. Improving Performance via Mini-applications , 2009 .
[8] Peter Arbenz,et al. A fault tolerant implementation of Multi-Level Monte Carlo methods , 2013, PARCO.
[9] Michael A. Heroux. Toward resilient algorithms and applications , 2013, FTXS '13.
[10] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[11] Jeffrey F. Naughton,et al. Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..
[12] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[13] Jack J. Dongarra,et al. Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[14] Mark A. Taylor,et al. Architecture of LA-MPI, a network-fault-tolerant MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[15] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[16] Eric Roman. A Survey of Checkpoint / Restart Implementations , 2002 .
[17] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..
[18] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[19] R. C. Whaley,et al. LAPACK Working Note 94: A User''s Guide to the BLACS v1.0 , 1995 .