Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand
暂无分享,去创建一个
Dhabaleswar K. Panda | Wei Huang | Qi Gao | Matthew J. Koop | D. Panda | Wei Huang | M. Koop | Qi Gao
[1] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[2] Jack Dongarra,et al. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .
[3] Dhabaleswar K. Panda,et al. Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[4] Jeffrey S. Vetter,et al. Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.
[5] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[6] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[7] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[8] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[9] F. Cappello,et al. Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[10] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[11] Dhabaleswar K. Panda,et al. Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand , 2006, 2006 International Conference on Parallel Processing (ICPP'06).
[12] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[13] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[14] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[15] Srinivasan Parthasarathy,et al. Parallel algorithms for mining frequent structural motifs in scientific data , 2004, ICS '04.
[16] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[17] Thomas Hérault,et al. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[18] Thomas Hérault,et al. Impact of event logger on causal message logging protocols for fault tolerant MPI , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[19] Wednesday September,et al. 2007 International Conference on Parallel Processing , 2007 .