Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
暂无分享,去创建一个
F. Cappello | T. Herault | P. Lemarinier | C. Coti | L. Pilard | A. Rezmerita | E. Rodriguez | F. Cappello | Thomas Hérault | Darius Buntinas | Pierre Lemarinier | Camille Coti | Ala Rezmerita | Laurence Pilard | Eric Rodriguez
[1] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[2] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[3] Franck Cappello,et al. Grid'5000: a large scale and highly reconfigurable grid experimental testbed , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..
[4] Daniel Marques,et al. Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.
[5] Charng-Da Lu,et al. Reliability challenges in large systems , 2006, Future Gener. Comput. Syst..
[6] Franck Cappello,et al. Grid'5000: a large scale, reconfigurable, controlable and monitorable Grid platform , 2005 .
[7] Daniel Marques,et al. Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[8] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[9] Lorenzo Alvisi,et al. An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[10] Achour Mostéfaoui,et al. Communication-Induced Determination of Consistent Snapshots , 1999, IEEE Trans. Parallel Distributed Syst..
[11] Jack Dongarra,et al. MPI: The Complete Reference , 1996 .
[12] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[13] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[14] Thomas Hérault,et al. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[15] Guillaume Mercier,et al. Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem , 2006, PVM/MPI.
[16] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..
[17] Armin R. Mikler,et al. NetPIPE: A Network Protocol Independent Performance Evaluator , 1996 .
[18] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.
[19] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.
[20] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..
[21] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[22] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[23] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.
[24] Greg Burns,et al. LAM: An Open Cluster Environment for MPI , 2002 .