Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
暂无分享,去创建一个
Thomas Hérault | Franck Cappello | Aurelien Bouteiller | Pierre Lemarinier | Géraud Krawezik | Geraud P. Krawezik | Aurélien Bouteiller | T. Hérault | F. Cappello | Pierre Lemarinier
[1] Lorenzo Alvisi,et al. The relative overhead of piggybacking in causal message logging protocols , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).
[2] Carl Kesselman,et al. Generalized communicators in the Message Passing Interface , 1996, Proceedings. Second MPI Developer's Conference.
[3] Heon Young Yeom,et al. An efficient algorithm for causal message logging , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).
[4] Greg Burns,et al. LAM: An Open Cluster Environment for MPI , 2002 .
[5] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.
[6] SkjellumAnthony,et al. A high-performance, portable implementation of the MPI message passing interface standard , 1996 .
[7] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..
[8] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[9] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..
[10] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.
[11] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[12] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[13] Lorenzo Alvisi,et al. Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.
[14] Armin R. Mikler,et al. NetPIPE: A Network Protocol Independent Performance Evaluator , 1996 .
[15] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[16] Harrick M. Vin,et al. The Cost of Recovery in Message Logging Protocols , 2000, IEEE Trans. Knowl. Data Eng..
[17] Harrick M. Vin,et al. Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[18] Franck Cappello,et al. Coordinated checkpoint versus message log for fault tolerant MPI , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.
[19] Harrick M. Vin,et al. The cost of recovery in message logging protocols , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).
[20] Jack Dongarra,et al. MPI: The Complete Reference , 1996 .
[21] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[22] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.