Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are: 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. We extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging and the server stress of coordinated checkpoint. We detail the protocols and their implementation into the new MPICH-V fault tolerant framework. We compare their performance against the previous versions and we compare the novel message logging protocols against the improved coordinated checkpointing one using the NAS benchmark on a typical high performance cluster equipped with a high speed network. The contribution of This work is twofold: a) an original message logging protocol and an improved coordinated checkpointing protocol and b) the comparison between them.

[1]  Lorenzo Alvisi,et al.  The relative overhead of piggybacking in causal message logging protocols , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[2]  Carl Kesselman,et al.  Generalized communicators in the Message Passing Interface , 1996, Proceedings. Second MPI Developer's Conference.

[3]  Heon Young Yeom,et al.  An efficient algorithm for causal message logging , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[4]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[5]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[6]  SkjellumAnthony,et al.  A high-performance, portable implementation of the MPI message passing interface standard , 1996 .

[7]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[8]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[9]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[10]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[11]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[12]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[13]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[14]  Armin R. Mikler,et al.  NetPIPE: A Network Protocol Independent Performance Evaluator , 1996 .

[15]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[16]  Harrick M. Vin,et al.  The Cost of Recovery in Message Logging Protocols , 2000, IEEE Trans. Knowl. Data Eng..

[17]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[18]  Franck Cappello,et al.  Coordinated checkpoint versus message log for fault tolerant MPI , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.

[19]  Harrick M. Vin,et al.  The cost of recovery in message logging protocols , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[20]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[21]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[22]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.