Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF

This paper presents an implementation of several consistent recovery protocols at the abstract device level and their performance comparison. We have performed experiments using three NAS Parallel Benchmark applications with class C datasets on state of the art equipment. The interesting result is that causal message logging protocol has the most expensive recovery cost with communication intensive applications since it suffers from concentrated overload of simultaneous message replaying. Receiver-based optimistic message logging has the least recovery cost with drawback of extensive disk access overhead in failure-free executions. Coordinated checkpointing seems the most practical choice among them.

[1]  Ian T. Foster,et al.  MPICH-G2: A Grid-enabled implementation of the Message Passing Interface , 2002, J. Parallel Distributed Comput..

[2]  Andrew S. Grimshaw,et al.  Integrating fault-tolerance techniques in grid applications , 2000 .

[3]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[4]  Franck Cappello,et al.  Coordinated checkpoint versus message log for fault tolerant MPI , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.

[5]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[6]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[7]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[8]  Viet D. Tran,et al.  Application Recovery in Parallel Programming Environment , 2002, PVM/MPI.

[9]  Lorenzo Alvisi,et al.  Trade-offs in implementing causal message logging protocols , 1996, PODC '96.

[10]  Jonathan Robinson,et al.  The Hector Distributed Run-Time Environment , 1998, IEEE Trans. Parallel Distributed Syst..

[11]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[12]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[13]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[14]  Harrick M. Vin,et al.  The Cost of Recovery in Message Logging Protocols , 2000, IEEE Trans. Knowl. Data Eng..

[15]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[16]  Jack Dongarra,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.

[17]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[18]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[19]  R. V. D. Wijngaart NAS Parallel Benchmarks Version 2.4 , 2022 .

[20]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.