Egida: an extensible toolkit for low-overhead fault-tolerance

We discuss the design and implementation of Egida, an object-oriented toolkit designed to support transparent rollback-recovery. Egida exports a simple specification language that can be used to express arbitrary rollback recovery protocols. From this specification, Egida automatically synthesizes an implementation of the specified protocol by gluing together the appropriate objects from an available library of "building blocks". Egida is extensible and facilitates rapid implementation of rollback recovery protocols with minimal programming effort. We have integrated Egida with the MPICH implementation of the MPI standard. Existing MPI applications can rake advantage of Egida without any modifications: fault-tolerance is achieved transparently-all that is needed is a simple re-link of the MPI application with Egida.

[1]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[2]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[3]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[4]  Yi-Min Wang,et al.  COMERA: COM Extensible Remoting Architecture , 1998, COOTS.

[5]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[6]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[7]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[8]  Ewing L. Lusk,et al.  Monitors, Messages, and Clusters: The p4 Parallel Programming System , 1994, Parallel Comput..

[9]  Danny Dolev,et al.  The Transis approach to high availability cluster communication , 1996, CACM.

[10]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[12]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[13]  Vijay K. Garg,et al.  How to recover efficiently and asynchronously when optimism fails , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[14]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[15]  L. Alvisi,et al.  Nonblocking and Orphan-Free Message Logging Protocols , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[16]  Roy H. Campbell,et al.  Quarterware for middleware , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[17]  Larry L. Peterson,et al.  The x-Kernel: An Architecture for Implementing Network Protocols , 1991, IEEE Trans. Software Eng..

[18]  P BirmanKenneth,et al.  Reliable communication in the presence of failures , 1987 .

[19]  Robbert van Renesse,et al.  Design and Performance of Horus: A Lightweight Group Communications System , 1994 .

[20]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .

[21]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[22]  Harrick M. Vin,et al.  Hybrid Message Logging Protocols for Fast Recovery , 1998 .

[23]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[24]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[25]  Jonathan Walpole,et al.  MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[26]  William Gropp,et al.  User's Guide for mpich, a Portable Implementation of MPI Version 1.2.2 , 1996 .

[27]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[28]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[29]  Carl Kesselman,et al.  Generalized communicators in the Message Passing Interface , 1996, Proceedings. Second MPI Developer's Conference.

[30]  Harrick M. Vin,et al.  The Cost of Recovery in Message Logging Protocols , 2000, IEEE Trans. Knowl. Data Eng..

[31]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[32]  Douglas C. Schmidt,et al.  ADAPTIVE: A dynamically assembled protocol transformation, integration and evaluation environment , 1993, Concurr. Pract. Exp..

[33]  Matti A. Hiltunen,et al.  A Configurable Membership Service , 1998, IEEE Trans. Computers.

[34]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[35]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[36]  Ravishankar K. Iyer,et al.  An object-oriented testbed for the evaluation of checkpointing and recovery systems , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[37]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.