Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.

[1]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[2]  Yuval Tamir,et al.  Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[3]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1981, TOCS.

[4]  Jonathan Walpole,et al.  MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[5]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[6]  Richard D. Schlichting,et al.  Fail-Stop Processors: An Approach to Designing Computing Systems , 1983 .

[7]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..

[8]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[9]  Jeffrey F. Naughton,et al.  An efficient checkpointing method for multicomputers with wormhole routing , 1991, International Journal of Parallel Programming.

[10]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[11]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[12]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[13]  R.E. Strom,et al.  A recoverable object store , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track.

[14]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[15]  Gernot Heiser,et al.  Checkpointing and recovery for distributed shared memory applications , 1995, Proceedings of International Workshop on Object Orientation in Operating Systems.

[16]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[17]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[18]  Micah Beck,et al.  Compiler-Assisted Memory Exclusion for Fast Checkpointing , 1995 .

[19]  Eric A. Brewer,et al.  An Algorithm for Concurrent Search Trees , 1991, International Conference on Parallel Processing.

[20]  W. Kent Fuchs,et al.  Compiler‐assisted full checkpointing , 1994, Softw. Pract. Exp..

[21]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[22]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[23]  Michael Litzkow,et al.  Supporting checkpointing and process migration outside the UNIX kernel , 1999 .

[24]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[25]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[26]  Vijay K. Garg,et al.  How to recover efficiently and asynchronously when optimism fails , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[27]  L. Alvisi,et al.  Nonblocking and Orphan-Free Message Logging Protocols , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[28]  Andrew S. Tanenbaum,et al.  Computer Networks , 1981 .

[29]  Jinsong Ouyang Supporting cost-effective fault tolerance in distributed applications with file operations , 1997 .

[30]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[31]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[32]  Sean W. Smith,et al.  Completely asynchronous optimistic recovery with minimal rollbacks , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[33]  Gilbert Cabillic,et al.  The performance of consistent checkpointing in distributed shared memory systems , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[34]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[35]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[36]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[37]  Gernot Heiser,et al.  Libra: A Library for Reliable Distributed Applications , 1996, PDPTA.

[38]  Georg Stellner Consistent Checkpoints of PVM Applications , 1994 .

[39]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[40]  Jeffrey F. Naughton,et al.  Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.