Distributed recovery with K-optimistic logging

Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world applications to reduce service downtime. Most industrial applications have chosen pessimistic logging because it allows fast and localized recovery. The price that they must pay, however, is the higher failure-free overhead. In this paper, we introduce the concept of K-optimistic logging where K is the degree of optimism that can be used to fine-tune the tradeoff between failure-free overhead and recovery efficiency. Traditional pessimistic logging and optimistic logging then become the two extremes in the entire spectrum spanned by K-optimistic logging. Our approach is to prove that only dependencies on those states that may be lost upon a failure need to be tracked on-line, and so transitive dependency tracking can be performed with a variable-size vector. The size of the vector piggybacked on a message then indicates the number of processes whose failures may revoke the message, and K corresponds to the system-imposed upper bound on the vector size.

[1]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[2]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[3]  David B. Johnson,et al.  Efficient transparent optimistic rollback recovery for distributed application programs , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[4]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[5]  James R. Russell,et al.  Optimistic failure recovery for very large networks , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[6]  Luke Lin,et al.  Using checkpoints to localize the effects of faults in distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[7]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[8]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[9]  Sean W. Smith,et al.  Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback , 1995, Proceedings 15th Symposium on Reliable Distributed Systems.

[10]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[11]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[12]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[13]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[14]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[15]  W. Kent Fuchs,et al.  Progressive Retry for Software Failure Recovery in Message-Passing Applications , 1997, IEEE Trans. Computers.

[16]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[17]  Yi-Min Wang,et al.  Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[18]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[19]  Vijay K. Garg,et al.  How to recover efficiently and asynchronously when optimism fails , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[20]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .

[21]  Phil Kearns,et al.  Rollback based on vector time , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[22]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .

[23]  Sean W. Smith,et al.  Completely asynchronous optimistic recovery with minimal rollbacks , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[24]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[25]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[26]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.