Distributed Recovery Units: An Approach for Hybrid and Adaptive Distributed Recovery

Traditionally, distributed recovery schemes have been designed for systems consisting of multiple recovery units. Each recovery unit (RU) resides on a single processor and it can fail and recover as a whole. This report introduces the \distributed recovery unit (DRU)" abstraction as an approach for design of \hybrid" and \adaptive" recovery schemes for distributed systems. The distributed system is viewed as a collection of DRUs, each DRU consisting of one or more RUs. This report presents a new recovery scheme based on the DRU abstraction. The proposed approach combines coordinated checkpointing with independent checkpointing and optimistic message logging to obtain a recovery scheme that can eeectively trade the overhead during failure-free operation with the overhead during recovery.

[1]  Len T. Armstrong Adaptive Fault Tolerance , 1994 .

[2]  David B. Johnson,et al.  Efficient transparent optimistic rollback recovery for distributed application programs , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[3]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[4]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[5]  James R. Russell,et al.  Optimistic failure recovery for very large networks , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[6]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[7]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[8]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[9]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[10]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[11]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[12]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[13]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .