论文信息 - Distributed Recovery Units: An Approach for Hybrid and Adaptive Distributed Recovery

Distributed Recovery Units: An Approach for Hybrid and Adaptive Distributed Recovery

Traditionally, distributed recovery schemes have been designed for systems consisting of multiple recovery units. Each recovery unit (RU) resides on a single processor and it can fail and recover as a whole. This report introduces the \distributed recovery unit (DRU)" abstraction as an approach for design of \hybrid" and \adaptive" recovery schemes for distributed systems. The distributed system is viewed as a collection of DRUs, each DRU consisting of one or more RUs. This report presents a new recovery scheme based on the DRU abstraction. The proposed approach combines coordinated checkpointing with independent checkpointing and optimistic message logging to obtain a recovery scheme that can eeectively trade the overhead during failure-free operation with the overhead during recovery.

N. Vaidya

[1] Len T. Armstrong. Adaptive Fault Tolerance , 1994 .

[2] David B. Johnson,et al. Efficient transparent optimistic rollback recovery for distributed application programs , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[3] W. Kent Fuchs,et al. Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[4] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[5] James R. Russell,et al. Optimistic failure recovery for very large networks , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[6] S. Venkatesan,et al. Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[7] A. Prasad Sistla,et al. Efficient distributed recovery using message logging , 1989, PODC '89.

[8] David B. Johnson,et al. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[9] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.

[10] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[11] Richard D. Schlichting,et al. Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[12] David L. Russell,et al. State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[13] Taesoon Park,et al. Checkpointing and rollback-recovery in distributed systems , 1989 .