论文信息 - Concurrent rollback for crash recovery in extended hypercube networks

Concurrent rollback for crash recovery in extended hypercube networks

Recovering from processor failures is an important problem in the design and development of reliable systems. We present a concurrent rollback algorithm in extended hypercube networks to recover from crash failures which involves small message and time complexities. The network of an extended hypercube is a hierarchical, low diameter, recursive structure. By appending only O(1) additional information to each message, we use less than O(Nlog N) message exchanges and O(log/sup 2/ N) time elapsed for recovery work where N is the number of processors of the extended hypercube network. The algorithms can be used to recover from the failure of an arbitrary number of processors.<<ETX>>

[1] A. Prasad Sistla,et al. Efficient distributed recovery using message logging , 1989, PODC '89.

[2] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[4] Ge-Ming Chiu,et al. Efficient Rollback-Recovery Technique in Distributed Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[5] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.

[6] Fred B. Schneider,et al. Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[7] Mohan Kumar,et al. Extended Hypercube: A Hierarchical Interconnection Network of Hypercubes , 1992, IEEE Trans. Parallel Distributed Syst..

[8] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[9] David B. Johnson,et al. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[10] S. Venkatesan,et al. Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.