论文信息 - Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System

Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System

An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. A common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement. These advantages are assessed quantitatively by developing a probabilistic model. >

Parameswaran Ramanathan | Kang G. Shin

[1] Parameswaran Ramanathan,et al. Clock Synchronization of a Large Multiprocessor System in the Presence of Malicious Faults , 1987, IEEE Transactions on Computers.

[2] P. M. Melliar-Smith,et al. A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[3] Danny Dolev,et al. On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[4] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[5] Joep L. W. Kessels. Two Designs of a Fault-Tolerant Clocking System , 1984, IEEE Transactions on Computers.

[6] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1985, JACM.

[7] Brian Randell. System structure for software fault tolerance , 1975 .

[8] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[9] Insup Lee,et al. Adding Time to Synchronous Process Communications , 1987, IEEE Transactions on Computers.

[10] K. H. Kim,et al. Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[11] Kang G. Shin,et al. Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[12] Kang G. Shin,et al. Ensuring Fault Tolerance of Phase-Locked Clocks , 1985, IEEE Transactions on Computers.

[13] Brian Randell,et al. Reliability Issues in Computing System Design , 1978, CSUR.

[14] Kang G. Shin,et al. Evaluation of Error Recovery Blocks Used for Cooperating Processes , 1984, IEEE Transactions on Software Engineering.

[15] Leslie Lamport,et al. Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[16] Kang G. Shin,et al. Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks , 1984, IEEE Transactions on Computers.