Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System

An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. A common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement. These advantages are assessed quantitatively by developing a probabilistic model. >

[1]  Parameswaran Ramanathan,et al.  Clock Synchronization of a Large Multiprocessor System in the Presence of Malicious Faults , 1987, IEEE Transactions on Computers.

[2]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[3]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[4]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[5]  Joep L. W. Kessels Two Designs of a Fault-Tolerant Clocking System , 1984, IEEE Transactions on Computers.

[6]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[7]  Brian Randell System structure for software fault tolerance , 1975 .

[8]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[9]  Insup Lee,et al.  Adding Time to Synchronous Process Communications , 1987, IEEE Transactions on Computers.

[10]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[11]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[12]  Kang G. Shin,et al.  Ensuring Fault Tolerance of Phase-Locked Clocks , 1985, IEEE Transactions on Computers.

[13]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[14]  Kang G. Shin,et al.  Evaluation of Error Recovery Blocks Used for Cooperating Processes , 1984, IEEE Transactions on Software Engineering.

[15]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[16]  Kang G. Shin,et al.  Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks , 1984, IEEE Transactions on Computers.