Checkpointing and rollback recovery in a distributed system using common time base

An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. First, a common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with a pseudorecovery block approach to develop a checkpointing algorithm that has the following advantages: (i) maximum process autonomy, (ii) no wait for commitment for establishing recovery lines, (iii) fewer messages to be exchanged, and (iv) less memory requirement.<<ETX>>

[1]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[2]  Parameswaran Ramanathan,et al.  Clock Synchronization of a Large Multiprocessor System in the Presence of Malicious Faults , 1987, IEEE Transactions on Computers.

[3]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[4]  Insup Lee,et al.  Adding Time to Synchronous Process Communications , 1987, IEEE Transactions on Computers.

[5]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[6]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[7]  Kang G. Shin,et al.  Ensuring Fault Tolerance of Phase-Locked Clocks , 1985, IEEE Transactions on Computers.

[8]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[9]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[10]  Kang G. Shin,et al.  Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks , 1984, IEEE Transactions on Computers.

[11]  Kang G. Shin,et al.  Evaluation of Error Recovery Blocks Used for Cooperating Processes , 1984, IEEE Transactions on Software Engineering.

[12]  Joep L. W. Kessels Two Designs of a Fault-Tolerant Clocking System , 1984, IEEE Transactions on Computers.

[13]  Brian Randell System structure for software fault tolerance , 1975 .