Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms

The performance of two distributed checkpointing and recovery algorithms, the synchronous checkpointing algorithm (SA) and the independent checkpointing algorithm (ICA), is evaluated. The performance is based on a detailed implementation of algorithms published earlier in the C language. A benchmark that simulates a variety of application requirements and considers variations in the number of processes running on one or more machines, size of processes to be checkpointed, size of control messages, and frequency of message exchanges for normal processing is used to conduct the experiments. Measurements are made for the elapsed time and the CPU time to run a single instance of the checkpoint or rollback. The experiments are repeated for various combinations of concurrent checkpoint and rollback executions. The messages needed for synchronization are computed. It is found that the time that a process spends in processing control messages contributes significantly to the elapsed time in both algorithms. Elapsed times for recovery for both algorithms are found to be comparable when the number of checkpoints is small.<<ETX>>

[1]  Bharat K. Bhargava,et al.  The Raid Distributed Database System , 1989, IEEE Trans. Software Eng..

[2]  Bharat K. Bhargava,et al.  An experimental analysis of replicated copy control site failure and recovery , 1988, Proceedings. Fourth International Conference on Data Engineering.

[3]  Bharat K. Bhargava,et al.  Concurrent robust checkpointing and recovery in distributed systems , 1988, Proceedings. Fourth International Conference on Data Engineering.

[4]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[5]  Augusto Ciuffoletti Error recovery in systems of communicating processes. , 1984, ICSE '84.

[6]  Bharat K. Bhargava,et al.  Experimental analysis of layered Ethernet software , 1987, FJCC.

[7]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[8]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[9]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[10]  J. Eliot B. Moss,et al.  Checkpoint and Restart in Distributed Transaction Systems , 1983, Symposium on Reliability in Distributed Software and Database Systems.

[11]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .