Analysis of failure recovery schemes for distributed shared-memory systems

Checkpoint and rollback recovery is a technique used to minimise the loss of computation when failures occur. When a process rolls back and re-executes from the last checkpoint, the cost (loss) incurred by redoing the lost computation may be larger than that to execute the original computation. In addition to completion time delay, other performance metrics (e.g. user's satisfaction in real-time on online transaction applications) may also degrade by unexpected failure and recovery. The paper determines how redo overhead factor for unexpected execution overhead affects the performance of recovery scheme. It analyses the performance of three recoverable schemes (incorporating redo overhead factor): multiple fault-tolerant scheme using checkpointing and rollback recovery, single fault-tolerant scheme, and a two-level scheme.

[1]  Christine Morin,et al.  Tolerating node failures in cache only memory architectures , 1994, Proceedings of Supercomputing '94.

[2]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[3]  Jie Wu,et al.  Dynamic snooping in a fault-tolerant distributed shared memory , 1994, 14th International Conference on Distributed Computing Systems.

[4]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[5]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[6]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[7]  Michael Stumm,et al.  Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.