Performance Optimization of Checkpointing Schemes with Task Duplication

This thesis deals with fault tolerant schemes that include checkpointing to shorten recovery time after failures, and task duplication for fault detection. Until now there was no known analytical method to analyze these schemes, and simulation was used to check their performance. The thesis includes a new analysis technique for checkpointing schemes with task duplication. This technique gives an easy-to-use method to analyze and study the performance of the schemes. A few applications of the analysis tool, such as finding the optimal interval between checkpoints and comparing different aspects in the performance of existing schemes, are given. One of conclusions we reached from studying the performance of existing schemes is that the system on which the scheme is implemented can have a major effect on the scheme performance. The thesis describes new checkpointing schemes that consist of two types of checkpoints, compare checkpoints and store checkpoints. The two types of checkpoints can be used to tune the schemes to the system they are used on, and enable an efficient use of the system resources. Analysis results show that using two types of checkpoints can lead to a significant improvement in the performance of checkpointing schemes. Experimental results, obtained on the Intel Paragon parallel computer and a cluster of workstations, confirm that the tuning of checkpointing schemes to the specific systems they are used on can significantly improve their performance. Another way to improve the performance of checkpointing schemes is to use changes in the checkpointing cost to improve the checkpointing placement strategy. A new on-line algorithm, that uses past and present knowledge when it decides whether or not to place a checkpoint, is presented. Analysis of the new scheme shows that the total overhead of execution time when the proposed algorithm is used is significantly smaller than the overhead when fixed intervals are used. Although the proposed on-line algorithm uses only knowledge about the past and present, its behavior is close to the off-line optimal algorithm that uses a complete knowledge of checkpointing cost in all possible locations.

[1]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[2]  J. Bruck,et al.  Efficient checkpointing over local area networks , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[3]  Jacob A. Abraham,et al.  Forward Recovery Using Checkpointing in Parallel Systems , 1990, ICPP.

[4]  C. V. Ramamoorthy,et al.  Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[5]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[6]  Prathima Agrawal,et al.  Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy , 1988, IEEE Trans. Computers.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[9]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[10]  Dhiraj K. Pradhan,et al.  Roll-forward and rollback recovery: performance-reliability trade-off , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[11]  Jehoshua Bruck,et al.  Analysis of checkpointing schemes for multiprocessor systems , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.