Another Two-Level Failure Recovery Scheme : Performance Impact of Checkpoint Placement andCheckpoint Latency

This report deals with the design and evaluation of a \two-level" failure recovery scheme for distributed systems. In our previous work 30, 32], we motivated a \two-level" recovery approach that tolerates the more probable failures with a low overhead, and less probable failures with possibly higher overhead. The two-level approach can achieve a smaller overhead as compared to traditional recovery schemes. The contributions of this report are summarized below: We present and evaluate a \two-level" recovery scheme that is suitable for a network of workstations, each workstation having a local disk. The recovery scheme presented in the report can tolerate transient processor failures with a low overhead , while other failures require a larger overhead. The report presents analysis of the average (expected) task completion time using the proposed scheme. This scheme has been implemented on a workstation cluster. Our analysis indicates that the proposed two-level recovery scheme can achieve better performance as compared to existing \one-level" recovery schemes. The report also evaluates the impact of checkpoint latency on the performance of the recovery scheme. To our knowledge, no analysis of the performance impact of checkpoint latency has been carried out previously. Experimental measurements of checkpoint latency and checkpoint overhead for four applications are presented. References 32, 30] present material related to this report. The interested reader can obtain these references via anonymous ftp from ftp.cs.tamu.edu:/pub/vaidya. y This report was revised several times in January 1995. The purpose of these revisions was to add Sections 10 and 11, and to revise Section 1.

[1]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[2]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[3]  Erol Gelenbe,et al.  A model of roll-back recovery with multiple checkpoints , 1976, ICSE '76.

[4]  Erol Gelenbe,et al.  Performance of rollback recovery systems under intermittent failures , 1978, CACM.

[5]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[6]  Terry Williams,et al.  Probability and Statistics with Reliability, Queueing and Computer Science Applications , 1983 .

[7]  Andreas Reuter,et al.  Performance analysis of recovery techniques , 1984, TODS.

[8]  Performance analysis of checkpointing strategies , 1984, TOCS.

[9]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[10]  G. V. Kulkarni,et al.  Effects of Checkpointing and Queueing on Program Performance , 1987 .

[11]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[12]  Kewal K. Saluja,et al.  An experimental study to determine task size for rollback recovery systems , 1988 .

[13]  Jacques Malenfant,et al.  Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems , 1988, IEEE Trans. Computers.

[14]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[15]  Victor F. Nicola,et al.  Comparative Analysis of Different Models of Checkpointing and Recovery , 1990, IEEE Trans. Software Eng..

[16]  Darrell D. E. Long,et al.  A study of the reliability of Internet sites , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[17]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[18]  Vincenzo Grassi,et al.  On the Optimal Checkpointing of Critical Tasks and Transaction-Oriented Systems , 1992, IEEE Trans. Software Eng..

[19]  Sachin Garg,et al.  Analysis of an Improved Distributed Checkpointing Algorithm , 1993 .

[20]  Nitin Hemant Vaidya,et al.  Low-cost schemes for fault tolerance , 1993 .

[21]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[22]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[23]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[24]  Mark A. Franklin,et al.  Distributed computing systems and checkpointing , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[25]  David A. Patterson,et al.  Computer Organization & Design: The Hardware/Software Interface , 1993 .

[26]  Dhiraj K. Pradhan,et al.  Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture , 1994, IEEE Trans. Computers.

[27]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[28]  Jehoshua Bruck,et al.  Analysis of checkpointing schemes for multiprocessor systems , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[29]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[30]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.