The impact of recovery mechanisms on the likelihood of saving corrupted state

Recovery systems must save state before a failure occurs to enable the system to recover from the failure. However, recovery will fail if the recovery system saves any state corrupted by the fault. The frequency and comprehensiveness of how a recovery system saves state has a major effect on how often the recovery system inadvertently saves corrupted state. This paper explores and measures that effect. We measure how often software faults in the application and operating system cause real applications to save corrupted state when using different types of recovery systems. We find that generic recovery techniques, such as checkpointing and logging, work well for faults in the operating system. However, we find that they do not work well for faults in the application because the very actions taken to enable recovery often corrupt the state upon which successful recovery depends.

[1]  Jacob A. Abraham,et al.  FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[2]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[3]  Peter M. Chen,et al.  The Design and Verification of the Rio File Cache , 2001, IEEE Trans. Computers.

[4]  Peter M. Chen,et al.  How fail-stop are faulty programs? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[5]  Henrique Madeira,et al.  Joint evaluation of performance and robustness of a COTS DBMS through fault-injection , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[6]  Peter M. Chen,et al.  Exploring failure transparency and the limits of generic recovery , 2000, OSDI.

[7]  Ravishankar K. Iyer,et al.  Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[8]  Daniel P. Siewiorek,et al.  Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[9]  Mark Sullivan,et al.  A comparison of software defects in database management systems and operating systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[10]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[11]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[12]  Peter M. Chen,et al.  An evaluation of the recovery-related properties of software faults , 2000 .