How fail-stop are faulty programs?

Most fault-tolerant systems are designed to stop faulty programs before they write permanent data or communicate with other processes. This property (halt-on-failure) forms the core of the fail-stop model. Unfortunately, little experimental data exists on whether or not program failures follow the fail-stop model. This paper describes a tool, based on the SimOS complete-machine simulator that can trace how faults propagate through memory, disk, and functions. Using this tool on the Postgres database system, we conduct a controlled experiment to measure how often faulty programs violate the fail-stop model. We find that a significant number of faults (7%) violate the fail-stop model by writing incorrect data to stable storage before halting. We then apply Postgres' transaction mechanism to undo recent changes before a crash and find that transactions reduce fail-stop violations by a factor of 3.

[1]  Jacob A. Abraham,et al.  FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[2]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[3]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[4]  Peter M. Chen,et al.  Free transactions with Rio Vista , 1997, SOSP.

[5]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[6]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[7]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[8]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[9]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[10]  Miguel Castro,et al.  Lightweight logging for lazy release consistent distributed shared memory , 1996, OSDI '96.

[11]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[12]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[13]  Peter M. Chen,et al.  The Rio file cache: surviving operating system crashes , 1996, ASPLOS VII.

[14]  Brian Randell System structure for software fault tolerance , 1975 .

[15]  André Schiper,et al.  Lightweight causal and atomic group multicast , 1991, TOCS.

[16]  Ravishankar K. Iyer,et al.  Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[17]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[18]  Peter M. Chen,et al.  Integrating reliable memory in databases , 1998, The VLDB Journal.

[19]  Ravishankar K. Iyer,et al.  Experimental evaluation , 1995 .

[20]  Mark Sullivan,et al.  A comparison of software defects in database management systems and operating systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[21]  Michael Stonebraker,et al.  The Design of the POSTGRES Storage System , 1988, VLDB.

[22]  Daniel P. Siewiorek,et al.  Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[23]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .