论文信息 - How fail-stop are faulty programs?

How fail-stop are faulty programs?

Most fault-tolerant systems are designed to stop faulty programs before they write permanent data or communicate with other processes. This property (halt-on-failure) forms the core of the fail-stop model. Unfortunately, little experimental data exists on whether or not program failures follow the fail-stop model. This paper describes a tool, based on the SimOS complete-machine simulator that can trace how faults propagate through memory, disk, and functions. Using this tool on the Postgres database system, we conduct a controlled experiment to measure how often faulty programs violate the fail-stop model. We find that a significant number of faults (7%) violate the fail-stop model by writing incorrect data to stable storage before halting. We then apply Postgres' transaction mechanism to undo recent changes before a crash and find that transactions reduce fail-stop violations by a factor of 3.

Peter M. Chen | Subhachandra Chandra | Peter M. Chen | Subhachandra Chandra

[1] Jacob A. Abraham,et al. FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[2] Anoop Gupta,et al. Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[3] Daniel P. Siewiorek,et al. High-availability computer systems , 1991, Computer.

[4] Peter M. Chen,et al. Free transactions with Rio Vista , 1997, SOSP.

[5] Algirdas Avizienis,et al. The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[6] Jim Gray,et al. Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[7] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.

[8] Mark Sullivan,et al. Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[9] Ravishankar K. Iyer,et al. FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[10] Miguel Castro,et al. Lightweight logging for lazy release consistent distributed shared memory , 1996, OSDI '96.

[11] Andreas Reuter,et al. Transaction Processing: Concepts and Techniques , 1992 .

[12] Fred B. Schneider,et al. Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[13] Peter M. Chen,et al. The Rio file cache: surviving operating system crashes , 1996, ASPLOS VII.

[14] Brian Randell. System structure for software fault tolerance , 1975 .

[15] André Schiper,et al. Lightweight causal and atomic group multicast , 1991, TOCS.

[16] Ravishankar K. Iyer,et al. Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[17] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[18] Peter M. Chen,et al. Integrating reliable memory in databases , 1998, The VLDB Journal.

[19] Ravishankar K. Iyer,et al. Experimental evaluation , 1995 .

[20] Mark Sullivan,et al. A comparison of software defects in database management systems and operating systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[21] Michael Stonebraker,et al. The Design of the POSTGRES Storage System , 1988, VLDB.

[22] Daniel P. Siewiorek,et al. Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[23] David B. Johnson,et al. Sender-Based Message Logging , 1987 .