Treating Bugs as Allergies: A Safe Method for Surviving Software Failures

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Previous approaches for surviving software failures suffer from several limitations, including requiring application restructuring, failing to address deterministic software bugs, unsafely speculating on program execution, and requiring a long recovery time. This paper proposes an innovative, safe technique, called Rx, that can quickly recover programs from many types of common software bugs, both deterministic and non-deterministic. Our idea, inspired by allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to reexecute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the "allergen" from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.

[1]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[2]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[3]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[4]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[5]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[6]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[7]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[8]  W. Kent Fuchs,et al.  Progressive retry for software error recovery in distributed systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[9]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Kishor S. Trivedi,et al.  On the analysis of software rejuvenation policies , 1997, Proceedings of COMPASS '97: 12th Annual Conference on Computer Assurance.

[11]  Matteo Sereno,et al.  Fine grained software rejuvenation models , 1998, Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248).

[12]  Werner Vogels,et al.  The Design and Architecture of the Microsoft Cluster Service , 1998 .

[13]  Kenneth P. Birman,et al.  The design and architecture of the Microsoft Cluster Service-a practical approach to high-availability and scalability , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[14]  Crispan Cowan,et al.  StackGuard: Automatic Adaptive Detection and Prevention of Buffer-Overflow Attacks , 1998, USENIX Security Symposium.

[15]  Katherine Guo,et al.  Scalability of the microsoft cluster service , 1998 .

[16]  Evan Marcus,et al.  Blueprints for high availability , 2000 .

[17]  Peter M. Chen,et al.  Whither generic recovery from application faults? A fault study using open-source software , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[18]  Evan Marcus,et al.  Blueprints for high availability: designing resilient distributed systems , 2000 .

[19]  George Candea,et al.  Recursive restartability: turning the reboot sledgehammer into a scalpel , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[20]  Dhiraj K. Pradhan,et al.  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.

[21]  George Candea,et al.  Reducing recovery time in a small recursively restartable system , 2002, Proceedings International Conference on Dependable Systems and Networks.

[22]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[23]  Daniel M. Roy,et al.  A dynamic technique for eliminating buffer overflow vulnerabilities (and other memory errors) , 2004, 20th Annual Computer Security Applications Conference.

[24]  Daniel M. Roy,et al.  Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.

[25]  Srikanth Kandula,et al.  Flashback: A Light-weight Rollback and Deterministic Replay Extension for Software Debugging , 2004 .

[26]  Yuanyuan Zhou,et al.  SafeMem: exploiting ECC-memory for detecting memory leaks and memory corruption during production runs , 2005, 11th International Symposium on High-Performance Computer Architecture.

[27]  Angelos D. Keromytis,et al.  Building a Reactive Immune System for Software Services , 2005, USENIX Annual Technical Conference, General Track.