Increasing software reliability through rollback and on-line fault repair

We propose a new paradigm for increasing the reliability of a software system by combing reactive and proactive approaches. The proposed approach employs rollback and restart for masking transient failure, and employs on-line software version charge to remove faults from the software. A model for reliability analysis of a system employing the proposed approach is presented. The analysis shows that substantial benefit in reliability can be obtained by employing the proposed approach. A prototype system which incorporates the proposed approach is also described.

[1]  Klaus-Peter Löhr,et al.  Dynamic Restructuring in an Experimental Operating System , 1978, IEEE Transactions on Software Engineering.

[2]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[3]  Ophir Frieder,et al.  On dynamically updating a computer program: From concept to prototype , 1991, J. Syst. Softw..

[4]  Insup Lee,et al.  DYMOS: a dynamic modification system , 1983 .

[5]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[6]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[7]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[8]  Ophir Frieder,et al.  On-the-fly program modification: systems for dynamic updating , 1993, IEEE Software.

[9]  Nancy G. Leveson,et al.  An experimental evaluation of the assumption of independence in multiversion programming , 1986, IEEE Transactions on Software Engineering.

[10]  Z. Jelinski,et al.  Software reliability Research , 1972, Statistical Computer Performance Evaluation.

[11]  Deepak Gupta,et al.  A Formal Framework for On-line Software Version Change , 1996, IEEE Trans. Software Eng..

[12]  Amrit L. Goel,et al.  An Analysis Of Recurrent Software Errors In A Real-Time Control System , 1978, ACM Annual Conference.

[13]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[14]  John D. Musa,et al.  Software reliability - measurement, prediction, application , 1987, McGraw-Hill series in software engineering and technology.

[15]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[16]  Deepak Gupta,et al.  On‐line software version change using state transfer between processes , 1993, Softw. Pract. Exp..

[17]  Amrit L. Goel,et al.  Software Reliability Models: Assumptions, Limitations, and Applicability , 1985, IEEE Transactions on Software Engineering.

[18]  W. Kent Fuchs,et al.  Progressive retry for software error recovery in distributed systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[19]  Y. Huang,et al.  A User-Level Replicated File System , 1993, USENIX Summer.

[20]  Robert S. Fabry,et al.  How to design a system in which modules can be changed on the fly , 1976, ICSE '76.

[21]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[22]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.