Error Recovery in Multi-Version Software

Abstract In multi-version software, design faults in versions are tolerated through comparison of results from several diverse versions. Since all versions are likely to contain some design faults, it is necessary to have recovery and reconfiguration mechanisms that are able to recover these versions as they fail. Community Error Recovery, an error recovery algorithm for multi-version software, has been designed and implemented on the UCLA DEDIX system. It consists of two levels: the cc-point level is responsible for local error recovery and the recovery point level is responsible for global error recovery. Recovery of failed versions is achieved through the injection of "consensus" data values supplied by the decision algorithm of the MVS system. The provision of two levels minimizes both the disturbance to the system and the restrictions to the implementation of diverse versions. Markov models have been developed to evaluate the reliability of multi-version software running on DEDIX with and without Community Error Recovery.

[1]  Lawrence A. Bjork Generalized Audit Trail Requirements and Concepts for Data Base Applications , 1975, IBM Syst. J..

[2]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[3]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[4]  Flaviu Cristian,et al.  Exception Handling and Software Fault Tolerance , 1982, IEEE Transactions on Computers.

[5]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[6]  Dave E. Eckhardt,et al.  A Theoretical Basis for the Analysis of Multiversion Software Subject to Coincident Errors , 1985, IEEE Transactions on Software Engineering.

[7]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[8]  Jean-Claude Laprie,et al.  Dependability Evaluation of Software Systems in Operation , 1984, IEEE Transactions on Software Engineering.

[9]  D. L. Parnas,et al.  On the criteria to be used in decomposing systems into modules , 1972, Software Pioneers.

[10]  Brian Randell System structure for software fault tolerance , 1975 .

[11]  Greg Thiel,et al.  LOCUS a network transparent, high reliability distributed system , 1981, SOSP.

[12]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[13]  K. S. Tso,et al.  Multi-Version Software Development , 1986 .

[14]  Lorenzo Strigini,et al.  Software Fault-Tolerance by Design Diversity Dedix: A Tool for Experiments , 1985 .

[15]  Jean Arlat,et al.  ON THE PERFORMANCE OF SOFTWARE FAULT-TOLERANCE STRATEGIES+ , 1980 .