Assessing the reliability impacts of software fault-tolerance mechanisms

Telecommunications systems are characterized by highly stringent reliability requirements for system availability and defect rate. A combination of approaches is used to achieve high software reliability, namely, fault avoidance, fault removal and implementation of fault-tolerant mechanisms. This paper focuses on the implementation of software fault-tolerant mechanisms and analyzes the impact of these mechanisms on software reliability. Based on field data on the frequency of invocation of some fault-tolerant mechanisms, we present an escalating recovery model for predicting the impact of these mechanisms on lost calls. The key parameters of the model are: the software fault recovery coverage factor; the proportion of successful recoveries at each level and the calls lost at each recovery level. The output of the model is a distribution and average of the number of lost calls per software error. The applicability of this model to systems with high reliability has been validated; the applicability of the model to less reliable systems is an area for future work.