Software Fault Tolerance in Safety-Critical Applications

Software fault tolerance has primarily been aimed at increasing total software reliability. Unfortunately, it is impossible to provide general techniques that tolerate all faults with a very high confidence rate. This paper presents some of the available experimental evidence. However, in some situations a more limited fault tolerance may be all that is needed, i.e., the program must be able to prevent unsafe states (but not necessarily all incorrect states) or detect them and recover to a safe (but not necessarily correct) state. This approach is application-specific; the particular fault-tolerance facilities are designed specifically for the particular application. This paper briefly describes how this can be accomplished. Although more specific analysis of the problem is required for this approach than the more general ones, it provides the advantage of partial verification of the adequacy of the fault tolerance used (e.g., it is possible to show that certain hazardous states cannot be caused by software faults) and therefore will aid in certifying and licensing software that can potentially have catastrophic consequences. That is, the approach provides greater confidence about a more limited goal than more general approaches. These techniques can also be used to tailor more general fault-tolerance techniques, such as recovery blocks, and to aid in writing acceptance tests that will ensure safety. Even with the use of these techniques, systems with very low acceptable risk may not be able to be built using software components.