A foundation for fault tolerant components

Monolithic approaches to the design, implementation, verification and maintenance of systems tend to yield intractable systems since behavior in one part of the system can have unexpected consequences elsewhere. Compositional approaches with locality are popular since complex systems are reduced to the composition of smaller, more manageable components in which each component can be understood locally and the properties of the system can be derived from the composition. At present, the compositional approach is suitable only for ideal (fault-free) components or for those that mask the effect of faults. If the effect of faults in a component is exposed then the system might no longer behave correctly. The principle, of locality is violated since effect of the fault can ripple through the entire system, thus requiring us to deal with the system monolithically. This study gives an approach to fault tolerance that is both compositional and local, and accommodates tolerances other than masking. A component is a data abstraction that can be affected by faults: in the presence of faults, its behavior may change. To be tolerant, a component in the absence of faults must refine (or implement) an ideal specification and in the presence of faults, refine a tolerance specification. Tolerance specifications include masking (fault effect is hidden), failsafe (safety is not violated) and nonmasking (fault effect is exposed). Components can be composed. A component acting as a client can use different subcomponents and its behavior changes accordingly. We gain locality of reasoning thusly: if a client using an ideal subcomponent specification meets its own specification then the subcomponent specification can be replaced by any refinement, and the client will still meet its specification. This is done for both the ideal and the tolerance cases, giving an ideal system as some composition that refines its ideal specification in the absence of faults and refines its tolerance specification in the presence of faults. Important contributions are the use of refinement to define component fault tolerance, the locality that comes from system design using component specifications, and the recognition that ideal and tolerance specifications must be separately satisfied.