A formal approach to fault-tolerance in distributed real-time systems

In many distributed systems, real-time and fault-tolerance requirements are inseperately intertwined. On one hand, time-redundancy or timing information is often used to obtain a certain degree of fault-tolerance. A typical example is the use of time-outs by communication protocols. Systems with hard real-time requirements, on the other hand, often also have several fault-tolerance requirements. A space shuttle that cannot adjust its course in time, due to a short failure in one of its processors, should not be lost forever. Another complication arises if one considers the functional behaviour of a system together with hard real-time and fault-tolerance requirements. Usually, there exists a trade-off between time, fault-tolerance and functionality within one system. This may be exploited to obtain graceful degradation, if it is sufficient to guarantee the response time of only some of the services of a system in the presence of faults. From the above it follows that the design of distributed systems with hard real-time and fault-tolerance requirements is a difficult task, leading to complicated and sometimes opaque designs. This calls for the use of formal methods during the design and verfication steps of the development stage. Especially, because these systems are often embedded in environments in which an error can have disastrous consequences. Several methods for verifying fanlt-tolerance properties have already been proposed, e.g. in [Cris85] [ScSc83] [Schn86] and [JMS 87]. Methods to verify distributed real-time systems are presented in e.g. [ZwLe85] [ShLa87] [ItoWi89]. tIowever, none of these methods deals with fault-tolerance and real-time properties simultaneously. Inspired by the observations in the preceeding paragraphs, our research involves the construction of a proof system for fault-tolerant distributed real-time systems. We aim at a compositional proof system, which means that the correctness of a program can be verified by looking at the specification of the component parts, without referring to their internal structure. This enables the verification of large systems in a stepwise manner by verifying smaller subsystems. Currently we are working on a compositional system to derive real-time and fault-tolerance properties of distributed programs, that communicate via synchronous message passing. In this proof theory one can derive formulae of the form S sat ~, where S is a program and ~ is a assertion. Below we describe briefly the programming language, the assertion language and the proof system. A simple example of a Voter system illustrates our ideas.