The introduction of fault-tolerance in a hierarchical operating system

A general method for introducing fault-tolerance in a hierarchical operating system is presented here. First, a hierarchically structured conventional (non-fault-tolerant) operating system is described. In order to transform it into a fault-tolerant system, each conventional machine is augmented with an Error Detection and Recovery (EDR) mechanism, thus obtaining a corresponding fault-tolerant machine. It is determined that, from the standpoint of fault-tolerance, three types of machines can be identified: physical, kernel, and process type. The EDR mechanism makes a conventional machine fault-tolerant by transforming its conventional operations into fault-tolerant operations. To provide this transformation, a set of operations are defined for the EDR mechanism. A model for fault-tolerant operations is developed, such that known techniques for fault-tolerance (e.g. recovery block, N-version programming, etc.) can be represented as particular cases. The general fault-tolerant operating system obtained is a hierarchy of fault-tolerant machines, with the physical type machines at the bottom, followed by the kernel type machines above them, and the process type as the upper machines.