The authors present a technique for maximizing the redundancy level of tasks and tolerating hardware faults by majority voting in the context of a network of workstations. The idea is to compute dynamically the number of copies allocated to each task, according to the number of sites and the tasks' criticality parameters. This technique leads to maximum utilization of the available resources in the distributed system, i.e. it reduces the idleness of resources and increases the redundancy of tasks. A reduction in fault dormancy and error latency is thus provided. This technique, called the saturation technique, is compared with similar approaches. A detailed description and the results obtained by simulation showing the advantages and the cost of implementing the saturation technique are given. The authors underline the structure of a convenient distributed operating system, including the execution model and task designation, to support the execution of multiple copies of tasks. The fault assumptions are discussed, and the different phases of a distributed scheduler are detailed.<<ETX>>
[1]
Hector Garcia-Molina,et al.
Database Processing with Triple Modular Redundancy
,
1986,
Symposium on Reliability in Distributed Software and Database Systems.
[2]
Brian Randell.
Fault Tolerance and System Structuring
,
1984
.
[3]
Hugues Deneux,et al.
Random testing of LSI self-checking circuits
,
1984,
Fehlertolerierende Rechensysteme.
[4]
Jean-Pierre Courtiat,et al.
Self-Checking software in distributed systems
,
1982,
ICDCS.
[5]
Jim Gray,et al.
Why Do Computers Stop and What Can Be Done About It?
,
1986,
Symposium on Reliability in Distributed Software and Database Systems.
[6]
Leslie Lamport,et al.
The Byzantine Generals Problem
,
1982,
TOPL.
[7]
A.L. Hopkins,et al.
FTMP—A highly reliable fault-tolerant multiprocess for aircraft
,
1978,
Proceedings of the IEEE.
[8]
J. Goldberg,et al.
SIFT: Design and analysis of a fault-tolerant computer for aircraft control
,
1978,
Proceedings of the IEEE.