This paper concerns an important aspect of the problem of designing fault-tolerant distributed computing systems. The concepts involved in "backward error recovery", i.e. restoring a system, or some part of a system, to a previous state which it is hoped or believed preceded the occurrence of any existing errors are formalised, and generalised so as to apply to concurrent, e.g. distributed, systems. Since in distributed systems there may exist a great deal of independence between activities, the system can be restored to a state that could have existed rather than to a state that actually existed. The formalisation is based on the use of what we term "Occurrence Graphs" to represent the cause-effect relationships that exist between the events that occur when a system is operational, and to indicate existing possibilities for state restoration. A protocol is presented which could be used in each of the nodes in a distributed computing system in order to provide system recoverability in the face even of multiple faults.
[1]
Brian Randell,et al.
Consistent State Restoration in Distributed Systems
,
1977
.
[2]
Gregor von Bochmann,et al.
A Unified Method for the Specification and Verification of Protocols
,
1977,
IFIP Congress.
[3]
Brian Randell.
Reliable Computing Systems
,
1978,
Advanced Course: Operating Systems.
[4]
David B. Lomet,et al.
Process structuring, synchronization, and recovery using atomic actions
,
1977,
Language Design for Reliable Software.
[5]
D. B. Lomet.
Process structuring, synchronization, and recovery using atomic actions
,
1977
.
[6]
Brian Randell,et al.
System structure for software fault tolerance
,
1975,
IEEE Transactions on Software Engineering.