Multitolerance in Distributed Reset

A reset of a distributed system is safe if it does not complete ``prematurely,'''' i.e., without having reset some process in the system. Safe resets are possible in the presence of certain faults, such as process fail-stops and repairs, but are not always possible in the presence of more general faults, such as arbitrary transients. In this paper, we design a bounded-memory distributed-reset program that possesses two tolerances: (1) in the presence of fail-stops and repairs, it always executes resets safely, and (2) in the presence of a finite number of transient faults, it eventually executes resets safely. Designing this multitolerance in the reset program introduces the novel concern of designing a safety detector that is itself multitolerant. A broad application of our multitolerant safety detector is to make any total program likewise multitolerant.

[1]  Anish Arora,et al.  Designing masking fault-tolerance via nonmasking fault-tolerance , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[2]  Anish Arora,et al.  Distributed Reset , 1994, IEEE Trans. Computers.

[3]  S. Finn Resynch Procedures and a Fail-Safe Network Protocol , 1979, IEEE Trans. Commun..

[4]  Shlomi Dolev,et al.  SuperStabilizing protocols for dynamic distributed systems , 1995, PODC '95.

[5]  Evan H. Magill,et al.  Detecting feature interactions in the Intelligent Network , 1994, FIW.

[6]  Amos Israeli,et al.  Self-stabilization of dynamic systems assuming only read/write atomicity , 1990, PODC '90.

[7]  B. Awerbuch,et al.  Memory-eecient and Self-stabilizing Network Reset , 2007 .

[8]  Mohamed G. Gouda,et al.  Stabilizing Communication Protocols , 1991, IEEE Trans. Computers.

[9]  David Gries,et al.  The Science of Programming , 1981, Text and Monographs in Computer Science.

[10]  Bowen Alpern,et al.  Proving Boolean Combinations of Deterministic Properties , 1987, Logic in Computer Science.

[11]  Edsger W. Dijkstra,et al.  A Discipline of Programming , 1976 .

[12]  AroraAnish,et al.  Designing Masking Fault-Tolerance via Nonmasking Fault-Tolerance , 1998 .

[13]  Anish Arora,et al.  Closure and Convergence: A Foundation of Fault-Tolerant Computing , 1993, IEEE Trans. Software Eng..

[14]  George Varghese,et al.  Crash failures can drive protocols to arbitrary states , 1996, PODC '96.

[15]  Boaz Patt-Shamir,et al.  Self-stabilization by local checking and correction , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.