Tolerance to unbounded Byzantine faults

An ideal approach to deal with faults in large-scale distributed systems is to contain the effects of faults as locally as possible and, additionally, to ensure some type of tolerance within each fault-affected locality. Existing results using this approach accommodate only limited faults (such as crashes) or assume that fault occurrence is bounded in space and/or time. In this paper, we define and explore possibility/impossibility of local tolerance with respect to arbitrary faults (such as Byzantine faults) whose occurrence may be unbounded in space and in time. Our positive results include programs for graph coloring and dining philosophers, with proofs that the size of their tolerance locality is optimal. The type of tolerance achieved within fault-affected localities is self-stabilization. That is, starting from an arbitrary state of the distributed system, each non-faulty process eventually reaches a state from where it behaves correctly as long as the only faults that occur henceforth (regardless of their number) are outside the locality of this process.

[1]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[2]  Nigamanth Sridhar,et al.  A New Distributed Resource-Allocation Algorithm with Optimal Failure Locality , 2000 .

[3]  Manhoi Choy,et al.  Localizing Failures in Distributed Synchronization , 1996, IEEE Trans. Parallel Distributed Syst..

[4]  Yih-Kuen Tsay,et al.  An Algorithm with Optimal Failure Locality for the Dining Philosophers Problem , 1994, WDAG.

[5]  Hongwei Zhang,et al.  GS3: scalable self-configuration and self-healing in wireless sensor networks , 2003, Comput. Networks.

[6]  Arobinda Gupta,et al.  An Exercise in Fault-Containment: Self-Stabilizing Leader Election , 1996, Inf. Process. Lett..

[7]  Eugene Styer,et al.  Improved algorithms for distributed resource allocation , 1988, PODC '88.

[8]  Shlomi Dolev,et al.  SuperStabilizing protocols for dynamic distributed systems , 1995, PODC '95.

[9]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[10]  Anish Arora,et al.  Component Based Design of Multitolerant Systems , 1998, IEEE Trans. Software Eng..

[11]  Joffroy Beauquier,et al.  Fault-tolerance and self-stabilization: impossibility results and solutions using self-stabilizing failure detectors , 1997, Int. J. Syst. Sci..

[12]  Arobinda Gupta,et al.  Fault-containing self-stabilizing algorithms , 1996, PODC '96.

[13]  Anish Arora,et al.  Dining philosophers that tolerate malicious crashes , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[14]  Anish Arora,et al.  Compositional design of multitolerant repetitive byzantine agreement , 1997, WSS.

[15]  Synnöve Kekkonen-Moneta,et al.  On FTSS-solvable distributed problems , 1997, PODC '97.

[16]  Anish Arora,et al.  Closure and Convergence: A Foundation of Fault-Tolerant Computing , 1993, IEEE Trans. Software Eng..

[17]  Robert Szewczyk,et al.  System architecture directions for networked sensors , 2000, ASPLOS IX.

[18]  Boaz Patt-Shamir,et al.  Stabilizing Time-Adaptive Protocols , 1999, Theor. Comput. Sci..

[19]  Sukumar Ghosh,et al.  Trade-offs in fault-containing self-stabilization , 1997, PODC '97.