Self-Stabilizing Failure Detector Algorithms

This paper revisits the interconnection of self-stabilization and fault-tolerance. Self-stabilizing algorithms are able to recover from arbitrary system states given that from some point in time on, there are no faults. Fault-tolerance, on the other hand, refers to algorithms that cope with systems where a (bounded) part of the system (e.g. at most f out of n processes) may fail permanently. In previous work [16] we considered the interconnection of these two paradigms, i.e., algorithms that recover from arbitrary states despite of permanent faults. We have shown that in certain settings, problems as failure detection cannot be solved. This paper presents ways to circumvent this impossibility result.

[1]  Martin Hutle An efficient failure detector for sparsely connected networks , 2004, Parallel and Distributed Computing and Networks.

[2]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[3]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[4]  Shlomi Dolev,et al.  Self Stabilization , 2004, J. Aerosp. Comput. Inf. Commun..

[5]  Gérard Le Lann,et al.  How to Implement a Time-Free Perfect Failure Detector in Partially Synchronous Systems , 2005 .

[6]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[7]  Nancy A. Lynch,et al.  Bounds on the time to reach agreement in the presence of timing uncertainty , 1991, STOC '91.

[8]  Danny Dolev,et al.  Linear Time Byzantine Self-Stabilizing Clock Synchronization , 2003, OPODIS.

[9]  Jennifer L. Welch,et al.  Self-Stabilizing Clock Synchronization in the Presence of ByzantineFaults ( Preliminary Version ) Shlomi Dolevy , 1995 .

[10]  Josef Widder Distributed Computing in the Presence of Bounded Asynchrony , 2004 .

[11]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[12]  Joffroy Beauquier,et al.  Fault-tolerance and self-stabilization: impossibility results and solutions using self-stabilizing failure detectors , 1997, Int. J. Syst. Sci..

[13]  Achour Mostéfaoui,et al.  Asynchronous implementation of failure detectors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[14]  Vassos Hadzilacos,et al.  Tolerating Transient and Permanent Failures (Extended Abstract) , 1993, WDAG.

[15]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[16]  Mohamed G. Gouda,et al.  Stabilizing Communication Protocols , 1991, IEEE Trans. Computers.