Randomization can be a healer: consensus with dynamic omission failures

Wireless ad-hoc networks are being increasingly used in diverse contexts, ranging from casual meetings to disaster recovery operations. A promising approach is to model these networks as distributed systems prone to dynamic communication failures. This captures transitory disconnections in communication due to phenomena like interference and collisions, and permits an efficient use of the wireless broadcasting medium. This model, however, is bound by the impossibility result of Santoro and Widmayer, which states that, even with strong synchrony assumptions, there is no deterministic solution to any non-trivial form of agreement if n − 1 or more messages can be lost per communication round in a system with n processes. In this paper we propose a novel way to circumvent this impossibility result by employing randomization. We present a consensus protocol that ensures safety in the presence of an unrestricted number of omission faults, and guarantees progress in rounds where such faults are bounded by $${f \,{\leq}\,\lceil \frac{n}{2} \rceil (n\,{-}\,k)\,{+}\,k\,{-}\,2}$$, where k is the number of processes required to decide, eventually assuring termination with probability 1.

[1]  Nicola Santoro,et al.  Time is Not a Healer , 1989, STACS.

[2]  Michael J. Fischer,et al.  The Consensus Problem in Unreliable Distributed Systems (A Brief Survey) , 1983, FCT.

[3]  Miguel Correia,et al.  Solving vector consensus with a wormhole , 2005, IEEE Transactions on Parallel and Distributed Systems.

[4]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[5]  Roy Friedman,et al.  Failure detectors in omission failure environments , 1997, PODC '97.

[6]  Miguel Correia,et al.  Experimental Comparison of Local and Shared Coin Randomized Consensus Protocols , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[7]  E. A. Akkoyunlu,et al.  Some constraints and tradeoffs in the design of network communications , 1975, SOSP.

[8]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[9]  André Schiper,et al.  Tolerating corrupted communication , 2007, PODC '07.

[10]  Michel Raynal,et al.  A note on a simple equivalence between round-based synchronous and asynchronous models , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[11]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[12]  Sam Toueg,et al.  Distributed agreement in the presence of processor and communication faults , 1986, IEEE Transactions on Software Engineering.

[13]  Nancy A. Lynch,et al.  Consensus and collision detectors in radio networks , 2008, Distributed Computing.

[14]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[15]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[16]  André Schiper,et al.  The Heard-Of model: computing in distributed systems with benign faults , 2009, Distributed Computing.

[17]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 1998, Distributed Computing.

[18]  Ran Canetti,et al.  Fast asynchronous Byzantine agreement with optimal resilience , 1993, STOC.

[19]  Miguel Correia,et al.  RITAS: Services for Randomized Intrusion Tolerance , 2011, IEEE Transactions on Dependable and Secure Computing.

[20]  Achour Mostéfaoui,et al.  Consensus in asynchronous systems where processes can crash and recover , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[21]  André Schiper,et al.  Consensus in the Crash-Recover Model , 1997 .

[22]  Nicola Santoro,et al.  Agreement in synchronous networks with ubiquitous faults , 2007, Theor. Comput. Sci..

[23]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[24]  Leslie Lamport Lower bounds for asynchronous consensus , 2003 .

[25]  Michael O. Rabin,et al.  Randomized byzantine generals , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[26]  Victor Shoup,et al.  Random Oracles in Constantinople: Practical Asynchronous Byzantine Agreement Using Cryptography , 2000, Journal of Cryptology.

[27]  Idit Keidar,et al.  Impossibility Results and Lower Bounds for Consensus under Link Failures , 2008, SIAM J. Comput..

[28]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[29]  Gabriel Bracha,et al.  An asynchronous [(n - 1)/3]-resilient consensus protocol , 1984, PODC '84.

[30]  Nancy A. Lynch,et al.  A Tradeoff Between Safety and Liveness for Randomized Coordinated Attack , 1996, Inf. Comput..

[31]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[32]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.