Randomization Can Be a Healer: Consensus with Dynamic Omission Failures

Wireless ad-hoc networks are being increasingly used in diverse contexts, ranging from casual meetings to disaster recovery operations. A promising approach is to model these networks as distributed systems prone to dynamic communication failures. This captures transitory disconnections in communication due to phenomena like interference and collisions, and permits an efficient use of the wireless broadcasting medium. This model, however, is bound by the impossibility result of Santoro and Widmayer, which states that, even with strong synchrony assumptions, there is no deterministic solution to any non-trivial form of agreement if n - 1 or more messages can be lost per communication round in a system with n processes. In this paper we propose a novel way to circumvent this impossibility result by employing randomization. We present a consensus protocol that ensures safety in the presence of an unrestricted number of omission faults, and guarantees progress in rounds where such faults are bounded by f ≤ ⌈n/2⌉(n-k)+k - 2, where k is the number of processes required to decide, eventually assuring termination with probability 1.

[1]  Victor Shoup,et al.  Random Oracles in Constantinople: Practical Asynchronous Byzantine Agreement Using Cryptography , 2000, Journal of Cryptology.

[2]  Nancy A. Lynch,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[3]  Michael J. Fischer,et al.  The Consensus Problem in Unreliable Distributed Systems (A Brief Survey) , 1983, FCT.

[4]  Nancy A. Lynch,et al.  Consensus and collision detectors in radio networks , 2008, Distributed Computing.

[5]  Miguel Correia,et al.  Experimental Comparison of Local and Shared Coin Randomized Consensus Protocols , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[6]  Gabriel Bracha,et al.  An asynchronous [(n - 1)/3]-resilient consensus protocol , 1984, PODC '84.

[7]  Miguel Correia,et al.  Solving vector consensus with a wormhole , 2005, IEEE Transactions on Parallel and Distributed Systems.

[8]  Nicola Santoro,et al.  Time is Not a Healer , 1989, STACS.

[9]  Michel Raynal,et al.  A note on a simple equivalence between round-based synchronous and asynchronous models , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[10]  Nancy A. Lynch,et al.  A Tradeoff Between Safety and Liveness for Randomized Coordinated Attack , 1996, Inf. Comput..

[11]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[12]  André Schiper,et al.  The Heard-Of model: computing in distributed systems with benign faults , 2009, Distributed Computing.

[13]  Achour Mostéfaoui,et al.  Consensus in asynchronous systems where processes can crash and recover , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[14]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[15]  Nicola Santoro,et al.  Agreement in synchronous networks with ubiquitous faults , 2007, Theor. Comput. Sci..

[16]  Sam Toueg,et al.  Distributed agreement in the presence of processor and communication faults , 1986, IEEE Transactions on Software Engineering.

[17]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[18]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[19]  Idit Keidar,et al.  Impossibility Results and Lower Bounds for Consensus under Link Failures , 2008, SIAM J. Comput..

[20]  Leslie Lamport,et al.  Lower bounds for asynchronous consensus , 2006, Distributed Computing.

[21]  Roy Friedman,et al.  Failure detectors in omission failure environments , 1997, PODC '97.

[22]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 2000, Distributed Computing.

[23]  André Schiper,et al.  Consensus in the Crash-Recover Model , 1997 .

[24]  Michael O. Rabin,et al.  Randomized byzantine generals , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[25]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[26]  E. A. Akkoyunlu,et al.  Some constraints and tradeoffs in the design of network communications , 1975, SOSP.

[27]  André Schiper,et al.  Tolerating corrupted communication , 2007, PODC '07.

[28]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[29]  Ran Canetti,et al.  Fast asynchronous Byzantine agreement with optimal resilience , 1993, STOC.

[30]  L. Stockmeyer,et al.  ON THE MINIMAL SYNCHRONISMNEEDED FOR DISTRIBUTED CONSENSUS , 1983 .

[31]  Miguel Correia,et al.  RITAS: Services for Randomized Intrusion Tolerance , 2011, IEEE Transactions on Dependable and Secure Computing.

[32]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[33]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.