Optimal time randomized consensus—making resilient algorithms fast in practice

In practice, the design of distributed systems is often geared towards optimizing the time complexity of algorithms in %orrnal " executions, i.e. ones in which at most a small number of failures occur , while at the same time building in safety provisions to protect against many failures. In this paper we present an optimally fast and highly resilient shared-memory randomized consensus algorithm that runs in only O(log n) expected time if @or less failures occur, and takes at most O(*) expected tim~ for any j. Every previously known resilient algorithm required polynomial expected time even if no faults occurred. Using the novel consensus algorithm, we show a method for speeding-up resilient algorithms: for any decision problem on n processors, given a highly resilient algorithm as a black box, it modularly generates an algorithm with the same strong properties, that runs in only O(log n) expected time in executions where no failures occur. 1 Introduction 1.1 Motivation This paper addresses the issue of designing highly resilient algorithms that perform optimally when only a small number of failures occur. These algorithms can be viewed as bridging the gap between the theoretical goal of having an algorithm with good running time even when the system exhibits extremely pathological behavior, and the practical goal (cf. [19]) of having an algorithm that runs optimally on " normal executions,)' namely, ones in which no failures or only a small number of failures occur. There has recently been a growing interest in devising algorithms that can be proven to have such properties [7, 11, 13, 22, 16]. It was introduced in the context of asynchronous shared memory algorithms by Attiya, Lynch and Shavit [7]. 1 The consensus problem for asynchronous sha~ea' memory systems (defined below) provides a paradigmatic illustration of the problem: for reliable systems there is a trivial algorithm that runs in constant time, but there is provably no deterministic algorithm that is ' [11, 13,22, 16] treat it in the context of synchronous message passing systems.

[1]  James Aspnes,et al.  Time-and space-efficient randomized consensus , 1990, PODC '90.

[2]  Nancy A. Lynch,et al.  On Describing the Behavior and Implementation of Distributed Systems , 1979, Theor. Comput. Sci..

[3]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[4]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[5]  Nancy A. Lynch,et al.  Efficiency of Synchronous Versus Asynchronous Distributed Systems , 1983, J. ACM.

[6]  Butler W. Lampson,et al.  Hints for Computer System Design , 1983, IEEE Software.

[7]  Serge A. Plotkin Sticky bits and universality of consensus , 1989, PODC '89.

[8]  James H. Anderson,et al.  The Virtue of Patience: Concurrent Programming with and Without Waiting , 1990 .

[9]  Nir Shavit,et al.  Bounded polynomial randomized consensus , 1989, PODC.

[10]  Brian A. Coan,et al.  Simultaneity Is Harder than Agreement , 1991, Inf. Comput..

[11]  Maurice Herlihy,et al.  Fast Randomized Consensus Using Shared Memory , 1990, J. Algorithms.

[12]  Amos Israeli,et al.  On processor coordination using asynchronous hardware , 1987, PODC '87.

[13]  Karl R. Abrahamson On achieving consensus using a shared memory , 1988, PODC '88.

[14]  Soma Chaudhuri,et al.  Agreement is harder than consensus: set consensus problems in totally asynchronous systems , 1990, PODC '90.

[15]  Michael O. Rabin,et al.  Randomized byzantine generals , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[16]  Yoram Moses,et al.  Knowledge and Common Knowledge in a Byzantine Environment I: Crash Failures , 1986, TARK.

[17]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.