Common knowledge and consistent simultaneous coordination

SummaryThere is a very close relationship between common knowledge and simultaneity in synchronous distributed systems. The analysis of several well-known problems in terms of common knowledge has led to round-optimal protocols for these problems, includingReliable Broadcast, Distributed Consensus, and theDistributed Firing Squad problem. These problems require that the correct processors coordinate their actions in some way but place no restrictions on the behaviour of the faulty processors. In systems with benign processor failures, howrver, it is reasonable to require that the actions of a faulty processor be consistent with those of the correct processors, assuming it performs any action at all. We consider problems requiringconsistent, simultaneous coordination. We then analyze these problems in terms of common knowledge in several failure models. The analysis of these stronger problems requires a stronger definition of common knowledge, and we study the relationship between these two definitions. In many cases, the two definitions are actually equivalent, and simple modifications of previous solutions yield roundoptimal solutions to these problems. When the definitions differ, however, we show that such problems cannot be solved, even in failure-free executions.

[1]  Sam Toueg,et al.  Reliable Broadcast in Synchronous and Asynchronous Environments (Preliminary Version) , 1989, WDAG.

[2]  Sam Toueg,et al.  Distributed agreement in the presence of processor and communication faults , 1986, IEEE Transactions on Software Engineering.

[3]  Gil Neiger,et al.  Automatically Increasing the Fault-Tolerance of Distributed Algorithms , 1990, J. Algorithms.

[4]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[5]  Vassos Hadzilacos,et al.  Issues of fault tolerance in concurrent computations (databases, reliability, transactions, agreement protocols, distributed computing) , 1985 .

[6]  Brian A. Coan,et al.  Achieving consensus in fault-tolerant distributed computer systems: protocols, lower bounds, and simulations , 1987 .

[7]  Brian A. Coan,et al.  The Distributed Firing Squad Problem , 1989, SIAM J. Comput..

[8]  Joseph Y. Halpern,et al.  A characterization of eventual Byzantine agreement , 1990, PODC '90.

[9]  Joseph Y. Halpern,et al.  Knowledge and common knowledge in a distributed environment , 1984, JACM.

[10]  Danny Dolev,et al.  Early stopping in Byzantine agreement , 1990, JACM.

[11]  R. Bayer,et al.  Operating systems: An advanced course , 1978 .

[12]  Yoram Moses,et al.  Knowledge and Common Knowledge in a Byzantine Environment I: Crash Failures , 1986, TARK.

[13]  Nancy A. Lynch,et al.  A Lower Bound for the Time to Assure Interactive Consistency , 1982, Inf. Process. Lett..

[14]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[15]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[16]  Brian A. Coan,et al.  A communication-efficient canonical form for fault-tolerant distributed protocols , 1986, PODC '86.

[17]  Rida A. Bazzi,et al.  Using Knowledge to Optimally Achieve Coordination in Distributed Systems , 1992, Theor. Comput. Sci..

[18]  Yoram Moses,et al.  Programming simultaneous actions using common knowledge , 1987, Algorithmica.

[19]  Nancy A. Lynch,et al.  The Byzantine Firing Squad Problem. , 1985 .

[20]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[21]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .