Simulating fail-stop in asynchronous distributed systems

The fail-stop failure model appears frequently in the distributed systems literature. However, in an asynchronous distributed system, the fail-stop model cannot be implemented. In particular, it is impossible to reliably detect crash failures in an asynchronous system. In this paper, we show that it is possible to specify and implement a failure model that is indistinguishable from the fail-stop model from the point of view of any process within an asynchronous system. We give necessary conditions for a failure model to be indistinguishable from the fail-stop model, and derive lower bounds on the amount of process replication needed to implement such a failure model. We present a simple one-round protocol for implementing one such failure model, which we call simulated fail-stop.<<ETX>>

[1]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.

[2]  Kenneth P. Birman,et al.  Understanding partitions and the 'no partition' assumption , 1993, 1993 4th Workshop on Future Trends of Distributed Computing Systems.

[3]  Yair Amir,et al.  Transis: A Communication Sub-system for High Availability , 1992 .

[4]  Yair Amir,et al.  Transis: a communication subsystem for high availability , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[5]  K. Birman,et al.  Understanding Partitions and the \ No Partition " , 1993 .

[6]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[7]  Sam Toueg,et al.  Unreliable failure detectors for asynchronous systems (preliminary version) , 1991, PODC '91.

[8]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[9]  Sam Toueg,et al.  Unreliable Failure Detectors for Asynchronous Systems , 1991 .

[10]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[11]  Fred Kröger,et al.  Temporal Logic of Programs , 1987, EATCS Monographs on Theoretical Computer Science.

[12]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[13]  Dale Skeen Determining the last process to fail , 1983, PODS '83.

[14]  Keith Marzullo,et al.  Simulating fail-stop in asynchronous distributed systems , 1994, PODC '94.

[15]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[16]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[17]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[18]  Kenneth P. Birman,et al.  Using process groups to implement failure detection in asynchronous environments , 1991, PODC '91.

[19]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[20]  K. Mani Chandy,et al.  How processes learn , 1985, PODC '85.