The Need for Realistic Failure Models in Protocol Design

Fault tolerant algorithms are often designed under the assumption that no more than t out of n processes or components can fail. This approach was pioneered by the SIFT project [22], and has since been widely applied to the design of algorithms for real critical systems, e.g., air traffic control [6], other highly available services like file servers [15], and so on. It is such a common assumption that most fault tolerant algorithms found in the literature today adopt it without any justification (e.g., [14, 19]). It is a common assumption because the t out of n model gives one a simple abstraction for reasoning about failure-prone environments and system reliability. With this assumption it is fairly easy to design and verify protocols and also to express lower and upper bounds. Unfortunately, when adopting this assumption, we often forget the relationship between the t out of n assumption and system reliability. In real systems reliability is typically expressed in terms of the probability that the system meets its specification. A more refined model expresses survivability as a range of probabilities for the system meeting a number of different degraded specifications [13]. When characterizing the possible failures as t out of n, one is implicitly expressing an upper bound on system reliability as the probability that t failures or less occur throughout the time the algorithm runs. Forgetting the relation between the t out of n assumption and system reliability can lead to a foolish design. For example, some consensus protocols (e.g., [7]) have a structure in which once t failures have been detected the protocol proceeds with the assumption that no further failures will occur. A more sensible design would have the protocol become more cautious under these circumstances: if t failures have occurred then it is possible that the failure analysis done in computing t was faulty, and so further failures may be more likely under these circumstances. Such foolishness can occur in a more subtle manner. In this position paper we remind ourselves of the relation between failure assumptions and system reliability. We bring to the surface some implicit assumptions made by the t out of n failure characterization and discuss their limitations. We briefly discuss failure assumptions that address some of these limitations. We argue that not all the shortcomings of this model have adequate solutions as of yet.

[1]  Idit Keidar,et al.  Availability study of dynamic voting algorithms , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[2]  Özalp Babaoglu,et al.  On the reliability of consensus-based fault-tolerant distributed computing systems , 1987, TOCS.

[3]  Flaviu Cristian,et al.  Fault-tolerance in air traffic control systems , 1996, TOCS.

[4]  William H. Sanders,et al.  Probabilistic verification of a synchronous round-based consensus protocol , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[5]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[6]  Michael Dahlin,et al.  End-to-end WAN service availability , 2001, TNET.

[7]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[8]  Danny Dolev,et al.  Early stopping in Byzantine agreement , 1990, JACM.

[9]  Dhiraj K. Pradhan,et al.  Consensus With Dual Failure Modes , 1991, IEEE Trans. Parallel Distributed Syst..

[10]  Donald F. Towsley,et al.  Measurement and modelling of the temporal dependence in packet loss , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[11]  Sushil Jajodia,et al.  Dynamic voting algorithms for maintaining the consistency of a replicated database , 1990, TODS.

[12]  J. Rushby,et al.  Formal verification of an interactive consistency algorithm for the Draper FTP architecture under a hybrid fault model , 1994, Proceedings of COMPASS'94 - 1994 IEEE 9th Annual Conference on Computer Assurance.

[13]  Ing-Ray Chen,et al.  Analyzing dynamic voting using Petri nets , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[14]  Yin Zhang,et al.  The Stationarity of Internet Path Properties: Routing, Loss, and Throughput , 2000 .

[15]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[16]  Idit Keidar,et al.  A client-server oriented algorithm for virtually synchronous group membership in WANs , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[17]  John C. Knight,et al.  TOWARDS A DEFINITION OF SURVIVABILITY , 2000 .

[18]  Nancy A. Lynch,et al.  Designing a Caching-Based Reliable Multicast Protocol , 2001 .

[19]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[20]  Kenneth P. Birman,et al.  Bimodal multicast , 1999, TOCS.

[21]  M. Handley An Examination of MBone Performance , 1997 .