Lower Bounds for a Primary–Backup Implementation of a Bofo Service

One way to implement a fault-tolerant service is the primary-backup or primarycopy approach [1]. With this approach, a service is implemented by a collection of servers. One server is designated as the primary; the others are called backups. Clients send requests to the primary and any responses to requests come from the primary. If the primary fails, then a failover occurs after which one of the backups assumes the role of the primary. With the primary-backup approach, a request from a client to the service can be lost if sent to a faulty primary. However, periods during which requests can be lost are bounded by the length of time that elapses between the failure of the primary and the takeover by a backup. Such behavior is an instance of what we call a bofo service (bounded outage finitely often); an (i,Δ)–bofo service is one in which requests that are not processed fall into at most i intervals of time, each interval having a length of at most Δ. Thus, in an (i,Δ)–bofo service, even though some requests might be lost by the service, not too many will. There exist lower bounds constraining i and Δ for implementing an (i,Δ)– bofo service using the primary-backup approach. These lower bounds are a function of message delivery delays and the number of failures that can be tolerated. The kinds of failures that that need to be tolerated constrain the degree of replication and the worst-case response time to client requests.

[1]  Peter A. Barrett,et al.  Using passive replicates in Delta-4 to provide dependable distributed computing , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[2]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[3]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[4]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[5]  Hector Garcia-Molina,et al.  Database Processing with Triple Modular Redundancy , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[6]  Timothy P. Mann,et al.  An Algorithm for Data Replication , 1989 .

[7]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[8]  Liuba Shrira,et al.  A replicated Unix file system , 1990, [1990] Proceedings. Workshop on the Management of Replicated Data.

[9]  Anupam Bhide,et al.  A Highly Available Network File Server , 1991, USENIX Winter.

[10]  Keith Marzullo,et al.  Supplying high availability with a standard network file system , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[11]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.