论文信息 - Chasing the FLP impossibility result in a LAN: or, How robust can a fault tolerant server be?

Chasing the FLP impossibility result in a LAN: or, How robust can a fault tolerant server be?

Fault tolerance can be achieved in distributed systems by replication. However Fischer, Lynch and Paterson (1985) have proven an impossibility result about consensus in the asynchronous system model, and similar impossibility results exist for atomic broadcast and group membership. We investigate, with the aid of an experiment conducted in a LAN, whether these impossibility results set limits to the robustness of a replicated server exposed to extremely high loads. The experiment consists of client processes that send requests to a replicated server (three replicas) using an atomic broadcast primitive. It has parameters that allow us to control the load on the hosts and the network, as well as the timeout value used by our heartbeat failure detection mechanism. Our main observation is that the atomic broadcast algorithm never stops delivering messages, not even under arbitrarily high load and very small timeout values (1 ms). So, by trying to illustrate the practical impact of impossibility results, we discovered that we had implemented a very robust replicated service.

[1] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[2] Sam Toueg,et al. Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[3] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[4] Péter Urbán,et al. Neko: a single environment to simulate and prototype distributed algorithms , 2001, Proceedings 15th International Conference on Information Networking.

[5] André Schiper. Early consensus in an asynchronous system with a weak failure detector , 1997, Distributed Computing.

[6] Sam Toueg,et al. Fault-tolerant broadcasts and related problems , 1993 .

[7] Marcos K. Aguilera,et al. On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[8] Fred B. Schneider,et al. Replication management using the state-machine approach , 1993 .

[9] Flaviu Cristian,et al. The Timed Asynchronous Distributed System Model , 1999, IEEE Trans. Parallel Distributed Syst..

[10] Bernadette Charron-Bost,et al. On the impossibility of group membership , 1996, PODC '96.

[11] Achour Mostéfaoui,et al. Solving Consensus Using Chandra-Toueg's Unreliable Failure Detectors: A General Quorum-Based Approach , 1999, DISC.

[12] Fred B. Schneider,et al. The primary-backup approach , 1993 .