Comparing Atomic Broadcast Algorithms in High Latency Networks

Since the introduction of the concept of failure detectors, several consensus and atomic broadcast algorithms based on these detectors have been published. The performance of these algorithms is often affected by a trade-off between the number of communication steps and the number of messages needed to reach a decision. Some algorithms reach decisions in few communication steps but require more messages to do so. Others save messages at the expense of an additional communication step to diffuse the decision to all processes in the system. This trade-off is heavily influenced by the network latency and the message processing times. Performance evaluations of these algorithms, both in simulated or in real environments, have been published. These evaluations often consider a symmetrical setup : all processes are on the same network and have identical peer-to-peer latencies. In this paper, we evaluate the performance of three consensus and atomic broadcast algorithms using failure detectors in several wide area networks. We specifically focus on the case of a system with three processes, two of which are on a local area network and the third on a distant site and examine how this setting affects the performance of all three algorithms.

[1]  Sam Toueg,et al.  A Modular Approach to Fault-Tolerant Broadcasts and Related Problems , 1994 .

[2]  Achour Mostéfaoui,et al.  Solving Consensus Using Chandra-Toueg's Unreliable Failure Detectors: A General Quorum-Based Approach , 1999, DISC.

[3]  Péter Urbán,et al.  Performance comparison of a rotating coordinator and a leader based consensus algorithm , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  Péter Urbán,et al.  EVALUATING THE PERFORMANCE OF DISTRIBUTED AGREEMENT ALGORITHMS: TOOLS, METHODOLOGY AND CASE STUDIES , 2003 .

[6]  Francisco Moura,et al.  Optimistic total order in wide area networks , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[7]  Franck Cappello,et al.  Grid'5000: a large scale, reconfigurable, controlable and monitorable Grid platform , 2005 .

[8]  Idit Keidar,et al.  Evaluating the running time of a communication round over the internet , 2002, PODC '02.

[9]  Péter Urbán,et al.  Comparison of failure detectors and group membership: performance study of two atomic broadcast algorithms , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[10]  Péter Urbán,et al.  Token-based atomic broadcast using unreliable failure detectors , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[11]  André Schiper,et al.  Solving Atomic Broadcast with Indirect Consensus , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[12]  Péter Urbán,et al.  Performance analysis of a consensus algorithm combining stochastic activity networks and measurements , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  Luís E. T. Rodrigues,et al.  An indulgent uniform total order algorithm with optimistic delivery , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[14]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[15]  Péter Urbán,et al.  Neko: a single environment to simulate and prototype distributed algorithms , 2001, Proceedings 15th International Conference on Information Networking.

[16]  Péter Urbán Evaluating the performance of distributed agreement algorithms , 2003 .

[17]  Ricardo Jiménez-Peris,et al.  Consistent Data Replication: Is It Feasible in WANs? , 2005, Euro-Par.

[18]  Danny Dolev,et al.  Evaluating Total Order Algorithms in WAN , 2003 .

[19]  Rachid Guerraoui,et al.  High Throughput Total Order Broadcast for Cluster Environments , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[20]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.