Evaluating the running time of a communication round over the internet

We study the running time of distributed algorithms deployed in a widely distributed setting over the Internet using TCP. We consider a simple primitive that corresponds to a communication round in which every host sends information to every other host; this primitive occurs in numerous distributed algorithms. We experiment with four algorithms that typically implement this primitive. We run our experiments on ten hosts at geographically disperse locations over the Internet. We observe that message loss has a large impact on algorithm running times, which causes leader-based algorithms to usually outperform decentralized ones.

[1]  Idit Keidar,et al.  On the Cost of Fault-Tolerant Consensus When There Are No Faults - A Tutorial , 2003, LADC.

[2]  Vern Paxson,et al.  End-to-end Internet packet dynamics , 1997, SIGCOMM '97.

[3]  Louise E. Moser,et al.  The Totem multiple-ring ordering and topology maintenance protocol , 1998, TOCS.

[4]  Idit Keidar,et al.  Increasing the Resilience of Distributed and Replicated Database Systems , 1998, J. Comput. Syst. Sci..

[5]  Hari Balakrishnan,et al.  Resilient overlay networks , 2001, SOSP.

[6]  André Schiper Early consensus in an asynchronous system with a weak failure detector , 1997, Distributed Computing.

[7]  Moshe Sidi,et al.  On the Performance of Synchronized Programs in Distributed Networks with Random Processing Times and Transmission Delays , 1994, IEEE Trans. Parallel Distributed Syst..

[8]  Katherine Guo,et al.  Structured virtual synchrony: exploring the bounds of virtual synchronous group communication , 1996, EW 7.

[9]  Yin Zhang,et al.  On the constancy of internet path properties , 2001, IMW '01.

[10]  Dale Skeen,et al.  Nonblocking commit protocols , 1981, SIGMOD '81.

[11]  Amin Vahdat,et al.  Detour: informed Internet routing and transport , 1999, IEEE Micro.

[12]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[13]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[14]  Michael Dahlin,et al.  End-to-end WAN service availability , 2001, TNET.

[15]  Rachid Guerraoui,et al.  The Decentralized Non-Blocking Atomic Commitment Protocol , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[16]  Yair Amir,et al.  Evaluating quorum systems over the Internet , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[17]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[18]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[19]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[20]  A. Schiper,et al.  Contention-aware metrics for distributed algorithms: comparison of atomic broadcast algorithms , 2000, Proceedings Ninth International Conference on Computer Communications and Networks (Cat.No.00EX440).

[21]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[22]  Nicole Sergent Evaluating latency of distributed algorithms using petri nets , 1997, PDP.

[23]  Idit Keidar,et al.  On the cost of fault-tolerant consensus when there are no faults: preliminary version , 2001, SIGA.

[24]  Sally Floyd,et al.  Difficulties in simulating the internet , 2001, TNET.

[25]  Idit Keidar,et al.  Moshe: A group membership service for WANs , 2002, TOCS.

[26]  Stefan Savage,et al.  The end-to-end effects of Internet path selection , 1999, SIGCOMM '99.