On failure detection algorithms in overlay networks

One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay networks to detect node failures, their tradeoffs and the circumstances in which they might best he suited is not well understood. In this paper, we study how the design of various keep-alive approaches affect their performance in node failure detection time, probability of false positive, control overhead, and packet loss rate via analysis, simulation, and implementation. We find that among the class of keep-alive algorithms that share information, the maintenance of backpointer state substantially improves detection time and packet loss rate. The improvement in detection time between baseline and sharing algorithms becomes more pronounced as the size of neighbor set increases. Finally, sharing of information allows a network to tolerate a higher churn rate than baseline.

[1]  R. Durrett Probability: Theory and Examples , 1993 .

[2]  Yakov Rekhter,et al.  A Border Gateway Protocol 4 (BGP-4) , 1994, RFC.

[3]  Donald F. Towsley,et al.  Measurement and modelling of the temporal dependence in packet loss , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[4]  Hui Zhang,et al.  A case for end system multicast (keynote address) , 2000, SIGMETRICS '00.

[5]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[6]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[7]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[8]  D. Andersen,et al.  Resilient overlay networks , 2002, CCRV.

[9]  Indranil Gupta,et al.  On scalable and efficient distributed failure detectors , 2001, PODC '01.

[10]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[11]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[12]  David R. Karger,et al.  Analysis of the evolution of peer-to-peer systems , 2002, PODC '02.

[13]  Krishna P. Gummadi,et al.  A measurement study of Napster and Gnutella as examples of peer-to-peer file sharing systems , 2002, CCRV.

[14]  Dejan Kostic,et al.  Scalability and accuracy in a large-scale network emulator , 2002, CCRV.

[15]  Chen-Nee Chuah,et al.  Analysis of link failures in an IP backbone , 2002, IMW '02.

[16]  Hari Balakrishnan,et al.  Resilient overlay networks , 2001, SOSP.

[17]  Srinivasan Seshan,et al.  A case for end system multicast , 2002, IEEE J. Sel. Areas Commun..

[18]  K. Walsh,et al.  Scalability and accuracy in a large-scale network emulator , 2002, OPSR.

[19]  Ben Y. Zhao,et al.  Distributed Object Location in a Dynamic Network , 2002, SPAA '02.

[20]  Krishna P. Gummadi,et al.  The impact of DHT routing geometry on resilience and proximity , 2003, SIGCOMM '03.

[21]  Two Hop Lookups for Large Scale Peer-to-Peer Overlays , 2003 .

[22]  Michael Dahlin,et al.  End-to-end WAN service availability , 2001, TNET.

[23]  Miguel Castro,et al.  Controlling the Cost of Reliability in Peer-to-Peer Overlays , 2003, IPTPS.

[24]  Anjali Gupta,et al.  One Hop Lookups for Peer-to-Peer Overlays , 2003, HotOS.

[25]  I. Stoica,et al.  Internet indirection infrastructure , 2002, SIGCOMM '02.

[26]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[27]  Dmitri Loguinov,et al.  Graph-theoretic analysis of structured peer-to-peer systems: routing distances and fault resilience , 2003, IEEE/ACM Transactions on Networking.