Minimizing churn in distributed systems

A pervasive requirement of distributed systems is to deal with churn-change in the set of participating nodes due to joins, graceful leaves, and failures. A high churn rate can increase costs or decrease service quality. This paper studies how to reduce churn by selecting which subset of a set of available nodes to use.First, we provide a comparison of the performance of a range of different node selection strategies in five real-world traces. Among our findings is that the simple strategy of picking a uniform-random replacement whenever a node fails performs surprisingly well. We explain its performance through analysis in a stochastic model.Second, we show that a class of strategies, which we call "Preference List" strategies, arise commonly as a result of optimizing for a metric other than churn, and produce high churn relative to more randomized strategies under realistic node failure patterns. Using this insight, we demonstrate and explain differences in performance for designs that incorporate varying degrees of randomization. We give examples from a variety of protocols, including anycast, over-lay multicast, and distributed hash tables. In many cases, simply adding some randomization can go a long way towards reducing churn.

[1]  Scott Shenker,et al.  Minimizing churn in distributed systems , 2006, SIGCOMM.

[2]  Scott Shenker,et al.  Fixing the Embarrassing Slowness of OpenDHT on PlanetLab , 2005, WORLDS.

[3]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[4]  Moni Naor,et al.  Novel architectures for P2P applications: the continuous-discrete approach , 2003, SPAA '03.

[5]  Robert Tappan Morris,et al.  Bandwidth-efficient management of DHT routing tables , 2005, NSDI.

[6]  Bo Zhang,et al.  Measurement-Based Analysis, Modeling, and Synthesis of the Internet Delay Space , 2006, IEEE/ACM Transactions on Networking.

[7]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[8]  Robert Tappan Morris,et al.  A performance vs. cost framework for evaluating DHT design tradeoffs under churn , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[9]  Asit Dan,et al.  An approximate analysis of the LRU and FIFO buffer replacement schemes , 1990, SIGMETRICS '90.

[10]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[11]  Paul Francis,et al.  Towards a global IP anycast service , 2005, SIGCOMM '05.

[12]  Bruce M. Maggs,et al.  The feasibility of supporting large-scale live streaming applications with dynamic application end-points , 2004, SIGCOMM.

[13]  Margo Seltzer,et al.  Reliability-and capacity-based selection in distributed hash tables , 2003 .

[14]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[15]  Peter A. Franaszek,et al.  Some Distribution-Free Aspects of Paging Algorithm Performance , 1974, JACM.

[16]  Ravi Jain,et al.  An Experimental Study of the Skype Peer-to-Peer VoIP System , 2005, IPTPS.

[17]  Emin Gün Sirer,et al.  Meridian: a lightweight network location service without virtual coordinates , 2005, SIGCOMM '05.

[18]  Philippe Flajolet,et al.  Birthday Paradox, Coupon Collectors, Caching Algorithms and Self-Organizing Search , 1992, Discret. Appl. Math..

[19]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[20]  Brian D. Noble,et al.  Predicting node availability in peer-to-peer networks , 2005, SIGMETRICS '05.

[21]  Gregory R. Ganger,et al.  On Correlated Failures in Survivable Storage Systems , 2002 .

[22]  David R. Karger,et al.  Koorde: A Simple Degree-Optimal Distributed Hash Table , 2003, IPTPS.

[23]  Moni Naor,et al.  Know thy neighbor's neighbor: the power of lookahead in randomized P2P networks , 2004, STOC '04.

[24]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[25]  Krishna P. Gummadi,et al.  The impact of DHT routing geometry on resilience and proximity , 2003, SIGCOMM '03.

[26]  B. Arnold Majorization and the Lorenz Order: A Brief Introduction , 1987 .

[27]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[28]  Scott Shenker,et al.  Internet indirection infrastructure , 2004, IEEE/ACM Transactions on Networking.

[29]  Geoffrey M. Voelker,et al.  On Object Maintenance in Peer-to-Peer Systems , 2006, IPTPS.

[30]  David E. Culler,et al.  Operating Systems Support for Planetary-Scale Network Services , 2004, NSDI.

[31]  DruschelPeter,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001 .

[32]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[33]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[34]  Dmitri Loguinov,et al.  On Lifetime-Based Node Failure and Stochastic Resilience of Decentralized Peer-to-Peer Networks , 2005, IEEE/ACM Transactions on Networking.

[35]  David Mazières,et al.  OASIS: Anycast for Any Service , 2006, NSDI.

[36]  Gurmeet Singh Manku,et al.  Symphony: Distributed Hashing in a Small World , 2003, USENIX Symposium on Internet Technologies and Systems.