An epidemic approach to dependable key-value substrates

The sheer volumes of data handled by today's Internet services demand uncompromising scalability from the persistence substrates. Such demands have been successfully addressed by highly decentralized key-value stores invariably governed by a distributed hash table. The availability of these structured overlays rests on the assumption of a moderately stable environment. However, as scale grows with unprecedented numbers of nodes the occurrence of faults and churn becomes the norm rather than the exception, precluding the adoption of rigid control over the network's organization. In this position paper we outline the major ideas of a novel architecture designed to handle today's very large scale demand and its inherent dynamism. The approach rests on the well-known reliability and scalability properties of epidemic protocols to minimize the impact of churn. We identify several challenges that such an approach implies and speculate on possible solutions to ensure data availability and adequate access performance.

[1]  Márk Jelasity,et al.  T-Man: Gossip-based fast overlay topology construction , 2009, Comput. Networks.

[2]  Christos Gkantsidis,et al.  Random walks in peer-to-peer networks: Algorithms and evaluation , 2006, Perform. Evaluation.

[3]  Anne-Marie Kermarrec,et al.  Sub-2-Sub: Self-Organizing Content-Based Publish Subscribe for Dynamic Large Scale Collaborative Networks , 2006, IPTPS.

[4]  José Pereira,et al.  StAN: exploiting shared interests without disclosing them in gossip-based publish/subscribe , 2010, IPTPS.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Guillaume Pierre,et al.  Adam2: Reliable Distribution Estimation in Decentralised Environments , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[7]  Tharam S. Dillon,et al.  On the Move to Meaningful Internet Systems, OTM 2010 , 2010, Lecture Notes in Computer Science.

[8]  Gade Krishna,et al.  A scalable peer-to-peer lookup protocol for Internet applications , 2012 .

[9]  Anne-Marie Kermarrec,et al.  Lightweight probabilistic broadcast , 2003, TOCS.

[10]  Kenneth P. Birman,et al.  Bimodal multicast , 1999, TOCS.

[11]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[12]  Márk Jelasity,et al.  Gossip-based aggregation in large dynamic networks , 2005, TOCS.

[13]  Cuong Pham,et al.  PUB-2-SUB: A Content-Based Publish/Subscribe Framework for Cooperative P2P Networks , 2009, Networking.

[14]  Anne-Marie Kermarrec,et al.  Peer counting and sampling in overlay networks: random walk methods , 2006, PODC '06.

[15]  Anne-Marie Kermarrec,et al.  NEEM: network-friendly epidemic multicast , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[16]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[17]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[18]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[19]  Robbert van Renesse,et al.  Gossip-based distribution estimation in peer-to-peer networks , 2008, IPTPS.

[20]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[21]  Pascal Felber,et al.  Semantic Peer-to-Peer Overlays for Publish/Subscribe Networks , 2005, Euro-Par.

[22]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[23]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[24]  Yoav Tock,et al.  SpiderCast: a scalable interest-aware overlay for topic-based pub/sub communication , 2007, DEBS '07.

[25]  Christopher Chute,et al.  The Diverse and Exploding Digital Universe , 2011 .

[26]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[27]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[28]  Ricardo Vilaça,et al.  Clouder: a flexible large scale decentralized object store: architecture overview , 2009, WDDM '09.

[29]  SchroederBianca,et al.  DRAM errors in the wild , 2009 .

[30]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[31]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[32]  Jorge C. S. Cardoso,et al.  Probabilistic Estimation of Network Size and Diameter , 2009, 2009 Fourth Latin-American Symposium on Dependable Computing.

[33]  Ricardo Manuel Pereira Vilaça,et al.  On the Expressiveness and Trade-Offs of Large Scale Tuple Stores , 2010, OTM Conferences.

[34]  Anne-Marie Kermarrec,et al.  Sub-2-Sub: Self-Organizing Content-Based Publish and Subscribe for Dynamic and Large Scale Collaborative Networks , 2006 .

[35]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.