Protector: A Probabilistic Failure Detector for Cost-Effective Peer-to-Peer Storage

Maintaining a given level of data redundancy is a fundamental requirement of peer-to-peer (P2P) storage systems-to ensure desired data availability, additional replicas must be created when peers fail. Since the majority of failures in P2P networks are transient (i.e., peers return with data intact), an intelligent system can reduce significant replication costs by not replicating data following transient failures. Reliably distinguishing permanent and transient failures, however, is a challenging task, because peers are unresponsive to probes in both cases. In this paper, we propose Protector, an algorithm that enables efficient replication policies by estimating the number of “remaining replicas” for each object, including those temporarily unavailable due to transient failures. Protector dramatically improves detection accuracy by exploiting two opportunities. First, it leverages failure patterns to predict the likelihood that a peer (and the data it hosts) has permanently failed given its current downtime. Second, it detects replication level across groups of replicas (or fragments), thereby balancing false positives for some peers against false negatives for others. Extensive simulations based on both synthetic and real traces show that Protector closely approximates the performance of a perfect “oracle” failure detector, and significantly outperforms time-out-based detectors using a wide range of parameters. Finally, we design, implement and deploy an efficient P2P storage system called AmazingStore by combining Protector with structured P2P overlays. Our experience proves that Protector enables efficient long-term data maintenance in P2P storage systems.

[1]  Pierre Sens,et al.  Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[2]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[3]  Stefan Savage,et al.  Understanding Availability , 2003, IPTPS.

[4]  Aaron Harwood,et al.  A comparative study on Peer-to-Peer failure rate estimation , 2007, 2007 International Conference on Parallel and Distributed Systems.

[5]  Yafei Dai,et al.  Understanding the Dynamic of Peer-to-Peer Systems , 2007, IPTPS.

[6]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[7]  Karl Aberer,et al.  Internet-Scale Storage Systems under Churn -- A Study of the Steady-State using Markov Models , 2006, Sixth IEEE International Conference on Peer-to-Peer Computing (P2P'06).

[8]  Andrea Bondavalli,et al.  Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults , 2000, IEEE Trans. Computers.

[9]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[10]  Ben Y. Zhao,et al.  Probabilistic Failure Detection for Efficient Distributed Storage Maintenance , 2008, 2008 Symposium on Reliable Distributed Systems.

[11]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[12]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[13]  John Kubiatowicz,et al.  Introspective failure analysis: avoiding correlated failures in peer-to-peer systems , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[14]  Andrea Bondavalli,et al.  Experimental evaluation of the QoS of failure detectors on wide area network , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[15]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[16]  J. Kubiatowicz,et al.  Long-Term Data Maintenance in Wide-Area Storage Systems : A Quantitative Approach , 2005 .

[17]  Brian D. Noble,et al.  Samsara: honor among thieves in peer-to-peer storage , 2003, SOSP '03.

[18]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[19]  Ingrid Jansch-Pôrto,et al.  QoS of timeout-based self-tuned failure detectors: the effects of the communication delay predictor and the safety margin , 2004, International Conference on Dependable Systems and Networks, 2004.

[20]  David Mazières,et al.  Kademlia: A Peer-to-Peer Information System Based on the XOR Metric , 2002, IPTPS.

[21]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[22]  Emin Gün Sirer,et al.  Beehive: O(1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays , 2004, NSDI.

[23]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[24]  Jinyang Li,et al.  Friendstore: cooperative online backup using trusted nodes , 2008, SocialNets '08.

[25]  Pierre Sens,et al.  Performance analysis of a hierarchical failure detector , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[26]  Andrea Bondavalli,et al.  A framework for dependable QoS adaptation in probabilistic environments , 2008, SAC '08.

[27]  Geoffrey M. Voelker,et al.  On Object Maintenance in Peer-to-Peer Systems , 2006, IPTPS.

[28]  Yafei Dai,et al.  A Data Placement Scheme with Time-Related Model for P2P Storages , 2007, Seventh IEEE International Conference on Peer-to-Peer Computing (P2P 2007).

[29]  Ben Y. Zhao,et al.  Deployment of a Large-scale Peer-to-Peer Social Network , 2004, WORLDS.

[30]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[31]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[32]  Andreas Haeberlen,et al.  Proactive Replication for Data Durability , 2006, IPTPS.

[33]  Matei Ripeanu,et al.  Exploring data reliability tradeoffs in replicated storage systems , 2009, HPDC '09.