The Neutralizer: a self‐configurable failure detector for minimizing distributed storage maintenance cost

To achieve high data availability or reliability in an efficient manner, distributed storage systems must detect whether an observed node failure is permanent or transient, and if necessary, generate replicas to restore the desired level of replication. Given the unpredictability of network dynamics, however, distinguishing permanent and transient failures is extremely difficult. Though timeout‐based detectors can be used to avoid mistaking transient failures as permanent failures, it is unknown how the timeout values should be selected to achieve a better tradeoff between detection latency and accuracy. In this paper, we address this fundamental tradeoff from several perspectives. First, we explore the impact of different timeout values on maintenance cost by examining the probability of their false positives and false negatives. Second, we propose a self‐configurable failure detector called the Neutralizer based on the idea of counteracting false positives with false negatives. The Neutralizer could enable the system to maintain a desired replication level on average with the least amount of bandwidth. We conduct extensive simulations using real trace data from a widely deployed peer‐to‐peer system and synthetic traces based on PlanetLab and Microsoft PCs, showing a significant reduction in aggregate bandwidth usage after applying the Neutralizer (especially in an environment with a low average node availability). Overall, we demonstrate that the Neutralizer closely approximates the performance of a perfect ‘oracle’ detector in many cases. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[2]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[3]  Stefan Savage,et al.  Understanding Availability , 2003, IPTPS.

[4]  Yafei Dai,et al.  Understanding the Dynamic of Peer-to-Peer Systems , 2007, IPTPS.

[5]  Ben Y. Zhao,et al.  Deployment of a Large-scale Peer-to-Peer Social Network , 2004, WORLDS.

[6]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.

[7]  L. Breuer Introduction to Stochastic Processes , 2022, Statistical Methods for Climate Scientists.

[8]  John Kubiatowicz,et al.  Introspective failure analysis: avoiding correlated failures in peer-to-peer systems , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[9]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[10]  Emin Gün Sirer,et al.  Latency and bandwidth-minimizing failure detectors , 2007, EuroSys '07.

[11]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[12]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[13]  Robert Tappan Morris,et al.  Designing a DHT for Low Latency and High Throughput , 2004, NSDI.

[14]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[15]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[16]  Xavier Défago,et al.  Semi-passive replication , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[17]  Scott Shenker,et al.  Spurring Adoption of DHTs with OpenHash, a Public DHT Service , 2004, IPTPS.

[18]  J. Kubiatowicz,et al.  Long-Term Data Maintenance in Wide-Area Storage Systems : A Quantitative Approach , 2005 .

[19]  Randy H. Katz,et al.  On failure detection algorithms in overlay networks , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[20]  Geoffrey M. Voelker,et al.  On Object Maintenance in Peer-to-Peer Systems , 2006, IPTPS.

[21]  Josh Cates,et al.  Robust and efficient data management for a distributed hash table , 2003 .

[22]  Wei Chen,et al.  BitVault: a highly reliable distributed data retention platform , 2007, OPSR.

[23]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.