Analysis of durability in replicated distributed storage systems

In this paper, we investigate the roles of replication vs. repair to achieve durability in large-scale distributed storage systems. Specifically, we address the fundamental questions: How does the lifetime of an object depend on the degree of replication and rate of repair, and how is lifetime maximized when there is a constraint on resources? In addition, in real systems, when a node becomes unavailable, there is uncertainty whether this is temporary or permanent; we analyze the use of timeouts as a mechanism to make this determination. Finally, we explore the importance of memory in repair mechanisms, and show that under certain cost conditions, memoryless systems, which are inherently less complex, perform just as well.

[1]  V. Kalashnikov,et al.  Geometric Sums: Bounds for Rare Events with Applications: Risk Analysis, Reliability, Queueing , 1997 .

[2]  Anantha Chandrakasan,et al.  Upper bounds on the lifetime of sensor networks , 2001, ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No.01CH37240).

[3]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[4]  Mark Brown The First Passage Time Distribution for a Parallel Exponential System with Repair. , 1974 .

[5]  Geoffrey M. Voelker,et al.  On Object Maintenance in Peer-to-Peer Systems , 2006, IPTPS.

[6]  Stefan Savage,et al.  Understanding Availability , 2003, IPTPS.

[7]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.

[8]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[9]  Joseph Pasquale,et al.  Analysis of Long-Running Replicated Systems , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[10]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[11]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[12]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[13]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[14]  John Kubiatowicz,et al.  Design and evaluation of distributed wide-area on-line archival storage systems , 2006 .

[15]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[16]  Jacob R. Lorch,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OSDI '02.

[17]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[18]  Eric R. Ziegel,et al.  System Reliability Theory: Models, Statistical Methods, and Applications , 2004, Technometrics.

[19]  Richard E. Barlow,et al.  Reliability and Fault Tree Analysis , 1977 .

[20]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.