Efficient Replica Maintenance for Distributed Storage Systems

This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server's worth of data over the Internet to maintain replication levels. The following insights in designing an efficient replication algorithm emerge from the paper's analysis. First, durability can be provided separately from availability; the former is less expensive to ensure and a more useful goal for many wide-area applications. Second, the focus of a durability algorithm must be to create new copies of data objects faster than permanent disk failures destroy the objects; careful choice of policies for what nodes should hold what data can decrease repair time. Third, increasing the number of replicas of each data object does not help a system tolerate a higher disk failure probability, but does help tolerate bursts of failures. Finally, ensuring that the system makes use of replicas that recover after temporary failure is critical to efficiency. Based on these insights, the paper proposes the Carbonite replication algorithm for keeping data durable at a low cost. A simulation of Carbonite storing 1 TB of data over a 365 day trace of PlanetLab activity shows that Carbonite is able to keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system.

[1]  John Kubiatowicz,et al.  Efficiently binding data to owners in distributed content-addressable storage systems , 2005, Third IEEE International Security in Storage Workshop (SISW'05).

[2]  KyoungSoo Park,et al.  CoMon: a mostly-scalable monitoring system for PlanetLab , 2006, OPSR.

[3]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[4]  Leonard Kleinrock,et al.  Queueing Systems: Volume I-Theory , 1975 .

[5]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[6]  Joseph Pasquale,et al.  Analysis of Long-Running Replicated Systems , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[7]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[8]  Irving L. Traiger,et al.  The Recovery Manager of the System R Database Manager , 1981, CSUR.

[9]  Doug Terry,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[10]  David E. Culler,et al.  A blueprint for introducing disruptive technology into the Internet , 2003, CCRV.

[11]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[12]  Jeanna Neefe Matthews,et al.  Serverless network file systems , 1996, TOCS.

[13]  Geoffrey M. Voelker,et al.  On Object Maintenance in Peer-to-Peer Systems , 2006, IPTPS.

[14]  Andreas Haeberlen,et al.  NSDI '06: 3rd Symposium on Networked Systems Design & Implementation , 2006 .

[15]  Robert Morris,et al.  A distributed hash table , 2006 .

[16]  Ion Stoica,et al.  Non-Transitive Connectivity and DHTs , 2005, WORLDS.

[17]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[18]  James Robertson,et al.  UsenetDHT: A Low Overhead Usenet Server , 2004, IPTPS.

[19]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[20]  Witold Litwin,et al.  LH*RS: a high-availability scalable distributed data structure using Reed Solomon Codes , 2000, SIGMOD '00.

[21]  David G. Andersen,et al.  Improving end-to-end availability using overlay networks , 2004 .

[22]  J. Kubiatowicz,et al.  Long-Term Data Maintenance in Wide-Area Storage Systems : A Quantitative Approach , 2005 .

[23]  GhemawatSanjay,et al.  The Google file system , 2003 .

[24]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[25]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[26]  Steve R. Kleiman,et al.  SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery , 2002, FAST.

[27]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[28]  David E. Culler,et al.  Scalable, Distributed Data Structures for Internet Service Construction , 2000, OSDI.

[29]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[30]  Josh Cates,et al.  Robust and efficient data management for a distributed hash table , 2003 .

[31]  David E. Culler,et al.  Distributed data structures for internet service construction , 2000, USENIX Symposium on Operating Systems Design and Implementation.

[32]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[33]  David R. Karger,et al.  OverCite: A Cooperative Digital Research Library , 2005, IPTPS.

[34]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[35]  Robert Tappan Morris,et al.  Designing a DHT for Low Latency and High Throughput , 2004, NSDI.

[36]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[37]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[38]  Sharon E. Perl,et al.  Myriad: Cost-Effective Disaster Tolerance , 2002, FAST.

[39]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.