Design and evaluation of distributed wide-area on-line archival storage systems

As the amount of digital assets increase, systems that ensure the durability, integrity, and accessibility of digital data become increasingly important. Distributed on-line archival storage systems are designed for this very purpose. This thesis explores several important challenges pertaining to fault tolerance, repair, and integrity that must be addressed to build such systems. The first part of this thesis explores how to maintain durability via fault tolerance and repair and presents many insights on how to do so efficiently. Fault tolerance ensures that data is not lost due to server failure. Replication is the canonical solution for data fault tolerance. The challenge is knowing how many replicas to create and where to store them. Fault tolerance alone, however, is not sufficient to prevent data loss as the last replica will eventually fail. Thus, repair is required to replace replicas lost to failure. The system must monitor and detect server failure and create replicas in response. The problem is that not all server failure results in loss of data and the system can be tricked into creating replicas unnecessarily. The challenge is knowing when to create replicas. Both fault tolerance and repair are required to prevent the last replica from being lost, hence, maintain data durability. The second part of this thesis explores how to ensure the integrity of data. Integrity ensures that the state of data stored in the system always reflects changes made by the owner. It includes non-repudiably binding owner to data and ensuring that only the owner can modify data, returned data is the same as stored, and the last write is returned in subsequent reads. The challenge is efficiency since requiring cryptography and consistency in the wide-area can easily be prohibitive. Next, we exploit a secure log to efficiently ensure integrity. We demonstrate how the narrow interface of a secure, append-only log simplifies the design of distributed wide-area storage systems. The system inherits the security and integrity properties of the log. We describe how to replicate the log for increased durability while ensuring consistency among the replicas. We present a repair algorithm that maintains sufficient replication levels as machines fail. Finally, the design uses aggregation to improve efficiency. Although simple, this interface is powerful enough to implement a variety of interesting applications. Finally, we apply the insights and architecture to a prototype called Antiquity. Antiquity efficiently maintains the durability and integrity of data. It has been running in the wide area on 400+ PlanetLab servers where we maintain the consistency, durability, and integrity of nearly 20,000 logs totaling more than 84 GB of data despite the constant churn of servers (a quarter of the servers experience a failure every hour).

[1]  Michael K. Reiter,et al.  Persistent objects in the Fleet system , 2001, Proceedings DARPA Information Survivability Conference and Exposition II. DISCEX'01.

[2]  Ben Y. Zhao,et al.  Distributed Object Location in a Dynamic Network , 2004, Theory of Computing Systems.

[3]  John Kubiatowicz,et al.  Efficiently binding data to owners in distributed content-addressable storage systems , 2005, Third IEEE International Security in Storage Workshop (SISW'05).

[4]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[5]  David A. Patterson,et al.  Studying and using failure data from large-scale internet services , 2002, EW 10.

[6]  Pieter H. Hartel,et al.  Secure Audit Logging with Tamper-Resistant Hardware , 2003, SEC.

[7]  Srinivasan Seshan,et al.  Tolerating Correlated Failures in Wide-Area Monitoring Services , 2004 .

[8]  Michael Luby,et al.  LT codes , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[9]  Emin Gün Sirer,et al.  Beehive: O(1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays , 2004, NSDI.

[10]  Andrew V. Goldberg,et al.  A prototype implementation of archival Intermemory , 1999, DL '99.

[11]  Larry L. Peterson,et al.  Experiences building PlanetLab , 2006, OSDI '06.

[12]  Robert Morris,et al.  A distributed hash table , 2006 .

[13]  Robert Morris,et al.  Etna: A Fault-tolerant Algorithm for Atomic Mutable DHT Data , 2005 .

[14]  David Mazières,et al.  Fast and secure distributed read-only file system , 2000, TOCS.

[15]  Christof Fetzer,et al.  Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.

[16]  Miguel Castro,et al.  Proactive recovery in a Byzantine-fault-tolerant system , 2000, OSDI.

[17]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[18]  Robbert van Renesse,et al.  Experiences with the Amoeba distributed operating system , 1990, CACM.

[19]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[20]  Luigi Rizzo,et al.  A reliable multicast data distribution protocol based on software FEC techniques , 1997, The Fourth IEEE Workshop on High-Performance Communication Systems.

[21]  Jean-Philippe Martin,et al.  A framework for dynamic Byzantine storage , 2004, International Conference on Dependable Systems and Networks, 2004.

[22]  Robert S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[23]  Thomas Schwarz,et al.  LH*RS: a high-availability scalable distributed data structure using Reed Solomon Codes , 2000, SIGMOD 2000.

[24]  Josh Cates,et al.  Robust and efficient data management for a distributed hash table , 2003 .

[25]  Dennis Shasha,et al.  Secure Untrusted Data Repository (SUNDR) , 2004, OSDI.

[26]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[27]  John Kubiatowicz,et al.  Introspective failure analysis: avoiding correlated failures in peer-to-peer systems , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[28]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.

[29]  Michael K. Reiter,et al.  Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[30]  Amin Shokrollahi,et al.  Raptor codes , 2011, IEEE Transactions on Information Theory.

[31]  Brighten Godfrey,et al.  OpenDHT: a public DHT service and its uses , 2005, SIGCOMM '05.

[32]  Ben Y. Zhao,et al.  Towards a Common API for Structured Peer-to-Peer Overlays , 2003, IPTPS.

[33]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[34]  Daniel A. Spielman,et al.  Analysis of low density codes and improved designs using irregular graphs , 1998, STOC '98.

[35]  Srinivasan Seshan,et al.  Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems , 2006, NSDI.

[36]  Sharon E. Perl,et al.  Myriad: Cost-Effective Disaster Tolerance , 2002, FAST.

[37]  Geoffrey M. Voelker,et al.  Surviving Internet Catastrophes , 2005, USENIX Annual Technical Conference, General Track.

[38]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[39]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[40]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[41]  Roger Wattenhofer,et al.  Large-scale simulation of replica placement algorithms for a serverless distributed file system , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[42]  Zhang Dong Global Information Grid (GIG) , 2003 .

[43]  Adrian Perrig,et al.  Distillation Codes and Applications to DoS Resistant Multicast Authentication , 2004, NDSS.

[44]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[45]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[46]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[47]  David E. Culler,et al.  Scalable, distributed data structures for internet service construction , 2000, OSDI.

[48]  Jeanna Neefe Matthews,et al.  Improving the performance of log-structured file systems with adaptive methods , 1997, SOSP.

[49]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 2000, Distributed Computing.

[50]  Ion Stoica,et al.  Non-Transitive Connectivity and DHTs , 2005, WORLDS.

[51]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[52]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[53]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[54]  Andrew S. Tanenbaum,et al.  A distributed file service based on optimistic concurrency control , 1985, SOSP 1985.

[55]  Andreas Haeberlen,et al.  Proactive Replication for Data Durability , 2006, IPTPS.

[56]  David Mazières,et al.  On-the-fly verification of rateless erasure codes for efficient content distribution , 2004, IEEE Symposium on Security and Privacy, 2004. Proceedings. 2004.

[57]  David E. Culler,et al.  A blueprint for introducing disruptive technology into the Internet , 2003, CCRV.

[58]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[59]  David G. Andersen,et al.  Improving end-to-end availability using overlay networks , 2004 .

[60]  David E. Culler,et al.  Operating Systems Support for Planetary-Scale Network Services , 2004, NSDI.

[61]  Daniel A. Spielman,et al.  Practical loss-resilient codes , 1997, STOC '97.

[62]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[63]  Eric Anderson,et al.  Proceedings of the Fast 2002 Conference on File and Storage Technologies Hippodrome: Running Circles around Storage Administration , 2022 .

[64]  Michael K. Reiter,et al.  Byzantine-Tolerant Erasure-Coded Storage , 2003 .

[65]  Robert Tappan Morris,et al.  Designing a DHT for Low Latency and High Throughput , 2004, NSDI.

[66]  Robert Tappan Morris,et al.  Bandwidth-efficient management of DHT routing tables , 2005, NSDI.

[67]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[68]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[69]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[70]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[71]  Gregory R. Ganger,et al.  Self-* Storage: Brick-based Storage with Automated Administration (CMU-CS-03-178) , 2003 .

[72]  Amin Vahdat,et al.  Workload and Failure Characterization on a Large-Scale Federated Testbed , 2003 .

[73]  Mary Baker,et al.  The LOCKSS peer-to-peer digital preservation system , 2005, TOCS.

[74]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[75]  Ralph C. Merkle,et al.  A Digital Signature Based on a Conventional Encryption Function , 1987, CRYPTO.

[76]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[77]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[78]  John Kubiatowicz,et al.  Naming and integrity: self-verifying data in peer-to-peer systems , 2003 .

[79]  Marek Karpinski,et al.  An XOR-based erasure-resilient coding scheme , 1995 .

[80]  Joseph Pasquale,et al.  Analysis of Long-Running Replicated Systems , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[81]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[82]  David A. Patterson,et al.  Serverless network file systems , 1995, SOSP.

[83]  Andreas Haeberlen,et al.  Experiences in building and operating ePOST, a reliable peer-to-peer application , 2006, EuroSys '06.

[84]  Liuba Shrira,et al.  HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.

[85]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[86]  Scott Shenker,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[87]  Lawrence W. Dowdy,et al.  Comparative Models of the File Assignment Problem , 1982, CSUR.

[88]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[89]  Alexander S. Szalay,et al.  TeraScale SneakerNet: Using Inexpensive Disks for Backup, Archiving, and Data Exchange , 2002, ArXiv.

[90]  KyoungSoo Park,et al.  CoMon: a mostly-scalable monitoring system for PlanetLab , 2006, OPSR.

[91]  Robert Tappan Morris,et al.  Ivy: a read/write peer-to-peer file system , 2002, OSDI '02.

[92]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[93]  Bruce Schneier,et al.  Cryptographic Support for Secure Logs on Untrusted Machines , 1998, USENIX Security Symposium.

[94]  Fred B. Schneider,et al.  COCA: a secure distributed online certification authority , 2002 .

[95]  Steve R. Kleiman,et al.  SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery , 2002, FAST.

[96]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[97]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[98]  Rodrigo Rodrigues,et al.  Rosebud: A Scalable Byzantine-Fault-Tolerant Storage Architecture , 2003 .

[99]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.