Decentralized Storage Consistency via Versioning Servers (CMU-CS-02-180)

This paper describes a consistency protocol that exploits versioning storage-nodes. The protocol provides linearizability with the possibility of read aborts in an asynchronous system that may suffer client and storage-node crash failures. The protocol supports both replication and erasure coding (which precludes post hoc repair of partial-writes), and avoids the excess work of two-phase commits. Versioning storagenodes allow the protocol to avoid excess communication in the common case of no write sharing and no failures of writing clients. We thank the members and companies of the PDL Consortium (including EMC, Hewlett-Packard, Hitachi, IBM, Intel, Microsoft, Network Appliance, Panasas, Seagate, Sun, and Veritas) for their interest, insights, feedback, and support. We thank IBM and Intel for hardware grants supporting our research efforts. This material is based on research sponsored by the Air Force Research Laboratory, under agreement number F49620-01-1-0433, and by DARPA/ITO’s OASIS program, under Air Force contract number F30602-99-2-0539-AFRL. Garth Goodson was supported by an IBM Fellowship. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government.

[1]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[2]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[3]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[4]  Lorenzo Alvisi,et al.  Self-adjusting quorum systems for byzantine fault tolerance , 2000 .

[5]  John H. Hartman,et al.  The Zebra striped network file system , 1995, TOCS.

[6]  Darrell D. E. Long,et al.  Swift/RAID: A Distributed RAID System , 1994, Comput. Syst..

[7]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[8]  Harjinder S. Sandhu,et al.  A Case Study of File System Workload in a Large-Scale Distributed Environment , 1994, SIGMETRICS.

[9]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[10]  David P. Reed,et al.  Implementing atomic actions on decentralized data , 1983, TOCS.

[11]  Pradeep K. Khosla,et al.  Survivable Information Storage Systems , 2000, Computer.

[12]  Mahadev Satyanarayanan,et al.  Disconnected Operation in the Coda File System , 1999, Mobidata.

[13]  Hai Jin,et al.  The Zebra Striped Network File System , 2002 .

[14]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[15]  Robert Gruber,et al.  Efficient optimistic concurrency control using loosely synchronized clocks , 1995, SIGMOD '95.

[16]  Mary Baker,et al.  Measurements of a distributed file system , 1991, SOSP '91.

[17]  Liba Svobodova,et al.  A distributed data storage system for a local network , 1980 .

[18]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[19]  Sharon E. Perl,et al.  Myriad: Cost-Effective Disaster Tolerance , 2002, FAST.

[20]  Craig A. N. Soules,et al.  Self-securing storage: protecting data in compromised systems , 2000, Foundations of Intrusion Tolerant Systems, 2003 [Organically Assured and Survivable Information Systems].

[21]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[22]  Garth A. Gibson,et al.  Highly concurrent shared storage , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[23]  Michael Dahlin,et al.  Minimal Byzantine Storage , 2002, DISC.

[24]  Barbara Liskov,et al.  Lazy consistency using loosely synchronized clocks , 1997, PODC '97.

[25]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[26]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[27]  Andrew S. Tanenbaum,et al.  A distributed file service based on optimistic concurrency control , 1985, SOSP '85.

[28]  Craig A. N. Soules,et al.  Metadata Efficiency in a Comprehensive Versioning File System (CMU-CS-02-145) , 2002 .

[29]  Norman C. Hutchinson,et al.  Elephant: the file system that never forgets , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[30]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[31]  Andrew V. Goldberg,et al.  A prototype implementation of archival Intermemory , 1999, DL '99.

[32]  Mahadev Satyanarayanan,et al.  An Empirical Study of a Highly Available File System , 1994, SIGMETRICS.

[33]  David L. Mills Improved algorithms for synchronizing computer network clocks , 1994, SIGCOMM 1994.

[34]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.