Flexible, wide-area storage for distributed systems using semantic cues

There is a growing set of Internet-based services that are too big, or too important, to run at a single site. Examples include Web services for e-mail, video and image hosting, and social networking. Splitting such services over multiple sites can increase capacity, improve fault tolerance, and reduce network delays to clients. These services often need storage infrastructure to share data among the sites. This dissertation explores the use of a new file system (WheelFS) specifically designed to be the storage infrastructure for wide-area distributed services. WheelFS allows applications to adjust the semantics of their data via semantic cues, which provide application control over consistency, failure handling, and file and replica placement. This dissertation describes a particular set of semantic cues that reflect the specific challenges that storing data over the wide-area network entails: high-latency and low-bandwidth links, coupled with increased node and link failures, when compared to local-area networks. By augmenting a familiar POSIX interface with support for semantic cues, WheelFS provides a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance, in the form of a distributed file system. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Six applications (an all-pairs-pings script, a distributed Web cache, an email service, large file distribution, distributed compilation, and protein sequence alignment software) demonstrate that WheelFS’s file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems. Thesis Supervisor: M. Frans Kaashoek Title: Professor Thesis Supervisor: Robert Morris Title: Professor Thesis Supervisor: Jinyang Li Title: Assistant Professor, NYU

[1]  KyoungSoo Park,et al.  Scale and Performance in the CoBlitz Large-File Distribution Service , 2006, NSDI.

[2]  David R. Karger,et al.  Chord: a scalable peer-to-peer lookup protocol for internet applications , 2003, TNET.

[3]  David Mazières,et al.  OASIS: Anycast for Any Service , 2006, NSDI.

[4]  Garret Swart,et al.  The Echo Distributed File System , 1996 .

[5]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[6]  Nancy A. Lynch,et al.  Eventually-serializable data services , 1996, PODC '96.

[7]  Ralph C. Merkle,et al.  A Digital Signature Based on a Conventional Encryption Function , 1987, CRYPTO.

[8]  Mahadev Satyanarayanan,et al.  The ITC distributed file system: principles and design , 1985, SOSP '85.

[9]  Daniel Aguayo,et al.  Rooter : A Methodology for the Typical Unification of Access Points and Redundancy , 2005 .

[10]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[11]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[12]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[13]  Robert Tappan Morris,et al.  Flexible, Wide-Area Storage for Distributed Systems with WheelFS , 2009, NSDI.

[14]  Lei Gao,et al.  PRACTI Replication , 2006, NSDI.

[15]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[16]  David Mazières,et al.  Separating key management from file system security , 1999, SOSP.

[17]  Marvin Theimer,et al.  Flexible update propagation for weakly consistent replication , 1997, SOSP.

[18]  Robert Grimm,et al.  PADS: A Policy Architecture for Distributed Storage Systems , 2009, NSDI.

[19]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[20]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[21]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[22]  Brian N. Bershad,et al.  Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service , 1999, TOCS.

[23]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[24]  Xavid Pretzer Securing wide-area storage in WheelFS , 2009 .

[25]  Ghaleb Abdulla,et al.  Data-Preservation in Scientific Workflow Middleware , 2006, 18th International Conference on Scientific and Statistical Database Management (SSDBM'06).

[26]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[27]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[28]  M. Humphrey,et al.  LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[29]  John H. Hartman,et al.  The Zebra striped network file system , 1995, TOCS.

[30]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OSDI '02.

[31]  Amin Vahdat,et al.  Design and evaluation of a conit-based continuous consistency model for replicated services , 2002, TOCS.

[32]  Dennis Shasha,et al.  Secure Untrusted Data Repository (SUNDR) , 2004, OSDI.

[33]  Chip Elliott,et al.  GENI - global environment for network innovations , 2008, LCN.

[34]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[35]  Timothy Roscoe,et al.  Learning from PlanetLab , 2006 .

[36]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[37]  B. Cohen,et al.  Incentives Build Robustness in Bit-Torrent , 2003 .

[38]  Robert Tappan Morris,et al.  Designing a DHT for Low Latency and High Throughput , 2004, NSDI.

[39]  Jon Howell,et al.  Distributed directory service in the Farsite file system , 2006, OSDI '06.

[40]  Brighten Godfrey,et al.  OpenDHT: a public DHT service and its uses , 2005, SIGCOMM '05.

[41]  Jonathan Pevsner,et al.  Basic Local Alignment Search Tool (BLAST) , 2005 .

[42]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[43]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[44]  Magnus Karlsson,et al.  Taming aggressive replication in the Pangaea wide-area file system , 2002, OPSR.

[45]  R. Grimm,et al.  PADS : A Policy Architecture for Building Distributed Storage Systems , 2008 .

[46]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[47]  Eric A. Brewer,et al.  TierStore: A Distributed Filesystem for Challenged Networks in Developing Regions , 2008, FAST.

[48]  Siddhartha Annapureddy,et al.  Shark: scaling file servers via cooperative caching , 2005, NSDI.

[49]  GhemawatSanjay,et al.  The Google file system , 2003 .

[50]  David A. Patterson,et al.  Serverless network file systems , 1995, SOSP.

[51]  Robert Tappan Morris,et al.  Don't Give Up on Distributed File Systems , 2007, IPTPS.

[52]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[53]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[54]  Michael J. Freedman,et al.  Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads , 2009, USENIX Annual Technical Conference.

[55]  Robert Tappan Morris,et al.  UsenetDHT: A Low-Overhead Design for Usenet , 2008, NSDI.

[56]  David Robinson,et al.  NFS version 4 Protocol , 2000, RFC.

[57]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[58]  Michael Dahlin,et al.  Transparent Information Dissemination , 2004, Middleware.

[59]  David E. Culler,et al.  Operating Systems Support for Planetary-Scale Network Services , 2004, NSDI.

[60]  M. Frans Kaashoek,et al.  Vivaldi: a decentralized network coordinate system , 2004, SIGCOMM 2004.

[61]  Kenneth P. Birman,et al.  Deceit: a flexible distributed file system , 1990, [1990] Proceedings. Workshop on the Management of Replicated Data.

[62]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[63]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[64]  Mahadev Satyanarayanan,et al.  Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[65]  M LevyHenry,et al.  Manageability, availability and performance in Porcupine , 1999 .

[66]  H. Apte,et al.  Serverless Network File Systems , 2006 .

[67]  Renato J. O. Figueiredo,et al.  Application-Tailored Cache Consistency for Wide-Area File Systems , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[68]  Marvin Theimer,et al.  Designing and implementing asynchronous collaborative applications with Bayou , 1997, UIST '97.

[69]  Thomas E. Anderson,et al.  xFS: a wide area mass storage file system , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[70]  Andrew Birrell,et al.  Implementing remote procedure calls , 1984, TOCS.

[71]  Irene Zhang,et al.  Efficient file distribution in a flexible,wide-area file system , 2009 .

[72]  Eric A. Brewer,et al.  NinjaMail: the design of a high-performance clustered, distributed e-mail system , 2000, Proceedings 2000. International Workshop on Parallel Processing.

[73]  David Mazières,et al.  Democratizing Content Publication with Coral , 2004, NSDI.

[74]  Assar Westerlund,et al.  The design of a multicast-based distributed file system , 1999, OSDI '99.