Ceph: reliable, scalable, and high-performance distributed storage

As the size and performance requirements of storage systems have increased, file system designers have looked to new architectures to facilitate system scalability. The emerging object-based storage paradigm diverges from server-based (e. g. NFS) and SAN-based storage systems by coupling processors and memory with disk drives, allowing systems to delegate low-level file system operations (e. g. allocation and scheduling) to object storage devices (OSDs) and decouple I/O (read/write) from metadata (file open/close) operations. Even recent object-based systems inherit a variety of decades-old architectural choices going back to early UNIX file systems, however, limiting their ability to effectively scale. This dissertation shows that device intelligence can be leveraged to provide reliable, scalable, and high-performance file service in a dynamic cluster environment. It presents a distributed metadata management architecture that provides excellent performance and scalability by adapting to highly variable system workloads while tolerating arbitrary node crashes. A flexible and robust data distribution function places data objects in a large, dynamic cluster of storage devices, simplifying metadata and facilitating system scalability, while providing a uniform distribution of data, protection from correlated device failure, and efficient data migration. This placement algorithm facilitates the creation of a reliable and scalable object storage service that distributes the complexity of consistent data replication, failure detection, and recovery across a heterogeneous cluster of semi-autonomous devices. These architectural components, which have been implemented in the Ceph distributed file system, are evaluated under a variety of workloads that show superior I/O performance, scalable metadata management, and failure recovery.

[1]  Michael J. Callahan,et al.  The InterMezzo File System , 1999 .

[2]  Ethan L. Miller,et al.  Evaluation of distributed recovery in large-scale storage systems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[3]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[4]  Ethan L. Miller,et al.  Replication under scalable hashing: a family of algorithms for scalable decentralized data distribution , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[6]  Robert S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[7]  Joseph Hall,et al.  An Experimental Study of Data Migration Algorithms , 2001, WAE.

[8]  Michael Dahlin,et al.  Cooperative caching: using remote client memory to improve file system performance , 1994, OSDI '94.

[9]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[10]  Ohad Rodeh,et al.  zFS - a scalable distributed file system using object disks , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[11]  Mary Baker,et al.  Measurements of a distributed file system , 1991, SOSP '91.

[12]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[13]  Christopher Hertel Implementing CIFS: The Common Internet File System , 2003 .

[14]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1988, TOCS.

[15]  Randy H. Katz,et al.  RAMA: An Easy-to-Use, High-Performance Parallel File System , 1997, Parallel Comput..

[16]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[17]  Scott A. Brandt,et al.  Efficient access control for distributed hierarchical file systems , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[18]  Scott A. Brandt,et al.  Providing Quality of Service Support in Object-Based File System , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[19]  Nancy P. Kronenberg,et al.  VAXcluster: a closely-coupled distributed system , 1986, TOCS.

[20]  Russell Glen Ross,et al.  Cluster storage for commodity computation , 2007 .

[21]  Andrew W. Leung,et al.  Scalable security for petascale parallel file systems , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[22]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[23]  Sara McMains,et al.  File System Logging versus Clustering: A Performance Comparison , 1995, USENIX.

[24]  Gustavo Alonso,et al.  Understanding replication in databases and distributed systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[25]  Richard A. Golding,et al.  Fault-tolerant replication management in large-scale distributed storage systems , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[26]  André Schiper,et al.  Comparison of database replication techniques based on total order broadcast , 2005, IEEE Transactions on Knowledge and Data Engineering.

[27]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[28]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[29]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[30]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[31]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[32]  Magnus Karlsson,et al.  Taming aggressive replication in the Pangaea wide-area file system , 2002, OPSR.

[33]  Liuba Shrira,et al.  Distributed Object Management in Thor , 1992, IWDOM.

[34]  Brent Welch,et al.  Managing Scalability in Object Storage Systems for HPC Linux Clusters , 2004, MSST.

[35]  Eric Anderson,et al.  Proceedings of the Fast 2002 Conference on File and Storage Technologies Hippodrome: Running Circles around Storage Administration , 2022 .

[36]  Randal C. Burns,et al.  Tunable randomization for load management in shared-disk clusters , 2005, TOS.

[37]  Carl Smith,et al.  NFS Version 3: Design and Implementation , 1994, USENIX Summer.

[38]  Wei Chen,et al.  On the Impact of Replica Placement to the Reliability of Distributed Brick Storage Systems , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[39]  Thomas E. Anderson,et al.  xFS: a wide area mass storage file system , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[40]  Scott A. Brandt,et al.  Reliability mechanisms for very large storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[41]  Ronald Fagin,et al.  Efficiently extendible mappings for balanced data distribution , 2005, Algorithmica.

[42]  Robert M. Rees,et al.  IBM Storage Tank - A heterogeneous scalable SAN file system , 2003, IBM Syst. J..

[43]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[44]  John Wilkes,et al.  Seneca: remote mirroring done write , 2003, USENIX Annual Technical Conference, General Track.

[45]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[46]  Michael Stonebraker,et al.  Concurrency Control and Consistency of Multiple Copies of Data in Distributed Ingres , 1979, IEEE Transactions on Software Engineering.

[47]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[48]  Richard A. Golding,et al.  D-SPTF: decentralized request distribution in brick-based storage systems , 2004, ASPLOS XI.

[49]  L. Shrira,et al.  BuddyCache: cache coherence for transactional peer group applications , 2001, Proceedings. The Second IEEE Workshop on Internet Applications. WIAPP 2001.

[50]  Noam Rinetzky,et al.  Towards an object store , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[51]  Feng Wang,et al.  File System Workload Analysis For Large Scale Scientific Com puting Applications , 2004 .

[52]  Robert Latham,et al.  A next-generation parallel file system for Linux cluster. , 2004 .

[53]  Tao Yang,et al.  A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[54]  Berthier A. Ribeiro-Neto,et al.  Comparing random data allocation and data striping in multimedia servers , 2000, SIGMETRICS '00.

[55]  Michael Dahlin,et al.  A quantitative analysis of cache policies for scalable network file systems , 1994, SIGMETRICS.

[56]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[57]  Carlos Maltzahn,et al.  End-to-end performance management for scalable distributed storage , 2007, PDSW '07.

[58]  Peter Honeyman,et al.  Exporting storage systems in a scalable manner with pNFS , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[59]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[60]  Margo I. Seltzer,et al.  Passive NFS Tracing of Email and Research Workloads , 2003, FAST.

[61]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[62]  Tao Yang,et al.  The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[63]  David K. Gifford,et al.  The Cedar file system , 1988, CACM.

[64]  Jon Howell,et al.  Distributed directory service in the Farsite file system , 2006, OSDI '06.

[65]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[66]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[67]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[68]  Carl Staelin,et al.  An Implementation of a Log-Structured File System for UNIX , 1993, USENIX Winter.

[69]  Darrell D. E. Long,et al.  Quota enforcement for high-performance distributed storage systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[70]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[71]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[72]  Dror G. Feitelson,et al.  The Vesta parallel file system , 1996, TOCS.

[73]  Kanad Ghose,et al.  hFS: a hybrid file system prototype for improving small file and metadata performance , 2007, EuroSys '07.

[74]  Mahadev Satyanarayanan,et al.  Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[75]  G. A. Alvarez,et al.  Tolerating Multiple Failures In Raid Architectures With Optimal Storage And Uniform Declustering , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[76]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[77]  Joseph S. Glider,et al.  IBM Research Report Kybos: Self-Management for Distributed Brick-Based Storage , 2005 .

[78]  Lei Gao,et al.  PRACTI Replication , 2006, NSDI.

[79]  Christian Scheideler,et al.  Efficient, distributed data placement strategies for storage area networks (extended abstract) , 2000, SPAA '00.

[80]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[81]  Ethan L. Miller,et al.  Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems , 2004 .

[82]  Ashish Goel,et al.  SCADDAR: an efficient randomized technique to reorganize continuous media blocks , 2002, Proceedings 18th International Conference on Data Engineering.

[83]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[84]  Scott A. Brandt,et al.  The Design and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device , 2006 .

[85]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[86]  John K. Ousterhout,et al.  A trace-driven analysis of name and attribute caching in a distributed system , 1992 .

[87]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[88]  Stephen Tweedie,et al.  Planned Extensions to the Linux Ext2/Ext3 Filesystem , 2002, USENIX Annual Technical Conference, FREENIX Track.

[89]  Andrew W. Leung,et al.  Scalable security for large, high performance storage systems , 2006, StorageSS '06.

[90]  Liuba Shrira,et al.  Providing high availability using lazy replication , 1992, TOCS.

[91]  Peter J. Keleher,et al.  Decentralized replicated-object protocols , 1999, PODC '99.

[92]  Howard Gobioff,et al.  Security for Network Attached Storage Devices , 1997 .

[93]  Friedhelm Meyer auf der Heide,et al.  Dynamic and Redundant Data Placement , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[94]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[95]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.