Ceph: a scalable, high-performance distributed file system

We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

[1]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[2]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[3]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[4]  Richard A. Golding,et al.  D-SPTF: decentralized request distribution in brick-based storage systems , 2004, ASPLOS XI.

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[7]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[8]  M. Humphrey,et al.  LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[9]  Dror G. Feitelson,et al.  The Vesta parallel file system , 1996, TOCS.

[10]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[11]  Ethan L. Miller,et al.  Replication under scalable hashing: a family of algorithms for scalable decentralized data distribution , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[12]  David Kotz,et al.  The galley parallel file system , 1997, ICS '96.

[13]  A. Weil Scalable Archival Data and Metadata Management in Object-based File Systems Technical Report UCSC-SSRC-04-01 June 2004 , 2004 .

[14]  Ethan L. Miller,et al.  Evaluation of distributed recovery in large-scale storage systems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[15]  Tao Yang,et al.  A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[16]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[17]  Carla Schlatter Ellis,et al.  File-Access Characteristics of Parallel Scientific Workloads , 1996, IEEE Trans. Parallel Distributed Syst..

[18]  Ohad Rodeh,et al.  zFS - a scalable distributed file system using object disks , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[19]  Ethan L. Miller,et al.  Secure capabilities for a petabyte-scale object-based distributed file system , 2005, StorageSS '05.

[20]  Carl Smith,et al.  NFS Version 3: Design and Implementation , 1994, USENIX Summer.

[21]  Brent Welch POSIX IO extensions for HPC , 2005 .

[22]  Andrew W. Leung,et al.  Scalable security for large, high performance storage systems , 2006, StorageSS '06.

[23]  Darrell D. E. Long,et al.  Swift: Using Distributed Disk Striping to Provide High I/O Data Rates , 1991, Comput. Syst..

[24]  Robert M. Rees,et al.  IBM Storage Tank - A heterogeneous scalable SAN file system , 2003, IBM Syst. J..

[25]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[26]  Joseph S. Glider,et al.  IBM Research Report Kybos: Self-Management for Distributed Brick-Based Storage , 2005 .

[27]  Feng Wang,et al.  File System Workload Analysis For Large Scale Scientific Com puting Applications , 2004 .

[28]  Robert Latham,et al.  A next-generation parallel file system for Linux cluster. , 2004 .

[29]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[30]  Peter Honeyman,et al.  Exporting storage systems in a scalable manner with pNFS , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[31]  Noam Rinetzky,et al.  Towards an object store , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[32]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[33]  Scott A. Brandt,et al.  The Design and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device , 2006 .

[34]  Brent Welch,et al.  Managing Scalability in Object Storage Systems for HPC Linux Clusters , 2004, MSST.

[35]  Carl Staelin,et al.  The HP AutoRAID hierarchical storage system , 1995, SOSP.

[36]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.