File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution

For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the convenience and maturity of battle-tested code. Ceph's experience, however, shows that this comes at a high price. First, developing a zero-overhead transaction mechanism is challenging. Second, metadata performance at the local level can significantly affect performance at the distributed level. Third, supporting emerging storage hardware is painstakingly slow. Ceph addressed these issues with BlueStore, a new back-end designed to run directly on raw storage devices. In only two years since its inception, BlueStore outperformed previous established backends and is adopted by 70% of users in production. By running in user space and fully controlling the I/O stack, it has enabled space-efficient metadata and data checksums, fast overwrites of erasure-coded data, inline compression, decreased performance variability, and avoided a series of performance pitfalls of local file systems. Finally, it makes the adoption of backwards-incompatible storage hardware possible, an important trait in a changing storage landscape that is learning to embrace hardware diversity.

[1]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .

[2]  Michael Stonebraker,et al.  Operating system support for database management , 1981, CACM.

[3]  Robert S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[4]  Michael Stonebraker,et al.  The design of POSTGRES , 1986, SIGMOD '86.

[5]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[6]  Frederick P. Brooks,et al.  No Silver Bullet: Essence and Accidents of Software Engineering , 1987 .

[7]  Frank B. Schmuck,et al.  Experience with transactions in QuickSilver , 1991, SOSP '91.

[8]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[9]  Michael A. Olson,et al.  The Design and Implementation of the Inversion File System , 1993, USENIX Winter.

[10]  Margo I. Seltzer Transaction support in a log-structured file system , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[11]  Dennis Shasha,et al.  2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm , 1994, VLDB.

[12]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[13]  Robert Grimm,et al.  Application performance and flexibility on exokernel systems , 1997, SOSP.

[14]  Margo I. Seltzer,et al.  Berkeley DB , 1999, USENIX Annual Technical Conference, FREENIX Track.

[15]  José M. García,et al.  DualFS: a new journaling file system without meta-data duplication , 2002, ICS '02.

[16]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[17]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[18]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[19]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[20]  Erez Zadok,et al.  Extending ACID semantics to the file system , 2007, TOS.

[21]  Carlos Maltzahn,et al.  RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[22]  Kanad Ghose,et al.  hFS: a hybrid file system prototype for improving small file and metadata performance , 2007, EuroSys '07.

[23]  Marta Mattoso,et al.  Parallel query processing for OLAP in grids , 2008, VLDB 2008.

[24]  Eugenio Cesario,et al.  The XtreemFS architecture—a case for object-based file systems in Grids , 2008, VLDB 2008.

[25]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[26]  Stephen C. Tweedie,et al.  Journaling the Linux ext2fs Filesystem , 2008 .

[27]  Christoph Hellwig XFS: The Big Storage File System for Linux , 2009, login Usenix Mag..

[28]  Erez Zadok,et al.  Enabling Transactional File Access via Lightweight Kernel Extensions , 2009, FAST.

[29]  Donald E. Porter,et al.  Operating System Transactions , 2009, SOSP '09.

[30]  Felix Hupfeld,et al.  BabuDB: Fast and Efficient File System Metadata Storage , 2010, 2010 International Workshop on Storage Network Architecture and Parallel I/Os.

[31]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[32]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[33]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[34]  Youjip Won,et al.  I/O Stack Optimization for Smartphones , 2013, USENIX ATC.

[35]  Kai Ren,et al.  TABLEFS: Enhancing Metadata Efficiency in the Local File System , 2013, USENIX Annual Technical Conference.

[36]  Meng Zhu,et al.  Journaling of journal is (almost) free , 2014, FAST.

[37]  BetrFS , 2015 .

[38]  P. Desnoyers,et al.  Skylight—A Window on Shingled Disk Operation , 2015, FAST.

[39]  Sam H. Noh,et al.  Towards SLO Complying SSDs Through OPS Isolation , 2015, FAST.

[40]  Sang-Won Lee,et al.  Lightweight Application-Level Crash Consistency on Transactional Flash Storage , 2015, USENIX Annual Technical Conference.

[41]  Michael A. Bender,et al.  BetrFS: Write-Optimization in a Kernel File System , 2015, ACM Trans. Storage.

[42]  Andrew A. Chien,et al.  The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments , 2016, FAST.

[43]  James Kelley,et al.  SMORE: A Cold Data Object Store for SMR Drives (Extended Version) , 2017, ArXiv.

[44]  Andrew A. Chien,et al.  Tiny-Tail Flash , 2017, ACM Trans. Storage.

[45]  Javier González,et al.  LightNVM: The Linux Open-Channel SSD Subsystem , 2017, FAST.

[46]  Abutalib Aghayev,et al.  Evolving Ext4 for Shingled Disks , 2017, FAST.

[47]  Fan Guo,et al.  Scaling Embedded In-Situ Indexing with DeltaFS , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[48]  Youngjin Kwon,et al.  TxFS , 2019, USENIX Annual Technical Conference.

[49]  G. Ganger,et al.  Reconciling LSM-Trees with Modern Hard Drives using BlueFS , 2019 .

[50]  Amazon S3 , 2019, Machine Learning in the AWS Cloud.

[51]  Matias Bjørling,et al.  From Open-Channel SSDs to Zoned Namespaces , 2019 .

[52]  Sangyoon Oh,et al.  Towards building a high-performance, scale-in key-value storage system , 2019, SYSTOR.