Lunule: an agile and judicious metadata load balancer for CephFS

For a decade, the Ceph distributed file system (CephFS) has been widely used to serve the ever-growing big data in many key fields ranging from Internet services to AI computing. To scale out the massive metadata access, CephFS adopts a dynamic subtree partitioning method, splitting the hierarchical namespace and distributing subtrees across multiple metadata servers. However, this method suffers from a severe imbalance problem that may result in poor performance due to its inaccurate imbalance prediction, ignorance of workload characteristics, and unnecessary/invalid migration activities. To eliminate these inefficiencies, we propose Lunule, a novel CephFS metadata load balancer, which employs an imbalance factor model for accurately determining when to trigger re-balance and tolerate benign imbalanced situations. Lunule further adopts a workload-aware migration planner to appropriately select subtree migration candidates. Compared to baselines, Lunule achieves better load balance, increases the metadata throughput by up to 315.8%, and shortens the tail job completion time by up to 64.6% for five real-world workloads and their mixture, respectively. Besides, Lunule is capable of handling the metadata cluster expansion and the client workload growth, and scales linearly on a cluster of 16 MDSs.

[1]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[2]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[3]  Cristina L. Abad,et al.  Metadata Traces and Workload Models for Evaluating Big Storage Systems , 2012, 2012 IEEE Fifth International Conference on Utility and Cloud Computing.

[4]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[5]  G Borges,et al.  CephFS: a new generation storage platform for Australian high energy physics , 2017 .

[6]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[7]  Xiaosong Ma,et al.  SpanDB: A Fast, Cost-Effective LSM-tree Based KV Store on Hybrid Storage , 2021, FAST.

[8]  An-I Wang,et al.  The Composite-file File System: Decoupling the One-to-One Mapping of Files and Metadata for Better Performance , 2016, FAST.

[9]  Daniel J. Abadi,et al.  CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems , 2015, FAST.

[10]  Sadaf R. Alam,et al.  Parallel I/O and the metadata wall , 2011, PDSW '11.

[11]  Scott A. Brandt,et al.  Ceph: reliable, scalable, and high-performance distributed storage , 2007 .

[12]  Carlos Maltzahn,et al.  Mantle: a programmable metadata load balancer for the ceph file system , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Scott A. Brandt,et al.  Efficient metadata management in large distributed storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[14]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[15]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[16]  Youyou Lu,et al.  LocoFS: A Loosely-Coupled Metadata Service for Distributed File Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  GhemawatSanjay,et al.  The Google file system , 2003 .

[18]  Tao Li,et al.  Octopus: an RDMA-enabled Distributed Persistent Memory File System , 2017, USENIX ATC.

[19]  Min Lv,et al.  Explicit Data Correlations-Directed Metadata Prefetching Method in Distributed File Systems , 2019, IEEE Transactions on Parallel and Distributed Systems.

[20]  Sean Quinlan,et al.  GFS: Evolution on Fast-forward , 2009, ACM Queue.

[21]  Seif Haridi,et al.  HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases , 2016, FAST.

[22]  Peter Honeyman,et al.  Exporting storage systems in a scalable manner with pNFS , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[23]  Yiming Zhang,et al.  MAPX: Controlled Data Migration in the Expansion of Decentralized Object-Based Storage Systems , 2020, FAST.

[24]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[25]  Jie Ma,et al.  Adaptive and scalable metadata management to support a trillion files , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[26]  Jian Yang,et al.  Orion: A Distributed File System for Non-Volatile Main Memory and RDMA-Capable Networks , 2019, FAST.

[27]  Carl Smith,et al.  NFS Version 3: Design and Implementation , 1994, USENIX Summer.

[28]  Abutalib Aghayev,et al.  File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution , 2019, SOSP.

[29]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Shankar Pasupathy,et al.  Measurement and Analysis of Large-Scale Network File System Workloads , 2008, USENIX Annual Technical Conference.

[31]  G. Kesteven,et al.  The Coefficient of Variation , 1946, Nature.

[32]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[33]  Srikanth Kandula,et al.  Dynamic load balancing without packet reordering , 2007, CCRV.