论文信息 - Lunule: an agile and judicious metadata load balancer for CephFS

Lunule: an agile and judicious metadata load balancer for CephFS

For a decade, the Ceph distributed file system (CephFS) has been widely used to serve the ever-growing big data in many key fields ranging from Internet services to AI computing. To scale out the massive metadata access, CephFS adopts a dynamic subtree partitioning method, splitting the hierarchical namespace and distributing subtrees across multiple metadata servers. However, this method suffers from a severe imbalance problem that may result in poor performance due to its inaccurate imbalance prediction, ignorance of workload characteristics, and unnecessary/invalid migration activities. To eliminate these inefficiencies, we propose Lunule, a novel CephFS metadata load balancer, which employs an imbalance factor model for accurately determining when to trigger re-balance and tolerate benign imbalanced situations. Lunule further adopts a workload-aware migration planner to appropriately select subtree migration candidates. Compared to baselines, Lunule achieves better load balance, increases the metadata throughput by up to 315.8%, and shortens the tail job completion time by up to 64.6% for five real-world workloads and their mixture, respectively. Besides, Lunule is capable of handling the metadata cluster expansion and the client workload growth, and scales linearly on a cluster of 16 MDSs.

[1] Scott A. Brandt,et al. Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[2] Bin Zhou,et al. Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[3] Cristina L. Abad,et al. Metadata Traces and Workload Models for Evaluating Big Storage Systems , 2012, 2012 IEEE Fifth International Conference on Utility and Cloud Computing.

[4] Cory Hill,et al. f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[5] G Borges,et al. CephFS: a new generation storage platform for Australian high energy physics , 2017 .

[6] Miguel Castro,et al. Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[7] Xiaosong Ma,et al. SpanDB: A Fast, Cost-Effective LSM-tree Based KV Store on Hybrid Storage , 2021, FAST.

[8] An-I Wang,et al. The Composite-file File System: Decoupling the One-to-One Mapping of Files and Metadata for Better Performance , 2016, FAST.

[9] Daniel J. Abadi,et al. CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems , 2015, FAST.

[10] Sadaf R. Alam,et al. Parallel I/O and the metadata wall , 2011, PDSW '11.

[11] Scott A. Brandt,et al. Ceph: reliable, scalable, and high-performance distributed storage , 2007 .

[12] Carlos Maltzahn,et al. Mantle: a programmable metadata load balancer for the ceph file system , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13] Scott A. Brandt,et al. Efficient metadata management in large distributed storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[14] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[15] Carlos Maltzahn,et al. Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[16] Youyou Lu,et al. LocoFS: A Loosely-Coupled Metadata Service for Distributed File Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] GhemawatSanjay,et al. The Google file system , 2003 .

[18] Tao Li,et al. Octopus: an RDMA-enabled Distributed Persistent Memory File System , 2017, USENIX ATC.

[19] Min Lv,et al. Explicit Data Correlations-Directed Metadata Prefetching Method in Distributed File Systems , 2019, IEEE Transactions on Parallel and Distributed Systems.

[20] Sean Quinlan,et al. GFS: Evolution on Fast-forward , 2009, ACM Queue.

[21] Seif Haridi,et al. HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases , 2016, FAST.

[22] Peter Honeyman,et al. Exporting storage systems in a scalable manner with pNFS , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[23] Yiming Zhang,et al. MAPX: Controlled Data Migration in the Expansion of Decentralized Object-Based Storage Systems , 2020, FAST.

[24] Garth A. Gibson,et al. Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[25] Jie Ma,et al. Adaptive and scalable metadata management to support a trillion files , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[26] Jian Yang,et al. Orion: A Distributed File System for Non-Volatile Main Memory and RDMA-Capable Networks , 2019, FAST.

[27] Carl Smith,et al. NFS Version 3: Design and Implementation , 1994, USENIX Summer.

[28] Abutalib Aghayev,et al. File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution , 2019, SOSP.

[29] Kai Ren,et al. IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30] Shankar Pasupathy,et al. Measurement and Analysis of Large-Scale Network File System Workloads , 2008, USENIX Annual Technical Conference.

[31] G. Kesteven,et al. The Coefficient of Variation , 1946, Nature.

[32] Gregory R. Ganger,et al. Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[33] Srikanth Kandula,et al. Dynamic load balancing without packet reordering , 2007, CCRV.