The State of the Art of Metadata Managements in Large-Scale Distributed File Systems Scalability, Performance and Availability

File system metadata is the data in charge of maintaining namespace, permission semantics and location of file data blocks. Operations on the metadata can account for up to 80% of total file system operations. As such, the performance of metadata services significantly impacts the overall performance of file systems. A large-scale distributed file system (DFS) is a storage system that is composed of multiple storage devices spreading across different sites to accommodate data files, and in most cases, to provide users with location independent access interfaces. Large-scale DFSs have been widely deployed as a substrate to a plethora of computing systems, and thus their metadata management efficiency is crucial to a massive number of applications, especially with the advent of the big data age, which poses tremendous pressure on underlying storage systems. This paper reports the state-of-the-art research on metadata services in large-scale distributed file systems, which is conducted from three indicative perspectives that are always used to characterize DFSs: high-scalability, high-performance, and high-availability, with special focus on their respective major challenges as well as their developed mainstream technologies. Additionally, the paper also identifies and analyzes several existing problems in the research, which could be used as a reference for related studies.

[1]  Arvind,et al.  Design of LSM-tree-based Key-value SSDs with Bounded Tails , 2021, ACM Trans. Storage.

[2]  André Brinkmann,et al.  NVMM-Oriented Hierarchical Persistent Client Caching for Lustre , 2021, ACM Trans. Storage.

[3]  Tim Verbelen,et al.  A Survey on Distributed Machine Learning , 2019, ACM Comput. Surv..

[4]  Hakbeom Jang,et al.  FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks , 2021, FAST.

[5]  Shimin Chen,et al.  ROART: Range-query Optimized Persistent ART , 2021, FAST.

[6]  Ethan Katz-Bassett,et al.  Facebook's Tectonic Filesystem: Efficiency from Exascale , 2021, FAST.

[7]  Michael Stumm,et al.  Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience , 2021, FAST.

[8]  Subho Sankar Banerjee,et al.  Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Fan Zhang,et al.  MDLB: a metadata dynamic load balancing mechanism based on reinforcement learning , 2020, Frontiers of Information Technology & Electronic Engineering.

[10]  Hui Zhang,et al.  SmartSSD: FPGA Accelerated Near-Storage Data Analytics on SSD , 2020, IEEE Computer Architecture Letters.

[11]  Ali R. Butt,et al.  An Integrated Indexing and Search Service for Distributed File Systems , 2020, IEEE Transactions on Parallel and Distributed Systems.

[12]  Yong Chen,et al.  PRS: A Pattern-Directed Replication Scheme for Heterogeneous Object-Based Storage , 2020, IEEE Transactions on Computers.

[13]  Muthian Sivathanu,et al.  Quiver: An Informed Storage Cache for Deep Learning , 2020, FAST.

[14]  Andrea C. Arpaci-Dusseau,et al.  The Network-Integrated Storage System , 2020, IEEE Transactions on Parallel and Distributed Systems.

[15]  Jiang Zhou,et al.  A Highly Reliable Metadata Service for Large-Scale Distributed File Systems , 2020, IEEE Transactions on Parallel and Distributed Systems.

[16]  Y. Kodama,et al.  Co-Design for A64FX Manycore Processor and ”Fugaku” , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Marco Canini,et al.  Assise: Performance and Availability via Client-local NVM in a Distributed File System , 2020, OSDI.

[18]  Sungjin Lee,et al.  PinK: High-speed In-storage Key-value Store with Bounded Tails , 2020, USENIX Annual Technical Conference.

[19]  Yiming Zhang,et al.  MAPX: Controlled Data Migration in the Expansion of Decentralized Object-Based Storage Systems , 2020, FAST.

[20]  Min Lv,et al.  Explicit Data Correlations-Directed Metadata Prefetching Method in Distributed File Systems , 2019, IEEE Transactions on Parallel and Distributed Systems.

[21]  A. Dilger,et al.  LPCC: hierarchical persistent client caching for lustre , 2019, SC.

[22]  Haibo Chen,et al.  Performance and protection in the ZoFS user-space NVM file system , 2019, SOSP.

[23]  Abutalib Aghayev,et al.  File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution , 2019, SOSP.

[24]  Guihai Chen,et al.  DeepHash: An End-to-End Learning Approach for Metadata Management in Distributed File Systems , 2019, ICPP.

[25]  Guihai Chen,et al.  AdaM: An Adaptive Fine-Grained Scheme for Distributed Metadata Management , 2019, ICPP.

[26]  John Jenkins,et al.  Managing Rich Metadata in High-Performance Computing Systems Using a Graph Model , 2019, IEEE Transactions on Parallel and Distributed Systems.

[27]  Wei Ding,et al.  CFS: A Distributed File System for Large Scale Container Platforms , 2019, SIGMOD Conference.

[28]  André Brinkmann,et al.  Hyperion: Building the Largest In-memory Search Tree , 2019, SIGMOD Conference.

[29]  TDDFS: A Tier-Aware Data Deduplication-Based File System , 2019, ACM Trans. Storage.

[30]  Xiaozhou Li,et al.  DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching , 2019, FAST.

[31]  Steven Swanson,et al.  Ziggurat: A Tiered File System for Non-Volatile Main Memories and Disks , 2019, FAST.

[32]  Youyou Lu,et al.  A Flattened Metadata Service for Distributed File Systems , 2018, IEEE Transactions on Parallel and Distributed Systems.

[33]  Hideyuki Kawashima,et al.  PPMDS: A Distributed Metadata Server Based on Nonblocking Transactions , 2018, 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[34]  Wei Cao,et al.  PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database , 2018, Proc. VLDB Endow..

[35]  María S. Pérez-Hernández,et al.  TýrFS: Increasing Small Files Access Performance with Dynamic Metadata Replication , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[36]  Yang Wang,et al.  CosaFS: A Cooperative Shingle-Aware File System , 2017, ACM Trans. Storage.

[37]  André Brinkmann,et al.  A Configurable Rule based Classful Token Bucket Filter Network Request Scheduler for the Lustre File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Robert Ricci,et al.  Rocksteady: Fast Migration for Low-latency In-memory Storage , 2017, SOSP.

[39]  Nikos Tsikoudis,et al.  A General-Purpose Architecture for Replicated Metadata Services in Distributed File Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[40]  Andrea C. Arpaci-Dusseau,et al.  Redundancy Does Not Imply Fault Tolerance , 2017, ACM Trans. Storage.

[41]  Houjun Tang,et al.  SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[42]  Yu Hua,et al.  SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems , 2017, USENIX Annual Technical Conference.

[43]  Herodotos Herodotou,et al.  OctopusFS: A Distributed File System with Tiered Storage Management , 2017, SIGMOD Conference.

[44]  Teng Wang,et al.  MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[45]  Peter Van Roy,et al.  Saturn: a Distributed Metadata Service for Causal Consistency , 2017, EuroSys.

[46]  Guihai Chen,et al.  AngleCut: A Ring-Based Hashing Scheme for Distributed Metadata Management , 2017, DASFAA.

[47]  Zhihan Lv,et al.  Toward Efficient and Flexible Metadata Indexing of Big Data Systems , 2017, IEEE Transactions on Big Data.

[48]  Seif Haridi,et al.  HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases , 2016, FAST.

[49]  Remzi H. Arpaci-Dusseau Operating Systems: Three Easy Pieces , 2015, login Usenix Mag..

[50]  Sangyeun Cho,et al.  Behaviors of Storage Backends in Ceph Object Store , 2017 .

[51]  David Hung-Chang Du,et al.  SMaRT: An Approach to Shingled Magnetic Recording Translation , 2017, FAST.

[52]  Dong-Oh Kim,et al.  Adaptive metadata rebalance in exascale file system , 2017, The Journal of Supercomputing.

[53]  Antonio F. Díaz,et al.  A New Scalable Approach for Distributed Metadata in HPC , 2016, ICA3PP.

[54]  Xiao Qin,et al.  Using Provenance to boost the Metadata Prefetching in distributed storage systems , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[55]  Ruini Xue,et al.  Replichard: Towards Tradeoff between Consistency and Performance for Metadata , 2016, ICS.

[56]  Robert B. Ross,et al.  Towards Exploring Data-Intensive Scientific Applications at Extreme Scales through Systems and Simulations , 2016, IEEE Transactions on Parallel and Distributed Systems.

[57]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[58]  Limin Xiao,et al.  File Creation Optimization for Metadata-Intensive Application in File Systems , 2015, ICA3PP.

[59]  Dan Feng,et al.  P-index: An Efficient Searchable Metadata Indexing Scheme Based on Data Provenance in Cold Storage , 2015, ICA3PP.

[60]  Carlos Maltzahn,et al.  Mantle: a programmable metadata load balancer for the ceph file system , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[61]  Gabriel Antoniu,et al.  Towards Multi-site Metadata Management for Geographically Distributed Cloud Workflows , 2015, 2015 IEEE International Conference on Cluster Computing.

[62]  Dan Feng,et al.  Using provenance to efficiently improve metadata searching performance in storage systems , 2015, Future Gener. Comput. Syst..

[63]  Yang Wang,et al.  WaFS: A Workflow-Aware File System for Effective Storage Utilization in the Cloud , 2015, IEEE Transactions on Computers.

[64]  Dan Meng,et al.  MAMS: A Highly Reliable Policy for Metadata Service , 2015, 2015 44th International Conference on Parallel Processing.

[65]  Yang Wang,et al.  A Heterogeneity-Aware Region-Level Data Layout for Hybrid Parallel File Systems , 2015, 2015 44th International Conference on Parallel Processing.

[66]  Lin Xiao,et al.  ShardFS vs. IndexFS: replication vs. caching strategies for distributed metadata management in cloud storage systems , 2015, SoCC.

[67]  Kayvan Najarian,et al.  Big Data Analytics in Healthcare , 2015, BioMed research international.

[68]  Limin Xiao,et al.  MBFS: a parallel metadata search method based on Bloomfilters using MapReduce for large-scale file systems , 2015, The Journal of Supercomputing.

[69]  Song Jiang,et al.  Selfie: co-locating metadata and data to enable fast virtual block devices , 2015, SYSTOR.

[70]  Fei-Yue Wang,et al.  Traffic Flow Prediction With Big Data: A Deep Learning Approach , 2015, IEEE Transactions on Intelligent Transportation Systems.

[71]  Tei-Wei Kuo,et al.  Marching-Based Wear-Leveling for PCM-Based Storage Systems , 2015, TODE.

[72]  Daniel J. Abadi,et al.  CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems , 2015, FAST.

[73]  Hua Li,et al.  Reusing Garbage Data for Efficient Workflow Computation , 2015, Comput. J..

[74]  A. B. M. Moniruzzaman NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management , 2014, ArXiv.

[75]  Hong Jiang,et al.  VSFS: A Searchable Distributed File System , 2014, 2014 9th Parallel Data Storage Workshop.

[76]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[77]  Robert B. Ross,et al.  Using Property Graphs for Rich Metadata Management in HPC Systems , 2014, 2014 9th Parallel Data Storage Workshop.

[78]  Sridhar Mahadevan,et al.  Efficient and Scalable Metadata Management in EB-Scale File Systems , 2014, IEEE Transactions on Parallel and Distributed Systems.

[79]  Robert B. Ross,et al.  FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[80]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[81]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[82]  S. Noh,et al.  pNFS for Everyone: An Empirical Study of a Low-cost, Highly Scalable Networked Storage , 2014 .

[83]  Baochun Li,et al.  Dynamic Cloud Pricing for Revenue Maximization , 2013, IEEE Transactions on Cloud Computing.

[84]  Lingkun Wu,et al.  FSMAC: A file system metadata accelerator with non-volatile memory , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[85]  Brent Welch,et al.  Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[86]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[87]  Sadaf R. Alam,et al.  Parallel I/O and the metadata wall , 2011, PDSW '11.

[88]  Yang Wang,et al.  Dataflow detection and applications to workflow scheduling , 2011, Concurr. Comput. Pract. Exp..

[89]  Hong Jiang,et al.  Supporting Scalable and Adaptive Metadata Management in Ultralarge-Scale File Systems , 2011, IEEE Transactions on Parallel and Distributed Systems.

[90]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[91]  Zhuan Chen,et al.  Replication-Based Highly Available Metadata Management for Cluster File Systems , 2010, 2010 IEEE International Conference on Cluster Computing.

[92]  Robert Budden,et al.  Kerberized Lustre 2.0 over the WAN , 2010 .

[93]  Sean Quinlan,et al.  GFS: Evolution on Fast-forward , 2009, ACM Queue.

[94]  Xiaoming Han,et al.  Volume Based Metadata Isolation in Blue Whale Cluster File System , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[95]  Shankar Pasupathy,et al.  Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems , 2009, FAST.

[96]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[97]  Qi Zhang,et al.  Characterization of storage workload traces from production Windows Servers , 2008, 2008 IEEE International Symposium on Workload Characterization.

[98]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[99]  M. Breitwisch Phase Change Memory , 2008, 2008 International Interconnect Technology Conference.

[100]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[101]  Jacob R. Lorch,et al.  A five-year study of file-system metadata , 2007, TOS.

[102]  Russell Glen Ross,et al.  Cluster storage for commodity computation , 2007 .

[103]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[104]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[105]  Jon Howell,et al.  Distributed directory service in the Farsite file system , 2006, OSDI '06.

[106]  Jie Gao,et al.  Weighted Bloom filter , 2006, 2006 IEEE International Symposium on Information Theory.

[107]  H. Apte,et al.  Serverless Network File Systems , 2006 .

[108]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[109]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[110]  Tao Yang,et al.  The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[111]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[112]  GhemawatSanjay,et al.  The Google file system , 2003 .

[113]  Ohad Rodeh,et al.  zFS - a scalable distributed file system using object disks , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[114]  Scott A. Brandt,et al.  Efficient metadata management in large distributed storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[115]  Robert M. Rees,et al.  IBM Storage Tank - A heterogeneous scalable SAN file system , 2003, IBM Syst. J..

[116]  Margo I. Seltzer,et al.  Passive NFS Tracing of Email and Research Workloads , 2003, FAST.

[117]  emontmej,et al.  High Performance Computing , 2003, Lecture Notes in Computer Science.

[118]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[119]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[120]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[121]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[122]  Ramesh K. Sitaraman,et al.  The power of two random choices: a survey of tech-niques and results , 2001 .

[123]  Edward W. Felten,et al.  Archipelago: an Island-based file system for highly available and scalable internet services , 2000 .

[124]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[125]  Lustre , 1999 .

[126]  Michael J. Callahan,et al.  The InterMezzo File System , 1999 .

[127]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[128]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[129]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[130]  Randy H. Katz,et al.  RAMA: An Easy-to-Use, High-Performance Parallel File System , 1997, Parallel Comput..

[131]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[132]  Dror G. Feitelson,et al.  The Vesta parallel file system , 1996, TOCS.

[133]  André Schiper,et al.  From Causal Consistency to Sequential Consistency in Shared Memory Systems , 1995, FSTTCS.

[134]  Mahadev Satyanarayanan,et al.  Coda: a highly available file system for a distributed workstation environment , 1989, Proceedings of the Second Workshop on Workstation Operating Systems.

[135]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[136]  Mahadev Satyanarayanan,et al.  Andrew: a distributed personal computing environment , 1986, CACM.

[137]  John A. Kunze,et al.  A trace-driven analysis of the UNIX 4.2 BSD file system , 1985, SOSP '85.

[138]  R. S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[139]  James K. Mullin,et al.  A second look at bloom filters , 1983, CACM.

[140]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.