DeltaFS: A Scalable No-Ground-Truth Filesystem For Massively-Parallel Computing

High-Performance Computing (HPC) is known for its use of massive concurrency. But it can be challenging for a parallel filesystem's control plane to utilize cores when every client process must globally synchronize and serialize its metadata mutations with those of other clients. We present DeltaFS, a new paradigm for distributed filesystem metadata. DeltaFS allows jobs to self-commit their namespace changes to logs, avoiding the cost of global synchronization. Followup jobs selectively merge logs produced by previous jobs as needed, a principle we term No Ground Truth which allows for efficient data sharing. By avoiding unnecessary synchronization of metadata operations, DeltaFS improves metadata operation throughput up to 98X leveraging parallelism on the nodes where job processes run. This speedup grows as job size increases. DeltaFS enables efficient inter-job communication, reducing overall workflow runtime by significantly improving client metadata operation latency up to 49X and resource usage up to 52X.

[1]  Lin Xiao,et al.  ShardFS vs. IndexFS: replication vs. caching strategies for distributed metadata management in cloud storage systems , 2015, SoCC.

[2]  Nancy P. Kronenberg,et al.  VAXcluster: a closely-coupled distributed system , 1986, TOCS.

[3]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[4]  Michael A. Bender,et al.  BetrFS: A Right-Optimized Write-Optimized File System , 2015, FAST.

[5]  Michael A. Bender,et al.  The TokuFS Streaming File System , 2012, HotStorage.

[6]  Tim Süß,et al.  GekkoFS — A Temporary Burst Buffer File System for HPC Applications , 2020, Journal of Computer Science and Technology.

[7]  Jie Ma,et al.  Adaptive and scalable metadata management to support a trillion files , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  P. Couvares Caching in the Sprite network file system , 2006 .

[9]  Dror G. Feitelson,et al.  The Vesta parallel file system , 1996, TOCS.

[10]  W. Daniel Hillis,et al.  The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.

[11]  Kimberly Keeton,et al.  From research to practice: experiences engineering a production metadata database for a scale out file system , 2014, FAST.

[12]  Fan Guo,et al.  Scaling Embedded In-Situ Indexing with DeltaFS , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[14]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[15]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16]  Grant Erickson,et al.  A 64-bit, shared disk file system for Linux , 1999, 16th IEEE Symposium on Mass Storage Systems in cooperation with the 7th NASA Goddard Conference on Mass Storage Systems and Technologies (Cat. No.99CB37098).

[17]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[18]  Jeffrey S. Vetter,et al.  Contemporary High Performance Computing - From Petascale toward Exascale , 2019, Chapman and Hall / CRC computational science series.

[19]  Ali R. Butt,et al.  Efficient Metadata Indexing for HPC Storage Systems , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[20]  Eugenio Cesario,et al.  XtreemFS: a case for object-based storage in Grid data management , 2007 .

[21]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[22]  Kai Ren,et al.  TABLEFS: Enhancing Metadata Efficiency in the Local File System , 2013, USENIX Annual Technical Conference.

[23]  Mahadev Satyanarayanan,et al.  Disconnected Operation in the Coda File System , 1999, Mobidata.

[24]  Hong Jiang,et al.  SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25]  Erez Zadok,et al.  Versatility and Unix semantics in namespace unification , 2006, TOS.

[26]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[27]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[28]  Robert B. Ross,et al.  Mercury: Enabling remote procedure call for high-performance computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[29]  Andrew Birrell,et al.  Implementing remote procedure calls , 1984, TOCS.

[30]  Douglas Thain,et al.  Confuga: Scalable Data Intensive Computing for POSIX Workflows , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[31]  Siddharth Seth,et al.  Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing , 2019, SIGMOD Conference.

[32]  A. Retrospective,et al.  The UNIX Time-sharing System , 1977 .

[33]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[34]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[35]  Robert B. Ross,et al.  FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[36]  Kai Ren,et al.  BatchFS: Scaling the File System Control Plane with Client-Funded Metadata Servers , 2014, 2014 9th Parallel Data Storage Workshop.

[37]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[38]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[39]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[40]  Murthy V. Devarakonda,et al.  Evaluation of Design Alternatives for a Cluster File System , 1995, USENIX.

[41]  Teng Wang,et al.  An Ephemeral Burst-Buffer File System for Scientific Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Shankar Pasupathy,et al.  Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems , 2009, FAST.

[43]  John Bent,et al.  Serving Data to the Lunatic Fringe: The Evolution of HPC Storage , 2016, login Usenix Mag..

[44]  Felix Hupfeld,et al.  BabuDB: Fast and Efficient File System Metadata Storage , 2010, 2010 International Workshop on Storage Network Architecture and Parallel I/Os.

[45]  Kai Ren,et al.  SlimDB: A Space-Efficient Key-Value Storage Engine For Semi-Sorted Data , 2017, Proc. VLDB Endow..

[46]  Don Monroe Fugaku takes the lead , 2021, Commun. ACM.

[47]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[48]  Steven Whitehouse The GFS2 Filesystem , 2010 .

[49]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[50]  Gregory R. Ganger,et al.  On the diversity of cluster workloads and its impact on research results , 2018, USENIX Annual Technical Conference.

[51]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[52]  Youyou Lu,et al.  LocoFS: A Loosely-Coupled Metadata Service for Distributed File Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[53]  Nicholas J. Wright,et al.  Architecture and Design of Cray DataWarp , 2016 .

[54]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[55]  Yubo Liu,et al.  Pacon: Improving Scalability and Efficiency of Metadata Service through Partial Consistency , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[56]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[57]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[58]  W. Allcock,et al.  Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[59]  S. Tucker Taft,et al.  Information technology — Programming Languages — Ada , 2001 .

[60]  Michael Stonebraker,et al.  "One size fits all": an idea whose time has come and gone , 2018, Making Databases Work.

[61]  Gary Grider,et al.  MarFS, a Near-POSIX Interface to Cloud Objects , 2017, login Usenix Mag..

[62]  Kai Ren,et al.  DeltaFS: exascale file systems scale better without dedicated servers , 2015, PDSW '15.

[63]  Robert B. Ross,et al.  Mochi: Composing Data Services for High-Performance Computing Environments , 2020, Journal of Computer Science and Technology.

[64]  Carlos Maltzahn,et al.  DAOS and Friends: A Proposal for an Exascale Storage System , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[65]  H. Apte,et al.  Serverless Network File Systems , 2006 .

[66]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[67]  Scott A. Brandt,et al.  Efficient metadata management in large distributed storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[68]  Alexander S. Szalay,et al.  Just-in-Time Analytics on Large File Systems , 2011, IEEE Transactions on Computers.

[69]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[70]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[71]  mark. fasheh OCFS 2 : The Oracle Clustered File System , Version 2 , 2010 .

[72]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[73]  Tarfa Hamed A Cost-Effective, High- Bandwidth Storage Architecture , 2011 .

[74]  Henrik Loeser,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone? , 2011, BTW.

[75]  Sadaf R. Alam,et al.  Parallel I/O and the metadata wall , 2011, PDSW '11.

[76]  Carlos Maltzahn,et al.  RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[77]  Osamu Tatebe,et al.  Gfarm/BB — Gfarm File System for Node-Local Burst Buffer , 2020, Journal of Computer Science and Technology.

[78]  Youngjae Kim,et al.  TagIt: An Integrated Indexing and Search Service for File Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[79]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[80]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[81]  Jon Howell,et al.  Distributed directory service in the Farsite file system , 2006, OSDI '06.