Cudele: An API and Framework for Programmable Consistency and Durability in a Global Namespace

HPC and data center scale application developers are abandoning POSIX IO because file system metadata synchronization and serialization overheads of providing strong consistency and durability are too costly – and often unnecessary – for their applications. Unfortunately, designing file systems with weaker consistency or durability semantics excludes applications that rely on stronger guarantees, forcing developers to re-write their applications or deploy them on a different system. We present a framework and API that lets administrators specify their consistency/durability requirements and dynamically assign them to subtrees in the same namespace, allowing administrators to optimize subtrees over time and space for different workloads. We show similar speedups to related work but more importantly, we show performance improvements when we custom fit subtree semantics to applications such as checkpoint-restart (91.7x speedup), user home directories (0.03 standard deviation from optimal), and users checking for partial results (2% overhead).

[1]  Sadaf R. Alam,et al.  Parallel I/O and the metadata wall , 2011, PDSW '11.

[2]  Kai Ren,et al.  BatchFS: Scaling the File System Control Plane with Client-Funded Metadata Servers , 2014, 2014 9th Parallel Data Storage Workshop.

[3]  Allen D. Malony,et al.  Scaling Spark on HPC Systems , 2016, HPDC.

[4]  Feiyi Wang,et al.  Performance and scalability evaluation of the Ceph parallel file system , 2013, PDSW@SC.

[5]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[6]  Carlos Maltzahn,et al.  Malacology: A Programmable Storage System , 2017, EuroSys.

[7]  Dean Hildebrand,et al.  Panache: A Parallel File System Cache for Global File Access , 2010, FAST.

[8]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[9]  John Bent,et al.  Serving Data to the Lunatic Fringe: The Evolution of HPC Storage , 2016, login Usenix Mag..

[10]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[11]  Robert B. Ross,et al.  Small-file access in parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[12]  Carlos Maltzahn,et al.  Mantle: a programmable metadata load balancer for the ceph file system , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Lin Xiao,et al.  ShardFS vs. IndexFS: replication vs. caching strategies for distributed metadata management in cloud storage systems , 2015, SoCC.

[14]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[15]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[16]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Carlos Maltzahn,et al.  Popper : Making Reproducible Systems Performance Evaluation Practical true , .

[18]  Kai Ren,et al.  DeltaFS: exascale file systems scale better without dedicated servers , 2015, PDSW '15.

[19]  Andrea C. Arpaci-Dusseau,et al.  The Popper Convention: Making Reproducible Systems Evaluation Practical , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Kai Ren,et al.  TABLEFS: Enhancing Metadata Efficiency in the Local File System , 2013, USENIX Annual Technical Conference.

[21]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[22]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[23]  Cristina L. Abad,et al.  Metadata Traces and Workload Models for Evaluating Big Storage Systems , 2012, 2012 IEEE Fifth International Conference on Utility and Cloud Computing.

[24]  Jim Dowling,et al.  Scaling HDFS with a Strongly Consistent Relational Model for Metadata , 2014, DAIS.

[25]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[26]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[27]  Sean Quinlan,et al.  GFS: evolution on fast-forward , 2010, Commun. ACM.