The Full Path to Full-Path Indexing

Full-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local file system with fullpath indexing are prohibitively expensive. This paper shows how to use full-path indexing in a file system to realize fast directory scans, writes, and renames. The paper introduces a range-rename mechanism for efficient key-space changes in a write-optimized dictionary. This mechanism is encapsulated in the key-value API and simplifies the overall file system design. We implemented this mechanism in BetrFS, an inkernel, local file system for Linux. This new version, BetrFS 0.4, performs recursive greps 1.5x faster and random writes 1.2x faster than BetrFS 0.3, but renames are competitive with indirection-based file systems for a range of sizes. BetrFS 0.4 outperforms BetrFS 0.3, as well as traditional file systems, such as ext4, XFS, and ZFS, across a variety of workloads.

[1]  Erez Zadok,et al.  Building workload-independent storage with VT-trees , 2013, FAST.

[2]  Michael A. Bender,et al.  The TokuFS Streaming File System , 2012, HotStorage.

[3]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[4]  Richard P. Martin,et al.  Wayfinder: Navigating and Sharing Information in a Decentralized World , 2004, DBISP2P.

[5]  Christoph Koch,et al.  DBToaster: A SQL Compiler for High-Performance Delta Processing in Main-Memory Databases , 2009, Proc. VLDB Endow..

[6]  Eric A. Brewer,et al.  Rose: compressed, log-structured replication , 2008, Proc. VLDB Endow..

[7]  Suresh Venkatasubramanian,et al.  On external memory graph traversal , 2000, SODA '00.

[8]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[9]  GhemawatSanjay,et al.  The Google file system , 2003 .

[10]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[11]  Michael A. Bender,et al.  An Introduction to Bε-trees and Write-Optimization , 2015, login Usenix Mag..

[12]  André Brinkmann,et al.  Direct lookup and hash-based metadata placement for local file systems , 2013, SYSTOR '13.

[13]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[14]  Tao Zhang,et al.  How to get more value from your file system directory cache , 2015, SOSP.

[15]  Kai Ren,et al.  TABLEFS: Enhancing Metadata Efficiency in the Local File System , 2013, USENIX Annual Technical Conference.

[16]  Kimberly Keeton,et al.  LazyBase: trading freshness for performance in a scalable database , 2012, EuroSys '12.

[17]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[18]  Hong Jiang,et al.  LSM-Tree Managed Storage for Large-Scale Key-Value Store , 2019, IEEE Trans. Parallel Distributed Syst..

[19]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[20]  Michael A. Bender,et al.  Writes Wrought Right, and Other Adventures in File System Optimization , 2017, ACM Trans. Storage.

[21]  Xuan Wang,et al.  Improving server applications with system transactions , 2012, EuroSys '12.

[22]  Pilar González-Férez,et al.  Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store , 2016, USENIX ATC.

[23]  Samuel J. Leffler,et al.  A Fast File System for UNIX (Revised July 27, 1983) , 1983 .

[24]  Milos Nikolic,et al.  DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views , 2012, Proc. VLDB Endow..

[25]  Norman May,et al.  Indexing Highly Dynamic Hierarchical Data , 2015, Proc. VLDB Endow..

[26]  Michael A. Bender,et al.  Optimizing Every Operation in a Write-optimized File System , 2016, USENIX Annual Technical Conference.

[27]  Kimberly Keeton,et al.  From research to practice: experiences engineering a production metadata database for a scale out file system , 2014, FAST.

[28]  Song Jiang,et al.  LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items , 2015, USENIX Annual Technical Conference.

[29]  Erik D. Demaine,et al.  Cache-oblivious dynamic dictionaries with update/query tradeoffs , 2010, SODA '10.

[30]  Rachid Guerraoui,et al.  FloDB: Unlocking Memory in Persistent Key-Value Stores , 2017, EuroSys.

[31]  Josef Bacik,et al.  BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[32]  Idit Keidar,et al.  Scaling concurrent log-structured data stores , 2015, EuroSys.

[33]  Eddie Kohler,et al.  Making information flow explicit in HiStar , 2006, OSDI '06.

[34]  Gerth Stølting Brodal,et al.  Lower bounds for external memory dictionaries , 2003, SODA '03.

[35]  Michael A. Bender,et al.  File Systems Fated for Senescence? Nonsense, Says Science! , 2017, FAST.

[36]  Michael A. Bender,et al.  Cache-oblivious streaming B-trees , 2007, SPAA '07.

[37]  Daniel J. Abadi,et al.  CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems , 2015, FAST.

[38]  Johannes Gehrke,et al.  Massively multi-query join processing in publish/subscribe systems , 2007, SIGMOD '07.

[39]  Jennifer Widom,et al.  STREAM: the stanford stream data manager (demonstration description) , 2003, SIGMOD '03.

[40]  Richard Cole,et al.  Scanning and Traversing: Maintaining Data for Traversals in a Memory Hierarchy , 2002, ESA.

[41]  Raghu Ramakrishnan,et al.  bLSM: a general purpose log structured merge tree , 2012, SIGMOD Conference.

[42]  L. Vivier,et al.  The new ext 4 filesystem : current status and future plans , 2007 .

[43]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[44]  Erez Zadok,et al.  Enabling Transactional File Access via Lightweight Kernel Extensions , 2009, FAST.

[45]  Michael A. Bender,et al.  BetrFS: Write-Optimization in a Kernel File System , 2015, ACM Trans. Storage.

[46]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[47]  Michael A. Bender,et al.  BetrFS: A Right-Optimized Write-Optimized File System , 2015, FAST.

[48]  Donald E. Porter,et al.  Operating System Transactions , 2009, SOSP '09.