RFS: An LSM-Tree-Based File System for Enhanced Microdata Performance

File system workloads are increasing write-heavy. The growing capacity of RAM in modern nodes allows many reads to be satisfied from memory while writes must be persisted to disk. Today’s sophisticated local file systems like Ext4, XFS and Btrfs optimize for reads but suffer from workloads dominated by microdata (including metadata and tiny files). In this paper we present an LSM-tree-based file system, RFS, which aims to take advantages of the write optimization of LSM-tree to provide enhanced microdata performance, while offering matching performance for large files. RFS incrementally partitions the namespace into several metadata columns on a per-directory basis, preserving disk locality for directories and reducing the write amplification of LSM-trees. A writeordered log-structured layout is used to store small files efficiently, rather than embedding the contents of small files into inodes. We also propose an optimization of global bloom filters for efficient point lookups. Experiments show our library version of RFS can handle microwrite-intensive workloads 2-10 times faster than existing solutions such as Ext4, Btrfs and XFS. key words: local file systems, log-structured merge tree, microdata, namespace partitioning, write-ordered write-optimized store

[1]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[2]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[3]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[4]  Carl Staelin,et al.  An Implementation of a Log-Structured File System for UNIX , 1993, USENIX Winter.

[5]  Michael A. Bender,et al.  BetrFS: Write-Optimization in a Kernel File System , 2015, ACM Trans. Storage.

[6]  Song Jiang,et al.  LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items , 2015, USENIX Annual Technical Conference.

[7]  Josef Bacik,et al.  BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[8]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[9]  Leonidas J. Guibas,et al.  Fractional cascading: I. A data structuring technique , 1986, Algorithmica.

[10]  Andrew Twigg,et al.  Stratified B-trees and versioning dictionaries , 2011, ArXiv.

[11]  Joo Young Hwang,et al.  F2FS: A New File System for Flash Storage , 2015, FAST.

[12]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[13]  Suparna Bhattacharya,et al.  Ext4: The Next Generation of Ext2/3 Filesystem , 2007, LSF.

[14]  Michael A. Bender,et al.  The TokuFS Streaming File System , 2012, HotStorage.

[15]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[16]  Michael A. Bender,et al.  BetrFS: A Right-Optimized Write-Optimized File System , 2015, FAST.

[17]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[18]  Val Henson,et al.  The Zettabyte File System , 2003 .

[19]  M. Balazinska,et al.  An analysis of Hadoop usage in scientific workloads , 2013 .

[20]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[21]  Sara McMains,et al.  File System Logging versus Clustering: A Performance Comparison , 1995, USENIX.

[22]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[23]  Kai Ren,et al.  TABLEFS: Enhancing Metadata Efficiency in the Local File System , 2013, USENIX Annual Technical Conference.

[24]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[25]  Emin Gün Sirer,et al.  HyperDex: a distributed, searchable key-value store , 2012, SIGCOMM '12.

[26]  Michael A. Bender,et al.  Cache-oblivious streaming B-trees , 2007, SPAA '07.

[27]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[28]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Erez Zadok,et al.  Building workload-independent storage with VT-trees , 2013, FAST.

[30]  Felix Hupfeld,et al.  BabuDB: Fast and Efficient File System Metadata Storage , 2010, 2010 International Workshop on Storage Network Architecture and Parallel I/Os.