Efficient Compactions between Storage Tiers with PrismDB

In recent years, emerging hardware storage technologies have focused on divergent goals: better performance or lower cost-per-bit of storage. Correspondingly, data systems that employ these new technologies are optimized either to be fast (but expensive) or cheap (but slow). We take a different approach: by combining multiple tiers of fast and low-cost storage technologies within the same system, we can achieve a Pareto-efficient balance between performance and cost-per-bit. This paper presents the design and implementation of PrismDB, a novel log-structured merge tree based key-value store that exploits a full spectrum of heterogeneous storage technologies (from 3D XPoint to QLC NAND). We introduce the notion of "read-awareness" to log-structured merge trees, which allows hot objects to be pinned to faster storage, achieving better tiering and hot-cold separation of objects. Compared to the standard use of RocksDB on flash in datacenters today, PrismDB's average throughput on heterogeneous storage is 2.3$\times$ faster and its tail latency is more than an order of magnitude better, using hardware than is half the cost.

[1]  Idit Keidar,et al.  EvenDB: optimizing key-value storage for spatial locality , 2020, EuroSys.

[2]  Song Jiang,et al.  LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items , 2015, USENIX Annual Technical Conference.

[3]  Andrea C. Arpaci-Dusseau,et al.  Towards an Unwritten Contract of Intel Optane SSD , 2019, HotStorage.

[4]  Rachid Guerraoui,et al.  TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores , 2017, USENIX Annual Technical Conference.

[5]  Jishen Zhao,et al.  Steal but No Force: Efficient Hardware Undo+Redo Logging for Persistent Memory Systems , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[6]  Kai Li,et al.  RIPQ: Advanced Photo Caching on Flash for Facebook , 2015, FAST.

[7]  Yoshiyasu Doi,et al.  Managing Non-Volatile Memory in Database Systems , 2018, SIGMOD Conference.

[8]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[9]  Taesoo Kim,et al.  SplitFS: reducing software overhead in file systems for persistent memory , 2019, SOSP.

[10]  Nathan Beckmann,et al.  LHD: Improving Cache Hit Rate by Maximizing Hit Density , 2018, NSDI.

[11]  Richard P. Spillane,et al.  SplinterDB: Closing the Bandwidth Gap for NVMe Key-Value Stores , 2020, USENIX Annual Technical Conference.

[12]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[13]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[14]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[15]  Jiguang Wan,et al.  MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores with Matrix Container in NVM , 2020, USENIX Annual Technical Conference.

[16]  Thomas E. Anderson,et al.  Strata: A Cross Media File System , 2017, SOSP.

[17]  Zhichao Cao,et al.  Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook , 2020, FAST.

[18]  Sachin Katti,et al.  Reducing DRAM footprint with NVM in Facebook , 2018, EuroSys.

[19]  Michael J. Freedman,et al.  Who's Afraid of Uncorrectable Bit Errors? Online Recovery of Flash Errors with Distributed Redundancy , 2019, USENIX Annual Technical Conference.

[20]  Todor I. Mollov,et al.  Quill : Exploiting Fast Non-Volatile Memory by Transparently Bypassing the File System , 2013 .

[21]  Kai Li,et al.  Learning Relaxed Belady for Content Distribution Network Caching , 2020, NSDI.

[22]  Willy Zwaenepoel,et al.  KVell: the design and implementation of a fast persistent key-value store , 2019, SOSP.

[23]  Sachin Katti,et al.  Flashield: a Hybrid Key-value Cache that Controls Flash Write Amplification , 2019, NSDI.

[24]  Michael M. Swift,et al.  Aerie: flexible file-system interfaces to storage-class memory , 2014, EuroSys '14.

[25]  Viktor Leis,et al.  Persistent Memory I/O Primitives , 2019, DaMoN.

[26]  Jason Cong,et al.  An efficient design and implementation of LSM-tree based key-value store on open-channel SSD , 2014, EuroSys '14.

[27]  Nikolas Ioannou,et al.  Reaping the performance of fast NVM storage with uDepot , 2019, FAST.

[28]  Karan Gupta,et al.  SILK+ Preventing Latency Spikes in Log-Structured Merge Key-Value Stores Running Heterogeneous Workloads , 2020, USENIX Annual Technical Conference.

[29]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[30]  Irfan Ahmad,et al.  Cache Modeling and Optimization using Miniature Simulations , 2017, USENIX Annual Technical Conference.

[31]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[32]  Sachin Katti,et al.  Bandana: Using Non-volatile Memory for Storing Deep Learning Models , 2018, MLSys.

[33]  Jian Xu,et al.  NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories , 2016, FAST.

[34]  Ittai Abraham,et al.  PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees , 2017, SOSP.

[35]  Sam H. Noh,et al.  SLM-DB: Single-Level Key-Value Store with Persistent Memory , 2019, FAST.

[36]  Irfan Ahmad,et al.  Efficient MRC Construction with SHARDS , 2015, FAST.

[37]  Steven Swanson,et al.  The bleak future of NAND flash memory , 2012, FAST.

[38]  Jian Xu,et al.  Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks , 2019, ASPLOS.

[39]  Sachin Katti,et al.  Cliffhanger: Scaling Performance Cliffs in Web Memory Caches , 2016, NSDI.

[40]  Margo I. Seltzer,et al.  Persistent Memcached: Bringing Legacy Code to Byte-Addressable Persistent Memory , 2017, HotStorage.

[41]  Andrea C. Arpaci-Dusseau,et al.  Exploiting Intel Optane SSD for Microsoft SQL Server , 2019, DaMoN.

[42]  Ada Gavrilovska,et al.  Mutant: Balancing Storage Cost and Latency in LSM-Tree Data Stores , 2018, SoCC.

[43]  Tony Savor,et al.  Optimizing Space Amplification in RocksDB , 2017, CIDR.

[44]  A. L. Narasimha Reddy,et al.  SCMFS: A File System for Storage Class Memory and its Extensions , 2013, ACM Trans. Storage.

[45]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[46]  Idit Keidar,et al.  Scaling concurrent log-structured data stores , 2015, EuroSys.

[47]  Christopher Frost,et al.  Better I/O through byte-addressable, persistent memory , 2009, SOSP '09.

[48]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[49]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[50]  Bianca Schroeder,et al.  Rethinking WOM Codes to Enhance the Lifetime in New SSD Generations , 2020, HotStorage.

[51]  Jin Xiong,et al.  HiKV: A Hybrid Index Key-Value Store for DRAM-NVM Memory Systems , 2017, USENIX Annual Technical Conference.