KVell: the design and implementation of a fast persistent key-value store

Modern block-addressable NVMe SSDs provide much higher bandwidth and similar performance for random and sequential access. Persistent key-value stores (KVs) designed for earlier storage devices, using either Log-Structured Merge (LSM) or B trees, do not take full advantage of these new devices. Logic to avoid random accesses, expensive operations for keeping data sorted on disk, and synchronization bottlenecks make these KVs CPU-bound on NVMe SSDs. We present a new persistent KV design. Unlike earlier designs, no attempt is made at sequential access, and data is not sorted when stored on disk. A shared-nothing philosophy is adopted to avoid synchronization overhead. Together with batching of device accesses, these design decisions make for read and write performance close to device bandwidth. Finally, maintaining an inexpensive partial sort in memory produces adequate scan performance. We implement this design in KVell, the first persistent KV able to utilize modern NVMe SSDs at maximum bandwidth. We compare KVell against available state-of-the-art LSM and B tree KVs, both with synthetic benchmarks and production workloads. KVell achieves throughput at least 2x that of its closest competitor on read-dominated workloads, and 5x on write-dominated workloads. For workloads that contain mostly scans, KVell performs comparably or better than its competitors. KVell provides maximum latencies an order of magnitude lower than the best of its competitors, even on scan-based workloads.

[1]  Michael A. Bender,et al.  An Introduction to Bε-trees and Write-Optimization , 2015, login Usenix Mag..

[2]  Jian Yang,et al.  Mojim: A Reliable and Highly-Available Non-Volatile Memory System , 2015, ASPLOS.

[3]  Eddie Kohler,et al.  Cache craftiness for fast multicore key-value storage , 2012, EuroSys '12.

[4]  Pradeep Dubey,et al.  Achieving One Billion Key-Value Requests per Second on a Single Server , 2016, IEEE Micro.

[5]  Nikolas Ioannou,et al.  Reaping the performance of fast NVM storage with uDepot , 2019, FAST.

[6]  Andrea C. Arpaci-Dusseau,et al.  Redesigning LSMs for Nonvolatile Memory with NoveLSM , 2018, USENIX Annual Technical Conference.

[7]  Nisha Talagala,et al.  NVMKV: A Scalable, Lightweight, FTL-aware Key-Value Store , 2015, USENIX Annual Technical Conference.

[8]  Ittai Abraham,et al.  PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees , 2017, SOSP.

[9]  Sam H. Noh,et al.  SLM-DB: Single-Level Key-Value Store with Persistent Memory , 2019, FAST.

[10]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[11]  Zhichao Cao,et al.  Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook , 2020, FAST.

[12]  Christoforos E. Kozyrakis,et al.  Flash storage disaggregation , 2016, EuroSys.

[13]  Michael A. Bender,et al.  BetrFS: Write-Optimization in a Kernel File System , 2015, ACM Trans. Storage.

[14]  Andrew Pavlo,et al.  Non-Volatile Memory Database Management Systems , 2019, Non-Volatile Memory Database Management Systems.

[15]  Margo I. Seltzer,et al.  Closing the Performance Gap Between Volatile and Persistent Key-Value Stores Using Cross-Referencing Logs , 2018, USENIX ATC.

[16]  Pilar González-Férez,et al.  Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store , 2016, USENIX ATC.

[17]  Bingsheng He,et al.  NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems , 2015, FAST.

[18]  Yongkun Li,et al.  Enabling Efficient Updates in KV Storage via Hashing , 2018, USENIX Annual Technical Conference.

[19]  Roy H. Campbell,et al.  Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory , 2011, FAST.

[20]  Bin Fan,et al.  SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[21]  Jin Xiong,et al.  HiKV: A Hybrid Index Key-Value Store for DRAM-NVM Memory Systems , 2017, USENIX Annual Technical Conference.

[22]  Sam H. Noh,et al.  WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems , 2017, FAST.

[23]  Jungwon Kim,et al.  PapyrusKV: A High-Performance Parallel Key-Value Store for Distributed NVM Architectures , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Rachid Guerraoui,et al.  TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores , 2017, USENIX Annual Technical Conference.

[25]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[26]  Jie Wu,et al.  Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory , 2018, OSDI.

[27]  Ismail Oukid,et al.  FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory , 2016, SIGMOD Conference.

[28]  Mikhail Bautin,et al.  Storage Infrastructure Behind Facebook Messages: Using HBase at Scale , 2012, IEEE Data Eng. Bull..

[29]  Kai Ren,et al.  SlimDB: A Space-Efficient Key-Value Storage Engine For Semi-Sorted Data , 2017, Proc. VLDB Endow..

[30]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[31]  Qin Jin,et al.  Persistent B+-Trees in Non-Volatile Main Memory , 2015, Proc. VLDB Endow..

[32]  Michael Haubenschild,et al.  LeanStore: In-Memory Data Management beyond Main Memory , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[33]  Per-Åke Larson,et al.  BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory , 2018, Proc. VLDB Endow..

[34]  Pilar González-Férez,et al.  An Efficient Memory-Mapped Key-Value Store for Flash Storage , 2018, SoCC.

[35]  Asim Kadav,et al.  Blizzard: Fast, Cloud-scale Block Storage for Cloud-oblivious Applications , 2014, NSDI.

[36]  Jason Cong,et al.  An efficient design and implementation of LSM-tree based key-value store on open-channel SSD , 2014, EuroSys '14.

[37]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[38]  Fred Douglis,et al.  Beating the I/O bottleneck: a case for log-structured file systems , 1989, OPSR.

[39]  Karan Gupta,et al.  SILK+ Preventing Latency Spikes in Log-Structured Merge Key-Value Stores Running Heterogeneous Workloads , 2020, USENIX Annual Technical Conference.

[40]  Sungjin Lee,et al.  BlueCache: A Scalable Distributed Flash-based Key-value Store , 2016, Proc. VLDB Endow..

[41]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[42]  Willy Zwaenepoel,et al.  Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores , 2018, NSDI.

[43]  Emin Gün Sirer,et al.  HyperDex: a distributed, searchable key-value store , 2012, SIGCOMM '12.

[44]  J. Chris Anderson,et al.  CouchDB - The Definitive Guide: Time to Relax , 2010 .