Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging

In this paper, we show that all mainstream LSM-tree based key-value stores in the literature and in industry are suboptimal with respect to how they trade off among the I/O costs of updates, point lookups, range lookups, as well as the cost of storage, measured as space-amplification. The reason is that they perform expensive merge operations in order to (1) bound the number of runs that a lookup has to probe, and to (2) remove obsolete entries to reclaim space. However, most of these merge operations reduce point lookup cost, long range lookup cost, and space-amplification by a negligible amount. To address this problem, we expand the LSM-tree design space with Lazy Leveling, a new design that prohibits merge operations at all levels of LSM-tree but the largest. We show that Lazy Leveling improves the worst-case cost complexity of updates while maintaining the same bounds on point lookup cost, long range lookup cost, and space-amplification. To be able to navigate between Lazy Leveling and other designs, we make the LSM-tree design space fluid by introducing Fluid LSM-tree, a generalization of LSM-tree that can be parameterized to assume all existing LSM-tree designs. We show how to fluidly transition from Lazy Leveling to (1) designs that are more optimized for updates by merging less at the largest level, and (2) designs that are more optimized for small range lookups by merging more at all other levels. We put everything together to design Dostoevsky, a key-value store that navigates the entire Fluid LSM-tree design space based on the application workload and hardware to maximize throughput using a novel closed-form performance model. We implemented Dostoevsky on top of RocksDB, and we show that it strictly dominates state-of-the-art LSM-tree based key-value stores in terms of performance and space-amplification.

[1]  Michael J. Carey,et al.  Pregelix: Big(ger) Graph Analytics on a Dataflow Engine , 2014, Proc. VLDB Endow..

[2]  Jin Li,et al.  FlashStore , 2010, Proc. VLDB Endow..

[3]  Pilar González-Férez,et al.  Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store , 2016, USENIX ATC.

[4]  Bin Fan,et al.  SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[5]  Suman Nath,et al.  FlashDB: Dynamic Self-tuning Database for NAND Flash , 2007, 2007 6th International Symposium on Information Processing in Sensor Networks.

[6]  Tony Savor,et al.  Optimizing Space Amplification in RocksDB , 2017, CIDR.

[7]  Bingsheng He,et al.  Tree indexing on solid state drives , 2010, Proc. VLDB Endow..

[8]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[9]  Anastasia Ailamaki,et al.  Online Updates on Data Warehouses via Judicious Use of Solid-State Storage , 2015, TODS.

[10]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[11]  Erez Zadok,et al.  Building workload-independent storage with VT-trees , 2013, FAST.

[12]  Idit Keidar,et al.  Scaling concurrent log-structured data stores , 2015, EuroSys.

[13]  Michael A. Bender,et al.  Cache-oblivious streaming B-trees , 2007, SPAA '07.

[14]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[15]  S. Sudarshan,et al.  Incremental Organization for Data Recording and Warehousing , 1997, VLDB.

[16]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[17]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[18]  Kai Ren,et al.  SlimDB: A Space-Efficient Key-Value Storage Engine For Semi-Sorted Data , 2017, Proc. VLDB Endow..

[19]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[20]  Philippe Bonnet,et al.  GeckoFTL: Scalable Flash Translation Techniques For Very Large Flash Devices , 2016, SIGMOD Conference.

[21]  Anastasia Ailamaki,et al.  MaSM: efficient online updates in data warehouses , 2011, SIGMOD '11.

[22]  Sasu Tarkoma,et al.  Theory and Practice of Bloom Filters for Distributed Systems , 2012, IEEE Communications Surveys & Tutorials.

[23]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[24]  Song Jiang,et al.  LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items , 2015, USENIX Annual Technical Conference.

[25]  Larry L. Peterson,et al.  HashCache: Cache Storage for the Next Billion , 2009, NSDI.

[26]  Feifei Li,et al.  LogKV: Exploiting Key-Value Stores for Log Processing , 2013, CIDR.

[27]  Chris Douglas,et al.  Walnut: a unified cloud object store , 2012, SIGMOD Conference.

[28]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[29]  Bettina Kemme,et al.  Compaction Management in Distributed Key-Value Datastores , 2015, Proc. VLDB Endow..

[30]  Hyeontaek Lim,et al.  Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs , 2016, FAST.

[31]  Manos Athanassoulis,et al.  Monkey: Optimal Navigable Key-Value Store , 2017, SIGMOD Conference.

[32]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[33]  Ramesh K. Sitaraman,et al.  Lazy-Adaptive Tree: An Optimized Index Structure for Flash Devices , 2009, Proc. VLDB Endow..

[34]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[35]  Ittai Abraham,et al.  PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees , 2017, SOSP.

[36]  Rachid Guerraoui,et al.  TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores , 2017, USENIX Annual Technical Conference.

[37]  Jin Li,et al.  SkimpyStash: RAM space skimpy key-value store on flash-based storage , 2011, SIGMOD '11.

[38]  Donald Kossmann,et al.  Fast Scans on Key-Value Stores , 2017, Proc. VLDB Endow..

[39]  Gerth Stølting Brodal,et al.  Lower bounds for external memory dictionaries , 2003, SODA '03.

[40]  Raghu Ramakrishnan,et al.  bLSM: a general purpose log structured merge tree , 2012, SIGMOD Conference.

[41]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[42]  Jun Yang,et al.  On Log-Structured Merge for Solid-State Drives , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).