Optimal Bloom Filters and Adaptive Merging for LSM-Trees

In this article, we show that key-value stores backed by a log-structured merge-tree (LSM-tree) exhibit an intrinsic tradeoff between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune tradeoff among these metrics. We pinpoint the problem to the fact that modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters’ false-positive rates across the LSM-tree’s different levels. We present Monkey, an LSM-tree based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The core insight is that worst-case lookup cost is proportional to the sum of the false-positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize the sum of their false-positive rates. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of RocksDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50--80% for the data sizes we experimented with). Furthermore, we map the design space onto a closed-form model that enables adapting the merging frequency and memory allocation to strike the best tradeoff among lookup cost, update cost and main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the design of the key-value store for optimal performance.

[1]  Guillaume Pierre,et al.  EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications , 2009, ICSOC/ServiceWave Workshops.

[2]  Anastasia Ailamaki,et al.  MaSM: efficient online updates in data warehouses , 2011, SIGMOD '11.

[3]  Hyeontaek Lim,et al.  Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs , 2016, FAST.

[4]  Viktor Leis,et al.  SuRF: Practical Range Query Filtering with Fast Succinct Tries , 2018, SIGMOD Conference.

[5]  Philippe Bonnet,et al.  GeckoFTL: Scalable Flash Translation Techniques For Very Large Flash Devices , 2016, SIGMOD Conference.

[6]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[7]  Bingsheng He,et al.  Tree indexing on solid state drives , 2010, Proc. VLDB Endow..

[8]  Idit Keidar,et al.  Scaling concurrent log-structured data stores , 2015, EuroSys.

[9]  Abdul Wasay,et al.  The Periodic Table of Data Structures , 2018, IEEE Data Eng. Bull..

[10]  Chris Jermaine,et al.  The partitioned exponential file for database storage management , 2007, The VLDB Journal.

[11]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[12]  Leonidas J. Guibas,et al.  Fractional cascading: I. A data structuring technique , 1986, Algorithmica.

[13]  Idit Keidar,et al.  Accordion: Better Memory Organization for LSM Key-Value Stores , 2018, Proc. VLDB Endow..

[14]  Manos Athanassoulis,et al.  Design Tradeoffs of Data Access Methods , 2016, SIGMOD Conference.

[15]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[16]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[17]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[18]  Larry L. Peterson,et al.  HashCache: Cache Storage for the Next Billion , 2009, NSDI.

[19]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[20]  Chen Li,et al.  AsterixDB: A Scalable, Open Source BDMS , 2014, Proc. VLDB Endow..

[21]  Jin Li,et al.  SkimpyStash: RAM space skimpy key-value store on flash-based storage , 2011, SIGMOD '11.

[22]  Michael J. Carey,et al.  Pregelix: Big(ger) Graph Analytics on a Dataflow Engine , 2014, Proc. VLDB Endow..

[23]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[24]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[25]  Mehul A. Shah,et al.  Analyzing the energy efficiency of a database server , 2010, SIGMOD Conference.

[26]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[27]  Stratos Idreos,et al.  Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging , 2018, SIGMOD Conference.

[28]  S. Sudarshan,et al.  Incremental Organization for Data Recording and Warehousing , 1997, VLDB.

[29]  Song Jiang,et al.  LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items , 2015, USENIX Annual Technical Conference.

[30]  Stratos Idreos,et al.  The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models , 2018, SIGMOD Conference.

[31]  Pilar González-Férez,et al.  Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store , 2016, USENIX ATC.

[32]  Suman Nath,et al.  Cheap and Large CAMs for High Performance Data-Intensive Networked Systems , 2010, NSDI.

[33]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[34]  Christopher Ré,et al.  Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[35]  Suresh Venkatasubramanian,et al.  On external memory graph traversal , 2000, SODA '00.

[36]  Yongkun Li,et al.  Enabling Efficient Updates in KV Storage via Hashing , 2018, USENIX Annual Technical Conference.

[37]  Badrish Chandramouli,et al.  FASTER: A Concurrent Key-Value Store with In-Place Updates , 2018, SIGMOD Conference.

[38]  Kai Ren,et al.  SlimDB: A Space-Efficient Key-Value Storage Engine For Semi-Sorted Data , 2017, Proc. VLDB Endow..

[39]  Sasu Tarkoma,et al.  Theory and Practice of Bloom Filters for Distributed Systems , 2012, IEEE Communications Surveys & Tutorials.

[40]  Bettina Kemme,et al.  Compaction Management in Distributed Key-Value Datastores , 2015, Proc. VLDB Endow..

[41]  Chris Douglas,et al.  Walnut: a unified cloud object store , 2012, SIGMOD Conference.

[42]  Jin Li,et al.  FlashStore , 2010, Proc. VLDB Endow..

[43]  Ittai Abraham,et al.  PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees , 2017, SOSP.

[44]  Lars Arge,et al.  The Buffer Tree: A Technique for Designing Batched External Data Structures , 2003, Algorithmica.

[45]  Erez Zadok,et al.  Building workload-independent storage with VT-trees , 2013, FAST.

[46]  Jason Cong,et al.  An efficient design and implementation of LSM-tree based key-value store on open-channel SSD , 2014, EuroSys '14.

[47]  Jun Yang,et al.  On Log-Structured Merge for Solid-State Drives , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[48]  Anastasia Ailamaki,et al.  Online Updates on Data Warehouses via Judicious Use of Solid-State Storage , 2015, TODS.

[49]  Manos Athanassoulis,et al.  Monkey: Optimal Navigable Key-Value Store , 2017, SIGMOD Conference.

[50]  Michael A. Bender,et al.  Cache-oblivious streaming B-trees , 2007, SPAA '07.

[51]  William A. Shaffer,et al.  Dynamo , 1980, Medical economics.

[52]  Bin Fan,et al.  SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[53]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[54]  Rachid Guerraoui,et al.  TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores , 2017, USENIX Annual Technical Conference.

[55]  Gerth Stølting Brodal,et al.  Lower bounds for external memory dictionaries , 2003, SODA '03.

[56]  Anastasia Ailamaki,et al.  Designing Access Methods: The RUM Conjecture , 2016, EDBT.

[57]  Raghu Ramakrishnan,et al.  bLSM: a general purpose log structured merge tree , 2012, SIGMOD Conference.

[58]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[59]  Tony Savor,et al.  Optimizing Space Amplification in RocksDB , 2017, CIDR.

[60]  The online reference , 2020, The Innovator’s Dictionary.

[61]  Mark Weiser,et al.  Source Code , 1987, Computer.

[62]  Chris Jermaine,et al.  A Novel Index Supporting High Volume Data Warehouse Insertion , 1999, VLDB.

[63]  Donald Kossmann,et al.  Fast Scans on Key-Value Stores , 2017, Proc. VLDB Endow..