From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees

We introduce BOURBON, a log-structured merge (LSM) tree that utilizes machine learning to provide fast lookups. We base the design and implementation of BOURBON on empirically-grounded principles that we derive through careful analysis of LSM design. BOURBON employs greedy piecewise linear regression to learn key distributions, enabling fast lookup with minimal computation, and applies a cost-benefit strategy to decide when learning will be worthwhile. Through a series of experiments on both synthetic and real-world datasets, we show that BOURBON improves lookup performance by 1.23x-1.78x as compared to state-of-the-art production LSMs.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[3]  Tim Kraska,et al.  RadixSpline: a single-pass learned index , 2020, aiDM@SIGMOD.

[4]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[5]  Marina Papatriantafilou,et al.  Piecewise Linear Approximation in Data Streaming: Algorithmic Implementations and Experimental Analysis , 2018, ArXiv.

[6]  Qing Xie,et al.  Maximum error-bounded Piecewise Linear Representation for online stream approximation , 2014, The VLDB Journal.

[7]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[8]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[9]  Michael A. Bender,et al.  An Introduction to Bε-trees and Write-Optimization , 2015, login Usenix Mag..

[10]  Haibo Chen,et al.  XIndex: a scalable learned index for multicore data storage , 2020, PPoPP.

[11]  Carsten Binnig,et al.  FITing-Tree: A Data-aware Index Structure , 2018, SIGMOD Conference.

[12]  Lars George,et al.  HBase - The Definitive Guide: Random Access to Your Planet-Size Data , 2011 .

[13]  Manos Athanassoulis,et al.  Monkey: Optimal Navigable Key-Value Store , 2017, SIGMOD Conference.

[14]  Pengfei Zuo,et al.  A Scalable Learned Index Scheme in Storage Systems , 2019, ArXiv.

[15]  Eric Eide,et al.  Introducing CloudLab: Scientific Infrastructure for Advancing Cloud Architectures and Applications , 2014, login Usenix Mag..

[16]  Idit Keidar,et al.  Scaling concurrent log-structured data stores , 2015, EuroSys.

[17]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[18]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Badrish Chandramouli,et al.  ALEX: An Updatable Adaptive Learned Index , 2019, SIGMOD Conference.

[20]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[21]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[22]  Jerry Li,et al.  Fast Algorithms for Segmented Regression , 2016, ICML.

[23]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[24]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[25]  Radu Horaud,et al.  A Comprehensive Analysis of Deep Regression , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[27]  Tim Kraska,et al.  SageDB: A Learned Database System , 2019, CIDR.

[28]  Christopher Leckie,et al.  High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning , 2016, Pattern Recognit..

[29]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[30]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[31]  Raghu Ramakrishnan,et al.  bLSM: a general purpose log structured merge tree , 2012, SIGMOD Conference.

[32]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[33]  Qiang Wang,et al.  Benchmarking State-of-the-Art Deep Learning Software Tools , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[34]  Tim Kraska,et al.  SOSD: A Benchmark for Learned Indexes , 2019, ArXiv.

[35]  Ittai Abraham,et al.  PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees , 2017, SOSP.

[36]  Andre Esteva,et al.  A guide to deep learning in healthcare , 2019, Nature Medicine.

[37]  Paolo Ferragina,et al.  The PGM-index , 2019, Proc. VLDB Endow..

[38]  Stratos Idreos,et al.  Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging , 2018, SIGMOD Conference.