From Auto-tuning One Size Fits All to Self-designed and Learned Data-intensive Systems

We survey new opportunities to design data systems, data structures and algorithms that can adapt to both data and query workloads. Data keeps growing, hardware keeps changing and new applications appear ever more frequently. One size does not fit all, but data-intensive applications would like to balance and control memory requirements, read costs, write costs, as well as monetary costs on the cloud. This calls for tailored data systems, storage, and computation solutions that match the exact requirements of the scenario at hand. Such systems should be "synthesized'' quickly and nearly automatically, removing the human system designers and administrators from the loop as much as possible to keep up with the quick evolution of applications and workloads. In addition, such systems should "learn'' from both past and current system performance and workload patterns to keep adapting their design. We survey new trends in 1) self-designed, and 2) learned data systems and how these technologies can apply to relational, NoSQL, and big data systems as well as to broad data science applications. We focus on both recent research advances and practical applications of this technology, as well as numerous open research opportunities that come from their fusion. We specifically highlight recent work on data structures, algorithms, and query optimization, and how machine learning inspired designs as well as a detailed mapping of the possible design space of solutions can drive innovation to create tailored systems. We also position and connect with past seminal system designs and research in auto-tuning, modular/extensible, and adaptive data systems to highlight the new challenges as well as the opportunities to combine past and new technologies.

[1]  Abdul Wasay,et al.  The Periodic Table of Data Structures , 2018, IEEE Data Eng. Bull..

[2]  Joseph M. Hellerstein,et al.  Amdb: A Design Tool for Access Methods , 2003, IEEE Data Eng. Bull..

[3]  Tim Kraska,et al.  SageDB: A Learned Database System , 2019, CIDR.

[4]  Stratos Idreos,et al.  Main Memory Adaptive Denormalization , 2016, SIGMOD Conference.

[5]  Eugene Wong,et al.  Query optimization by simulated annealing , 1987, SIGMOD '87.

[6]  Themis Palpanas,et al.  Indexing for interactive exploration of big data series , 2014, SIGMOD Conference.

[7]  Don S. Batory,et al.  GENESIS: An Extensible Database Management System , 1988, IEEE Trans. Software Eng..

[8]  Paul M. Aoki Generalizing Search'' in Generalized Search Trees (Extended Abstract) , 1998, ICDE 1998.

[9]  Jens Dittrich,et al.  Main memory adaptive indexing for multi-core systems , 2014, DaMoN '14.

[10]  Henrik Loeser,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone? , 2011, BTW.

[11]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[12]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[13]  C. Mohan,et al.  Concurrency and recovery in generalized search trees , 1997, SIGMOD '97.

[14]  Marcel Kornacker,et al.  High-Performance Extensible Indexing , 1999, VLDB.

[15]  Surajit Chaudhuri,et al.  Automatic physical database tuning: a relaxation-based approach , 2005, SIGMOD '05.

[16]  Jignesh M. Patel,et al.  Data Morphing: An Adaptive, Cache-Conscious Storage Technique , 2003, VLDB.

[17]  Olga Papaemmanouil,et al.  Towards a Hands-Free Query Optimizer through Deep Learning , 2018, CIDR.

[18]  Volker Markl,et al.  Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation , 2015, SIGMOD Conference.

[19]  Manos Athanassoulis,et al.  Monkey: Optimal Navigable Key-Value Store , 2017, SIGMOD Conference.

[20]  Manos Athanassoulis,et al.  Optimal Bloom Filters and Adaptive Merging for LSM-Trees , 2018, ACM Trans. Database Syst..

[21]  Lukasz Ziarek,et al.  Just-In-Time Data Structures , 2015, CIDR.

[22]  Michael J. Franklin Caching and Memory Management in Client-Server Database Systems , 1993 .

[23]  Martin L. Kersten,et al.  Self-organizing tuple reconstruction in column-stores , 2009, SIGMOD Conference.

[24]  Sudipta Sengupta,et al.  LLAMA: A Cache/Storage Subsystem for Modern Hardware , 2013, Proc. VLDB Endow..

[25]  Anastasia Ailamaki,et al.  Designing Access Methods: The RUM Conjecture , 2016, EDBT.

[26]  Eleni Petraki,et al.  Holistic Indexing in Main-memory Column-stores , 2015, SIGMOD Conference.

[27]  Sudipta Sengupta,et al.  The Bw-Tree: A B-tree for new hardware platforms , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[28]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[29]  Herodotos Herodotou,et al.  Automated Experiment-Driven Management of (Database) Systems , 2009, HotOS.

[30]  Gerhard Weikum,et al.  Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System , 2000, VLDB.

[31]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[32]  Alekh Jindal,et al.  Towards a One Size Fits All Database Architecture , 2011, CIDR.

[33]  Anastasia Ailamaki,et al.  H2O: a hands-free adaptive store , 2014, SIGMOD Conference.

[34]  Christopher Ré,et al.  Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[35]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[36]  David Li,et al.  Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn , 2019, CIDR.

[37]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[38]  Andrew Pavlo,et al.  Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads , 2016, SIGMOD Conference.

[39]  Karsten Schmidt,et al.  Autonomous Management of Soft Indexes , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[40]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[41]  David Lorge Parnas,et al.  Review of David L. Parnas' "Designing Software for Ease of Extension and Contraction" , 2004 .

[42]  Harumi A. Kuno,et al.  Concurrency Control for Adaptive Indexing , 2012, Proc. VLDB Endow..

[43]  Serge Abiteboul,et al.  COLT: continuous on-line tuning , 2006, SIGMOD Conference.

[44]  Eleni Petraki,et al.  Database cracking: fancy scan, not poor man's sort! , 2014, DaMoN '14.

[45]  Serge Abiteboul,et al.  COLT: Continuous On-Line Database Tuning , 2006 .

[46]  Paul M. Aoki Generalizing "search" in generalized search trees , 1998, Proceedings 14th International Conference on Data Engineering.

[47]  Joseph M. Hellerstein,et al.  AMDB: an access method debugging tool , 1998, SIGMOD '98.

[48]  Stratos Idreos,et al.  The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models , 2018, SIGMOD Conference.

[49]  Stratos Idreos,et al.  Evolutionary Data Systems , 2017, ArXiv.

[50]  Paul M. Aoki How to avoid building DataBlades(R) that know the value of everything and the cost of nothing , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[51]  Stratos Idreos,et al.  Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging , 2018, SIGMOD Conference.

[52]  Martin L. Kersten,et al.  A Database System with Amnesia , 2017, CIDR.

[53]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[54]  Alekh Jindal,et al.  The Uncracked Pieces in Database Cracking , 2013, Proc. VLDB Endow..