Learning Data Structure Alchemy

We propose a solution based on first principles and AI to the decades-old problem of data structure design. Instead of working on individual designs that each can only be helpful in a small set of environments, we propose the construction of an engine, a Data Alchemist, which learns how to blend fine-grained data structure design principles to automatically synthesize brand new data structures. 1 Computing Instead of Inventing Data Structures Read Memory Udate Prform ance Trae-offs Data Structures Databases Access Patterns Hardware Cloud costs K V K V K V ... Table Table LS M Hash BTree Machine

[1]  Michael A. Bender,et al.  BetrFS: A Right-Optimized Write-Optimized File System , 2015, FAST.

[2]  Alfonso F. Cardenas,et al.  Evaluation and selection of file organization—a model and system , 1973, Commun. ACM.

[3]  Harumi A. Kuno,et al.  Concurrency Control for Adaptive Indexing , 2012, Proc. VLDB Endow..

[4]  Abdul Wasay,et al.  The Periodic Table of Data Structures , 2018, IEEE Data Eng. Bull..

[5]  Paul M. Aoki Generalizing Search'' in Generalized Search Trees (Extended Abstract) , 1998, ICDE 1998.

[6]  Michael Stonebraker,et al.  Aurum: A Data Discovery System , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[7]  Martin L. Kersten,et al.  The researcher's guide to the data deluge , 2011, Proc. VLDB Endow..

[8]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[9]  Eddie Kohler,et al.  Cache craftiness for fast multicore key-value storage , 2012, EuroSys '12.

[10]  Tim Kraska,et al.  SageDB: A Learned Database System , 2019, CIDR.

[11]  Herodotos Herodotou,et al.  Automated Experiment-Driven Management of (Database) Systems , 2009, HotOS.

[12]  Kesheng Wu,et al.  ArrayBridge: Interweaving Declarative Array Processing in SciDB with Imperative HDF5-Based Programs , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[13]  Stavros Papadopoulos,et al.  The TileDB Array Data Storage Manager , 2016, Proc. VLDB Endow..

[14]  Alekh Jindal,et al.  Towards a One Size Fits All Database Architecture , 2011, CIDR.

[15]  Michael J. Steindorfer,et al.  Towards a software product line of trie-based collections , 2016, GPCE.

[16]  Lukasz Ziarek,et al.  Just-In-Time Data Structures , 2015, CIDR.

[17]  Stratos Idreos,et al.  The Log-Structured Merge-Bush & the Wacky Continuum , 2019, SIGMOD Conference.

[18]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[19]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[20]  Michael Stonebraker,et al.  Efficient Versioning for Scientific Array Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[21]  Paul M. Aoki How to avoid building DataBlades(R) that know the value of everything and the cost of nothing , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[22]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[23]  Abdul Wasay,et al.  Queriosity: Automated Data Exploration , 2015, 2015 IEEE International Congress on Big Data.

[24]  Abdul Wasay,et al.  Data Canopy: Accelerating Exploratory Statistical Analysis , 2017, SIGMOD Conference.

[25]  Alexander Aiken,et al.  Data representation synthesis , 2011, PLDI '11.

[26]  Stratos Idreos,et al.  Main Memory Adaptive Denormalization , 2016, SIGMOD Conference.

[27]  Marcel Kornacker,et al.  High-Performance Extensible Indexing , 1999, VLDB.

[28]  Michael Stonebraker,et al.  The Future of Scientific Data Bases , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[29]  Tim Kraska,et al.  From Auto-tuning One Size Fits All to Self-designed and Learned Data-intensive Systems , 2019, SIGMOD Conference.

[30]  Mark Bailey,et al.  The Grammar of Graphics , 2007, Technometrics.

[31]  Micha Sharir,et al.  Automatic data structure selection in SETL , 1979, POPL.

[32]  Ming Zhou,et al.  Generalizing Database Access Methods , 1999 .

[33]  Surajit Chaudhuri,et al.  Overview of Data Exploration Techniques , 2015, SIGMOD Conference.

[34]  Lars Arge,et al.  The Buffer Tree: A Technique for Designing Batched External Data Structures , 2003, Algorithmica.

[35]  Manos Athanassoulis,et al.  Optimal Bloom Filters and Adaptive Merging for LSM-Trees , 2018, ACM Trans. Database Syst..

[36]  Stratos Idreos,et al.  The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models , 2018, SIGMOD Conference.

[37]  Manos Athanassoulis,et al.  Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? , 2017, SIGMOD Conference.

[38]  Manos Athanassoulis,et al.  Monkey: Optimal Navigable Key-Value Store , 2017, SIGMOD Conference.

[39]  Pilar González-Férez,et al.  Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store , 2016, USENIX ATC.

[40]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[41]  Michael J. Franklin Caching and Memory Management in Client-Server Database Systems , 1993 .

[42]  Anastasia Ailamaki,et al.  Designing Access Methods: The RUM Conjecture , 2016, EDBT.

[43]  Witold Litwin,et al.  The bounded disorder access method , 1986, 1986 IEEE Second International Conference on Data Engineering.

[44]  Toby J. Teorey,et al.  Application of an analytical model to evaluate storage structures , 1976, SIGMOD '76.

[45]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[46]  Eleni Petraki,et al.  Holistic Indexing in Main-memory Column-stores , 2015, SIGMOD Conference.

[47]  S. Bing Yao An attribute based model for database access cost analysis , 1977, TODS.

[48]  Andrew Pavlo,et al.  Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads , 2016, SIGMOD Conference.

[49]  Michael J. Carey,et al.  A Study of Index Structures for a Main Memory Database Management System , 1986, HPTS.

[50]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[51]  Stratos Idreos,et al.  Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging , 2018, SIGMOD Conference.

[52]  Martin L. Kersten,et al.  A Database System with Amnesia , 2017, CIDR.

[53]  Michael A. Bender,et al.  Cache-oblivious streaming B-trees , 2007, SPAA '07.

[54]  Eleni Petraki,et al.  Database cracking: fancy scan, not poor man's sort! , 2014, DaMoN '14.

[55]  Philip A. Bernstein,et al.  An Extensible Framework for Multimedia Information Management , 1987, IEEE Data Eng. Bull..

[56]  Walter S. Scott,et al.  Magic: A VLSI Layout System , 1984, 21st Design Automation Conference Proceedings.

[57]  Stratos Idreos,et al.  Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation , 2018, SIGMOD Conference.

[58]  Joseph M. Hellerstein,et al.  AMDB: an access method debugging tool , 1998, SIGMOD '98.

[59]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[60]  S. B. Yao,et al.  Evaluation of database access paths , 1978, SIGMOD Conference.

[61]  Anastasia Ailamaki,et al.  H2O: a hands-free adaptive store , 2014, SIGMOD Conference.

[62]  S. Bing Yao,et al.  Selection of file organization using an analytic model , 1975, VLDB '75.

[63]  Stratos Idreos Big Data Exploration , 2013 .

[64]  Michael Stonebraker,et al.  SciDB DBMS Research at M.I.T , 2013, IEEE Data Eng. Bull..

[65]  Jens Dittrich,et al.  Main memory adaptive indexing for multi-core systems , 2014, DaMoN '14.

[66]  Martin L. Kersten,et al.  Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.

[67]  Gerth Stølting Brodal,et al.  Lower bounds for external memory dictionaries , 2003, SODA '03.

[68]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[69]  Alexander Aiken,et al.  Concurrent data representation synthesis , 2012, PLDI.

[70]  David Li,et al.  Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn , 2019, CIDR.

[71]  Christopher Ré,et al.  Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[72]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[73]  Eran Yahav,et al.  Chameleon: adaptive selection of collections , 2009, PLDI '09.

[74]  Alekh Jindal,et al.  The Uncracked Pieces in Database Cracking , 2013, Proc. VLDB Endow..

[75]  Michael D. Ernst,et al.  Fast synthesis of fast collections , 2016, PLDI.

[76]  Philippe Bonnet,et al.  GeckoFTL: Scalable Flash Translation Techniques For Very Large Flash Devices , 2016, SIGMOD Conference.

[77]  Bingsheng He,et al.  Tree indexing on solid state drives , 2010, Proc. VLDB Endow..

[78]  Joseph M. Hellerstein,et al.  Amdb: A Design Tool for Access Methods , 2003, IEEE Data Eng. Bull..

[79]  Eugene Wong,et al.  Query optimization by simulated annealing , 1987, SIGMOD '87.

[80]  Themis Palpanas,et al.  Indexing for interactive exploration of big data series , 2014, SIGMOD Conference.

[81]  Stratos Idreos,et al.  Evolutionary Data Systems , 2017, ArXiv.

[82]  Volker Markl,et al.  Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation , 2015, SIGMOD Conference.

[83]  Micha Sharir,et al.  An Automatic Technique for Selection of Data Representations in SETL Programs , 1981, TOPL.

[84]  R. Tarjan Complexity of combinatorial algorithms , 1977 .

[85]  Yannis Smaragdakis,et al.  DiSTiL: A Transformation Library for Data Structures , 1997, DSL.

[86]  Alvin Cheung Towards Generating Application-Specific Data Management Systems , 2015, CIDR.

[87]  S.Suganthi,et al.  Cassandra-A Decentralized Structured Storage System , 2017 .

[88]  C. Mohan,et al.  Concurrency and recovery in generalized search trees , 1997, SIGMOD '97.

[89]  Chris Jermaine,et al.  The partitioned exponential file for database storage management , 2007, The VLDB Journal.

[90]  Harumi A. Kuno,et al.  Modern B-tree techniques , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[91]  Stratos Idreos,et al.  The Internals of the Data Calculator , 2018, ArXiv.

[92]  S. Sudarshan,et al.  Incremental Organization for Data Recording and Warehousing , 1997, VLDB.

[93]  Jignesh M. Patel,et al.  Data Morphing: An Adaptive, Cache-Conscious Storage Technique , 2003, VLDB.

[94]  Donald Cohen,et al.  Automating relational operations on data structures , 1993, IEEE Software.