The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models

Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices.

[1]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[2]  Alexander Aiken,et al.  Concurrent data representation synthesis , 2012, PLDI.

[3]  Manos Athanassoulis,et al.  Monkey: Optimal Navigable Key-Value Store , 2017, SIGMOD Conference.

[4]  Harumi A. Kuno,et al.  Concurrency Control for Adaptive Indexing , 2012, Proc. VLDB Endow..

[5]  Christopher Ré,et al.  Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[6]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[7]  Eugene Wong,et al.  Query optimization by simulated annealing , 1987, SIGMOD '87.

[8]  Themis Palpanas,et al.  Indexing for interactive exploration of big data series , 2014, SIGMOD Conference.

[9]  Leland Wilkinson The Grammar of Graphics , 1999 .

[10]  Alexander Aiken,et al.  Data representation synthesis , 2011, PLDI '11.

[11]  Leland Wilkinson,et al.  The Grammar of Graphics (Statistics and Computing) , 2005 .

[12]  Alvin Cheung Towards Generating Application-Specific Data Management Systems , 2015, CIDR.

[13]  Harumi A. Kuno,et al.  Modern B-tree techniques , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[14]  Goetz Graefe Modern B-Tree Techniques , 2011, Found. Trends Databases.

[15]  Jignesh M. Patel,et al.  Data Morphing: An Adaptive, Cache-Conscious Storage Technique , 2003, VLDB.

[16]  Volker Markl,et al.  Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation , 2015, SIGMOD Conference.

[17]  Micha Sharir,et al.  An Automatic Technique for Selection of Data Representations in SETL Programs , 1981, TOPL.

[18]  R. Tarjan Complexity of combinatorial algorithms , 1977 .

[19]  Yannis Smaragdakis,et al.  DiSTiL: A Transformation Library for Data Structures , 1997, DSL.

[20]  Philip A. Bernstein,et al.  An Extensible Framework for Multimedia Information Management , 1987, IEEE Data Eng. Bull..

[21]  Paul M. Aoki How to avoid building DataBlades(R) that know the value of everything and the cost of nothing , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[22]  Micha Sharir,et al.  Automatic data structure selection in SETL , 1979, POPL.

[23]  S. B. Yao,et al.  Evaluation of database access paths , 1978, SIGMOD Conference.

[24]  Don S. Batory,et al.  GENESIS: An Extensible Database Management System , 1988, IEEE Trans. Software Eng..

[25]  Donald Cohen,et al.  Automating relational operations on data structures , 1993, IEEE Software.

[26]  S. Bing Yao,et al.  Selection of file organization using an analytic model , 1975, VLDB '75.

[27]  Jens Dittrich,et al.  Main memory adaptive indexing for multi-core systems , 2014, DaMoN '14.

[28]  Manos Athanassoulis,et al.  Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? , 2017, SIGMOD Conference.

[29]  Ming Zhou,et al.  Generalizing Database Access Methods , 1999 .

[30]  Lukasz Ziarek,et al.  Just-In-Time Data Structures , 2015, CIDR.

[31]  Toby J. Teorey,et al.  Application of an analytical model to evaluate storage structures , 1976, SIGMOD '76.

[32]  Anastasia Ailamaki,et al.  Designing Access Methods: The RUM Conjecture , 2016, EDBT.

[33]  Martti Penttonen,et al.  A Reliable Randomized Algorithm for the Closest-Pair Problem , 1997, J. Algorithms.

[34]  Herodotos Herodotou,et al.  Automated Experiment-Driven Management of (Database) Systems , 2009, HotOS.

[35]  Gerhard Weikum,et al.  Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System , 2000, VLDB.

[36]  Alekh Jindal,et al.  Towards a One Size Fits All Database Architecture , 2011, CIDR.

[37]  Michael J. Steindorfer,et al.  Towards a software product line of trie-based collections , 2016, GPCE.

[38]  Anastasia Ailamaki,et al.  H2O: a hands-free adaptive store , 2014, SIGMOD Conference.

[39]  Jim Gray What Next? A Few Remaining Problems in Information Technlogy, SIGMOD Conference 1999, ACM Turing Award Lecture, Video , 2000, ACM SIGMOD Digit. Symp. Collect..

[40]  Alfonso F. Cardenas,et al.  Evaluation and selection of file organization—a model and system , 1973, Commun. ACM.

[41]  Paul M. Aoki Generalizing "search" in generalized search trees , 1998, Proceedings 14th International Conference on Data Engineering.

[42]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[43]  Eran Yahav,et al.  Chameleon: adaptive selection of collections , 2009, PLDI '09.

[44]  Alekh Jindal,et al.  The Uncracked Pieces in Database Cracking , 2013, Proc. VLDB Endow..

[45]  Michael D. Ernst,et al.  Fast synthesis of fast collections , 2016, PLDI.

[46]  Martin L. Kersten,et al.  Self-organizing tuple reconstruction in column-stores , 2009, SIGMOD Conference.

[47]  Peng Wang,et al.  TiML: a functional language for practical complexity analysis with invariants , 2017, Proc. ACM Program. Lang..

[48]  Paul M. Aoki Generalizing Search'' in Generalized Search Trees (Extended Abstract) , 1998, ICDE 1998.

[49]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[50]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[51]  S. Bing Yao An attribute based model for database access cost analysis , 1977, TODS.

[52]  Andrew Pavlo,et al.  Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads , 2016, SIGMOD Conference.