Online Sketch-based Query Optimization

Cost-based query optimization remains a critical task in relational databases even after decades of research and industrial development. Query optimizers rely on a large range of statistical synopses – including attribute-level histograms and table-level samples – for accurate cardinality estimation. As the complexity of selection predicates and the number of join predicates increase, two problems arise. First, statistics cannot be incrementally composed to effectively estimate the cost of the sub-plans generated in plan enumeration. Second, small errors are propagated exponentially through join operators, which can lead to severely sub-optimal plans. In this paper, we introduce COMPASS, a novel query optimization paradigm for in-memory databases based on a single type of statistics—Fast-AGMS sketches. In COMPASS, query optimization and execution are intertwined. Selection predicates and sketch updates are pushed-down and evaluated online during query optimization. This allows Fast-AGMS sketches to be computed only over the relevant tuples—which enhances cardinality estimation accuracy. Plan enumeration is performed over the query join graph by incrementally composing attribute-level sketches—not by building a separate sketch for every sub-plan. We prototype COMPASS in MapD – an open-source parallel database – and perform extensive experiments over the complete JOB benchmark. The results prove that COMPASS generates better execution plans – both in terms of cardinality and runtime – compared to four other database systems. Overall, COMPASS achieves a speedup ranging from 1.35X to 11.28X in cumulative query execution time over the considered competitors.

[1]  Tong Yang,et al.  SF-Sketch: A Two-Stage Sketch for Data Streams , 2017, IEEE Transactions on Parallel and Distributed Systems.

[2]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[3]  Nitesh V. Chawla,et al.  A Black-Box Approach to Query Cardinality Estimation , 2007, CIDR.

[4]  Florin Rusu,et al.  Sketching Sampled Data Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[5]  Jeffrey F. Naughton,et al.  Sampling-Based Query Re-Optimization , 2016, SIGMOD Conference.

[6]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[7]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[8]  Wen-Chi Hou,et al.  CS2: a new database synopsis for query estimation , 2013, SIGMOD '13.

[9]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[10]  Florin Rusu,et al.  Sketches for size of join estimation , 2008, TODS.

[11]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[12]  Florin Rusu,et al.  Pseudo-random number generation for sketch-based estimations , 2007, TODS.

[13]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[14]  Guido Moerkotte,et al.  Errata for "Analysis of two existing and one new dynamic programming algorithm for the generation of optimal bushy join trees without cross products" , 2006, Proc. VLDB Endow..

[15]  Volker Markl,et al.  Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models , 2017, Proc. VLDB Endow..

[16]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[17]  Dan Suciu,et al.  Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities , 2019, SIGMOD Conference.

[18]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[19]  Volker Markl,et al.  Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation , 2015, SIGMOD Conference.

[20]  Magdalena Balazinska,et al.  An Empirical Analysis of Deep Learning for Cardinality Estimation , 2019, ArXiv.

[21]  Rajeev Rastogi,et al.  Sketch-Based Multi-Query Processing over Data Streams , 2004, Data Stream Management.

[22]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[23]  Jens Teubner,et al.  Pipelined Query Processing in Coprocessor Environments , 2018, SIGMOD Conference.

[24]  Olga Papaemmanouil,et al.  Towards a Hands-Free Query Optimizer through Deep Learning , 2018, CIDR.

[25]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[26]  A. Meister GPU-accelerated join-order optimization , 2015 .

[27]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[28]  Immanuel Trummer,et al.  SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning , 2018, Proc. VLDB Endow..

[29]  Nick Koudas,et al.  Multi-Attribute Selectivity Estimation Using Deep Learning , 2019, ArXiv.

[30]  Martin L. Kersten,et al.  MonetDB: Two Decades of Research in Column-oriented Database Architectures , 2012, IEEE Data Eng. Bull..

[31]  Andreas Kipf,et al.  Estimating Cardinalities with Deep Sketches , 2019, SIGMOD Conference.

[32]  David Maier,et al.  Rapid bushy join-order optimization with Cartesian products , 1996, SIGMOD '96.

[33]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[34]  Volker Markl,et al.  LEO: An autonomic query optimizer for DB2 , 2003, IBM Syst. J..

[35]  Guido Moerkotte,et al.  Heuristic and randomized optimization for the join ordering problem , 1997, The VLDB Journal.

[36]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[37]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[39]  Wolfgang Lehner,et al.  Simplicity Done Right for Join Ordering , 2021, CIDR.

[40]  Calisto Zuzarte,et al.  Cardinality estimation using neural networks , 2015, CASCON.

[41]  Alex Suhan,et al.  Exact Selectivity Computation for Modern In-Memory Database Query Optimization , 2019, ArXiv.

[42]  Srikanth Kandula,et al.  Selectivity Estimation for Range Predicates using Lightweight Models , 2019, Proc. VLDB Endow..

[43]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[44]  Viktor Leis,et al.  Query optimization through the looking glass, and what we found running the Join Order Benchmark , 2017, The VLDB Journal.

[45]  Florin Rusu,et al.  Fast range-summable random variables for efficient aggregate estimation , 2006, SIGMOD Conference.

[46]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[47]  Viktor Leis,et al.  Cardinality Estimation Done Right: Index-Based Join Sampling , 2017, CIDR.

[48]  Bingsheng He,et al.  Relational query coprocessing on graphics processors , 2009, TODS.

[49]  Gunter Saake,et al.  Challenges for a GPU-Accelerated Dynamic Programming Approach for Join-Order Optimization , 2016, GvD.

[50]  Ion Stoica,et al.  Learning to Optimize Join Queries With Deep Reinforcement Learning , 2018, ArXiv.

[51]  David Vengerov,et al.  Join Size Estimation Subject to Filter Conditions , 2015, Proc. VLDB Endow..

[52]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[53]  Guy M. Lohman,et al.  Is query optimization a 'solved' problem? , 1989 .

[54]  Riham Abdel Kader,et al.  ROX: run-time optimization of XQueries , 2009, SIGMOD Conference.

[55]  Wolfgang Lehner,et al.  Cardinality estimation with local deep learning models , 2019, aiDM@SIGMOD.

[56]  Guido Moerkotte,et al.  Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses , 2018, Proc. VLDB Endow..

[57]  Olga Papaemmanouil,et al.  Deep Reinforcement Learning for Join Order Enumeration , 2018, aiDM@SIGMOD.

[58]  Marina Papatriantafilou,et al.  Delegation sketch: a parallel design with support for fast and accurate concurrent operations , 2020, EuroSys.

[59]  Gunter Saake,et al.  GPU-Accelerated Database Systems: Survey and Open Challenges , 2014, Trans. Large Scale Data Knowl. Centered Syst..