A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Query optimizer is at the heart of the database systems. Cost-based optimizer studied in this paper is adopted in almost all current database systems. A cost-based optimizer introduces a plan enumeration algorithm to find a (sub)plan, and then uses a cost model to obtain the cost of that plan, and selects the plan with the lowest cost. In the cost model, cardinality, the number of tuples through an operator, plays a crucial role. Due to the inaccuracy in cardinality estimation, errors in cost model, and the huge plan space, the optimizer cannot find the optimal execution plan for a complex query in a reasonable time. In this paper, we first deeply study the causes behind the limitations above. Next, we review the techniques used to improve the quality of the three key components in the cost-based optimizer, cardinality estimation, cost model, and plan enumeration. We also provide our insights on the future directions for each of the above aspects.

[1]  Yossi Matias,et al.  Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[2]  Christoph Koch,et al.  Solving the Join Ordering Problem via Mixed Integer Linear Programming , 2015, SIGMOD Conference.

[3]  Yannis Manolopoulos,et al.  A Bi-objective Cost Model for Database Queries in a Multi-cloud Environment , 2014, MEDES.

[4]  Volker Markl,et al.  Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models , 2017, Proc. VLDB Endow..

[5]  Christian S. Jensen,et al.  Efficiently adapting graphical models for selectivity estimation , 2012, The VLDB Journal.

[6]  Rajeev Rastogi,et al.  Processing Data-Stream Join Aggregates Using Skimmed Sketches , 2004, EDBT.

[7]  Guido Moerkotte,et al.  Dynamic programming strikes back , 2008, SIGMOD Conference.

[8]  Alekh Jindal,et al.  Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings , 2020, SIGMOD Conference.

[9]  Yoon-Min Nam Nam,et al.  SPRINTER: A Fast n-ary Join Query Processing Method for Complex OLAP Queries , 2020, SIGMOD Conference.

[10]  Jeffrey F. Naughton,et al.  Sampling-Based Query Re-Optimization , 2016, SIGMOD Conference.

[11]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[12]  Jyoti Leeka,et al.  Incorporating Super-Operators in Big-Data Query Optimizers , 2019, Proc. VLDB Endow..

[13]  Liwei Wang,et al.  Deep Reinforcement Learning-Based Approach to Tackle Topic-Aware Influence Maximization , 2020, Data Science and Engineering.

[14]  Andreas Kipf,et al.  Estimating Cardinalities with Deep Sketches , 2019, SIGMOD Conference.

[15]  Ronitt Rubinfeld,et al.  Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[16]  Sanjay Chawla,et al.  ML-based Cross-Platform Query Optimization , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[17]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18]  Xudong Lin,et al.  A Cardinality Estimation Approach Based on Two Level Histograms , 2015, J. Inf. Sci. Eng..

[19]  Todd Eavis,et al.  Rk-hist: an r-tree based histogram for multi-dimensional selectivity estimation , 2007, CIKM '07.

[20]  Olga Papaemmanouil,et al.  Deep Reinforcement Learning for Join Order Enumeration , 2018, aiDM@SIGMOD.

[21]  Donald Kossmann,et al.  Iterative dynamic programming: a new class of query optimization algorithms , 2000, TODS.

[22]  Barzan Mozafari,et al.  QuickSel: Quick Selectivity Learning with Mixture Models , 2018, SIGMOD Conference.

[23]  Guy M. Lohman,et al.  Is query optimization a 'solved' problem? , 1989 .

[24]  Yu Chen,et al.  Two-Level Sampling for Join Size Estimation , 2017, SIGMOD Conference.

[25]  Felix Naumann,et al.  Cardinality Estimation: An Experimental Survey , 2017, Proc. VLDB Endow..

[26]  Cyrus Shahabi,et al.  Entropy-based histograms for selectivity estimation , 2013, CIKM.

[27]  Wolfgang Lehner,et al.  Cardinality estimation with local deep learning models , 2019, aiDM@SIGMOD.

[28]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[29]  Dimitrios Gunopulos,et al.  Selectivity estimators for multidimensional range queries over real attributes , 2005, The VLDB Journal.

[30]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[31]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[32]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[33]  Nick Koudas,et al.  Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries , 2020, SIGMOD Conference.

[34]  Kinji Ono,et al.  Cost estimation of user-defined methods in object-relational database systems , 1999, SGMD.

[35]  Guoliang Li,et al.  Reinforcement Learning with Tree-LSTM for Join Order Selection , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[36]  Jeffrey F. Naughton,et al.  Fixed-precision estimation of join selectivity , 1993, PODS '93.

[37]  Guido Moerkotte,et al.  A new, highly efficient, and easy to implement top-down join enumeration algorithm , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[38]  Guoliang Li,et al.  An End-to-End Learning-based Cost Estimator , 2019, Proc. VLDB Endow..

[39]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[40]  Christian S. Jensen,et al.  Lightweight graphical models for selectivity estimation without independence assumptions , 2011, Proc. VLDB Endow..

[41]  Toshihide Ibaraki,et al.  On the optimal nesting order for computing N-relational joins , 1984, TODS.

[42]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[43]  Xi Chen,et al.  Deep Unsupervised Cardinality Estimation , 2019, Proc. VLDB Endow..

[44]  Franck Morvan,et al.  An Approach Based on Bayesian Networks for Query Selectivity Estimation , 2019, DASFAA.

[45]  Xi Chen,et al.  NeuroCard , 2020, Proc. VLDB Endow..

[46]  Nicolas Bruno,et al.  Polynomial heuristics for query optimization , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[47]  Guido Moerkotte,et al.  Top down plan generation: From theory to practice , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[48]  Jeffrey Scott Vitter,et al.  SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads , 2003, VLDB.

[49]  Feilong Liu,et al.  Forecasting the cost of processing multi-join queries via hashing for main-memory databases , 2015, SoCC.

[50]  Zhen He,et al.  Self-tuning UDF Cost Modeling Using the Memory-Limited Quadtree , 2004, EDBT.

[51]  Martin L. Kersten,et al.  Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.

[52]  Surajit Chaudhuri,et al.  Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques , 2012, Proc. VLDB Endow..

[53]  Magdalena Balazinska,et al.  An Empirical Analysis of Deep Learning for Cardinality Estimation , 2019, ArXiv.

[54]  Guy M. Lohman,et al.  Measuring the Complexity of Join Enumeration in Query Optimization , 1990, VLDB.

[55]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[56]  Roland H. C. Yap,et al.  Local Search in Histogram Construction , 2010, AAAI.

[57]  Hiren Patel,et al.  Towards a Learning Optimizer for Shared Clouds , 2018, Proc. VLDB Endow..

[58]  Carsten Binnig,et al.  DeepDB , 2019, Proc. VLDB Endow..

[59]  Jianhua Feng,et al.  Query performance prediction for concurrent queries using graph embedding , 2020, Proc. VLDB Endow..

[60]  Chengliang Chai,et al.  Database Meets Artificial Intelligence: A Survey , 2020, IEEE Transactions on Knowledge and Data Engineering.

[61]  Chinmay Hegde,et al.  Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms , 2015, PODS.

[62]  Ion Stoica,et al.  Learning to Optimize Join Queries With Deep Reinforcement Learning , 2018, ArXiv.

[63]  Guido Moerkotte,et al.  Effective and Robust Pruning for Top-Down Join Enumeration Algorithms , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[64]  Immanuel Trummer,et al.  SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning , 2018, Proc. VLDB Endow..

[65]  David Vengerov,et al.  Join Size Estimation Subject to Filter Conditions , 2015, Proc. VLDB Endow..

[66]  Xuemin Lin,et al.  Selectivity Estimation on Set Containment Search , 2019, Data Science and Engineering.

[67]  Thomas Neumann,et al.  Adaptive Optimization of Very Large Join Queries , 2018, SIGMOD Conference.

[68]  Zhen He,et al.  Self-tuning cost modeling of user-defined functions in an object-relational DBMS , 2005, TODS.

[69]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[70]  Guido Moerkotte,et al.  Errata for "Analysis of two existing and one new dynamic programming algorithm for the generation of optimal bushy join trees without cross products" , 2006, Proc. VLDB Endow..

[71]  Jeffrey F. Naughton,et al.  End-biased Samples for Join Cardinality Estimation , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[72]  S. Sudarshan,et al.  Optimizing Join Enumeration in Transformation-based Query Optimizers , 2014, Proc. VLDB Endow..

[73]  Florin Rusu,et al.  Sketches for size of join estimation , 2008, TODS.

[74]  M. Seetha Lakshmi,et al.  Selectivity Estimation in Extensible Databases - A Neural Network Approach , 1998, VLDB.

[75]  Roland H. C. Yap,et al.  Fast and effective histogram construction , 2009, CIKM.

[76]  Guido Moerkotte,et al.  Counter Strike: Generic Top-Down Join Enumeration for Hypergraphs , 2013, Proc. VLDB Endow..

[77]  David Maier,et al.  Rapid bushy join-order optimization with Cartesian products , 1996, SIGMOD '96.

[78]  Frank Wm. Tompa,et al.  Optimal top-down join enumeration , 2007, SIGMOD '07.

[79]  Adith Swaminathan,et al.  Active Learning for ML Enhanced Database Systems , 2020, SIGMOD Conference.

[80]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[81]  Thomas Neumann,et al.  Query simplification: graceful degradation for join-order optimization , 2009, SIGMOD Conference.

[82]  Chee-Yong Chan,et al.  Improved Correlated Sampling for Join Size Estimation , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[83]  Guido Moerkotte,et al.  Heuristic and randomized optimization for the join ordering problem , 1997, The VLDB Journal.

[84]  Nitesh V. Chawla,et al.  A Black-Box Approach to Query Cardinality Estimation , 2007, CIDR.

[85]  A. Swami Optimization of Large Join Queries: Combining Heuristic and Combinatorial Techniques , 1989, SIGMOD Conference.

[86]  Wen-Chi Hou,et al.  CS2: a new database synopsis for query estimation , 2013, SIGMOD '13.

[87]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[88]  Dan Suciu,et al.  Consistent Histograms In The Presence of Distinct Value Counts , 2009, Proc. VLDB Endow..

[89]  Jeffrey F. Naughton,et al.  Predicting query execution time: Are optimizer cost models really unusable? , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[90]  Olga Papaemmanouil,et al.  Plan-Structured Deep Neural Network Models for Query Performance Prediction , 2019, Proc. VLDB Endow..

[91]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[92]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[93]  Calisto Zuzarte,et al.  Cardinality estimation using neural networks , 2015, CASCON.

[94]  Kinji Ono,et al.  A Neural Networks Approach for Query Cost Evaluation , 1997 .

[95]  Srikanth Kandula,et al.  Selectivity Estimation for Range Predicates using Lightweight Models , 2019, Proc. VLDB Endow..

[96]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[97]  Kurt Stockinger,et al.  Join Query Optimization with Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[98]  Wolfgang Lehner,et al.  Machine Learning-based Cardinality Estimation in DBMS on Pre-Aggregated Data , 2020, ArXiv.

[99]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[100]  Xuemin Lin,et al.  Selectivity Estimation on Set Containment Search , 2019, DASFAA.

[101]  Dan Suciu,et al.  Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities , 2019, SIGMOD Conference.

[102]  Volker Markl,et al.  Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation , 2015, SIGMOD Conference.

[103]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[104]  Neoklis Polyzotis,et al.  Graph-based synopses for relational selectivity estimation , 2006, SIGMOD Conference.