QuickSel: Quick Selectivity Learning with Mixture Models

Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today's databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent scans are costly, these statistics are often stale and lead to poor selectivity estimates. As an alternative to scans, query-driven histograms have been proposed, which refine the histograms based on the actual selectivities of the observed queries. Unfortunately, these approaches are either too costly to use in practice---i.e., require an exponential number of buckets---or quickly lose their advantage as they observe more queries. In this paper, we propose a selectivity learning framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms. Instead, it builds an internal model of the underlying data, which can be refined significantly faster (e.g., only 1.9 milliseconds for 300 queries). This fast refinement allows QuickSel to continuously learn from each query and yield increasingly more accurate selectivity estimates over time. Unlike query-driven histograms, QuickSel relies on a mixture model and a new optimization algorithm for training its model. Our extensive experiments on two real-world datasets confirm that, given the same target accuracy, QuickSel is 34.0x--179.4x faster than state-of-the-art query-driven histograms, including ISOMER and STHoles. Further, given the same space budget, QuickSel is 26.8%--91.8% more accurate than periodically-updated histograms and samples, respectively.

[1]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[2]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[3]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[4]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[5]  Clifford A. Lynch,et al.  Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values , 1988, VLDB.

[6]  D. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[7]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[8]  A. Genz Numerical Computation of Multivariate Normal Probabilities , 1992 .

[9]  Allen Van Gelder,et al.  Multiple Join Size Estimation by Virtual Domains. , 1993, PODS 1993.

[10]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[11]  Jeffrey F. Naughton,et al.  On the relative cost of sampling for join selectivity estimation , 1994, PODS '94.

[12]  Arun N. Swami,et al.  On the Estimation of Join Result Sizes , 1994, EDBT.

[13]  H. Joe Approximations to Multivariate Normal Rectangle Probabilities Based on Conditional Expectations , 1995 .

[14]  P. Craigmile,et al.  Parameter estimation for finite mixtures of uniform distributions , 1997 .

[15]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[16]  Bernhard Seeger,et al.  A comparison of selectivity estimators for range queries on metric attributes , 1999, SIGMOD '99.

[17]  Theodore Johnson,et al.  Range selectivity estimation for continuous attributes , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[18]  Divesh Srivastava,et al.  Multi-Dimensional Substring Selectivity Estimation , 1999, VLDB.

[19]  Divesh Srivastava,et al.  Substring selectivity estimation , 1999, PODS '99.

[20]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[21]  Divesh Srivastava,et al.  Optimal histograms for hierarchical range queries (extended abstract) , 2000, PODS '00.

[22]  Divesh Srivastava,et al.  One-dimensional and multi-dimensional substring selectivity estimation , 2000, The VLDB Journal.

[23]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD 2000.

[24]  Divesh Srivastava,et al.  Optimal histograms for hierarchical range queries , 2000, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[25]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[26]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[27]  Beng Chin Ooi,et al.  Global optimization of histograms , 2001, SIGMOD '01.

[28]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[29]  Selectivity Estimation using Probabilistic Models , 2001, SIGMOD Conference.

[30]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[31]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[32]  Dimitris Papadias,et al.  Selectivity Estimation of Complex Spatial Queries , 2001, SSTD.

[33]  Sudipto Guha,et al.  Fast algorithms for hierarchical range histogram construction , 2002, PODS '02.

[34]  Xuemin Lin,et al.  On Linear-Spline Based Histograms , 2002, WAIM.

[35]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[36]  Dimitrios Gunopulos,et al.  Selectivity estimators for multidimensional range queries over real attributes , 2005, The VLDB Journal.

[37]  Jimeng Sun,et al.  Selectivity estimation for predictive spatio-temporal queries , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[38]  Jeffrey Scott Vitter,et al.  SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads , 2003, VLDB.

[39]  Jignesh M. Patel,et al.  Using histograms to estimate answer sizes for XML queries , 2003, Inf. Syst..

[40]  Qing Liu,et al.  Multiscale Histograms: Summarizing Topological Relations in Large Spatial Datasets , 2003, VLDB.

[41]  Xuemin Lin,et al.  Clustering Moving Objects for Spatio-temporal Selectivity Estimation , 2004, ADC.

[42]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[43]  Lei Chen,et al.  Multi-scale histograms for answering queries over time series data , 2004, Proceedings. 20th International Conference on Data Engineering.

[44]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[45]  Peter J. Haas,et al.  Consistently Estimating the Selectivity of Conjuncts of Predicates , 2005, VLDB.

[46]  Bin Dong,et al.  K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset , 2005, ArXiv.

[47]  Kenneth Salem,et al.  Dynamic histograms for non-stationary updates , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).

[48]  Guoliang Li,et al.  DMT: A Flexible and Versatile Selectivity Estimation Approach for Graph Query , 2005, WAIM.

[49]  Evaggelia Pitoura,et al.  Query workload-aware overlay construction using histograms , 2005, CIKM '05.

[50]  Peter J. Haas,et al.  Consistent selectivity estimation via maximum entropy , 2007, The VLDB Journal.

[51]  Surajit Chaudhuri,et al.  3 Self-Tuning Histograms : Exploiting Execution Feedback , 2006 .

[52]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[53]  Neoklis Polyzotis,et al.  Graph-based synopses for relational selectivity estimation , 2006, SIGMOD Conference.

[54]  Jimeng Sun,et al.  Spatio-temporal join selectivity , 2006, Inf. Syst..

[55]  Srinivasan Parthasarathy,et al.  A Decomposition-Based Probabilistic Framework for Estimating the Selectivity of XML Twig Queries , 2006, EDBT.

[56]  Sourav S. Bhowmick,et al.  Efficient evaluation of high-selective xml twig patterns with parent child edges in tree-unaware rdbms , 2007, CIKM '07.

[57]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[58]  Vivek R. Narasayya,et al.  Self-Tuning Database Systems: A Decade of Progress , 2007, VLDB.

[59]  Divesh Srivastava,et al.  Estimating the selectivity of approximate string queries , 2007, TODS.

[60]  Nikos Mamoulis,et al.  Lattice Histograms: a Resilient Synopsis Structure , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[61]  Sebastian Michel,et al.  Smooth Interpolating Histograms with Error Guarantees , 2008, BNCOD.

[62]  Dan Suciu,et al.  Consistent Histograms In The Presence of Distinct Value Counts , 2009, Proc. VLDB Endow..

[63]  Dan Suciu,et al.  Boosting the accuracy of differentially private histograms through consistency , 2009, Proc. VLDB Endow..

[64]  Andrew McGregor,et al.  Optimizing linear counting queries under differential privacy , 2009, PODS.

[65]  Feifei Li,et al.  Building Wavelet Histograms on Large Data in MapReduce , 2011, Proc. VLDB Endow..

[66]  Eli Upfal,et al.  The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling , 2011, ECML/PKDD.

[67]  Klemens Böhm,et al.  Sensitivity of Self-tuning Histograms: Query Order Affecting Accuracy and Robustness , 2012, SSDBM.

[68]  Christian S. Jensen,et al.  Efficiently adapting graphical models for selectivity estimation , 2012, The VLDB Journal.

[69]  Christopher Ré,et al.  Understanding cardinality estimation using entropy maximization , 2012, ACM Trans. Database Syst..

[70]  Cyrus Shahabi,et al.  Entropy-based histograms for selectivity estimation , 2013, CIKM.

[71]  Feifei Li,et al.  Scalable histograms on large probabilistic data , 2014, KDD.

[72]  Gustavo Alonso,et al.  Histograms as a side effect of data movement for big data , 2014, SIGMOD Conference.

[73]  Xuemin Lin,et al.  Selectivity Estimation on Streaming Spatio-Textual Data Using Local Correlations , 2014, Proc. VLDB Endow..

[74]  Norman May,et al.  Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA , 2014, SIGMOD Conference.

[75]  Thierno M. O. Diallo,et al.  Structural Equation Modeling: A Multidisciplinary Journal , 2014 .

[76]  Calisto Zuzarte,et al.  Cardinality estimation using neural networks , 2015, CASCON.

[77]  Klemens Böhm,et al.  Improving Accuracy and Robustness of Self-Tuning Histograms by Subspace Clustering , 2015, IEEE Trans. Knowl. Data Eng..

[78]  Volker Markl,et al.  Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation , 2015, SIGMOD Conference.

[79]  Guillermo Sapiro,et al.  Compressive Sensing by Learning a Gaussian Mixture Model From Measurements , 2015, IEEE Transactions on Image Processing.

[80]  P. Visscher,et al.  Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model , 2015, PLoS genetics.

[81]  Klemens Böhm,et al.  Improving Accuracy and Robustness of Self-Tuning Histograms by Subspace Clustering , 2015, IEEE Transactions on Knowledge and Data Engineering.

[82]  Peter Triantafillou,et al.  Learning to accurately COUNT with query-driven predictive analytics , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[83]  Sridharakumar Narasimhan,et al.  Unsupervised Segmentation of Cervical Cell Images Using Gaussian Mixture Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[84]  B. Muthén,et al.  Structural Equation Models and Mixture Models With Continuous Nonnormal Skewed Distributions , 2016 .

[85]  Carsten Binnig,et al.  Revisiting Reuse for Approximate Query Processing , 2017, Proc. VLDB Endow..

[86]  Manos Athanassoulis,et al.  Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? , 2017, SIGMOD Conference.

[87]  Michael J. Cafarella,et al.  Database Learning: Toward a Database that Becomes Smarter Every Time , 2017, SIGMOD Conference.

[88]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[89]  Ashwin Machanavajjhala,et al.  Differentially Private Hierarchical Count-of-Counts Histograms , 2018, Proc. VLDB Endow..

[90]  Lin Ma,et al.  Query-based Workload Forecasting for Self-Driving Database Management Systems , 2018, SIGMOD Conference.

[91]  Yen-Chi Chen STAT 425 : Introduction to Nonparametric Statistics Winter 2018 Lecture 6 : Density Estimation : Histogram and Kernel Density Estimator , 2018 .

[92]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[93]  Xi Chen,et al.  Deep Unsupervised Cardinality Estimation , 2019, Proc. VLDB Endow..

[94]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[95]  Immanuel Trummer,et al.  Exact Cardinality Query Optimization with Bounded Execution Cost , 2019, SIGMOD Conference.

[96]  P. Abbeel,et al.  Selectivity Estimation with Deep Likelihood Models , 2019, ArXiv.

[97]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[98]  Neo , 2019, Proceedings of the VLDB Endowment.

[99]  Tim Kraska,et al.  SageDB: A Learned Database System , 2019, CIDR.

[100]  Srikanth Kandula,et al.  Selectivity Estimation for Range Predicates using Lightweight Models , 2019, Proc. VLDB Endow..

[101]  Dan Suciu,et al.  Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities , 2019, SIGMOD Conference.