The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling

We develop a novel method, based on the statistical concept of VC-dimension, for evaluating the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound on the VC-dimension of a range space defined by all possible outcomes of a collection (class) of queries. We prove that the VC-dimension is a function of the maximum number of Boolean operations in the selection predicate, and of the maximum number of select and join operations in any individual query in the collection, but it is neither a function of the number of queries in the collection nor of the size of the database. We develop a method based on this result: given a class of queries, it constructs a concise random sample of a database, such that with high probability the execution of any query in the class on the sample provides an accurate estimate for the selectivity of the query on the original large database. The error probability holds simultaneously for the selectivity estimates of all queries in the collection, thus the same sample can be used to evaluate the selectivity of multiple queries, and the sample needs to be refreshed only following major changes in the database. The sample representation computed by our method is typically sufficiently small to be stored in main memory. We present extensive experimental results, validating our theoretical analysis and demonstrating the advantage of our technique when compared to complex selectivity estimation techniques used in PostgreSQL and the Microsoft SQL Server.

[1]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[2]  David Gross-Amblard,et al.  Query-preserving watermarking of relational databases and XML documents , 2003, PODS '03.

[3]  Gautam Das Sampling Methods in Approximate Query Answering Systems Select State, Count (*) as Itemcount from Salesdata Where Productname = 'lawn Mower' Group by State Order by Itemcount Desc , .

[4]  Rupert G. Miller Simultaneous Statistical Inference , 1966 .

[5]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[6]  Bernard Chazelle,et al.  The discrepancy method - randomness and complexity , 2000 .

[7]  Jarek Gryz,et al.  Query Selectivity Estimation via Data Mining , 2004, Intelligent Information Systems.

[8]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[9]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[10]  Surajit Chaudhuri,et al.  Conditional selectivity for statistics on query expressions , 2004, SIGMOD '04.

[11]  Peter J. Haas,et al.  Maintaining bernoulli samples over evolving multisets , 2007, PODS '07.

[12]  Jiri Matousek,et al.  Lectures on discrete geometry , 2002, Graduate texts in mathematics.

[13]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[14]  ChaudhuriSurajit,et al.  Random sampling for histogram construction , 1998 .

[15]  Christopher Ré,et al.  Understanding cardinality estimation using entropy maximization , 2010, PODS '10.

[16]  Chris Jermaine,et al.  Online maintenance of very large random samples , 2004, SIGMOD '04.

[17]  N. Alon,et al.  The Probabilistic Method: Alon/Probabilistic , 2008 .

[18]  David Haussler,et al.  Epsilon-nets and simplex range queries , 1986, SCG '86.

[19]  Anne H. H. Ngu,et al.  Query Size Estimation for Joins Using Systematic Sampling , 2004, Distributed and Parallel Databases.

[20]  Anne H. H. Ngu,et al.  Query Size Estimation Using Machine Learning , 1997, DASFAA.

[21]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[22]  Michael Benedikt,et al.  Aggregate Operators in Constraint Query Languages , 2002, J. Comput. Syst. Sci..

[23]  Christopher Ré,et al.  Understanding cardinality estimation using entropy maximization , 2012, ACM Trans. Database Syst..

[24]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[25]  Jeffrey F. Naughton,et al.  Query Size Estimation by Adaptive Sampling , 1995, J. Comput. Syst. Sci..

[26]  Jeffrey F. Naughton,et al.  End-biased Samples for Join Cardinality Estimation , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[27]  Jeffrey F. Naughton,et al.  Fixed-precision estimation of join selectivity , 1993, PODS '93.

[28]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[29]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[30]  Raghu Ramakrishnan,et al.  Synopses for query optimization: A space-complexity perspective , 2004, TODS.

[31]  Peter J. Haas,et al.  A dip in the reservoir: maintaining sample synopses of evolving datasets , 2006, VLDB.

[32]  Wolfgang Lehner,et al.  Cardinality estimation using sample views with quality assurance , 2007, SIGMOD '07.

[33]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[34]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[35]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[36]  Micha Sharir,et al.  Relative (p,ε)-Approximations in Geometry , 2011, Discret. Comput. Geom..

[37]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[38]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[39]  Maarten Löffler,et al.  Shape Fitting on Point Sets with Probability Distributions , 2008, ESA.

[40]  Yossi Matias,et al.  Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[41]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[42]  Divyakant Agrawal,et al.  Applying the golden rule of sampling for query estimation , 2001, SIGMOD '01.

[43]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[44]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[45]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[46]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[47]  Peter J. Haas,et al.  Techniques for Warehousing of Sample Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[48]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[49]  Alin Dobra Histograms revisited: when are histograms the best approximation method for aggregates over joins? , 2005, PODS '05.

[50]  Gautam Das,et al.  Sampling Methods in Approximate Query Answering Systems , 2009, Encyclopedia of Data Warehousing and Mining.

[51]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[52]  Amos Fiat,et al.  Algorithms - ESA 2009 , 2009, Lecture Notes in Computer Science.

[53]  Peter J. Haas,et al.  A bi-level Bernoulli scheme for database sampling , 2004, SIGMOD '04.

[54]  Chris Jermaine,et al.  Robust Stratified Sampling Plans for Low Selectivity Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[55]  Jeffrey Scott Vitter,et al.  Selectivity estimation in the presence of alphanumeric correlations , 1997, Proceedings 13th International Conference on Data Engineering.

[56]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[57]  Jeffrey F. Naughton,et al.  On the relative cost of sampling for join selectivity estimation , 1994, PODS '94.

[58]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[59]  Wen-Chi Hou,et al.  Error-constrained COUNT query evaluation in relational databases , 1991, SIGMOD '91.

[60]  Hai Wang,et al.  A multi-dimensional histogram for selectivity estimation and fast approximate query answering , 2003, CASCON.

[61]  Aaron Roth,et al.  A learning theory approach to noninteractive database privacy , 2011, JACM.

[62]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[63]  Meng Chang Chen,et al.  Selectivity estimation using homogeneity measurement , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[64]  John Wang,et al.  Encyclopedia of Data Warehousing and Mining , 2005 .

[65]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[66]  Ruoming Jin,et al.  New Sampling-Based Estimators for OLAP Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[67]  Peter J. Haas,et al.  Consistent selectivity estimation via maximum entropy , 2007, The VLDB Journal.

[68]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[69]  Aaron Roth,et al.  A learning theory approach to non-interactive database privacy , 2008, STOC.

[70]  David Gross-Amblard,et al.  Query-preserving watermarking of relational databases and Xml documents , 2011, TODS.

[71]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[72]  Peter J. Haas,et al.  Hoeffding inequalities for join-selectivity estimation and online aggregation , 1996 .

[73]  B GibbonsPhillip,et al.  New sampling-based summary statistics for improving approximate query answers , 1998 .

[74]  P.J. Haas,et al.  Sampling-based selectivity estimation for joins using augmented frequent value statistics , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[75]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.