Optimizing Sample Design for Approximate Query Processing

The rapid increase of data volumes makes sampling a crucial component of modern data management systems. Although there is a large body of work on database sampling, the problem of automatically determine the optimal sample for a given query remained almost unaddressed. To tackle this problem the authors propose a sample advisor based on a novel cost model. Primarily designed for advising samples of a few queries specified by an expert, the authors additionally propose two extensions of the sample advisor. The first extension enhances the applicability by utilizing recorded workload information and taking memory bounds into account. The second extension increases the effectiveness by merging samples in case of overlapping pieces of sample advice. For both extensions, the authors present exact and heuristic solutions. Within their evaluation, the authors analyze the properties of the cost model and demonstrate the effectiveness and the efficiency of the heuristic solutions with a variety of experiments.

[1]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[2]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[3]  Sam Lightstone,et al.  Physical Database Design for Relational Databases , 2009, Encyclopedia of Database Systems.

[4]  Chris Jermaine,et al.  A disk-based join with probabilistic guarantees , 2005, SIGMOD '05.

[5]  Gerhard Weikum,et al.  A Framework for the Physical Design Problem for Data Synopses , 2002, EDBT.

[6]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[7]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[8]  Wolfgang Lehner,et al.  Designing Random Sample Synopses with Outliers , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Surajit Chaudhuri,et al.  Automating Statistics Management for Query Optimizers , 2001, IEEE Trans. Knowl. Data Eng..

[10]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[11]  Wolfgang Lehner,et al.  A Sample Advisor for Approximate Query Processing , 2010, ADBIS.

[12]  Wolfgang Lehner,et al.  Linked Bernoulli Synopses: Sampling along Foreign Keys , 2008, SSDBM.

[13]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[14]  Wolfgang Lehner,et al.  Sample synopses for approximate answering of group-by queries , 2009, EDBT '09.

[15]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[16]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[17]  Inderpal Singh Mumick,et al.  Selection of views to materialize in a data warehouse , 1997, IEEE Transactions on Knowledge and Data Engineering.

[18]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[19]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[20]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2008, TODS.

[21]  Surajit Chaudhuri,et al.  AutoAdmin “what-if” index analysis utility , 1998, SIGMOD '98.

[22]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[23]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[24]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[25]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[26]  Ashraf Aboulnaga,et al.  Robustness in automatic physical database design , 2008, EDBT '08.

[27]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.