A Framework for the Physical Design Problem for Data Synopses

Maintaining statistics on multidimensional data distributions is crucial for predicting the run-time and result size of queries and data analysis tasks with acceptable accuracy. To this end a plethora of techniques have been proposed for maintaining a compact data "synopsis" on a single table, ranging from variants of histograms to methods based on wavelets and other transforms. However, the fundamental question of how to reconcile the synopses for large information sources with many tables has been largely unexplored. This paper develops a general framework for reconciling the synopses on many tables, which may come from different information sources. It shows how to compute the optimal combination of synopses for a given workload and a limited amount of available memory. The practicality of the approach and the accuracy of the proposed heuristics are demonstrated by experiments.

[1]  Naphtali Rishe,et al.  An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment , 1993, SIGMOD '93.

[2]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[3]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[4]  Felix Naumann,et al.  Cooperative Query Answering with Density Scores , 2000 .

[5]  Surajit Chaudhuri,et al.  Index merging , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[6]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[7]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[8]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[9]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[10]  Dennis Shasha,et al.  A Framework for Automating Physical Database Design , 1991, VLDB.

[11]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[12]  Gerhard Weikum,et al.  Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation , 1999, VLDB.

[13]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[14]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[15]  Matteo Fischetti,et al.  Exact and Approximate Algorithms for the Index Selection Problem in Physical Database Design , 1995, IEEE Trans. Knowl. Data Eng..

[16]  Felix Naumann,et al.  Quality-driven Integration of Heterogenous Information Systems , 1999, VLDB.

[17]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[18]  Viswanath Poosala Histogram-Based Estimation Techniques in Database Systems , 1997 .

[19]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[20]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[21]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[22]  Henk M. Blanken,et al.  TOPYDE: A Tool for Physical Database Design , 1995, DEXA.

[23]  Surajit Chaudhuri,et al.  Automating Statistics Management for Query Optimizers , 2001, IEEE Trans. Knowl. Data Eng..

[24]  Bernhard Seeger,et al.  A comparison of selectivity estimators for range queries on metric attributes , 1999, SIGMOD '99.

[25]  Gerhard Weikum,et al.  Auto-Tuned Spline Synopses for Database Statistics Management , 2000 .

[26]  EWA SKUBALSKA-RAFAJ THE CLOSED CURVE FILLING MULTIDIMENSIONAL CUBE , 1994 .

[27]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[28]  Beng Chin Ooi,et al.  Global optimization of histograms , 2001, SIGMOD '01.

[29]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[30]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[31]  S. Muthukrishnan,et al.  AQUA: System and Techniques for Approximate Query Answering , 1998 .

[32]  M. Schkolnick,et al.  Physical database design for relational databases , 1988, TODS.

[33]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.