Query Size Estimation for Joins Using Systematic Sampling

We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called “systematic sampling—SYSSMP”, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it exploits the sortedness of data; the result of this is that the sample relation obtained well represents the underlying frequency distribution of the join attribute in the original relation.We first develop a theoretical foundation for systematic sampling which suggests that the method gives a more representative sample than the traditional simple random sampling. Subsequent experimental analysis on a range of synthetic relations confirms that the quality of sample relations yielded by systematic sampling is higher than those produced by the traditional simple random sampling.To ensure that sample relations produced by systematic sampling indeed assist in computing more accurate query result sizes, we compare systematic sampling with the most efficient simple random sampling called t_cross using a variety of relation configurations. The results obtained validate that systematic sampling uses the same amount of sampling but still provides more accurate query result sizes than t_cross. Furthermore, the extra sampling cost incurred by the use of systematic sampling pays off in a cheaper query execution cost at run-time.

[1]  David W. Aha,et al.  Instance‐based prediction of real‐valued attributes , 1989, Comput. Intell..

[2]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[3]  Jeffrey F. Naughton,et al.  Fixed-precision estimation of join selectivity , 1993, PODS '93.

[4]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[5]  J. Ross Quinlan,et al.  Combining Instance-Based and Model-Based Learning , 1993, ICML.

[6]  David Aha A study of instance-based algorithms for supervised learning tasks: mathematica:l , 1990 .

[7]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..

[8]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[9]  Arun N. Swami,et al.  On the Estimation of Join Result Sizes , 1994, EDBT.

[10]  Anne H. H. Ngu,et al.  Query Size Estimation using Systematic Sampling , 1996, CODAS.

[11]  Qiang Zhu,et al.  An integrated method for estimating selectivities in a multidatabase system , 1993, CASCON.

[12]  Wen-Chi Hou,et al.  Error-constrained COUNT query evaluation in relational databases , 1991, SIGMOD '91.

[13]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[14]  Akifumi Makinouchi,et al.  The Optimization Strategy for Query Evaluation in RDB/V1 , 1981, VLDB.

[15]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[16]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[17]  Anne H. H. Ngu,et al.  Query Size Estimation Using Machine Learning , 1997, DASFAA.

[18]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[19]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[20]  Yannis E. Ioannidis,et al.  Universality of Serial Histograms , 1993, VLDB.

[21]  Naphtali Rishe,et al.  An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment , 1993, SIGMOD '93.

[22]  Wei Sun,et al.  An evaluation of sampling-based size estimation methods for selections in database systems , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[23]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[24]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[25]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[26]  Richard L. Scheaffer,et al.  Elementary Survey Sampling , 1971 .

[27]  P.J. Haas,et al.  Sampling-based selectivity estimation for joins using augmented frequent value statistics , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[28]  M. N. Murthy,et al.  7 Systematic sampling with illustrative examples , 1988 .

[29]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[30]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[31]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.