Relational confidence bounds are easy with the bootstrap

Statistical estimation and approximate query processing have become increasingly prevalent applications for database systems. However, approximation is usually of little use without some sort of guarantee on estimation accuracy, or "confidence bound." Analytically deriving probabilistic guarantees for database queries over sampled data is a daunting task, not suitable for the faint of heart, and certainly beyond the expertise of the typical database system end-user. This paper considers the problem of incorporating into a database system a powerful "plug-in" method for computing confidence bounds on the answer to relational database queries over sampled or incomplete data. This statistical tool, called the bootstrap, is simple enough that it can be used by a data-base programmer with a rudimentary mathematical background, but general enough that it can be applied to almost any statistical inference problem. Given the power and ease-of-use of the bootstrap, we argue that the algorithms presented for supporting the bootstrap should be incorporated into any database system which is intended to support analytic processing.

[1]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[2]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[3]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[4]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[5]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[6]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[7]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[8]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[9]  Markus Schneider,et al.  Uncertainty Management for Spatial Data in Databases: Fuzzy Spatial Data Types , 1999, SSD.

[10]  Wen-Chi Hou,et al.  Statistical estimators for aggregate relational algebra queries , 1991, TODS.

[11]  Jeffrey F. Naughton,et al.  Fixed-precision estimation of join selectivity , 1993, PODS '93.

[12]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[13]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[14]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[15]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[16]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[17]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[18]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .