A disk-based join with probabilistic guarantees

One of the most common operations in analytic query processing is the application of an aggregate function to the result of a relational join. We describe an algorithm for computing the answer to such a query over large, disk-based input tables. The key innovation of our algorithm is that at all times, it provides an online, statistical estimator for the eventual answer to the query, as well as probabilistic confidence bounds. Thus, a user can monitor the progress of the join throughout its execution and stop the join when satisfied with the estimate's accuracy, or run the algorithm to completion with a total time requirement that is not much longer than other common join algorithms. This contrasts with other online join algorithms, which either do not offer such statistical guarantees or can only offer guarantees so long as the input data can fit into core memory.

[1]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[2]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[3]  Yossi Matias,et al.  Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[4]  Ping Xu,et al.  Random sampling from hash files , 1990, SIGMOD '90.

[5]  Bernhard Seeger,et al.  Progressive Merge Join: A Generic and Non-blocking Sort-based Join Algorithm , 2002, VLDB.

[6]  Bernhard Seeger,et al.  On producing join results early , 2003, PODS '03.

[7]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[8]  SilberschatzAvi,et al.  Bifocal sampling for skew-resistant join size estimation , 1996 .

[9]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[10]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[11]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[12]  Jeffrey F. Naughton,et al.  A non-blocking parallel spatial join algorithm , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[14]  Jeffrey F. Naughton,et al.  A scalable hash ripple join algorithm , 2002, SIGMOD '02.

[15]  Peter J. Haas,et al.  Interactive data Analysis: The Control Project , 1999, Computer.

[16]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[17]  Leonard D. Shapiro,et al.  Join processing in database systems with large main memories , 1986, TODS.

[18]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.