论文信息 - Sampling Issues in Parallel Database Systems

Sampling Issues in Parallel Database Systems

Sampling has proven useful in database systems in applications including query size estimation, and most recently, probabilistic parallel query evaluation algorithms. In order to apply the full power of modern multiprocessor database systems, sampling techniques must (1) distribute the sampling workload evenly among the processors in the system, and (2) make use of all the data on the pages brought into main memory during the course of the sampling. In this paper we show how to achieve these two goals by proving that for query size estimation, (1) stratified random sampling guarantees perfect load balancing without reducing the accuracy of the estimate, and that (2) for a given number of I/O operations, page level sampling always produces a more accurate estimate than tuple level sampling. For probabilistic parallel query evaluation algorithms, high performance requires tight boundsxon the expected skew in the allocation of work to processors as a function of the number of samples. Toward this end we prove a new bound on this skew, and show that our new bound is better than previously known bounds.

Jeffrey F. Naughton | S. Seshadri | S. Seshadri | J. Naughton

[1] Wen-Chi Hou,et al. Statistical estimators for relational algebra expressions , 1988, PODS '88.

[2] Guy E. Blelloch,et al. A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[3] Doron Rotem,et al. Simple Random Sampling from Relational Databases , 1986, VLDB.

[4] David J. DeWitt,et al. Parallel database systems: the future of database processing or a passing fad? , 1990, SGMD.

[5] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[6] Gaston H. Gonnet,et al. Expected Length of the Longest Probe Sequence in Hash Code Searching , 1981, JACM.

[7] Jeffrey F. Naughton,et al. Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[8] David J. DeWitt,et al. Parallel sorting on a shared-nothing architecture using probabilistic splitting , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[9] Wen-Chi Hou,et al. Error-constrained COUNT query evaluation in relational databases , 1991, SIGMOD '91.

[10] Wen-Chi Hou,et al. Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[11] Michael Stonebraker,et al. The Case for Shared Nothing , 1985, HPTS.