Maintaining very large random samples using the geometric file

Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a “sample” is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We also present algorithms to retrieve small size random sample from large disk-based sample which may be used for various purposes including statistical analyses by a DBMS.

[1]  Chris Jermaine,et al.  A Novel Index Supporting High Volume Data Warehouse Insertion , 1999, VLDB.

[2]  A. Winsor Sampling techniques. , 2000, Nursing times.

[3]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[4]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[5]  Lars Arge,et al.  The Buffer Tree: A New Technique for Optimal I/O-Algorithms (Extended Abstract) , 1995, WADS.

[6]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[7]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[8]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[9]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[10]  Jeffrey Scott Vitter,et al.  An efficient algorithm for sequential random sampling , 1987, TOMS.

[11]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[12]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[13]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[14]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[15]  Yossi Matias,et al.  Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[16]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[17]  Mervin E. Muller,et al.  Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers , 1962 .

[18]  Lars Arge,et al.  The Buuer Tree: a New Technique for Optimal I/o-algorithms ? , 1995 .

[19]  Theodore Johnson,et al.  The Gigascope Stream Database , 2003, IEEE Data Eng. Bull..

[20]  Theodore Johnson,et al.  Gigascope: high performance network monitoring with an SQL interface , 2002, SIGMOD '02.

[21]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[22]  Chris Jermaine,et al.  Robust Estimation With Sampling and Approximate Pre-Aggregation , 2003, VLDB.

[23]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[24]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[25]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[26]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[27]  Joseph M. Hellerstein,et al.  Informix under CONTROL: Online Query Processing , 2000, Data Mining and Knowledge Discovery.

[28]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[29]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD 2000.

[30]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[31]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[32]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[33]  Michael Friendly,et al.  Discrete Distributions , 2005, Probability and Bayesian Modeling.

[34]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[35]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[36]  Chris Jermaine,et al.  The partitioned exponential file for database storage management , 2007, The VLDB Journal.

[37]  Terence G. Jones,et al.  A note on sampling a tape-file , 1962, Commun. ACM.

[38]  Surajit Chaudhuri,et al.  A robust, optimization-based approach for approximate answering of aggregate queries , 2001, SIGMOD '01.

[39]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[40]  Ping Xu,et al.  Random sampling from hash files , 1990, SIGMOD '90.