Maintaining bounded-size sample synopses of evolving datasets

Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. The ability to bound the maximum size of a sample can be very convenient from a system-design point of view, because the task of memory management is simplified, especially when many samples are maintained simultaneously. In this paper, we study methods for incrementally maintaining a bounded-size uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable” datasets whose size remains roughly constant over time, we provide a novel sampling scheme, called “random pairing” (RP), that maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the 45-year-old reservoir sampling algorithm to handle deletions; RP reduces to the “passive” algorithm of Babcock et al. when the insertions and deletions correspond to a moving window over a data stream. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing” datasets, we consider algorithms for periodically resizing a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size. We also show how to merge uniform samples from disjoint datasets to obtain a uniform sample of the union of the datasets; the merged sample can be incrementally maintained. Our new RPMerge algorithm extends the HRMerge algorithm of Brown and Haas to effectively deal with deletions, thereby facilitating efficient parallel sampling.

[1]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[2]  William H. Press,et al.  Numerical Recipes in C, 2nd Edition , 1992 .

[3]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[4]  F. Olken,et al.  Maintenance of materialized views of sampling queries , 1992, [1992] Eighth International Conference on Data Engineering.

[5]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[6]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[7]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[8]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[9]  D. Vere-Jones Markov Chains , 1972, Nature.

[10]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[11]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[12]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[13]  Yossi Matias,et al.  Aqua Project White Paper , 1997 .

[14]  Wolfgang Lehner,et al.  Data Management in a Connected World, Essays Dedicated to Hartmut Wedekind on the Occasion of His 70th Birthday , 2005, Data Management in a Connected World.

[15]  Peter J. Haas,et al.  A bi-level Bernoulli scheme for database sampling , 2004, SIGMOD '04.

[16]  Voratas Kachitvichyanukul,et al.  Computer generation of hypergeometric random variates , 1985 .

[17]  Chris Jermaine,et al.  Online maintenance of very large random samples , 2004, SIGMOD '04.

[18]  Heikki Mannila,et al.  The power of sampling in knowledge discovery , 1994, PODS '94.

[19]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[20]  Pierre L'Ecuyer,et al.  Uniform random number generation , 1994, Ann. Oper. Res..

[21]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[22]  Paul Brown,et al.  BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data , 2003, VLDB.

[23]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[24]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[25]  Jeffrey Scott Vitter,et al.  Faster methods for random sampling , 1984, CACM.

[26]  William H. Press,et al.  Numerical Recipes in Fortran 77: The Art of Scientific Computing 2nd Editionn - Volume 1 of Fortran Numerical Recipes , 1992 .

[27]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[28]  Peter J. Haas,et al.  Techniques for Warehousing of Sample Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[29]  L. Devroye Discrete Univariate Distributions , 1986 .

[30]  Mervin E. Muller,et al.  Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers , 1962 .

[31]  Felix Naumann,et al.  (Almost) Hands-Off Information Integration for the Life Sciences , 2005, CIDR.

[32]  Averill M. Law,et al.  Simulation Modeling and Analysis , 1982 .

[33]  Peter J. Haas,et al.  Maintaining bernoulli samples over evolving multisets , 2007, PODS '07.

[34]  H. Robbins A Stochastic Approximation Method , 1951 .

[35]  C. N Bouza,et al.  Spall, J.C. Introduction to stochastic search and optimization. Estimation, simulation and control. Wiley Interscience Series in Discrete Mathematics and Optimization, 2003 , 2004 .

[36]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[37]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[38]  Paul Brown,et al.  Toward Automated Large-Scale Information Integration and Discovery , 2005, Data Management in a Connected World.

[39]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[40]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[41]  J. Doob Stochastic processes , 1953 .

[42]  Pierre L'Ecuyer,et al.  Chapter 3 Uniform Random Number Generation , 2006, Simulation.

[43]  Peter J. Haas,et al.  A dip in the reservoir: maintaining sample synopses of evolving datasets , 2006, VLDB.

[44]  Wolfgang Lehner,et al.  Deferred Maintenance of Disk-Based Random Samples , 2006, EDBT.

[45]  Carl-Erik Särndal,et al.  Model Assisted Survey Sampling , 1997 .

[46]  A. I. McLeod,et al.  A Convenient Algorithm for Drawing a Simple Random Sample , 1983 .

[47]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[48]  Piotr Indyk,et al.  Sampling in dynamic data streams and applications , 2005, Int. J. Comput. Geom. Appl..

[49]  B GibbonsPhillip,et al.  New sampling-based summary statistics for improving approximate query answers , 1998 .