Sampling algorithms for evolving datasets

Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system’s point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid underor oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing.

[1]  P. Haas,et al.  Estimating the Number of Classes in a Finite Population , 1998 .

[2]  Peter J. Haas,et al.  A bi-level Bernoulli scheme for database sampling , 2004, SIGMOD '04.

[3]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[4]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[5]  Anany Levitin,et al.  Introduction to the Design and Analysis of Algorithms , 2002 .

[6]  Brian A. Carter,et al.  Advanced Encryption Standard , 2007 .

[7]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[8]  Paul Brown,et al.  BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data , 2003, VLDB.

[9]  Surajit Chaudhuri,et al.  A robust, optimization-based approach for approximate answering of aggregate queries , 2001, SIGMOD '01.

[10]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[11]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[12]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[13]  Bin Chen,et al.  Efficient Data-Reduction Methods for On-line Association Rule Discovery , 2004 .

[14]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD '00.

[15]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[16]  Carsten Lund,et al.  Charging from sampled network usage , 2001, IMW '01.

[17]  K. Aiyappan Nair An Improved Algorithm for Ordered Sequential Random Sampling , 1990, TOMS.

[18]  Calisto Zuzarte,et al.  Query sampling in DB2 Universal Database , 2004, SIGMOD '04.

[19]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[20]  Heikki Mannila,et al.  The power of sampling in knowledge discovery , 1994, PODS '94.

[21]  P. Haas Speeding up DB 2 UDB Using Sampling , 2003 .

[22]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[23]  A. I. McLeod,et al.  A Convenient Algorithm for Drawing a Simple Random Sample , 1983 .

[24]  Peter J. Haas,et al.  Maintaining bernoulli samples over evolving multisets , 2007, PODS '07.

[25]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[26]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[27]  S. B. Yao,et al.  Approximating block accesses in database organizations , 1977, CACM.

[28]  Cecilia R. Aragon,et al.  Randomized search trees , 2005, Algorithmica.

[29]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[30]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[31]  Mervin E. Muller,et al.  Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers , 1962 .

[32]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[33]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[34]  Chris Jermaine,et al.  Robust Estimation With Sampling and Approximate Pre-Aggregation , 2003, VLDB.

[35]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[36]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[37]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[38]  Peter J Haas,et al.  An Estimator of Number of Species from Quadrat Sampling , 2006, Biometrics.

[39]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[40]  Rajeev Motwani,et al.  On Sampling and Relational Operators , 1999, IEEE Data Eng. Bull..

[41]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[42]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[43]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[44]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[45]  Jeffrey F. Naughton,et al.  On the relative cost of sampling for join selectivity estimation , 1994, PODS '94.

[46]  Mikkel Thorup,et al.  Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.

[47]  Theodore Johnson,et al.  Sampling algorithms in a stream operator , 2005, SIGMOD '05.

[48]  Wolfgang Lehner,et al.  Linked Bernoulli Synopses: Sampling along Foreign Keys , 2008, SSDBM.

[49]  Peter J. Haas,et al.  Techniques for Warehousing of Sample Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[50]  Wolfgang Lehner,et al.  Designing Random Sample Synopses with Outliers , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[51]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[52]  Jeffrey Scott Vitter,et al.  Faster methods for random sampling , 1984, CACM.

[53]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[54]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[55]  David J. DeWitt,et al.  Parallel sorting on a shared-nothing architecture using probabilistic splitting , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[56]  Wen-Chi Hou,et al.  Statistical estimators for aggregate relational algebra queries , 1991, TODS.

[57]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[58]  Voratas Kachitvichyanukul,et al.  Computer generation of hypergeometric random variates , 1985 .

[59]  Olli Nevalainen,et al.  Two efficient algorithms for random sampling without replacement , 1982 .

[60]  Peter J. Haas,et al.  A dip in the reservoir: maintaining sample synopses of evolving datasets , 2006, VLDB.

[61]  S. Seshadri Probabilistic methods in query processing , 1992 .

[62]  Jeffrey F. Naughton,et al.  Query size estimation by adaptive sampling (extended abstract) , 1990, PODS.

[63]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[64]  Bin Chen,et al.  A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.

[65]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[66]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[67]  Joachim H. Ahrens,et al.  Sequential random sampling , 1985, TOMS.

[68]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[69]  M. H. Hansen,et al.  On the Theory of Sampling from Finite Populations , 1943 .

[70]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[71]  Suman Nath,et al.  Online maintenance of very large random samples on flash storage , 2009, The VLDB Journal.

[72]  Dan E. Willard,et al.  Optimal sample cost residues for differential database batch query problems , 1991, JACM.

[73]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[74]  Kurzfassung der Dissertation Sampling Algorithms for Evolving Datasets , 2008 .

[75]  A. C. Bebbington,et al.  A Simple Method of Drawing a Sample Without Replacement , 1975 .

[76]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[77]  Peter J. Haas,et al.  Hoeffding inequalities for join-selectivity estimation and online aggregation , 1996 .

[78]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[79]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[80]  Stavros Christodoulakis,et al.  Estimating block transfers and join sizes , 1983, SIGMOD '83.

[81]  Chris Jermaine,et al.  Maintaining very large random samples using the geometric file , 2008, The VLDB Journal.

[82]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[83]  Mervin E. Muller The use of computers in inspection procedures , 1958, CACM.

[84]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[85]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.

[86]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[87]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[88]  Chris Jermaine,et al.  A disk-based join with probabilistic guarantees , 2005, SIGMOD '05.

[89]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[90]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[91]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[92]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[93]  William H. Press,et al.  Numerical recipes in C (2nd ed.): the art of scientific computing , 1992 .

[94]  T. Shinozaki,et al.  Constructing an Optimal Family of Min-Wise Independent Permutations , 2000 .

[95]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[96]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[97]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[98]  Chris Jermaine,et al.  Sampling-based estimators for subset-based queries , 2008, The VLDB Journal.

[99]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[100]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[101]  Usama M. Fayyad,et al.  Knowledge Discovery in Databases: An Overview , 1997, ILP.

[102]  P.J. Haas,et al.  Sampling-based selectivity estimation for joins using augmented frequent value statistics , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[103]  D. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[104]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2008, TODS.

[105]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[106]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[107]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[108]  Rajeev Rastogi,et al.  Data Stream Management: Processing High-Speed Data Streams (Data-Centric Systems and Applications) , 2019 .

[109]  Dorothy E. Denning,et al.  Secure statistical databases with random sample queries , 1980, TODS.

[110]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[111]  Ashish Gupta,et al.  Materialized views: techniques, implementations, and applications , 1999 .

[112]  Ruoming Jin,et al.  New Sampling-Based Estimators for OLAP Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[113]  Pierre L'Ecuyer,et al.  Uniform random number generation , 1994, Ann. Oper. Res..

[114]  F. Olken,et al.  Maintenance of materialized views of sampling queries , 1992, [1992] Eighth International Conference on Data Engineering.

[115]  Fei Xu,et al.  Confidence bounds for sampling-based group by estimates , 2008, TODS.

[116]  Wolfgang Lehner,et al.  Sampling time-based sliding windows in bounded space , 2008, SIGMOD Conference.

[117]  Muhammad HanifI,et al.  Sampling with Unequal Probabilities without Replacement: A Review , 1980 .

[118]  R. S. Pinkham An Efficient Algorithm for Drawing a Simple Random Sample , 1987 .

[119]  Peter J. Haas,et al.  Maintaining bounded-size sample synopses of evolving datasets , 2008, The VLDB Journal.

[120]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[121]  Jeffrey Scott Vitter,et al.  An efficient algorithm for sequential random sampling , 1987, TOMS.

[122]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[123]  Stefan Berchtold,et al.  An efficient approximation scheme for data mining tasks , 2001, Proceedings 17th International Conference on Data Engineering.

[124]  Yossi Matias,et al.  Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[125]  M. Grossglauser,et al.  Trajectory sampling for direct traffic observation , 2000 .

[126]  Gennady Antoshenkov,et al.  Random Sampling from Pseudo-Ranked B+ Trees , 1992, VLDB.

[127]  Wolfgang Lehner,et al.  Cardinality estimation using sample views with quality assurance , 2007, SIGMOD '07.

[128]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[129]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[130]  Sumit Ganguly,et al.  Counting distinct items over update streams , 2005, Theor. Comput. Sci..

[131]  Mikkel Thorup Even strongly universal hashing is pretty fast , 2000, SODA '00.

[132]  Armido R. Didonato,et al.  Algorithm 708: Significant digit computation of the incomplete beta function ratios , 1988, TOMS.

[133]  Wei Sun,et al.  An evaluation of sampling-based size estimation methods for selections in database systems , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[134]  Bin Chen,et al.  Efficient data reduction with EASE , 2003, KDD '03.

[135]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[136]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[137]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[138]  Terence G. Jones,et al.  A note on sampling a tape-file , 1962, Commun. ACM.

[139]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[140]  To-Yat Cheung Estimating block accesses and number of records in file management , 1982, CACM.

[141]  Carl-Erik Särndal,et al.  Model Assisted Survey Sampling , 1997 .

[142]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[143]  Kai-Min Chung,et al.  Why simple hash functions work: exploiting the entropy in a data stream , 2008, SODA '08.

[144]  A. Bissell Ordered Random Selection without Replacement , 1986 .

[145]  Wen-Chi Hou,et al.  Error-constrained COUNT query evaluation in relational databases , 1991, SIGMOD '91.

[146]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[147]  Peter Hellekalek,et al.  Empirical evidence concerning AES , 2003, TOMC.

[148]  Jeffrey F. Naughton,et al.  Synopses for query optimization: A space-complexity perspective , 2004, TODS.

[149]  Yufei Tao,et al.  Random Sampling for Continuous Streams with Arbitrary Updates , 2007 .

[150]  Piotr Indyk,et al.  Sampling in dynamic data streams and applications , 2005, Int. J. Comput. Geom. Appl..

[151]  Xiaohui Yu,et al.  Hashed samples: selectivity estimators for set similarity selection queries , 2008, Proc. VLDB Endow..

[152]  J. Rao On the Comparison of Sampling with and without Replacement , 1966 .

[153]  Michael M. Strand Estimation of a Population Total under a “Bernoulli Sampling” Procedure , 1979 .

[154]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[155]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[156]  Chris Jermaine,et al.  Online maintenance of very large random samples , 2004, SIGMOD '04.

[157]  Wolfgang Lehner,et al.  Deferred Maintenance of Disk-Based Random Samples , 2006, EDBT.

[158]  Ping Xu,et al.  Random sampling from hash files , 1990, SIGMOD '90.

[159]  Olli Nevalainen,et al.  An Algorithm for Unbiased Random Sampling , 1982, Comput. J..

[160]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[161]  Kim-Hung Li,et al.  Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))) , 1994, TOMS.

[162]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[163]  Jeffrey F. Naughton,et al.  End-biased Samples for Join Cardinality Estimation , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[164]  Farshad Fotouhi,et al.  Computation of partial query results with an adaptive stratified sampling technique , 1995, CIKM '95.

[165]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[166]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[167]  Jeffrey F. Naughton,et al.  Fixed-precision estimation of join selectivity , 1993, PODS '93.