Aqua Project White Paper

In large data recording and warehousing environments it is often advantageous to provide fast approximate answers to queries whenever possible The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer by avoiding or minimizing the number of accesses to the base data This white paper describes the Approximate QUery Answering AQUA Project underway in the Information Sciences Research Center at Bell Labs We present a framework for an approximate query engine that observes new data as it arrives and maintains small synopsis data structures on that data These data structures are used to provide fast approximate answers to a broad class of queries We describe metrics for evaluating approximate query answers We also present new synopsis data structures and new techniques for approximate query answers We report on the goals and status of the Aqua project and plans for future work Email gibbons research bell labs com Current address is Tel Aviv University Ramat Aviv Tel Aviv Israel Email matias math tau ac il Email poosala research bell labs com

[1]  Robert H. Morris,et al.  Counting large numbers of events in small registers , 1978, CACM.

[2]  T. H. Merrett,et al.  Distribution Models Of Relations , 1979, Fifth International Conference on Very Large Data Bases, 1979..

[3]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[4]  Philippe Flajolet,et al.  Probabilistic counting , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[5]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..

[6]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[7]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[8]  Roger King,et al.  A model of data distribution based on texture analysis , 1985, SIGMOD '85.

[9]  Philippe Flajolet,et al.  Approximate counting: A detailed analysis , 1985, BIT.

[10]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[11]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[12]  Clifford A. Lynch,et al.  Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values , 1988, VLDB.

[13]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[14]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[15]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[16]  Jeffrey F. Naughton,et al.  Estimating the Size of Generalized Transitive Closures , 1989, VLDB.

[17]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[18]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[19]  Wen-Chi Hou,et al.  Error-constrained COUNT query evaluation in relational databases , 1991, SIGMOD '91.

[20]  Wei Sun,et al.  A supplement to sampling-based methods for query size estimation in a database system , 1992, SGMD.

[21]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[22]  Jeffrey Scott Vitter,et al.  Dynamic Generation of Discrete Random Variates , 1993, SODA '93.

[23]  Jeffrey F. Naughton,et al.  Efficient Sampling Strategies for Relational Database Operations , 1993, Theor. Comput. Sci..

[24]  Yannis E. Ioannidis,et al.  Universality of Serial Histograms , 1993, VLDB.

[25]  Jeffrey F. Naughton,et al.  Fixed-precision estimation of join selectivity , 1993, PODS '93.

[26]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[27]  Jeffrey F. Naughton,et al.  On the relative cost of sampling for join selectivity estimation , 1994, PODS '94.

[28]  Jeffrey Scott Vitter,et al.  Approximate data structures with applications , 1994, SODA '94.

[29]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[30]  Jeffrey F. Naughton,et al.  Query Size Estimation by Adaptive Sampling , 1995, J. Comput. Syst. Sci..

[31]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[32]  Donovan A. Schneider,et al.  The ins and outs (and everything in between) of data warehousing , 1996, SIGMOD '96.

[33]  Yossi Matias,et al.  Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[34]  Yossi Matias,et al.  Performance evaluation of approximate priority queues , 1996 .

[35]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[36]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[37]  Christos Faloutsos,et al.  Recovering Information from Summary Data , 1997, VLDB.

[38]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[39]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .