Approximate Query Answering by Model Averaging

In earlier work we have introduced and explored a variety of different probabilistic models for the problem of answering selectivity queries posed to large sparse binary data sets. These models can be directly scaled to hundreds or thousands of dimensions, in contrast to other approximate querying techniques (such as histograms or wavelets) that are inherently limited to relatively small numbers of dimensions. In this paper, we extend this work by applying probabilistic model-averaging to the problem of query answering, a scheme that allows the query-answering algorithm to automatically and optimally adapt to both the specific nature of the data and the distribution of queries being issued any specific user. We demonstrate that on realworld and simulated data sets that model-averaging can reduce the prediction error of any single model by factors of up to 50%. Learning the combining weights is a straightforward and scalable optimization problem that can be easily automated, providing a practical framework for approximate query answering with massive data sets.

[1]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[2]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[3]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[4]  Padhraic Smyth,et al.  Probabilistic query models for transaction data , 2001, KDD '01.

[5]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[6]  Christos Faloutsos,et al.  NetCube: A Scalable Tool for Fast Data Mining and Compression , 2001, VLDB.

[7]  Charles A. Ingene,et al.  Specification Searches: Ad Hoc Inference with Nonexperimental Data , 1980 .

[8]  Padhraic Smyth,et al.  Linearly Combining Density Estimators via Stacking , 1999, Machine Learning.

[9]  Stavros Christodoulakis,et al.  Implications of certain assumptions in database performance evauation , 1984, TODS.

[10]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[11]  Heikki Mannila,et al.  Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data , 2003, IEEE Trans. Knowl. Data Eng..

[12]  Dale Schuurmans,et al.  Learning Bayesian Nets that Perform Well , 1997, UAI.

[13]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[14]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[15]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[16]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[17]  Paul S. Bradley,et al.  Compressed data cubes for OLAP aggregate query approximation on continuous dimensions , 1999, KDD '99.

[18]  Heikki Mannila,et al.  Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets , 2000, UAI.

[19]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.