Approximate range–sum query answering on data cubes with probabilistic guarantees

Approximate range aggregate queries are one of the most frequent and useful kinds of queries for Decision Support Systems (DSS), as they are widely used in many data analysis tasks. Traditionally, sampling-based techniques have been proposed to tackle this problem. However, their effectiveness degrade when the underlying data distribution is skewed. Another approach based on the outlier management can limit the effect of data skews but fails to address other requirements of approximate range aggregate queries, such as error guarantees and query processing efficiency. In this paper, we present a technique that provides approximate answers to range aggregate queries on OLAP data cubes efficiently, with theoretical guarantees on the errors. Our basic idea is to build different data structures to manage outliers and the rest of the data. Carefully chosen outliers are organized in a quad-tree based indexing data structure to provide efficient access for query processing. A query-workload adaptive, tree-like synopsis data structure, called TunablePartition-Tree (TP-Tree), is proposed to organize samples extracted from non-outlier data. Our experiments clearly demonstrate the merits of our technique, by comparing with previous well-known techniques.

[1]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[2]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[3]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[4]  Kyuseok Shim,et al.  WALRUS: A Similarity Retrieval Algorithm for Image Databases , 2004, IEEE Trans. Knowl. Data Eng..

[5]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[6]  Filippo Furfaro,et al.  Hierarchical binary histograms for summarizing multi-dimensional data , 2005, SAC '05.

[7]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[8]  Surajit Chaudhuri,et al.  AutoAdmin “what-if” index analysis utility , 1998, SIGMOD '98.

[9]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[10]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[11]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[12]  Jeffrey F. Naughton,et al.  Caching multidimensional queries using chunks , 1998, SIGMOD '98.

[13]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[14]  Alfredo Cuzzocrea Overcoming limitations of approximate query answering in OLAP , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).

[15]  Bernd-Uwe Pagel,et al.  Towards an analysis of range query performance in spatial data structures , 1993, PODS '93.

[16]  David Salesin,et al.  Wavelets for computer graphics: theory and applications , 1996 .

[17]  John R. Smith,et al.  Dynamic assembly of views in data cubes , 1998, PODS '98.

[18]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[19]  Chris Chatfield,et al.  The Analysis of Time Series , 1990 .

[20]  Torsten Suel,et al.  On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications , 1999, ICDT.

[21]  Sanjeev Khanna,et al.  On approximating rectangle tiling and packing , 1998, SODA '98.

[22]  Viswanath Poosala,et al.  Fast approximate answers to aggregate queries on a data cube , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[23]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[24]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[25]  S. Muthukrishnan,et al.  Mining Deviants in a Time Series Database , 1999, VLDB.

[26]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[27]  Sridhar Ramaswamy,et al.  Selectivity estimation in spatial databases , 1999, SIGMOD '99.

[28]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[29]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[30]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[31]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[32]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[33]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[34]  Xintao Wu,et al.  Loglinear-Based Quasi Cubes , 2004, Journal of Intelligent Information Systems.

[35]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[36]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[37]  Timos K. Sellis,et al.  SISYPHUS: The implementation of a chunk-based storage manager for OLAP data cubes , 2003, Data Knowl. Eng..

[38]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[39]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[40]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[41]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[42]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[43]  George Colliat,et al.  OLAP, relational, and multidimensional database systems , 1996, SGMD.

[44]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[45]  Abhinandan Das,et al.  Automating layout of relational databases , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[46]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[47]  David Salesin,et al.  Wavelets for computer graphics: a primer. 2 , 1995, IEEE Computer Graphics and Applications.

[48]  Viswanath Poosala,et al.  Aqua: A Fast Decision Support Systems Using Approximate Query Answers , 1999, VLDB.

[49]  Francesco Buccafurri,et al.  A quad-tree based multiresolution approach for two-dimensional summary data , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..