Improving range-sum query evaluation on data cubes via polynomial approximation

Inefficient query answering is the main drawback in Decision Support Systems (DSS), due to the very large size of the multidimensional data stored in the underlying Data Warehouse Server (DWS). Aggregate queries are the most frequent and useful kind for such systems, as they support several analysis based on the multidimensionality and multi-resolution of data. As a consequence, providing fast answers to aggregate queries (by trading off accuracy for efficiency, if possible) has become a very important requirement in improving the effectiveness of DSS-based applications. In this paper we present a technique based on an analytical interpretation of multidimensional data and on the well-known least squares approximation (LSA) method for supporting approximate aggregate query answering in OLAP, which represents the most common application interfaces for a DWS. Our technique consists in building data synopses by interpreting the original data distributions as a set of discrete functions. These synopses, called Δ-Syn, are obtained by approximating data with a set of polynomial coefficients, and by storing these coefficients instead of the original data. Queries are issued on the compressed representation, thus reducing the number of disk accesses needed to evaluate the answers.

[1]  Sanjeev Khanna,et al.  On approximating rectangle tiling and packing , 1998, SODA '98.

[2]  Viswanath Poosala,et al.  Fast approximate answers to aggregate queries on a data cube , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[3]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[4]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[5]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[6]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[7]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[8]  Timos K. Sellis,et al.  SISYPHUS: The implementation of a chunk-based storage manager for OLAP data cubes , 2003, Data Knowl. Eng..

[9]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[10]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[11]  Divesh Srivastava,et al.  Optimal histograms for hierarchical range queries (extended abstract) , 2000, PODS '00.

[12]  Nick Roussopoulos,et al.  Extended wavelets for multiple measures , 2003, SIGMOD '03.

[13]  Jeffrey F. Naughton,et al.  Caching multidimensional queries using chunks , 1998, SIGMOD '98.

[14]  Torsten Suel,et al.  On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications , 1999, ICDT.

[15]  J. Kenney,et al.  Mathematics of statistics , 1940 .

[16]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[17]  Daniel P. Miranker,et al.  Processing queries for first-few answers , 1996, CIKM '96.

[18]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[19]  Philippe Bonnet,et al.  Towards Sensor Database Systems , 2001, Mobile Data Management.

[20]  John R. Smith,et al.  Dynamic assembly of views in data cubes , 1998, PODS '98.

[21]  M. Powell,et al.  Approximation theory and methods , 1984 .

[22]  Mohamed Ziauddin,et al.  Query processing and optimization in Oracle Rdb , 1996, The VLDB Journal.

[23]  George Colliat,et al.  OLAP, relational, and multidimensional database systems , 1996, SGMD.

[24]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[25]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[26]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[27]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[28]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[29]  Viswanath Poosala,et al.  Aqua: A Fast Decision Support Systems Using Approximate Query Answers , 1999, VLDB.

[30]  Francesco Buccafurri,et al.  A quad-tree based multiresolution approach for two-dimensional summary data , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[31]  Filippo Furfaro,et al.  Hierarchical binary histograms for summarizing multi-dimensional data , 2005, SAC '05.

[32]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[33]  Divesh Srivastava,et al.  Optimal histograms for hierarchical range queries , 2000, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[34]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[35]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[36]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[37]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[38]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[39]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.