Storage estimation for multidimensional aggregates in OLAP

On-line analytical processing (OLAP) is an important technique for analyzing data in decision support systems. Most analytical queries require aggregation of the interesting data. Pre-aggregation is one of the most important techniques used to speed up the query response time. However, precomputing every aggregate takes a large amount of time and space. The decision of which aggregates should be precomputed and how much space is required is thus important. By estimating the storage space required for each aggregate view, we can allocate the space for aggregates efficienlty and decide which aggregates to precompute. We investigate four existing strategies for this problem: two based on mathematical approximations, one based on sampling, and one hybrid approach based on mathematical approximation and sampling. We propose a new hybrid strategy that is based on mathematical approximation and sampling and is easy to compute. We evaluate the accuracy of these algorithms in estimating the storage explosion due to aggregation for different data distributions and data densities. The result indicate that our proposed strategy approximates the explosion more accurately then other strategies.

[1]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[2]  Erik Thomsen,et al.  OLAP Solutions - Building Multidimensional Information Systems , 1997 .

[3]  Elena Baralis,et al.  Materialized Views Selection in a Multidimensional Database , 1997, VLDB.

[4]  H. V. Jagadish,et al.  Database Modeling and Design , 1998 .

[5]  Jeffrey F. Naughton,et al.  Materialized View Selection for Multidimensional Datasets , 1998, VLDB.

[6]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[7]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[8]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[9]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[10]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[11]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[12]  Timos K. Sellis,et al.  Data Warehouse Configuration , 1997, VLDB.

[13]  Abraham Silberschatz,et al.  Database System Concepts , 1980 .

[14]  Jeffrey F. Naughton,et al.  Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies , 1996, VLDB.

[15]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[16]  Christos Faloutsos,et al.  Modeling Skewed Distribution Using Multifractals and the '80-20' Law , 1996, VLDB.

[17]  Matteo Golfarelli,et al.  A methodological framework for data warehouse design , 1998, DOLAP '98.

[18]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[19]  S. B. Yao,et al.  Approximating block accesses in database organizations , 1977, CACM.