Efficient Online Aggregates in Dense-Region-Based Data Cube Representations

In-memory OLAP systems require a space-efficient representation of sparse data cubes in order to accommodate large data sets. On the other hand, many efficient online aggregation techniques, such as prefix sums, are built on dense array-based representations. These are often not applicable to real-world data due to the size of the arrays which usually cannot be compressed well, as most sparsity is removed during pre-processing. A possible solution is to identify dense regions in a sparse cube and only represent those using arrays, while storing sparse data separately, e.g. in a spatial index structure. Previous denseregion-based approaches have concentrated mainly on the effectiveness of the dense-region detection (i.e. on the space-efficiency of the result). However, especially in higher-dimensional cubes, data is usually more cluttered, resulting in a potentially large number of small dense regions, which negatively affects query performance on such a structure. In this article, our focus is not only on space-efficiency but also on time-efficiency, both for the initial dense-region extraction and for queries carried out in the resulting hybrid data structure. After describing a pre-aggregation method for representing dense sub-cubes which supports efficient online aggregate queries as well as cell updates, our sub-cube extraction approach is outlined in detail. In addition, optimizations in our approach significantly reduce the time to build the initial data structure compared to former systems. Two methods to trade available memory for increased aggregate query performance are provided. Also, we present a straightforward adaptation of our approach to support multi-core or multi-processor architectures, which can further enhance query performance. Experiments with different realworld data sets show how various parameter settings can be used to adjust the efficiency and effectiveness of our algorithms.

[1]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[2]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[3]  Seok-Lyong Lee An Effective Algorithm to Extract Dense Sub-cubes from a Large Sparse Cube , 2006, DaWaK.

[4]  Terence R. Smith,et al.  Relative prefix sums: an efficient approach for querying dynamic OLAP data cubes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[6]  David Wai-Lok Cheung,et al.  Towards the building of a dense-region-based OLAP system , 2001, Data Knowl. Eng..

[7]  Chin-Wan Chung,et al.  Space-efficient cubes for OLAP range-sum queries , 2004, Decis. Support Syst..

[8]  Divyakant Agrawal,et al.  Flexible Data Cubes for Online Aggregation , 2001, ICDT.

[9]  Panos Kalnis,et al.  Evaluation of Top-k OLAP Queries Using Aggregate R-Trees , 2005, SSTD.

[10]  Tobias Lauer,et al.  Efficient Online Aggregates in Dense-Region-Based Data Cube Representations , 2009, DaWaK.

[11]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[12]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[13]  Jeffrey F. Naughton,et al.  An array-based algorithm for simultaneous multidimensional aggregates , 1997, SIGMOD '97.

[14]  Tobias Lauer,et al.  Efficient Range-Sum Queries along Dimensional Hierarchies in Data Cubes , 2009, 2009 First International Confernce on Advances in Databases, Knowledge, and Data Applications.

[15]  Alfredo Cuzzocrea,et al.  Approximate range–sum query answering on data cubes with probabilistic guarantees , 2007, Journal of Intelligent Information Systems.

[16]  Divyakant Agrawal,et al.  pCube: Update-efficient online aggregation with progressive feedback and error bounds , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[17]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.