Range queries in OLAP data cubes

A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations cover techniques required for most popular aggregation operations, such as those supported by SQL. For range-sum queries, the essential idea is to precompute some auxiliary information (prefix sums) that is used to answer ad hoc queries at run-time. By maintaining auxiliary information which is of the same size as the data cube, all range queries for a given cube can be answered in constant time, irrespective of the size of the sub-cube circumscribed by a query. Alternatively, one can keep auxiliary information which is 1/bd of the size of the d-dimensional data cube. Response to a range query may now require access to some cells of the data cube in addition to the access to the auxiliary information, but the overall time complexity is typically reduced significantly. We also discuss how the precomputed information is incrementally updated by batching updates to the data cube. Finally, we present algorithms for choosing the subset of the data cube dimensions for which the auxiliary information is computed and the blocking factor to use for each such subset. Our approach to answering range-max queries is based on precomputed max over balanced hierarchical tree structures. We use a branch-and-bound-like procedure to speed up the finding of max in a region. We also show that with a branch-and-bound procedure, the average-case complexity is much smaller than the worst-case complexity.

[1]  George S. Lueker,et al.  Adding range restriction capability to dynamic data structures , 1985, JACM.

[2]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[3]  Ashish Gupta,et al.  Aggregate-Query Processing in Data Warehousing Environments , 1995, VLDB.

[4]  Zbigniew Michalewicz Statistical and Scientific Databases , 1991 .

[5]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[6]  Jeffrey F. Naughton,et al.  Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies , 1996, VLDB.

[7]  Pravin M. Vaidya Space-time tradeoffs for orthogonal range queries , 1985, STOC '85.

[8]  A. Nijenhuis Combinatorial algorithms , 1975 .

[9]  L. G. Mitten Branch-and-Bound Methods: General Formulation and Properties , 1970, Oper. Res..

[10]  Kyuseok Shim,et al.  Including Group-By in Query Optimization , 1994, VLDB.

[11]  N. J. A. Sloane,et al.  Further results on the covering radius of codes , 1986, IEEE Trans. Inf. Theory.

[12]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[13]  Meng Chang Chen,et al.  On the Data Model and Access Method of Summary Data Management , 1989, IEEE Trans. Knowl. Data Eng..

[14]  Bernard Chazelle,et al.  Computing partial sums in multidimensional arrays , 1989, SCG '89.

[15]  George Colliat,et al.  OLAP, relational, and multidimensional database systems , 1996, SGMD.

[16]  Per-Åke Larson,et al.  Eager Aggregation and Lazy Aggregation , 1995, VLDB.

[17]  Isidore Rigoutsos,et al.  An algorithm for point clustering and grid generation , 1991, IEEE Trans. Syst. Man Cybern..

[18]  Jehoshua Bruck,et al.  Partial-sum queries in OLAP data cubes using covering codes , 1997, PODS '97.

[19]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[20]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[21]  Venky Harinarayan,et al.  Implementing Data Cubes E ciently , 1996 .

[22]  Jaideep Srivastava,et al.  TBSAM: An Access Method for Efficient Processing of Statistical Queries , 1989, IEEE Trans. Knowl. Data Eng..

[23]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[24]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[25]  Jon Louis Bentley,et al.  Data Structures for Range Searching , 1979, CSUR.

[26]  D. Shasha,et al.  Hierarchically Split Cube Forests for Decision Support: description and tuned design , 1996 .

[27]  Kurt Mehlhorn,et al.  Data Structures and Algorithms 3: Multi-dimensional Searching and Computational Geometry , 2012, EATCS Monographs on Theoretical Computer Science.

[28]  Bernard Chazelle,et al.  Lower bounds for orthogonal range searching: part II. The arithmetic model , 1990, JACM.

[29]  Andrew Chi-Chih Yao On the Complexity of Maintaining Partial Sums , 1985, SIAM J. Comput..

[30]  Sunita Sarawagi,et al.  Modeling multidimensional databases , 1997, Proceedings 13th International Conference on Data Engineering.

[31]  Josef Bigün,et al.  Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement , 1995, Pattern Recognit..

[32]  Jon Louis Bentley,et al.  Multidimensional divide-and-conquer , 1980, CACM.

[33]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.