PnP: sequential, external memory, and parallel iceberg cube computation

Abstract We present “Pipe ’n Prune” (PnP), a new hybrid method for iceberg-cube query computation. The novelty of our method is that it achieves a tight integration of top-down piping for data aggregation with bottom-up a priori data pruning. A particular strength of PnP is that it is efficient for all of the following scenarios: (1) Sequential iceberg-cube queries, (2) External memory iceberg-cube queries, and (3) Parallel iceberg-cube queries on shared-nothing PC clusters with multiple disks. We performed an extensive performance analysis of PnP for the above scenarios with the following main results: In the first scenario PnP performs very well for both dense and sparse data sets, providing an interesting alternative to BUC and Star-Cubing. In the second scenario PnP shows a surprisingly efficient handling of disk I/O, with an external memory running time that is less than twice the running time for full in-memory computation of the same iceberg-cube query. In the third scenario PnP scales very well, providing near linear speedup for a larger number of processors and thereby solving the scalability problem observed for the parallel iceberg-cubes proposed by Ng et al.

[1]  Laks V. S. Lakshmanan,et al.  Quotient Cube: How to Summarize the Semantics of a Data Cube , 2002, VLDB.

[2]  Andrew Rau-Chaplin,et al.  Computing Partial Data Cubes for Parallel Data Warehousing Applications , 2001, PVM/MPI.

[3]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[4]  Nick Roussopoulos,et al.  Cubetree: organization of and bulk incremental updates on the data cube , 1997, SIGMOD '97.

[5]  Ying Chen,et al.  Parallel ROLAP Data Cube Construction on Shared-Nothing Multiprocessors , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[6]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[7]  Laks V. S. Lakshmanan,et al.  QC-trees: an efficient summary structure for semantic OLAP , 2003, SIGMOD '03.

[8]  Hongjun Lu,et al.  Fully Dynamic Partitioning: Handling Data Skew in Parallel Data Cube Computation , 2004, Distributed and Parallel Databases.

[9]  Ying Chen,et al.  Building large ROLAP data cubes in parallel , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[10]  Yannis Sismanis,et al.  Dwarf: shrinking the PetaCube , 2002, SIGMOD '02.

[11]  Alok N. Choudhary,et al.  High Performance OLAP and Data Mining on Parallel Computers , 1997, Data Mining and Knowledge Discovery.

[12]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[13]  Andrew Rau-Chaplin,et al.  A cluster architecture for parallel data warehousing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[14]  Alok N. Choudhary,et al.  A parallel scalable infrastructure for OLAP and data mining , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).

[15]  Sunita Sarawagi,et al.  On computing the data cube , 1996 .

[16]  Susanne E. Hambrusch,et al.  Parallelizing the Data Cube , 2001, Distributed and Parallel Databases.

[17]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[18]  Jiawei Han,et al.  Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration , 2003, Very Large Data Bases Conference.

[19]  RamakrishnanRaghu,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[20]  Ying Chen,et al.  Parallel ROLAP Data Cube Construction on Shared-Nothing Multiprocessors , 2004, Distributed and Parallel Databases.

[21]  Jeffrey F. Naughton,et al.  An array-based algorithm for simultaneous multidimensional aggregates , 1997, SIGMOD '97.

[22]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[23]  Alok N. Choudhary,et al.  High performance multidimensional analysis of large datasets , 1998, DOLAP '98.

[24]  Hongjun Lu,et al.  Condensed cube: an effective approach to reducing data cube size , 2002, Proceedings 18th International Conference on Data Engineering.

[25]  Raymond T. Ng,et al.  Iceberg-cube computation with PC clusters , 2001, SIGMOD '01.

[26]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[27]  Masaru Kitsuregawa,et al.  A dynamic load balancing strategy for parallel datacube computation , 1999, DOLAP '99.

[28]  Kenneth A. Ross,et al.  Fast Computation of Sparse Datacubes , 1997, VLDB.