Building large ROLAP data cubes in parallel

The pre-computation of data cubes is critical to improving the response time of on-line analytical processing (OLAP) systems and can be instrumental in accelerating data mining tasks in large data warehouses. However, as the size of data warehouses grows, the time it takes to perform this pre-computation becomes a significant performance bottleneck. This work presents a fast parallel method for generating ROLAP data cubes on a shared-nothing multiprocessor based on a novel optimized data partitioning technique. Since no shared disk is required, this method can be applied on highly scalable processor clusters consisting of standard PCs with local disks, connected via a data switch. The approach taken, which uses a ROLAP representation of the data cube, is well suited to large data warehouses on high dimensional data, and supports the generation of both fully materialized and partially materialized cubes. In comparison with previous approaches, our new method does significantly improve the scalability with respect to both, the number of processors and the I/O bandwidth (number of parallel disks). We have implemented our new parallel shared-nothing data cube generation method and evaluated it on a PC cluster, exploring relative speedup, scaleup, sizeup, output sizes and data skew. For a fact table with 16 million rows and 8 attributes, our parallel data cube generation method achieves close to optimal speedup for as many as 32 processors, generating a full data cube in under 7 minutes. For a fact table with 256 million rows and 8 attributes, our parallel method achieves optimal speedup for 32 processors, generating a full data cube consisting of /spl ap/7 billion rows (200 Gigabytes) in under 88 minutes.

[1]  Andrew Rau-Chaplin,et al.  Computing Partial Data Cubes , 2003 .

[2]  Raymond T. Ng,et al.  Iceberg-cube computation with PC clusters , 2001, SIGMOD '01.

[3]  Jiawei Han,et al.  DBMiner: A System for Mining Knowledge in Large Relational Databases , 1996, KDD.

[4]  Kenneth A. Ross,et al.  Fast Computation of Sparse Datacubes , 1997, VLDB.

[5]  RamakrishnanRaghu,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[6]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[7]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[8]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[9]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[10]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[11]  Susanne E. Hambrusch,et al.  Parallelizing the Data Cube , 2001, ICDT.

[12]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[13]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[14]  Masaru Kitsuregawa,et al.  A dynamic load balancing strategy for parallel datacube computation , 1999, DOLAP '99.

[15]  Ying Chen,et al.  Parallel ROLAP Data Cube Construction on Shared-Nothing Multiprocessors , 2004, Distributed and Parallel Databases.

[16]  Jeffrey F. Naughton,et al.  Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies , 1996, VLDB.

[17]  Andrew Rau-Chaplin,et al.  A cluster architecture for parallel data warehousing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[18]  Alok N. Choudhary,et al.  High performance multidimensional analysis of large datasets , 1998, DOLAP '98.

[19]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, I: Two-level memories , 2005, Algorithmica.

[20]  Alok N. Choudhary,et al.  High Performance OLAP and Data Mining on Parallel Computers , 1997, Data Mining and Knowledge Discovery.

[21]  Alok N. Choudhary,et al.  A parallel scalable infrastructure for OLAP and data mining , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).

[22]  Sunita Sarawagi,et al.  On computing the data cube , 1996 .

[23]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[24]  Hongjun Lu,et al.  Multi-cube computation , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[25]  Jeffrey F. Naughton,et al.  An array-based algorithm for simultaneous multidimensional aggregates , 1997, SIGMOD '97.