High Performance OLAP and Data Mining on Parallel Computers

On-Line Analytical Processing (OLAP) techniques are increasingly being used in decision support systems to provide analysis of data. Queries posed on such systems are quite complex and require different views of data. Analytical models need to capture the multidimensionality of the underlying data, a task for which multidimensional databases are well suited. Multidimensional OLAP systems store data in multidimensional arrays on which analytical operations are performed. Knowledge discovery and data mining requires complex operations on the underlying data which can be very expensive in terms of computation time. High performance parallel systems can reduce this analysis time.Precomputed aggregate calculations in a Data Cube can provide efficient query processing for OLAP applications. In this article, we present algorithms for construction of data cubes on distributed-memory parallel computers. Data is loaded from a relational database into a multidimensional array. We present two methods, sort-based and hash-based for loading the base cube and compare their performances. Data cubes are used to perform consolidation queries used in roll-up operations using dimension hierarchies. Finally, we show how data cubes are used for data mining using Attribute Focusing techniques. We present results for these on the IBM-SP2 parallel machine. Results show that our algorithms and techniques for OLAP and data mining on parallel systems are scalable to a large number of processors, providing a high performance platform for such applications.

[1]  Sunita Sarawagi,et al.  On computing the data cube , 1996 .

[2]  Inderpal S. Bhandari,et al.  Advanced Scout: Data Mining and Knowledge Discovery in NBA Data , 2004, Data Mining and Knowledge Discovery.

[3]  Venky Harinarayan,et al.  Implementing Data Cubes E ciently , 1996 .

[4]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[5]  Inderpal S. Bhandari,et al.  A Case Study of Software Process Improvement During Development , 1993, IEEE Trans. Software Eng..

[6]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[7]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[8]  Alok N. Choudhary,et al.  Parallel data cube construction for high performance on-line analytical processing , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[9]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[10]  Jiawei Han,et al.  Data-Driven Discovery of Quantitative Rules in Relational Databases , 1993, IEEE Trans. Knowl. Data Eng..

[11]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[12]  Ralf Hartmut Güting,et al.  An introduction to spatial database systems , 1994, VLDB J..

[13]  Michael Stonebraker,et al.  Efficient organization of large multidimensional arrays , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[14]  E. F. Codd,et al.  Providing OLAP to User-Analysts: An IT Mandate , 1998 .

[15]  Yihong Zhao Kristin Tufte On the Performance of an Array-Based ADT for OLAP Workloads , 1996 .