A parallel scalable infrastructure for OLAP and data mining

Decision support systems are important in leveraging information present in data warehouses in businesses like banking, insurance, retail and health care. The multidimensional aspects of a business can be naturally expressed using a multidimensional data model. Data analysis and data mining on these warehouses pose new challenges for traditional database systems. OLAP and data mining operations require summary information on these multidimensional data sets. Query processing for these applications require different views of data for analysis and effective decision making. Data mining techniques can be applied in conjunction with OLAP for an integrated business solution. As data warehouses grow, parallel processing techniques have been applied to enable the use of larger data sets and reduce the time for analysis, thereby enabling evaluation of many more options for decision making. We address: (1) scalability in multidimensional systems for OLAP and multidimensional analysis; (2) integration of data mining with the OLAP framework; and (3) high performance by using parallel processing for OLAP and data mining. We describe our system PARSIMONY-Parallel and Scalable Infrastructure for Multidimensional Online analytical processing. This platform is used both for OLAP and data mining. Sparsity of data sets is handled by using sparse chunks using a bit encoded sparse structure for compression. Techniques for effectively using summary information available in data cubes for data mining are presented for mining association rules and decision tree based classification. These take advantage of the data organization provided by the multidimensional data model. Performance results for high dimensional data sets on a distributed memory parallel machine (IBM SP-2) show good speedup and scalability.

[1]  D UllmanJeffrey,et al.  Implementing data cubes efficiently , 1996 .

[2]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[3]  Michael Stonebraker,et al.  Efficient organization of large multidimensional arrays , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[4]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[5]  Jiawei Han,et al.  Data-Driven Discovery of Quantitative Rules in Relational Databases , 1993, IEEE Trans. Knowl. Data Eng..

[6]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[7]  Alok N. Choudhary,et al.  Design and implementation of a scalable parallel system for multidimensional analysis and OLAP , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[8]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[9]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[10]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[11]  Jeffrey F. Naughton,et al.  Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies , 1996, VLDB.

[12]  Alok N. Choudhary,et al.  High Performance Multidimensional Analysis and Data Mining , 1998, Proceedings of the IEEE/ACM SC98 Conference.