Data Guided Approach to Generate Multi-dimensional Schema for Targeted Knowledge Discovery

Data mining and data warehousing are two key technologies which have made significant contributions to the field of knowledge discovery in a variety of domains. More recently, the integrated use of traditional data mining techniques such as clustering and pattern recognition with data warehousing technique of Online Analytical Processing (OLAP) have motivated diverse research areas for leveraging knowledge discovery from complex real-world datasets. Recently, a number of such integrated methodologies have been proposed to extract knowledge from datasets but most of these methodologies lack automated and generic methods for schema generation and knowledge extraction. Mostly data analysts need to rely on domain specific knowledge and have to cope with technological constraints in order to discover knowledge from high dimensional datasets. In this paper we present a generic methodology which incorporates semi-automated knowledge extraction methods to provide data-driven assistance towards knowledge discovery. In particular, we provide a method for constructing a binary tree of hierarchical clusters and annotate each node in the tree with significant numeric variables. Additionally, we propose automated methods to rank nominal variables and to generate candidate multidimensional schema with highly significant dimensions. We have performed three case studies on three real-world datasets taken from the UCI machine learning repository in order to validate the generality and applicability of our proposed methodology.

[1]  Rokia Missaoui,et al.  Toward Integrating Data Warehousing with Data Mining Techniques , 2007 .

[2]  Alex Alves Freitas,et al.  Incorporating Deviation-Detection Functionality into the OLAP Paradigm , 2001, SBBD.

[3]  Jayanta Basak,et al.  Interpretable hierarchical clustering by constructing an unsupervised decision tree , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[5]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[6]  Jiawei Han,et al.  Towards on-line analytical mining in large databases , 1998, SGMD.

[7]  Simon Fong,et al.  Integrated Performance and Visualization Enhancement of OLAP Using Growing Self Organizing Neural Networks , 2010 .

[8]  Z. Liu,et al.  A proposal of integrating data mining and on-line analytical processing in data warehouse , 2001, 2001 International Conferences on Info-Tech and Info-Net. Proceedings (Cat. No.01EX479).

[9]  A Min Tjoa,et al.  Data Warehousing and Knowledge Discovery: A Chronological View of Research Challenges , 2005, DaWaK.

[10]  Russel Pears,et al.  Integration of Data Mining and Data Warehousing: A Practical Methodology , 2010, Int. J. Adv. Comp. Techn..

[11]  Jinwook Seo,et al.  Exploratory Data Analysis With Categorical Variables: An Improved Rank-by-Feature Framework and a Case Study , 2007, Int. J. Hum. Comput. Interact..

[12]  Ben Shneiderman,et al.  Knowledge discovery in high-dimensional data: case studies and a user survey for the rank-by-feature framework , 2006, IEEE Transactions on Visualization and Computer Graphics.

[13]  Jesús Pardillo,et al.  Integrating Clustering Data Mining into the Multidimensional Modeling of Data Warehouses with UML Profiles , 2007, DaWaK.

[14]  Matthew O. Ward,et al.  Mapping Nominal Values to Numbers for Effective Visualization , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[15]  The use of multiple correspondence analysis and hierarchical clustering to identify incident typologies pertaining to the biofuel industry , 2010 .

[16]  Sohail Asghar,et al.  An Architecture for Integrated Online Analytical Mining , 2011 .

[17]  Russel Pears,et al.  Multi Level Mining of Warehouse Schema , 2011, NDT.

[18]  Abraham Bernstein,et al.  Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[20]  Alok N. Choudhary,et al.  High Performance OLAP and Data Mining on Parallel Computers , 1997, Data Mining and Knowledge Discovery.

[21]  Jiawei Han,et al.  Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes , 1997, KDD.

[22]  Lucio Ieronutti,et al.  A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses , 2008, Int. J. Data Warehous. Min..

[23]  Clark S. Lindsey,et al.  Unsupervised learning with ART on the CNAPS , 1997 .

[24]  Sabine Loudcher,et al.  Enhanced mining of association rules from data cubes , 2006, DOLAP '06.

[25]  Alok N. Choudhary,et al.  A parallel scalable infrastructure for OLAP and data mining , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).

[26]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[27]  Alok N. Choudhary,et al.  PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining , 2001, J. Parallel Distributed Comput..

[28]  Dietmar Cordes,et al.  Hierarchical clustering to measure connectivity in fMRI resting-state data. , 2002, Magnetic resonance imaging.