Integration of Data Mining and Data Warehousing: A Practical Methodology

The ever growing repository of data in all fields poses new challenges to the modern analytical systems. Real-world datasets, with mixed numeric and nominal variables, are difficult to analyze and require effective visual exploration that conveys semantic relationships of data. Traditional data mining techniques such as clustering clusters only the numeric data. Little research has been carried out in tackling the problem of clustering high cardinality nominal variables to get better insight of underlying dataset. Several works in the literature proved the likelihood of integrating data mining with warehousing to discover knowledge from data. For the seamless integration, the mined data has to be modeled in form of a data warehouse schema. Schema generation process is complex manual task and requires domain and warehousing familiarity. Automated techniques are required to generate warehouse schema to overcome the existing dependencies. To fulfill the growing analytical needs and to overcome the existing limitations, we propose a novel methodology in this paper that permits efficient analysis of mixed numeric and nominal data, effective visual data exploration, automatic warehouse schema generation and integration of data mining and warehousing. The proposed methodology is evaluated by performing case study on real-world data set. Results show that multidimensional analysis can be performed in an easier and flexible way to discover meaningful knowledge from large datasets.

[1]  Jim X. Chen,et al.  Data visualization: parallel coordinates and dimension reduction , 2001, Comput. Sci. Eng..

[2]  Stefan Berchtold,et al.  Similarity clustering of dimensions for an enhanced visualization of multidimensional data , 1998, Proceedings IEEE Symposium on Information Visualization (Cat. No.98TB100258).

[3]  Jennifer Chiang,et al.  Issues for On-Line Analytical Mining of Data Warehouses , 1998 .

[4]  Yuan An,et al.  SAMSTAR: An Automatic Tool for Generating Star Schemas from an Entity-Relationship Diagram , 2008, ER.

[5]  Simon Fong,et al.  Data mining and automatic OLAP schema generation , 2010, 2010 Fifth International Conference on Digital Information Management (ICDIM).

[6]  Ben Shneiderman,et al.  Interactive color mosaic and dendrogram displays for signal/noise optimization in microarray data analysis , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[7]  Alok N. Choudhary,et al.  High Performance OLAP and Data Mining on Parallel Computers , 1997, Data Mining and Knowledge Discovery.

[8]  Chung-Chian Hsu,et al.  Hierarchical clustering of mixed data based on distance hierarchy , 2007, Inf. Sci..

[9]  Luigi Palopoli,et al.  A novel three-level architecture for large data warehouses , 2002, J. Syst. Archit..

[10]  Simon Fong,et al.  A Conceptual Model for Combining Enhanced OLAP and Data Mining Systems , 2009, 2009 Fifth International Joint Conference on INC, IMS and IDC.

[11]  R. Kruse,et al.  Fuzzy clustering of quantitative and qualitative data , 2004, IEEE Annual Meeting of the Fuzzy Information, 2004. Processing NAFIPS '04..

[12]  Jose-Norberto Mazón,et al.  WITHDRAWN: Designing OLAP schemata for data warehouses from conceptual models with MDA , 2010, DSS 2010.

[13]  Matthew O. Ward,et al.  Hierarchical parallel coordinates for exploration of large datasets , 1999, Proceedings Visualization '99 (Cat. No.99CB37067).

[14]  Boriana L. Milenova Clustering Large Databases with Numeric and Nominal Values Using Orthogonal Projections , 2004 .

[15]  Brendan McCane,et al.  Distance functions for categorical and mixed variables , 2008, Pattern Recognit. Lett..

[16]  Haim Levkowitz,et al.  Enhanced High Dimensional Data Visualization through Dimension Reduction and Attribute Arrangement , 2006, Tenth International Conference on Information Visualisation (IV'06).

[17]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[18]  Haim Levkowitz,et al.  Uncovering Clusters in Crowded Parallel Coordinates Visualizations , 2004, IEEE Symposium on Information Visualization.

[19]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[20]  Hua Zhu,et al.  On-Line Analytical Mining of Association Rules , 1998 .

[21]  Sohail Asghar,et al.  Enhancing OLAP functionality using self-organizing neural networks , 2004 .

[22]  Alok N. Choudhary,et al.  PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining , 2001, J. Parallel Distributed Comput..

[23]  Sabine Loudcher,et al.  A new OLAP aggregation based on the AHC technique , 2004, DOLAP '04.

[24]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[25]  Matthew O. Ward,et al.  Mapping Nominal Values to Numbers for Effective Visualization , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[26]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[27]  Carsten Sapia,et al.  Automatically generating OLAP schemata from conceptual graphical models , 2000, DOLAP '00.

[28]  Jiawei Han,et al.  Towards on-line analytical mining in large databases , 1998, SGMD.

[29]  Yixiao Li,et al.  Clustering Mixed Data Based on Evidence Accumulation , 2006, ADMA.

[30]  Kezhi Mao,et al.  Feature selection algorithm for mixed data with both nominal and continuous features , 2007, Pattern Recognit. Lett..

[31]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[32]  Verónika Peralta,et al.  Towards the Automation of Data Warehouse Logical Design: a Rule-Based Approach , 2003, CAiSE Short Paper Proceedings.

[33]  Simon Fong,et al.  Integrated Performance and Visualization Enhancement of OLAP Using Growing Self Organizing Neural Networks , 2010 .

[34]  Marcos M. Campos,et al.  O-Cluster: scalable clustering of large high dimensional data sets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[35]  Yannis Theodoridis,et al.  Seismological Data Warehousing and Mining , 2010, Strategic Advancements in Utilizing Data Mining and Warehousing Technologies.

[36]  Dov Dori,et al.  From conceptual models to schemata: An object-process-based data warehouse construction method , 2008, Inf. Syst..

[37]  Nectaria Tryfona,et al.  starER: a conceptual model for data warehouse design , 1999, DOLAP '99.

[38]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .