Analysis and summarization of correlations in data cubes and its application in microarray data analysis

This paper presents a novel mechanism to analyze and summarize the statistical correlations among the attributes of a data cube. To perform the analysis and summarization, this paper proposes a new measure of statistical significance. The main reason for proposing the new measure of statistical significance is to have an essential closure property, which is exploited in the summarization stage of the data mining process. In addition to the closure property, the proposed measure of statistical significance has two other important properties. First, the proposed measure of statistical significance is more conservative than the well-known chi-square test in classical statistics and, therefore, inherits its statistical robustness. This paper does not simply employ the chi-square test due to lack of the desired closure property, which may lead to a precision problem in the summarization process. The second additional property is that, though the proposed measure of statistical significance is more conservative than the chi-square test, for most cases, the proposed measure yields a value that is almost equal to the z test, a conventional measurement of statistical significance based on the normal distribution. Based on the closure property addressed above, this paper develops an algorithm to summarize the results from performing statistical analysis in the data cube. Though the proposed measure of statistical significance avoids the precision problem due to having the closure property, its conservative nature may lead to a recall rate problem in the data mining process. On the other hand, if the chi-square test, which does not have the closure property, was employed, then the summarization process may suffer a precision problem. In this paper, we also show a possible application in bioinformatics. We applied the proposed mechanism on a microarray dataset, in order to identify groups of genes with similar expression patterns in subspaces of feature space.

[1]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[2]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[3]  Hsueh-Fen Juan,et al.  Biomic study of human myeloid leukemia cells differentiation to macrophages using DNA array, proteomic, and bioinformatic analytical methods , 2002, Electrophoresis.

[4]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[5]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[6]  Shinichi Morishita,et al.  Parallel Branch-and-Bound Graph Search for Correlated Association Rules , 1999, Large-Scale Parallel Data Mining.

[7]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[8]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[10]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[11]  Laks V. S. Lakshmanan,et al.  Efficient mining of constrained correlated sets , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[12]  P. J. Green,et al.  Probability and Statistical Inference , 1978 .

[13]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[14]  Philip S. Yu,et al.  A new framework for itemset generation , 1998, PODS '98.

[15]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[16]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[17]  Laks V. S. Lakshmanan,et al.  Interactive Mining of Correlations - A Constraint Perspective , 1999, 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[18]  Rajeev Motwani,et al.  Scalable Techniques for Mining Causal Structures , 1998, Data Mining and Knowledge Discovery.