Algorithms for Efficient Mining of Statistically Significant Attribute Association Information

Knowledge of the association information between the attributes in a data set provides insight into the underlying structure of the data and explains the relationships (independence, synergy, redundancy) between the attributes and class (if present). Complex models learnt computationally from the data are more interpretable to a human analyst when such interdependencies are known. In this paper, we focus on mining two types of association information among the attributes - correlation information and interaction information for both supervised (class attribute present) and unsupervised analysis (class attribute absent). Identifying the statistically significant attribute associations is a computationally challenging task - the number of possible associations increases exponentially and many associations contain redundant information when a number of correlated attributes are present. In this paper, we explore efficient data mining methods to discover non-redundant attribute sets that contain significant association information indicating the presence of informative patterns in the data.

[1]  N. J. Cerf,et al.  Entropic Bell inequalities , 1997 .

[2]  Ivan Bratko,et al.  Testing the significance of attribute interactions , 2004, ICML.

[3]  Wilfred Ng,et al.  Mining quantitative correlated patterns using an information-theoretic approach , 2006, KDD '06.

[4]  R. Culverhouse,et al.  The Use of the Restricted Partition Method with Case-Control Data , 2007, Human Heredity.

[5]  ChoiChong-Ho,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002 .

[6]  Arno J. Knobbe,et al.  Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[7]  Sinead B. O'Leary,et al.  Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease , 2001, Nature Genetics.

[8]  Heikki Mannila,et al.  Finding low-entropy sets and trees from binary data , 2007, KDD '07.

[9]  Hui Xiong,et al.  Hyperclique pattern discovery , 2006, Data Mining and Knowledge Discovery.

[10]  Hui Xiong,et al.  Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs , 2004, KDD.

[11]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[12]  Aidong Zhang,et al.  On Mining Statistically Significant Attribute Association Information , 2010, SDM.

[13]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[14]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[15]  Andrew B. Nobel,et al.  Mining non-redundant high order correlations in binary data , 2008, Proc. VLDB Endow..

[16]  Matsuda,et al.  Physical nature of higher-order mutual information: intrinsic correlations and frustration , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[17]  Oleg Okun,et al.  Predictive Analysis of Gene Expression Data from Human SAGE Libraries , 2005 .

[18]  Ivan Bratko,et al.  Quantifying and Visualizing Attribute Interactions: An Approach Based on Entropy , 2003 .

[19]  Dumitru Brinza,et al.  SEARCH FOR MULTI-SNP DISEASE ASSOCIATION , 2006 .

[20]  A. J. Bell THE CO-INFORMATION LATTICE , 2003 .

[21]  P. Chanda,et al.  AMBIENCE: A Novel Approach and Efficient Algorithm for Identifying Informative Genetic and Environmental Associations With Complex Phenotypes , 2008, Genetics.

[22]  Te Sun Han,et al.  Multiple Mutual Informations and Multiple Interactions in Frequency Data , 1980, Inf. Control..

[23]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[24]  W. Patefield,et al.  An Efficient Method of Generating Random R × C Tables with Given Row and Column Totals , 1981 .

[25]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[26]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[27]  Alexander Kraskov,et al.  Published under the scientific responsability of the EUROPEAN PHYSICAL SOCIETY Incorporating , 2002 .

[28]  Aidong Zhang,et al.  Information-theoretic metrics for visualizing gene-environment interactions. , 2007, American journal of human genetics.

[29]  T. Tsujishita,et al.  On Triple Mutual Information , 1994 .

[30]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[31]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[32]  Ehsan S. Soofi,et al.  Visualizing Attribute Interdependencies Using Mutual Information, Hierarchical Clustering, Multidimensional Scaling, and Self-organizing Maps , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[33]  M. Tesmer,et al.  AMIFS: adaptive feature selection by using mutual information , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[34]  Xiaobo Zhou,et al.  Gene Clustering Based on Clusterwide Mutual Information , 2004, J. Comput. Biol..

[35]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[36]  Ivan Bratko,et al.  Analyzing Attribute Dependencies , 2003, PKDD.

[37]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[38]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.