Selecting the Right Correlation Measure for Binary Data

Finding the most interesting correlations among items is essential for problems in many commercial, medical, and scientific domains. Although there are numerous measures available for evaluating correlations, different correlation measures provide drastically different results. Piatetsky-Shapiro provided three mandatory properties for any reasonable correlation measure, and Tan et al. proposed several properties to categorize correlation measures; however, it is still hard for users to choose the desirable correlation measures according to their needs. In order to solve this problem, we explore the effectiveness problem in three ways. First, we propose two desirable properties and two optional properties for correlation measure selection and study the property satisfaction for different correlation measures. Second, we study different techniques to adjust correlation measures and propose two new correlation measures: the Simplified χ2 with Continuity Correction and the Simplified χ2 with Support. Third, we analyze the upper and lower bounds of different measures and categorize them by the bound differences. Combining these three directions, we provide guidelines for users to choose the proper measure according to their needs.

[1]  Joan Feigenbaum,et al.  Finding highly correlated pairs efficiently with powerful pruning , 2006, CIKM '06.

[2]  Frederick Mosteller,et al.  Association and Estimation in Contingency Tables , 1968 .

[3]  Yiyu Yao,et al.  Peculiarity Oriented Multi-database Mining , 1999, PKDD.

[4]  A. Bate,et al.  A Bayesian neural network method for adverse drug reaction signal generation , 1998, European Journal of Clinical Pharmacology.

[5]  William Nick Street,et al.  Finding Maximal Fully-Correlated Itemsets in Large Databases , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[6]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[7]  Hui Xiong,et al.  TOP-COP: Mining TOP-K Strongly Correlated Pairs in Large Databases , 2006, Sixth International Conference on Data Mining (ICDM'06).

[8]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[9]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[10]  H. Everett "Relative State" Formulation of Quantum Mechanics , 1957 .

[11]  Chris Jermaine,et al.  Finding the most interesting correlations in a database: how hard can it be? , 2005, Inf. Syst..

[12]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[13]  G. Niklas Norén,et al.  Temporal pattern discovery for trends and transient effects: its application to patient records , 2008, KDD.

[14]  Christophe G. Giraud-Carrier,et al.  Behavior-based clustering and analysis of interestingness measures for association rule mining , 2014, Data Mining and Knowledge Discovery.

[15]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[16]  C. Garvan,et al.  Proportions, odds, and risk. , 2004, Radiology.

[17]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[18]  Yanchi Liu,et al.  Speeding up correlation search for binary data , 2013, Pattern Recognit. Lett..

[19]  H. T. Reynolds,et al.  The analysis of cross-classifications , 1977 .

[20]  Hua Xu,et al.  Comparative analysis of pharmacovigilance methods in the detection of adverse drug reactions using electronic medical records , 2013, J. Am. Medical Informatics Assoc..

[21]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[22]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[23]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[24]  William DuMouchel,et al.  Bayesian Data Mining in Large Frequency Tables, with an Application to the FDA Spontaneous Reporting System , 1999 .

[25]  Jaideep Srivastava,et al.  Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[26]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[27]  Ning Zhong,et al.  Dynamically Organizing KDD Processes , 2001, Int. J. Pattern Recognit. Artif. Intell..

[28]  Pang-Ning Tan,et al.  Interestingness Measures for Association Patterns : A Perspective , 2000, KDD 2000.

[29]  Hui Xiong,et al.  TAPER: a two-step approach for all-strong-pairs correlation query in large databases , 2006, IEEE Transactions on Knowledge and Data Engineering.

[30]  Yanchi Liu,et al.  Community detection in graphs through correlation , 2014, KDD.

[31]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[32]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.