Speeding up correlation search for binary data

Searching correlated pairs in a collection of items is essential for many problems in commercial, medical, and scientific domains. Recently, a lot of progress has been made to speed up the search for pairs that have a high Pearson correlation (@f-coefficient). However, @f-coefficient is not the only or the best correlation measure. In this paper, we aim at an alternative task: finding correlated pairs of any ''good'' correlation measure which satisfies the three widely-accepted correlation properties in Section 2.1. In this paper, we identify a 1-dimensional monotone property of the upper bound of any ''good'' correlation measure, and different 2-dimensional monotone properties for different types of correlation measures. We can either use the 2-dimensional search algorithm to retrieve correlated pairs above a certain threshold, or our new token-ring algorithm to find top-k correlated pairs to prune many pairs without computing their correlations. The experimental results show that our robust algorithm can efficiently search correlated pairs under different situations and is an order of magnitude faster than the brute-force method.

[1]  Joan Feigenbaum,et al.  Finding highly correlated pairs efficiently with powerful pruning , 2006, CIKM '06.

[2]  Mei Liu,et al.  Adverse Drug Effect Detection , 2013, IEEE Journal of Biomedical and Health Informatics.

[3]  Hui Xiong,et al.  Top-k Correlation Computation , 2008, INFORMS J. Comput..

[4]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[5]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[6]  Jaideep Srivastava,et al.  Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[7]  Mohammed J. Zaki,et al.  GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets , 2005, Data Mining and Knowledge Discovery.

[8]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[9]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[10]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[11]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[12]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[14]  Hui Xiong,et al.  TAPER: a two-step approach for all-strong-pairs correlation query in large databases , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Christian Borgelt,et al.  EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .

[16]  A. Bate,et al.  A Bayesian neural network method for adverse drug reaction signal generation , 1998, European Journal of Clinical Pharmacology.

[17]  Nikolaj Tatti,et al.  Maximum entropy based significance of itemsets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  Luc De Raedt,et al.  Itemset mining: A constraint programming perspective , 2011, Artif. Intell..

[19]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[20]  Yanchi Liu,et al.  Selecting the Right Correlation Measure for Binary Data , 2014, TKDD.

[21]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[22]  William Nick Street,et al.  Finding Maximal Fully-Correlated Itemsets in Large Databases , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[23]  Kuei-Hong Lin,et al.  An adaptive correlation-based group recommendation system , 2011, 2011 International Symposium on Intelligent Signal Processing and Communications Systems (ISPACS).

[24]  Engelbert Mephu Nguifo,et al.  Frequent closed itemset based algorithms: a thorough structural and analytical survey , 2006, SKDD.

[25]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[26]  Chris Jermaine,et al.  Finding the most interesting correlations in a database: how hard can it be? , 2005, Inf. Syst..

[27]  Charu C. Aggarwal,et al.  On privacy preservation against adversarial data mining , 2006, KDD '06.

[28]  Hui Xiong,et al.  TOP-COP: Mining TOP-K Strongly Correlated Pairs in Large Databases , 2006, Sixth International Conference on Data Mining (ICDM'06).