TOP-COP: Mining TOP-K Strongly Correlated Pairs in Large Databases

Recently, there has been considerable interest in computing strongly correlated pairs in large databases. Most previous studies require the specification of a minimum correlation threshold to perform the computation. However, it may be difficult for users to provide an appropriate threshold in practice, since different data sets typically have different characteristics. To this end, we propose an alternative task: mining the top-k strongly correlated pairs. In this paper, we identify a 2-D monotone property of an upper bound of Pearson's correlation coefficient and develop an efficient algorithm, called TOP-COP to exploit this property to effectively prune many pairs even without computing their correlation coefficients. Our experimental results show that the TOP-COP algorithm can be orders of magnitude faster than brute-force alternatives for mining the top-k strongly correlated pairs.

[1]  Pierre Giot,et al.  Market Models: A Guide to Financial Data Analysis , 2003 .

[2]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[3]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[4]  H. Storch,et al.  Statistical Analysis in Climate Research , 2000 .

[5]  C. F. Kossack,et al.  Rank Correlation Methods , 1949 .

[6]  H. T. Reynolds,et al.  The analysis of cross-classifications , 1977 .

[7]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 2004, Data Mining and Knowledge Discovery.

[8]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[9]  Ke Wang,et al.  Mining confident rules without support requirement , 2001, CIKM '01.

[10]  M. Kendall,et al.  Rank Correlation Methods , 1949 .

[11]  Shashi Shekhar,et al.  TR 03-020 TAPER : An Efficient Two-Step Approach for All-Pairs Correlation Query in Transaction Databases , 2003 .

[12]  Chris Jermaine,et al.  The Computational Complexity of High-Dimensional Correlation Search diapers no diapers , 2004 .

[13]  Wynne Hsu,et al.  Mining association rules with multiple minimum supports , 1999, KDD '99.

[14]  Hui Xiong,et al.  Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery , 2004, Pacific Symposium on Biocomputing.

[15]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[16]  E. Lehmann,et al.  Nonparametrics: Statistical Methods Based on Ranks , 1976 .

[17]  William DuMouchel,et al.  Empirical bayes screening for multi-item associations , 2001, KDD '01.

[18]  Chris Jermaine,et al.  Playing hide-and-seek with correlations , 2003, KDD '03.

[19]  Hui Xiong,et al.  Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs , 2004, KDD.