Finding the most interesting correlations in a database: how hard can it be?

This paper addresses some of the foundational issues associated with discovering the best few correlations from a database. Specifically, we consider the computational complexity of various definitions of the "top-k correlation problem," where the goal is to discover the few sets of events whose co-occurrence exhibits the smallest degree of independence. Our results show that many rigorous definitions of correlation lead to intractable and strongly inapproximable problems. Proof of this inapproximability is significant, since similar problems studied by the computer science theory community have resisted such analysis. One goal of the paper (and for future research) is to develop alternative correlation metrics whose use will both allow efficient search and produce results that are satisfactory for users.

[1]  Dimitrios Gunopulos,et al.  Discovering All Most Specific Sentences by Randomized Algorithms , 1997, ICDT.

[2]  David S. Johnson,et al.  The NP-Completeness Column: An Ongoing Guide , 1982, J. Algorithms.

[3]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[4]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[5]  Mohammed J. Zaki,et al.  Theoretical Foundations of Association Rules , 2007 .

[6]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[8]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[9]  Uriel Feige,et al.  Relations between average case complexity and approximation complexity , 2002, STOC '02.

[10]  Balaji Padmanabhan,et al.  Small is beautiful: discovering the minimal set of unexpected patterns , 2000, KDD '00.

[11]  Marc Reisner,et al.  Hohokam, Hoover Dam, Hayden: Indians, Water and Power in the West@@@Command of the Waters: Iron Triangles, Federal Water Development, and Indian Water@@@Water in the Hispanic Southwest: A Social and Legal History, 1550-1850@@@Cadillac Desert: The American West and Its Disappearing Water , 1988 .

[12]  M. Reisner Cadillac Desert: The American West and Its Disappearing Water , 1987 .

[13]  Julius T. Tou,et al.  Information Systems , 1973, GI Jahrestagung.

[14]  瀬々 潤,et al.  Traversing Itemset Lattices with Statistical Metric Pruning (小特集 「発見科学」及び一般演題) , 2000 .

[15]  RamakrishnanRaghu,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[16]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[17]  Robert Meersman,et al.  On the Complexity of Mining Quantitative Association Rules , 1998, Data Mining and Knowledge Discovery.

[18]  Sunita Sarawagi,et al.  Mining Surprising Patterns Using Temporal Description Length , 1998, VLDB.

[19]  Dorit S. Hochbaum,et al.  Approximating Clique and Biclique Problems , 1998, J. Algorithms.

[20]  Jian Pei,et al.  Efficient computation of Iceberg cubes with complex measures , 2001, SIGMOD '01.

[21]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[22]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[23]  Sunita Sarawagi,et al.  Explaining Differences in Multidimensional Aggregates , 1999, VLDB.

[24]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[25]  Katsutoshi Yada,et al.  Approximation of Optimal Two-Dimensional Association Rules for Categorical Attributes Using Semidefinite Programming , 1999, Discovery Science.

[26]  Roberto J. Bayardo,et al.  Mining the most interesting rules , 1999, KDD '99.

[27]  Balaji Padmanabhan,et al.  A Belief-Driven Method for Discovering Unexpected Patterns , 1998, KDD.

[28]  Geoffrey I. Webb Efficient search for association rules , 2000, KDD '00.

[29]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[30]  AgrawalRakesh,et al.  Mining association rules between sets of items in large databases , 1993 .

[31]  Raghu Ramakrishnan,et al.  Proceedings : KDD 2000 : the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2000, Boston, MA, USA , 2000 .

[32]  J. Wolfowitz,et al.  Introduction to the Theory of Statistics. , 1951 .

[33]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[34]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[35]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[36]  Carsten Lund,et al.  Hardness of approximations , 1996 .

[37]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[38]  Kyuseok Shim,et al.  Mining optimized support rules for numeric attributes , 2001, Inf. Syst..

[39]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[40]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.