Efficient Discovery of Statistically Significant Association Rules

Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive.The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Based on both theoretical and empirical observations, the resulting rules are very accurate compared to traditional association rules. In addition, StatApriori can work with extremely low frequencies, thus finding new interesting rules.

[1]  Elena Baralis,et al.  A lazy approach to pruning classification rules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[3]  Daniel Sánchez,et al.  A New Framework to Assess Association Rules , 2001, IDA.

[4]  Pang-Ning Tan,et al.  Interestingness Measures for Association Patterns : A Perspective , 2000, KDD 2000.

[5]  Philip S. Yu,et al.  A new framework for itemset generation , 1998, PODS '98.

[6]  Gerd Stumme,et al.  Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets , 2000, Computational Logic.

[7]  森下 真一,et al.  Parallel Branch-and-Bound Graph Search for Correlated Association Rules , 1999 .

[8]  Rosa Meo Theory of dependence values , 2000, TODS.

[9]  Chris Jermaine,et al.  Finding the most interesting correlations in a database: how hard can it be? , 2005, Inf. Syst..

[10]  K. Carrière,et al.  HOW GOOD IS A NORMAL APPROXIMATION FOR RATES AND PROPORTIONS OF LOW INCIDENCE EVENTS? , 2001 .

[11]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[12]  J. Kere,et al.  Data mining applied to linkage disequilibrium mapping. , 2000, American journal of human genetics.

[13]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[14]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[15]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[16]  W. Hays Statistical theory. , 1968, Annual review of psychology.

[17]  J. Susan Milton,et al.  Introduction to Probability and Statistics: Principles and Applications for Engineering and the Computing Sciences , 1990 .

[18]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[19]  Joost N. Kok,et al.  Multi-class Correlated Pattern Mining , 2005, KDID.

[20]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[21]  Xindong Wu,et al.  Efficient mining of both positive and negative association rules , 2004, TOIS.

[22]  Rajeev Motwani,et al.  Beyond Market Baskets: Generalizing Association Rules to Dependence Rules , 1998, Data Mining and Knowledge Discovery.

[23]  Jaideep Srivastava,et al.  Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[24]  Yiyu Yao,et al.  An Analysis of Quantitative Measures Associated with Rules , 1999, PAKDD.

[25]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[26]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[27]  Alan Agresti,et al.  Frequentist Performance of Bayesian Confidence Intervals for Comparing Proportions in 2 × 2 Contingency Tables , 2005, Biometrics.

[28]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[29]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[30]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[31]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.