StatApriori: an efficient algorithm for searching statistically significant association rules

Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive. The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper, we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Empirical experiments have shown that StatApriori is very efficient, but in the same time it finds good quality rules.

[1]  Jiuyong Li On optimal rule discovery , 2006 .

[2]  Padhraic Smyth,et al.  An Information Theoretic Approach to Rule Induction from Databases , 1992, IEEE Trans. Knowl. Data Eng..

[3]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[4]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[5]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[6]  Joost N. Kok,et al.  Multi-class Correlated Pattern Mining , 2005, KDID.

[7]  Chris Jermaine,et al.  Finding the most interesting correlations in a database: how hard can it be? , 2005, Inf. Syst..

[8]  Daniel Sánchez,et al.  A New Framework to Assess Association Rules , 2001, IDA.

[9]  K. Carrière,et al.  HOW GOOD IS A NORMAL APPROXIMATION FOR RATES AND PROPORTIONS OF LOW INCIDENCE EVENTS? , 2001 .

[10]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[11]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[12]  Alan Agresti,et al.  Frequentist Performance of Bayesian Confidence Intervals for Comparing Proportions in 2 × 2 Contingency Tables , 2005, Biometrics.

[13]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[14]  W. Hays Statistical theory. , 1968, Annual review of psychology.

[15]  Rajeev Motwani,et al.  Dynamic miss-counting algorithms: finding implication and similarity rules with confidence pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[16]  Rajeev Motwani,et al.  Beyond Market Baskets: Generalizing Association Rules to Dependence Rules , 1998, Data Mining and Knowledge Discovery.

[17]  Jaideep Srivastava,et al.  Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[18]  Philip S. Yu,et al.  A new framework for itemset generation , 1998, PODS '98.

[19]  Gerd Stumme,et al.  Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets , 2000, Computational Logic.

[20]  森下 真一,et al.  Parallel Branch-and-Bound Graph Search for Correlated Association Rules , 1999 .

[21]  Yun Sing Koh,et al.  Mining Interesting Imperfectly Sporadic Rules , 2006, PAKDD.

[22]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[23]  Yun Sing Koh,et al.  Efficiently Finding Negative Association Rules Without Support Threshold , 2007, Australian Conference on Artificial Intelligence.

[24]  Zina M. Ibrahim,et al.  Advances in Artificial Intelligence , 2003, Lecture Notes in Computer Science.

[25]  W. Hays Statistics, 4th ed. , 1988 .

[26]  Shinichi Morishita,et al.  Parallel Branch-and-Bound Graph Search for Correlated Association Rules , 1999, Large-Scale Parallel Data Mining.

[27]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[28]  Matti Nykänen,et al.  Efficient Discovery of Statistically Significant Association Rules , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[29]  Ke Wang,et al.  Mining confident rules without support requirement , 2001, CIKM '01.

[30]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[31]  Rosa Meo Theory of dependence values , 2000, TODS.

[32]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[33]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[34]  Arbee L. P. Chen,et al.  An efficient approach to discovering knowledge from large databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[35]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[36]  Ivan Bratko,et al.  Why Is Rule Learning Optimistic and How to Correct It , 2006, ECML.

[37]  Yun Sing Koh Mining Non-coincidental Rules without a User Defined Support Threshold , 2008, PAKDD.

[38]  Jinyan Li,et al.  A new concise representation of frequent itemsets using generators and a positive border , 2008, Knowledge and Information Systems.

[39]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[40]  Ke Wang,et al.  Growing decision trees on support-less association rules , 2000, KDD '00.

[41]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[42]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[43]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[44]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .