Controlling False Positives in Association Rule Mining

Association rule mining is an important problem in the data mining area. It enumerates and tests a large number of rules on a dataset and outputs rules that satisfy user-specified constraints. Due to the large number of rules being tested, rules that do not represent real systematic effect in the data can satisfy the given constraints purely by random chance. Hence association rule mining often suffers from a high risk of false positive errors. There is a lack of comprehensive study on controlling false positives in association rule mining. In this paper, we adopt three multiple testing correction approaches---the direct adjustment approach, the permutation-based approach and the holdout approach---to control false positives in association rule mining, and conduct extensive experiments to study their performance. Our results show that (1) Numerous spurious rules are generated if no correction is made. (2) The three approaches can control false positives effectively. Among the three approaches, the permutation-based approach has the highest power of detecting real association rules, but it is very computationally expensive. We employ several techniques to reduce its cost effectively.

[1]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[2]  Eli Upfal,et al.  An efficient rigorous approach for identifying statistically significant frequent itemsets , 2009, JACM.

[3]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[4]  R. Fisher On the Interpretation of χ 2 from Contingency Tables , and the Calculation of P Author , 2022 .

[5]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[6]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[7]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[8]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[9]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[10]  Nimrod Megiddo,et al.  Discovering Predictive Association Rules , 1998, KDD.

[11]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[12]  Ron Rymon,et al.  Search through Systematic Set Enumeration , 1992, KR.

[13]  S. Dudoit,et al.  Multiple Testing Procedures with Applications to Genomics , 2007 .

[14]  Geoffrey I. Webb Layered critical values: a powerful direct-adjustment approach to discovering significant patterns , 2008, Machine Learning.

[15]  T. Perneger What's wrong with Bonferroni adjustments , 1998, BMJ.

[16]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[17]  H. Abdi The Bonferonni and Šidák Corrections for Multiple Comparisons , 2006 .

[18]  Hongjun Lu,et al.  CFP-tree: A compact disk-based structure for storing and querying frequent itemsets , 2007, Inf. Syst..

[19]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[20]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[21]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[22]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .