Discovering Significant Patterns

Abstract Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.

[1]  Geoffrey I. Webb Discovering associations with numeric variables , 2001, KDD '01.

[2]  Tobias Scheffer Finding association rules that trade support optimally against confidence , 2005 .

[3]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[4]  William DuMouchel,et al.  Empirical bayes screening for multi-item associations , 2001, KDD '01.

[5]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[6]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[7]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[8]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[9]  M. Kendall Elementary Statistics , 1945, Nature.

[10]  A. Agresti [A Survey of Exact Inference for Contingency Tables]: Rejoinder , 1992 .

[11]  Geoffrey I. Webb,et al.  K-Optimal Rule Discovery , 2005, Data Mining and Knowledge Discovery.

[12]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 2004, Data Mining and Knowledge Discovery.

[13]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[14]  Yehuda Lindell,et al.  A Statistical Theory for Quantitative Association Rules , 1999, KDD '99.

[15]  Paul R. Cohen,et al.  Multiple Comparisons in Induction Algorithms , 2000, Machine Learning.

[16]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[17]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[18]  Nimrod Megiddo,et al.  Discovering Predictive Association Rules , 1998, KDD.

[19]  Geoffrey I. Webb Magnum Opus version 1 , 2001 .

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  Geoffrey I. Webb,et al.  Preliminary investigations into statistically valid exploratory rule discovery , 2003 .

[22]  Gerd Stumme,et al.  Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets , 2000, Computational Logic.

[23]  Ron Kohavi,et al.  Real world performance of association rule algorithms , 2001, KDD '01.

[24]  Geoffrey I. Webb OPUS: An Efficient Admissible Algorithm for Unordered Search , 1995, J. Artif. Intell. Res..

[25]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[26]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[27]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[28]  Stefan Wrobel,et al.  Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling , 2003, J. Mach. Learn. Res..

[29]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[30]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[31]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[32]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[33]  Peter D. Turney Types of Cost in Inductive Concept Learning , 2002, ArXiv.

[34]  R. Mike Cameron-Jones,et al.  Oversearching and Layered Search in Empirical Learning , 1995, IJCAI.

[35]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[36]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[37]  Szymon Jaroszewicz,et al.  Interestingness of frequent itemsets using Bayesian networks as background knowledge , 2004, KDD.

[38]  Balaji Padmanabhan,et al.  On the discovery of significant statistical quantitative rules , 2004, KDD.