Discovering significant patterns

Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.

[1]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[2]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[3]  M. Kendall Elementary Statistics , 1945, Nature.

[4]  A. Agresti [A Survey of Exact Inference for Contingency Tables]: Rejoinder , 1992 .

[5]  Gerd Stumme,et al.  Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets , 2000, Computational Logic.

[6]  Ron Kohavi,et al.  Real world performance of association rule algorithms , 2001, KDD '01.

[7]  Yehuda Lindell,et al.  A Statistical Theory for Quantitative Association Rules , 1999, KDD.

[8]  Balaji Padmanabhan,et al.  On the discovery of significant statistical quantitative rules , 2004, KDD.

[9]  Stefan Wrobel,et al.  Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling , 2003, J. Mach. Learn. Res..

[10]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[11]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[12]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[13]  Peter D. Turney Types of Cost in Inductive Concept Learning , 2002, ArXiv.

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[16]  Geoffrey I. Webb,et al.  Preliminary investigations into statistically valid exploratory rule discovery , 2003 .

[17]  Geoffrey I. Webb Magnum Opus version 1 , 2001 .

[18]  Ryszard S. Michalski,et al.  A Theory and Methodology of Inductive Learning , 1983, Artificial Intelligence.

[19]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[20]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[21]  Geoffrey I. Webb Discovering associations with numeric variables , 2001, KDD '01.

[22]  Geoffrey I. Webb,et al.  K-Optimal Rule Discovery , 2005, Data Mining and Knowledge Discovery.

[23]  MotwaniRajeev,et al.  Beyond market baskets , 1997 .

[24]  Szymon Jaroszewicz,et al.  Interestingness of frequent itemsets using Bayesian networks as background knowledge , 2004, KDD.

[25]  Tobias Scheffer,et al.  Finding association rules that trade support optimally against confidence , 2001, Intell. Data Anal..

[26]  R. Mike Cameron-Jones,et al.  Oversearching and Layered Search in Empirical Learning , 1995, IJCAI.

[27]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[28]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[29]  Paul R. Cohen,et al.  Multiple Comparisons in Induction Algorithms , 2000, Machine Learning.

[30]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[31]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[32]  William DuMouchel,et al.  Empirical bayes screening for multi-item associations , 2001, KDD '01.

[33]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[34]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[35]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 2004, Data Mining and Knowledge Discovery.

[36]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[37]  Geoffrey I. Webb OPUS: An Efficient Admissible Algorithm for Unordered Search , 1995, J. Artif. Intell. Res..

[38]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[39]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[40]  Nimrod Megiddo,et al.  Discovering Predictive Association Rules , 1998, KDD.

[41]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.