Statistical Emerging Pattern Mining with Multiple Testing Correction

Emerging patterns are patterns whose support significantly differs between two databases. We study the problem of listing emerging patterns with a multiple testing guarantee. Recently, Terada et al., proposed the Limitless Arity Multiple-testing Procedure (LAMP) that controls the family-wise error rate (FWER) in statistical association mining. LAMP reduces the number of "untestable" hypotheses without compromising its statistical power. Still, FWER is restrictive, and as a result, its statistical power is inherently unsatisfying when the number of patterns is large. On the other hand, the false discovery rate (FDR) is less restrictive than FWER, and thus controlling FDR yields a larger number of significant patterns. We propose two emerging pattern mining methods: the first one controls FWER, and the second one controls FDR. The effectiveness of the methods is verified in computer simulations with real-world datasets.

[1]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[2]  Jun Sese,et al.  High-speed westfall-young permutation procedure for genome-wide association studies , 2015, BCB.

[3]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[4]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[5]  Cécile Low-Kam,et al.  Mining Statistically Significant Sequential Patterns , 2013, 2013 IEEE 13th International Conference on Data Mining.

[6]  Tarone Re A modified Bonferroni method for discrete data. , 1990 .

[7]  Guozhu Dong,et al.  Discovery of Highly Differentiative Gene Groups from Microarray Gene Expression Data Using the Gene Club Approach , 2005, J. Bioinform. Comput. Biol..

[8]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[9]  Takeaki Uno,et al.  Frequent Pattern Mining , 2016, Encyclopedia of Algorithms.

[10]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[11]  Olivier Teytaud,et al.  Statistical inference and data mining: false discoveries control , 2006 .

[12]  Olivier Teytaud,et al.  Association Rule Interestingness: Measure and Statistical Validation , 2007, Quality Measures in Data Mining.

[13]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[14]  Karsten M. Borgwardt,et al.  Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing , 2015, KDD.

[15]  Matthijs van Leeuwen,et al.  Fast Estimation of the Pattern Frequency Spectrum , 2014, ECML/PKDD.

[16]  Fabio Vandin,et al.  Finding the True Frequent Itemsets , 2013, SDM.

[17]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[18]  Sami Hanhijärvi Multiple Hypothesis Testing in Pattern Discovery , 2011, Discovery Science.

[19]  Karsten M. Borgwardt,et al.  Finding significant combinations of features in the presence of categorical covariates , 2016, NIPS.

[20]  K. Tsuda,et al.  Statistical significance of combinatorial regulations , 2013, Proceedings of the National Academy of Sciences.

[21]  Hiroki Arimura,et al.  LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets , 2003, FIMI.

[22]  Geoffrey I. Webb Discovering significant patterns , 2008, Machine Learning.

[23]  James Bailey,et al.  Contrast Data Mining: Concepts, Algorithms, and Applications , 2012 .

[24]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[25]  James Bailey,et al.  Fast Algorithms for Mining Emerging Patterns , 2002, PKDD.

[26]  Takeaki Uno,et al.  A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration , 2014, ECML/PKDD.

[27]  Y. Benjamini,et al.  Adaptive linear step-up procedures that control the false discovery rate , 2006 .

[28]  Nicolai Meinshausen,et al.  Asymptotic optimality of the Westfall--Young permutation procedure for multiple testing under dependence , 2011, 1106.2068.

[29]  Peter B. Gilbert,et al.  A modified false discovery rate multiple‐comparisons procedure for discrete data, applied to human immunodeficiency virus genetics , 2005 .

[30]  Karsten M. Borgwardt,et al.  Significant Subgraph Mining with Multiple Testing Correction , 2014, SDM.

[31]  R. Tarone,et al.  A modified Bonferroni method for discrete data. , 1990, Biometrics.