Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing

We present a novel algorithm for significant pattern mining, Westfall-Young light. The target patterns are statistically significantly enriched in one of two classes of objects. Our method corrects for multiple hypothesis testing and correlations between patterns via the Westfall-Young permutation procedure, which empirically estimates the null distribution of pattern frequencies in each class via permutations. In our experiments, Westfall-Young light dramatically outperforms the current state-of-the-art approach, both in terms of runtime and memory efficiency on popular real-world benchmark datasets for pattern mining. The key to this efficiency is that, unlike all existing methods, our algorithm does not need to solve the underlying frequent pattern mining problem anew for each permutation and does not need to store the occurrence list of all frequent patterns. Westfall-Young light opens the door to significant pattern mining on large datasets that previously involved prohibitive runtime or memory costs. Our code is available from http://www.bsse.ethz.ch/mlcb/research/machine-learning/wylight.html

[1]  Wilhelmiina Hämäläinen,et al.  StatApriori: an efficient algorithm for searching statistically significant association rules , 2010, Knowledge and Information Systems.

[2]  R. Fisher On the Interpretation of χ 2 from Contingency Tables , and the Calculation of P Author , 2022 .

[3]  G. Niklas Norén,et al.  Robust discovery of local patterns: subsets and stratification in adverse drug reaction surveillance , 2012, IHI '12.

[4]  Karsten M. Borgwardt,et al.  Significant Subgraph Mining with Multiple Testing Correction , 2014, SDM.

[5]  Geng Li,et al.  Effective graph classification based on topological and label attributes , 2012, Stat. Anal. Data Min..

[6]  Tarone Re A modified Bonferroni method for discrete data. , 1990 .

[7]  Takeaki Uno,et al.  Frequent Pattern Mining , 2016, Encyclopedia of Algorithms.

[8]  Ichigaku Takigawa,et al.  Graph mining: procedure, application to drug discovery and recent advances. , 2013, Drug discovery today.

[9]  K. Tsuda,et al.  Statistical significance of combinatorial regulations , 2013, Proceedings of the National Academy of Sciences.

[10]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[11]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[12]  Hiroki Arimura,et al.  An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases , 2004, Discovery Science.

[13]  Thorsten Meinl,et al.  A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston , 2005, PKDD.

[14]  S. Dudoit,et al.  Multiple Testing Procedures with Applications to Genomics , 2007 .

[15]  Kurt Mehlhorn,et al.  Weisfeiler-Lehman Graph Kernels , 2011, J. Mach. Learn. Res..

[16]  Koji Tsuda,et al.  Fast Westfall-Young permutation procedure for combinatorial regulation discovery , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[17]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[18]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[19]  James Bailey,et al.  Contrast Data Mining: Concepts, Algorithms, and Applications , 2012 .

[20]  Charu C. Aggarwal,et al.  Frequent Pattern Mining , 2014, Springer International Publishing.

[21]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[22]  Geoffrey I. Webb Layered critical values: a powerful direct-adjustment approach to discovering significant patterns , 2008, Machine Learning.

[23]  Philip S. Yu,et al.  Semi-supervised feature selection for graph classification , 2010, KDD.

[24]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[25]  Ron Shamir,et al.  Design of shortest double-stranded DNA sequences covering all k-mers with applications to protein-binding microarrays and synthetic enhancers , 2013, Bioinform..

[26]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[27]  Takeaki Uno,et al.  A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration , 2014, ECML/PKDD.

[28]  Ewan Birney,et al.  Cell-type specific and combinatorial usage of diverse transcription factors revealed by genome-wide binding studies in multiple human cells. , 2012, Genome research.

[29]  Xiang Zhang,et al.  Fastanova: an efficient algorithm for genome-wide association study , 2008, KDD.

[30]  Xiang Zhang,et al.  TEAM: efficient two-locus epistasis tests in human genome-wide association study , 2010, Bioinform..

[31]  Mayank Sachan,et al.  Mining statistically significant connected subgraphs in vertex labeled graphs , 2014, SIGMOD Conference.

[32]  J. Mixter Fast , 2012 .

[33]  Albrecht Zimmermann,et al.  Fast, Effective Molecular Feature Mining by Local Optimization , 2010, ECML/PKDD.

[34]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[35]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[36]  Michael P. Epstein,et al.  A permutation procedure to correct for confounders in case-control studies, including tests of rare variation. , 2012, American journal of human genetics.

[37]  Siegfried Nijssen,et al.  Supervised Pattern Mining and Applications to Classification , 2014, Frequent Pattern Mining.

[38]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[39]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[40]  Philip S. Yu,et al.  Direct mining of discriminative and essential frequent patterns via model-based search tree , 2008, KDD.

[41]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[42]  Karsten M. Borgwardt,et al.  Genome-wide detection of intervals of genetic heterogeneity associated with complex traits , 2015, Bioinform..