Significant Subgraph Mining with Multiple Testing Correction

The problem of finding itemsets that are statistically significantly enriched in a class of transactions is complicated by the need to correct for multiple hypothesis testing. Pruning untestable hypotheses was recently proposed as a strategy for this task of significant itemset mining. It was shown to lead to greater statistical power, the discovery of more truly significant itemsets, than the standard Bonferroni correction on real-world datasets. An open question, however, is whether this strategy of excluding untestable hypotheses also leads to greater statistical power in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. Here we answer this question by an empirical investigation on eight popular graph benchmark datasets. We propose a new efficient search strategy, which always returns the same solution as the state-of-the-art approach and is approximately two orders of magnitude faster. Moreover, we exploit the dependence between subgraphs by considering the effective number of tests and thereby further increase the statistical power.

[1]  Nathanael Weill,et al.  Development and Validation of a Novel Protein-Ligand Fingerprint To Mine Chemogenomic Space: Application to G Protein-Coupled Receptors and Their Ligands , 2009, J. Chem. Inf. Model..

[2]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[3]  Antje Chang,et al.  BRENDA , the enzyme database : updates and major new developments , 2003 .

[4]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[5]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[6]  V. Moskvina,et al.  On multiple‐testing correction in genome‐wide association studies , 2008, Genetic epidemiology.

[7]  Philip S. Yu,et al.  Semi-supervised feature selection for graph classification , 2010, KDD.

[8]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[9]  Ambuj K. Singh,et al.  GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space , 2006, Sixth International Conference on Data Mining (ICDM'06).

[10]  Ichigaku Takigawa,et al.  Graph mining: procedure, application to drug discovery and recent advances. , 2013, Drug discovery today.

[11]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[12]  K. Tsuda,et al.  Statistical significance of combinatorial regulations , 2013, Proceedings of the National Academy of Sciences.

[13]  Mayank Sachan,et al.  Mining statistically significant connected subgraphs in vertex labeled graphs , 2014, SIGMOD Conference.

[14]  R. Doerge,et al.  Empirical threshold values for quantitative trait mapping. , 1994, Genetics.

[15]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[16]  D. Nyholt A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. , 2004, American journal of human genetics.

[17]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[18]  Thorsten Meinl,et al.  A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston , 2005, PKDD.

[19]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[20]  George Karypis,et al.  Comparison of descriptor spaces for chemical compound retrieval and classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[22]  A. Debnath,et al.  Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. , 1991, Journal of medicinal chemistry.

[23]  Ambuj K. Singh,et al.  GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[24]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[25]  Andrew B. Nobel,et al.  Mining non-redundant high order correlations in binary data , 2008, Proc. VLDB Endow..

[26]  Philip S. Yu,et al.  Positive and Unlabeled Learning for Graph Classification , 2011, 2011 IEEE 11th International Conference on Data Mining.

[27]  Koji Tsuda,et al.  Entire regularization paths for graph data , 2007, ICML '07.

[28]  Jon M. Kleinberg,et al.  Subgraph frequencies: mapping the empirical and extremal geography of large graph collections , 2013, WWW.

[29]  Tarone Re A modified Bonferroni method for discrete data. , 1990 .

[30]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[31]  Takeaki Uno,et al.  A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration , 2014, ECML/PKDD.

[32]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[33]  Z. Šidák Rectangular Confidence Regions for the Means of Multivariate Normal Distributions , 1967 .

[34]  Kurt Mehlhorn,et al.  Weisfeiler-Lehman Graph Kernels , 2011, J. Mach. Learn. Res..

[35]  Geng Li,et al.  Effective graph classification based on topological and label attributes , 2012, Stat. Anal. Data Min..

[36]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[37]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[38]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[39]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[40]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[41]  Martin Bland,et al.  An Introduction to Medical Statistics , 1987 .

[42]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..