Multiple Testing Correction in Graph Mining

We present a method for finding all subgraphs whose occurrence is significantly enriched in a particular class of graphs while correcting for multiple testing. Although detecting such significant subgraphs is a crucial step for further analysis across application domains, multiple testing of subgraphs has not been investigated before as it is not only computationally expensive, but also leads to a great loss in statistical power. Here we solve both problems by examining only testable subgraphs, which dramatically reduces the number of subgraph candidates, yet all significant subgraphs are detected. Moreover, we exploit the dependence between testable subgraphs by considering the effective number of tests to further increase the statistical power. Our experiments show that the proposed methods are faster and are statistically more powerful than the current state-of-the-art approach.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  V. Moskvina,et al.  On multiple‐testing correction in genome‐wide association studies , 2008, Genetic epidemiology.

[3]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[5]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[6]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[7]  Philip S. Yu,et al.  Semi-supervised feature selection for graph classification , 2010, KDD.

[8]  Thorsten Meinl,et al.  A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston , 2005, PKDD.

[9]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[10]  Ichigaku Takigawa,et al.  Graph mining: procedure, application to drug discovery and recent advances. , 2013, Drug discovery today.

[11]  Ambuj K. Singh,et al.  GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space , 2006, Sixth International Conference on Data Mining (ICDM'06).

[12]  K. Tsuda,et al.  Statistical significance of combinatorial regulations , 2013, Proceedings of the National Academy of Sciences.

[13]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[14]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[15]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[16]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[17]  Z. Šidák Rectangular Confidence Regions for the Means of Multivariate Normal Distributions , 1967 .

[18]  Kurt Mehlhorn,et al.  Weisfeiler-Lehman Graph Kernels , 2011, J. Mach. Learn. Res..

[19]  Philip S. Yu,et al.  Positive and Unlabeled Learning for Graph Classification , 2011, 2011 IEEE 11th International Conference on Data Mining.

[20]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[21]  Tarone Re A modified Bonferroni method for discrete data. , 1990 .

[22]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[23]  R. Doerge,et al.  Empirical threshold values for quantitative trait mapping. , 1994, Genetics.

[24]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[25]  M. Greenwood An Introduction to Medical Statistics , 1932, Nature.

[26]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[27]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[28]  Antje Chang,et al.  BRENDA , the enzyme database : updates and major new developments , 2003 .

[29]  Koji Tsuda,et al.  Entire regularization paths for graph data , 2007, ICML '07.

[30]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[31]  Andrew B. Nobel,et al.  Mining non-redundant high order correlations in binary data , 2008, Proc. VLDB Endow..

[32]  Takeaki Uno,et al.  Fast Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Mining , 2014 .

[33]  Geng Li,et al.  Effective graph classification based on topological and label attributes , 2012, Stat. Anal. Data Min..

[34]  George Karypis,et al.  Comparison of descriptor spaces for chemical compound retrieval and classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[35]  Karsten M. Borgwardt,et al.  Significant Subgraph Mining with Multiple Testing Correction , 2014, SDM.

[36]  D. Nyholt A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. , 2004, American journal of human genetics.

[37]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[38]  A. Debnath,et al.  Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. , 1991, Journal of medicinal chemistry.

[39]  Ambuj K. Singh,et al.  GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[40]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.