Discriminative frequent subgraph mining with optimality guarantees

The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near‐optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state‐of‐the‐art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 302‐318, 2010

[1]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[2]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[3]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4]  Huan Liu,et al.  Consistency Based Feature Selection , 2000, PAKDD.

[5]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[6]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[7]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[8]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[9]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[10]  Ehud Gudes,et al.  Computing frequent graph patterns from semistructured data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[11]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[12]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[14]  H. Kubinyi Drug research: myths, hype and reality , 2003, Nature Reviews Drug Discovery.

[15]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[16]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[17]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[18]  Toshihide Ibaraki,et al.  Finding Essential Attributes from Binary Data , 2003, Annals of Mathematics and Artificial Intelligence.

[19]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[20]  Zoran Obradovic,et al.  Feature Selection Filters Based on the Permutation Test , 2004, ECML.

[21]  Albrecht Zimmermann,et al.  CTC - correlating tree patterns for classification , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[22]  Sebastian Wernicke,et al.  A Faster Algorithm for Detecting Network Motifs , 2005, WABI.

[23]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[24]  Andreas Krause,et al.  Near-optimal sensor placements in Gaussian processes , 2005, ICML.

[25]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.

[26]  Christian Borgelt,et al.  MoSS: a program for molecular substructure mining , 2005 .

[27]  George Karypis,et al.  Comparison of descriptor spaces for chemical compound retrieval and classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[28]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Natasa Przulj,et al.  Biological network comparison using graphlet degree distribution , 2007, Bioinform..

[31]  Koji Tsuda,et al.  Entire regularization paths for graph data , 2007, ICML '07.

[32]  Sebastian Nowozin,et al.  gBoost: a mathematical programming approach to graph classification and regression , 2009, Machine Learning.

[33]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[34]  Nicole Krämer,et al.  Partial least squares regression for graph mining , 2008, KDD.

[35]  Philip S. Yu,et al.  Direct mining of discriminative and essential frequent patterns via model-based search tree , 2008, KDD.

[36]  Albrecht Zimmermann,et al.  One in a million: picking the right patterns , 2008, Knowledge and Information Systems.

[37]  Karsten M. Borgwardt,et al.  Fast subtree kernels on graphs , 2009, NIPS.

[38]  Philip S. Yu,et al.  Near-optimal Supervised Feature Selection among Frequent Subgraphs , 2009, SDM.

[39]  Wei Wang,et al.  Graph classification based on pattern co-occurrence , 2009, CIKM.

[40]  Kurt Mehlhorn,et al.  Efficient graphlet kernels for large graph comparison , 2009, AISTATS.

[41]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.