Effective feature construction by maximum common subgraph sampling

The standard approach to feature construction and predictive learning in molecular datasets is to employ computationally expensive graph mining techniques and to bias the feature search exploration using frequency or correlation measures. These features are then typically employed in predictive models that can be constructed using, for example, SVMs or decision trees. We take a different approach: rather than mining for all optimal local patterns, we extract features from the set of pairwise maximum common subgraphs. The maximum common subgraphs are computed under the block-and-bridge-preserving subgraph isomorphism from the outerplanar examples in polynomial time. We empirically observe a significant increase in predictive performance when using maximum common subgraph features instead of correlated local patterns on 60 benchmark datasets from NCI. Moreover, we show that when we randomly sample the pairs of graphs from which to extract the maximum common subgraphs, we obtain a smaller set of features that still allows the same predictive performance as methods that exhaustively enumerate all possible patterns. The sampling strategy turns out to be a very good compromise between a slight decrease in predictive performance (although still remaining comparable with state-of-the-art methods) and a significant runtime reduction (two orders of magnitude on a popular medium size chemoinformatics dataset). This suggests that maximum common subgraphs are interesting and meaningful features.

[1]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[2]  Luc De Raedt,et al.  Logical and relational learning , 2008, Cognitive Technologies.

[3]  Michèle Sebag,et al.  Distance Induction in First Order Logic , 1997, ILP.

[4]  Reinhard Diestel,et al.  Graph Theory , 1997 .

[5]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[6]  Michael Satosi Watanabe,et al.  Information Theoretical Analysis of Multivariate Correlation , 1960, IBM J. Res. Dev..

[7]  Thomas Gärtner,et al.  Cyclic pattern kernels for predictive graph mining , 2004, KDD.

[8]  Peter Willett,et al.  Maximum common subgraph isomorphism algorithms for the matching of chemical structures , 2002, J. Comput. Aided Mol. Des..

[9]  Peter Willett,et al.  Similarity-based virtual screening using 2D fingerprints. , 2006, Drug discovery today.

[10]  Jan Ramon,et al.  Frequent subgraph mining in outerplanar graphs , 2006, KDD.

[11]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[12]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[13]  George Karypis,et al.  Comparison of descriptor spaces for chemical compound retrieval and classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Luc De Raedt,et al.  Deriving distance metrics from generality relations , 2009, Pattern Recognit. Lett..

[15]  Thomas Gärtner,et al.  Kernels for structured data , 2008, Series in Machine Perception and Artificial Intelligence.

[16]  Gordon Plotkin,et al.  A Further Note on Inductive Generalization , 2008 .

[17]  Maurice Bruynooghe,et al.  An Efficiently Computable Graph-Based Metric for the Classification of Small Molecules , 2008, Discovery Science.

[18]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[19]  Ambuj K. Singh,et al.  GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space , 2006, Sixth International Conference on Data Mining (ICDM'06).

[20]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[21]  Luc De Raedt,et al.  Don't Be Afraid of Simpler Patterns , 2006, PKDD.

[22]  Alessio Ceroni,et al.  Classification of small molecules by two- and three-dimensional decomposition kernels , 2007, Bioinform..

[23]  Henrik Boström,et al.  Learning to classify structured data by graph propositionalization , 2006, Computational Intelligence.

[24]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[25]  Saso Dzeroski Relational Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[26]  Mohammad Al Hasan,et al.  ORIGAMI: A Novel and Effective Approach for Mining Representative Orthogonal Graph Patterns , 2008 .

[27]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[28]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[29]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[30]  Shai Ben-David,et al.  Limitations of Learning Via Embeddings in Euclidean Half Spaces , 2003, J. Mach. Learn. Res..

[31]  Peter A. Flach,et al.  Propositionalization approaches to relational data mining , 2001 .

[32]  Alex S. Taylor,et al.  Machine intelligence , 2009, CHI.

[33]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[34]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.