GPM: A graph pattern matching kernel with diffusion for chemical compound classification

Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogeneous sources and have been applied in modelling many kinds of biological data. With the fast accumulation of chemical structure data, building highly accurate predictive models for chemical graphs emerges as a new challenge . In this paper, we demonstrate a novel technique called Graph Pattern Matching kernel (GPM). Our idea is to leverage existing frequent pattern discovery methods and explore their application to kernel classifiers (e.g. support vector machine) for graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the database and use a diffusion process to label nodes in the graphs. Finally the kernel is computed using a set matching algorithm. We performed experiments on 16 chemical structure data sets and have compared our methods to other major graph kernels. The experimental results demonstrate excellent performance of our method.

[1]  Siwei Lyu,et al.  Mercer kernels for object recognition with local features , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[2]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[3]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[4]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[5]  Hans-Peter Kriegel,et al.  Shortest-path kernels on graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[7]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[8]  George Karypis,et al.  Comparison of descriptor spaces for chemical compound retrieval and classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Jun Huan,et al.  Chemical Compound Classification with Automatically Mined Structure Patterns , 2008, APBC.

[10]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[11]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[12]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[13]  Jun Huan,et al.  CPM : A Graph Pattern Matching Kernel with Diffusion for Accurate Graph Classification , 2008 .

[14]  Thomas Gärtner,et al.  Cyclic pattern kernels for predictive graph mining , 2004, KDD.

[15]  T. Insel,et al.  NIH Molecular Libraries Initiative , 2004, Science.

[16]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[17]  Andreas Zell,et al.  Kernel Functions for Attributed Molecular Graphs – A New Similarity‐Based Approach to ADME Prediction in Classification and Regression , 2006 .

[18]  Jean-Philippe Vert,et al.  Graph kernels based on tree patterns for molecules , 2006, Machine Learning.

[19]  Haifeng Li,et al.  Systematic discovery of functional modules and context-specific functional annotation of human genome , 2007, ISMB/ECCB.

[20]  P. Clemons,et al.  Small molecules, big players: the National Cancer Institute's Initiative for Chemical Genetics. , 2006, Cancer research.

[21]  Ashwin Srinivasan,et al.  The Predictive Toxicology Challenge 2000-2001 , 2001, Bioinform..

[22]  Xin Wen,et al.  BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities , 2006, Nucleic Acids Res..

[23]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[24]  Michael K. Gilson,et al.  Virtual Screening of Molecular Databases Using a Support Vector Machine , 2005, J. Chem. Inf. Model..

[25]  Jan Ramon,et al.  Expressivity versus efficiency of graph kernels , 2003 .

[26]  Wei Wang,et al.  Accurate Classification of Protein Structural Families Using Coherent Subgraph Analysis , 2003, Pacific Symposium on Biocomputing.

[27]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..