Approximation of Graph Kernel Similarities for Chemical Graphs by Kernel Principal Component Analysis

Graph kernels have been successfully applied on chemical graphs on small to medium sized machine learning problems. However, graph kernels often require a graph transformation before the computation can be applied. Furthermore, the kernel calculation can have a polynomial complexity of degree three and higher. Therefore, they cannot be applied in large instance-based machine learning problems. By using kernel principal component analysis, we mapped the compounds to the principal components, obtaining q-dimensional real-valued vectors. The goal of this study is to investigate the correlation between the graph kernel similarities and the similarities between the vectors. In the experiments we compared the similarities on various data sets, covering a wide range of typical chemical data mining problems. The similarity matrix between the vectorial projection was computed with the Jaccard and Cosine similarity coefficient and was correlated with the similarity matrix of the original graph kernel. The main result is that there is a strong correlation between the similarities of the vectors and the original graph kernel regarding rank correlation and linear correlation. The method seems to be robust and independent of the choice of the reference subset with observed standard deviations below 5%. An important application of the approach are instance-based data mining and machine learning tasks where the computation of the original graph kernel would be prohibitive.

[1]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[2]  Andreas Zell,et al.  Atomic Local Neighborhood Flexibility Incorporation into a Structured Similarity Measure for QSAR , 2009, J. Chem. Inf. Model..

[3]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[4]  Jean-Philippe Vert,et al.  The Pharmacophore Kernel for Virtual Screening with Support Vector Machines , 2006, J. Chem. Inf. Model..

[5]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[6]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[7]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[8]  Pierre Baldi,et al.  ChemDB update - full-text search and virtual chemical space , 2007, Bioinform..

[9]  Klaus-Robert Müller,et al.  Benchmark Data Set for in Silico Prediction of Ames Mutagenicity , 2009, J. Chem. Inf. Model..

[10]  Pierre Baldi,et al.  Large scale study of multiple-molecule queries , 2009, J. Cheminformatics.

[11]  Pierre Baldi,et al.  ChemDB: a public database of small molecules and related chemoinformatics resources , 2005, Bioinform..

[12]  A. Zell,et al.  Assignment kernels for chemical compounds , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[13]  Andreas Zell,et al.  Optimal assignment methods for ligand-based virtual screening , 2009, J. Cheminformatics.

[14]  Andreas Zell,et al.  Probabilistic Modeling of Conformational Space for 3D Machine Learning Approaches , 2010, Molecular informatics.

[15]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[16]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[17]  Tatsuya Akutsu,et al.  Extensions of marginalized graph kernels , 2004, ICML.