A structural cluster kernel for learning on graphs

In recent years, graph kernels have received considerable interest within the machine learning and data mining community. Here, we introduce a novel approach enabling kernel methods to utilize additional information hidden in the structural neighborhood of the graphs under consideration. Our novel structural cluster kernel (SCK) incorporates similarities induced by a structural clustering algorithm to improve state-of-the-art graph kernels. The approach taken is based on the idea that graph similarity can not only be described by the similarity between the graphs themselves, but also by the similarity they possess with respect to their structural neighborhood. We applied our novel kernel in a supervised and a semi-supervised setting to regression and classification problems on a number of real-world datasets of molecular graphs. Our results show that the structural cluster similarity information can indeed leverage the prediction performance of the base kernel, particularly when the dataset is structurally sparse and consequently structurally diverse. By additionally taking into account a large number of unlabeled instances the performance of the structural cluster kernel can further be improved.

[1]  Tapani Raiko,et al.  European conference on machine learning and knowledge discovery in databases , 2014 .

[2]  Jeffrey J. Sutherland,et al.  Spline-Fitting with a Genetic Algorithm: A Method for Developing Classification Structure-Activity Relationships , 2003, J. Chem. Inf. Comput. Sci..

[3]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[5]  Bernhard Schölkopf,et al.  Cluster Kernels for Semi-Supervised Learning , 2002, NIPS.

[6]  Paolo Frasconi,et al.  Weighted decomposition kernels , 2005, ICML.

[7]  Yu Zong Chen,et al.  Prediction of Cytochrome P450 3A4, 2D6, and 2C9 Inhibitors and Substrates by Using Support Vector Machines , 2005, J. Chem. Inf. Model..

[8]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[9]  Saso Dzeroski,et al.  Experiments in Predicting Biodegradability , 1999, ILP.

[10]  Fabrizio Costa,et al.  Fast Neighborhood Subgraph Pairwise Distance Kernel , 2010, ICML.

[11]  N. Kruhlak,et al.  Assessment of the health effects of chemicals in humans: I. QSAR estimation of the maximum recommended therapeutic dose (MRTD) and no effect level (NOEL) of organic chemicals based on clinical trial data. , 2004, Current drug discovery technologies.

[12]  E. Zeiger,et al.  Handbook of Carcinogenic Potency and Genotoxicity Databases , 1996 .

[13]  Pierre Baldi,et al.  ChemDB update - full-text search and virtual chemical space , 2007, Bioinform..

[14]  Z. Bodo,et al.  Hierarchical cluster kernels for supervised and semi-supervised learning , 2008, 2008 4th International Conference on Intelligent Computer Communication and Processing.

[15]  Stefan Kramer,et al.  Online Structural Graph Clustering Using Frequent Subgraph Mining , 2010, ECML/PKDD.

[16]  Fabian Buchwald,et al.  Using Local Models to Improve (Q)SAR Predictivity , 2011, Molecular informatics.

[17]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[18]  Lehel Csató,et al.  Hierarchical and Reweighting Cluster Kernels for Semi-Supervised Learning , 2010, Int. J. Comput. Commun. Control.

[19]  Pierre Baldi,et al.  ChemDB: a public database of small molecules and related chemoinformatics resources , 2005, Bioinform..

[20]  F. Sanz,et al.  Anchor-GRIND: filling the gap between standard 3D QSAR and the GRid-INdependent descriptors. , 2005 .

[21]  Stefan Kramer,et al.  Adapted Transfer of Distance Measures for Quantitative Structure-Activity Relationships , 2010, Discovery Science.

[22]  Alexandros Stamatakis,et al.  Parallel Structural Graph Clustering , 2011, ECML/PKDD.

[23]  Katharina Jahn,et al.  Optimizing gSpan for Molecular Datasets , 2005 .

[24]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[25]  Thomas Gärtner,et al.  Kernels for structured data , 2008, Series in Machine Perception and Artificial Intelligence.

[26]  Luc De Raedt,et al.  Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds , 2004, J. Chem. Inf. Model..