Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity

MOTIVATION Small molecules play a fundamental role in organic chemistry and biology. They can be used to probe biological systems and to discover new drugs and other useful compounds. As increasing numbers of large datasets of small molecules become available, it is necessary to develop computational methods that can deal with molecules of variable size and structure and predict their physical, chemical and biological properties. RESULTS Here we develop several new classes of kernels for small molecules using their 1D, 2D and 3D representations. In 1D, we consider string kernels based on SMILES strings. In 2D, we introduce several similarity kernels based on conventional or generalized fingerprints. Generalized fingerprints are derived by counting in different ways subpaths contained in the graph of bonds, using depth-first searches. In 3D, we consider similarity measures between histograms of pairwise distances between atom classes. These kernels can be computed efficiently and are applied to problems of classification and prediction of mutagenicity, toxicity and anti-cancer activity on three publicly available datasets. The results derived using cross-validation methods are state-of-the-art. Tradeoffs between various kernels are briefly discussed. AVAILABILITY Datasets available from http://www.igb.uci.edu/servers/servers.html

[1]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[2]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[3]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[4]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[5]  David Weininger,et al.  SMILES. 2. Algorithm for generation of unique SMILES notation , 1989, J. Chem. Inf. Comput. Sci..

[6]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[7]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[8]  A. Debnath,et al.  Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. , 1991, Journal of medicinal chemistry.

[9]  David Haussler,et al.  Recent Methods for RNA Modeling Using Stochastic Context-Free Grammars , 1994, CPM.

[10]  John R. Koza,et al.  Evolution of a Computer Program for Classifying Protein Segments as Transmembrane Domains Using Genetic Programming , 1994, ISMB.

[11]  C. Hansch,et al.  QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS OF THE BENZODIAZEPINES. A REVIEW AND REEVALUATION , 1994 .

[12]  D. Villemin,et al.  Use of a neural network to determine the boiling point of alkanes , 1994 .

[13]  Gerhard Klebe,et al.  Comparison of Automatic Three-Dimensional Model Builders Using 639 X-ray Structures , 1994, J. Chem. Inf. Comput. Sci..

[14]  M J Sternberg,et al.  Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[15]  J. Gasteiger,et al.  Chemical Information in 3D Space , 1997 .

[16]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[17]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[18]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..

[19]  Michael I. Jordan Graphical Models , 1998 .

[20]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[21]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[22]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[23]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[24]  Alessio Micheli,et al.  Analysis of the Internal Representations Developed by Neural Networks for Structures Applied to Quantitative Structure-Activity Relationship Studies of Benzodiazepines , 2001, J. Chem. Inf. Comput. Sci..

[25]  Luc De Raedt,et al.  Feature Construction with Version Spaces for Biochemical Applications , 2001, ICML.

[26]  Ashwin Srinivasan,et al.  The Predictive Toxicology Challenge 2000-2001 , 2001, Bioinform..

[27]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[28]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[29]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[30]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[31]  Jean-Philippe Vert A tree kernel to analyze phylog enetic profi les , 2002 .

[32]  Jean-Philippe Vert,et al.  A tree kernel to analyse phylogenetic profiles , 2002, ISMB.

[33]  Joseph S. Verducci,et al.  A Modification of the Jaccard–Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings , 2002, Technometrics.

[34]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[35]  Pierre Baldi,et al.  The Principled Design of Large-Scale Recursive Neural Network Architectures--DAG-RNNs and the Protein Structure Prediction Problem , 2003, J. Mach. Learn. Res..

[36]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[37]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[38]  A. Micheli,et al.  A Novel Approach to QSPR/QSAR Based on Neural Networks for Structures , 2003 .

[39]  Tatsuya Akutsu,et al.  Extensions of marginalized graph kernels , 2004, ICML.

[40]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[41]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[42]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .