Ligand Prediction for Orphan Targets Using Support Vector Machines and Various Target-Ligand Kernels Is Dominated by Nearest Neighbor Effects

Support vector machine (SVM) calculations combining protein and small molecule information have been applied to identify ligands for simulated orphan targets (i.e., targets for which no ligands were available). The combination of protein and ligand information was facilitated through the design of target-ligand kernel functions that account for pairwise ligand and target similarity. The design and biological information content of such kernel functions was expected to play a major role for target-directed ligand prediction. Therefore, a variety of target-ligand kernels were implemented to capture different types of target information including sequence, secondary structure, tertiary structure, biophysical properties, ontologies, or structural taxonomy. These kernels were tested in ligand predictions for simulated orphan targets in two target protein systems characterized by the presence of different intertarget relationships. Surprisingly, although there were target- and set-specific differences in prediction rates for alternative target-ligand kernels, the performance of these kernels was overall similar and also similar to SVM linear combinations. Test calculations designed to better understand possible reasons for these observations revealed that ligand information provided by nearest neighbors of orphan targets significantly influenced SVM performance, much more so than the inclusion of protein information. As long as ligands of closely related neighbors of orphan targets were available for SVM learning, orphan target ligands could be well predicted, regardless of the type and sophistication of the kernel function that was used. These findings suggest simplified strategies for SVM-based ligand prediction for orphan targets.

[1]  Silvio C. E. Tosatto,et al.  The SSEA server for protein secondary structure alignment , 2005, Bioinform..

[2]  Jean-Philippe Vert,et al.  Protein-ligand interaction prediction: an improved chemogenomics approach , 2008, Bioinform..

[3]  Thomas Gärtner,et al.  Support-Vector-Machine-Based Ranking Significantly Improves the Effectiveness of Similarity Searching Using 2D Fingerprints and Multiple Reference Compounds , 2008, J. Chem. Inf. Model..

[4]  Manfred J. Sippl,et al.  On distance and similarity in fold space , 2008, Bioinform..

[5]  Neil D. Rawlings,et al.  MEROPS: the peptidase database , 2009, Nucleic Acids Res..

[6]  L. Jiang,et al.  PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[7]  Thomas Gärtner,et al.  Ligand Prediction from Protein Sequence and Small Molecule Information Using Support Vector Machines and Fingerprint Descriptors , 2009, J. Chem. Inf. Model..

[8]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[9]  David A. Gough,et al.  Virtual Screen for Ligands of Orphan G Protein-Coupled Receptors , 2005, J. Chem. Inf. Model..

[10]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[12]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[13]  Jeffrey W. Smith,et al.  CutDB: a proteolytic event database , 2006, Nucleic Acids Res..

[14]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[15]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[16]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[19]  Xin Wen,et al.  BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities , 2006, Nucleic Acids Res..

[20]  Yoshua Bengio,et al.  Collaborative Filtering on a Family of Biological Targets , 2006, J. Chem. Inf. Model..

[21]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[22]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[23]  Jürgen Bajorath,et al.  Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. , 2007, Drug discovery today.

[24]  Yang Dai,et al.  Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction , 2006, BMC Bioinformatics.

[25]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[26]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[28]  M. J. D. Powell,et al.  Radial basis functions for multivariable interpolation: a review , 1987 .

[29]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[30]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[31]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[32]  Manfred J. Sippl,et al.  A note on difficult structure alignment problems , 2008, Bioinform..

[33]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[34]  X Chen,et al.  BindingDB: a web-accessible molecular recognition database. , 2001, Combinatorial chemistry & high throughput screening.

[35]  Anne Mai Wassermann,et al.  Searching for Target-Selective Compounds Using Different Combinations of Multiclass Support Vector Machine Ranking Methods, Kernel Functions, and Fingerprint Descriptors , 2009, J. Chem. Inf. Model..

[36]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[37]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..